* mtd_info->size again (lengthy) @ 2008-06-08 5:38 Bruce_Leonard 2008-06-08 12:10 ` Jörn Engel 0 siblings, 1 reply; 17+ messages in thread From: Bruce_Leonard @ 2008-06-08 5:38 UTC (permalink / raw) To: linux-mtd There's been some chatter on the list in the last couple of months (mostly by me) about the limit on MTD devices of no more than 2GiB because struct mtd_info->size is a 32-bit value. A few weeks ago I asked for ideas on how to get around this limit, the most popular on being that I should change the meaning of mtd_info->size from "the size of the device in bytes" to "the size of the device in kilobytes". I took the suggestion to heart and when I started working on this last week, that was the direction I took. Now I'm in deep trouble. I'm having multiple problems with this solution. A quick grep of mtd->info turns up on the order of 225 hits in various and assorted drivers and core code, not counting other structures that have an mtd_info buried inside them. A lot of those hits are simple things like "if(off + len > mtd->info)". So if I change the meaning of mtd_info->size to KiB, then I have to change everyplace that touches/checkes/uses mtd_info->size to use KiB. A duanting task, especially as there are other places where the value of mtd_info->size is used to initialize other structures (JFFS2 leeps to mind). Would I then have to go through those other places and ripple the change in meaning there as well? This idea was getting out of hand in a hurry. So I stepped back and rethought this. Maybe just leaving the meaning as bytes and changing the type to 64-bits wasn't such a bad idea after all. As I said, a sizable percentage of the 225 hits are simple compares, the compiler would take care of my problem by promotion. Plus places like the NAND layer that does calculations by shifting mtd_info->size could safely be left alone. However, I'm still left with the problem of ripple effect into other layers like JFFS2 and some of the MTD map files. Plus I run into this problem with making it a 64-bit value: there are a few places (add_mtd_device() for example) where mtd_info->size is passed as a parameter of type size_t, which in my case on an mpc8347 is a 32-bit value. This is where I really start to get bogged down. size_t (please correct me if I'm wrong because I probably am) is defined to be a type that will tell you the largest item (in bytes) that can be accessed by that particular processor. In other words, the biggest structure you could access in the RAM on the local bus of an mpc8347 would be 4GiB, or 0x0 - 0xffffffff, a 32-bit value, and thus size_t on an 8347 is defined to be an unsigned long. That makes my poor little HW engineer brain hurt, though I think I get it. But wait....now I have 8GiB of NAND flash, connected to a PCI controller, handled by a single PCI driver and thus presented to the kernel as a single item, 8GiB in size which doesn't fit in size_t. So now, my call to mtd->unlock() in the function add_mtd_device() passes a length of 0 because mtd->size gets truncated to 32-bits! Ouch, my poor little HW engineer brain just had an aneurysm :(. We then thought, why don't we just break the NAND controller up into four controllers (we can do that becuase we rolled our own in an FPGA) with 2GiB of flash on each controller. Then we can just concatenate the four together in the MTD layer. Great idea. Not! Same problem. When you concatenate devices together in MTD, each devices size gets added to mtd_info->size, and again I've overflowed my 32-bit size. Then we thought, okay just have four 2GiB devices on /dev/mtd0-3. After all, we can mount as much stuff on the root directory as we want. Except my SW folks are telling me that, due to archiving requirements we will probably end up with sub-directories in excess of 2GiB which will make file management a real nightmare. Well, fine, so I'm back to trying to get the MTD layer to support >2GiB devices. And to be honest, I really believe it's the best solution for all. As flash densities continue to grow, I think there are going to be more and more embedded designs where using Compact Flash or some other SSD hidden behind an IDE interface is not going to be acceptible. This is a good thing to change and now is a good time. But folks, I'm stuck. I'm just a dumb HW engineer trying to get my box running. I'm willing to make the change and I want to make the change, but I need help. I hate asking, I know that the etiquette on the lists frowns on asking for specific help. And usually after a few days of poking I can come up with my own solution. It's not pretty but it works :). But right now I'm totally lost in the subtle complexeties and the shear size of MTD. It's way beyond me. What I need is a concrete code example from someone on what and how to go about expanding MTD to support large sizes. Can someone please help me out? If I can get a good solid idea of how to fix it, I'm more than happy to do the work of making all the code changes and putting together a patch. The open source community has been good to me and I'd love nothing more than being able to contribute something back. Thanks for listening. Bruce ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-08 5:38 mtd_info->size again (lengthy) Bruce_Leonard @ 2008-06-08 12:10 ` Jörn Engel 2008-06-08 12:13 ` Jörn Engel ` (4 more replies) 0 siblings, 5 replies; 17+ messages in thread From: Jörn Engel @ 2008-06-08 12:10 UTC (permalink / raw) To: Bruce_Leonard; +Cc: linux-mtd On Sat, 7 June 2008 22:38:21 -0700, Bruce_Leonard@selinc.com wrote: > > But folks, I'm stuck. Sometimes the best answer comes in diff -u form, see below. This has been on my TODO list for a while. Long-term we want to get rid of read, write and erase in drivers, replacing them with submit_fio (or some equivalent function, if people dislike the name). Submit_fio has the advantage of allowing asynchronous operations, queueing, out-of-order completion and all those goodies. If you have several chips in a device, I'm sure you want those as well. And while at it, replace the 32bit size with a 64bit blockno and 32bit offset. Unless someone is stupid enough to create devices with 4G erasesize, that should last for a human lifetime or something close. It still leaves the problem of converting all driver. I've only started on mtdram - because it is simple and everyone can test it. Other drivers can follow over time. Users like jffs2 could be converted as well, but I'd rather leave them for now. We want synchronous wrappers around submit_fio anyway, as those are useful for development and for trivial performance-agnostic code. So for the time being, users can simply use the wrappers. Which leaves four quadrants: old driver new driver old user 1 2 new user 3 4 1 is fine, 4 is fine. 2 needs to deal with the synchronous wrappers and 3 will also need some plumbing. Comments? Jörn -- Don't patch bad code, rewrite it. -- Kernigham and Pike, according to Rusty diff --git a/drivers/mtd/devices/mtdram.c b/drivers/mtd/devices/mtdram.c index 0399be1..6ddefa7 100644 --- a/drivers/mtd/devices/mtdram.c +++ b/drivers/mtd/devices/mtdram.c @@ -90,6 +90,50 @@ static int ram_write(struct mtd_info *mtd, loff_t to, size_t len, return 0; } +static int ram_submit_fio(struct mtd_info *mtd, struct fio *fio) +{ + int i, err = 0; + u32 ofs; + size_t *retlen; + const u_char *buf; + struct fio_vec *vec; + + /* This driver cannot deal with device >32bit yet */ + if (fio->fi_blockno * mtd->erasesize + fio->block_ofs + fio->fi_len + > 0xffffffff) + return -EIO; + + ofs = fio->fi_blockno * mtd->erasesize + fio->block_ofs; + + if (fio->fi_command == FI_ERASE) { + memset(mtd->priv + ofs, 0xff, fio->fi_len); + goto out; + } + + for (i = 0; i < fio->fi_vcnt; i++) { + vec = fio->fi_io_vec + i; + buf = kmap_atomic(vec->fv_page, KM_USER0); + switch (fio->fi_command) { + case FI_READ: + err = ram_read(mtd, ofs, vec->fv_len, &retlen, + buf + vec->fv_offset); + break; + case FI_WRITE: + err = ram_write(mtd, ofs, vec->fv_len, &retlen, + buf + vec->fv_offset); + default: + BUG(); + } + kunmap_atomic(buf, KM_USER0); + ofs += vec->fv_len; + if (err) + goto out; + } +out: + fio->fi_end_io(fio, err); + return 0; +} + static void __exit cleanup_mtdram(void) { if (mtd_info) { @@ -109,8 +153,10 @@ int mtdram_init_device(struct mtd_info *mtd, void *mapped_address, mtd->type = MTD_RAM; mtd->flags = MTD_CAP_RAM; mtd->size = size; - mtd->writesize = 1; + mtd->no_eraseblocks = size; + do_div(mtd->no_eraseblocks, MTDRAM_ERASE_SIZE); mtd->erasesize = MTDRAM_ERASE_SIZE; + mtd->writesize = 1; mtd->priv = mapped_address; mtd->owner = THIS_MODULE; diff --git a/include/linux/mtd/mtd.h b/include/linux/mtd/mtd.h index 245f909..9729a17 100644 --- a/include/linux/mtd/mtd.h +++ b/include/linux/mtd/mtd.h @@ -98,10 +98,51 @@ struct mtd_oob_ops { uint8_t *oobbuf; }; +enum fi_command { + FI_READ, + FI_WRITE, + FI_ERASE, +}; + +struct fio_vec { + struct page *fv_page; + unsigned int fv_len; + unsigned int fv_offset; +}; + +typedef void (fio_end_io_t) (struct fio *, int); +/* + * struct fio (flash io) is modeled after struct bio and where possible the + * fields have been kept identical. Some fields have gone missing and others + * been added. + */ +struct fio { + /* Fields shared with struct bio */ + struct fio *fi_next; + unsigned long fi_flags; + unsigned long fi_command; + unsigned short fi_vcnt; + struct fio_vec fi_io_vec; + fio_end_io_t fi_end_io; + void *fi_private; + /* Fields specific to flash */ + u64 fi_blockno; + u32 fi_blockofs; + u32 fi_len; +}; + struct mtd_info { u_char type; u_int32_t flags; - u_int32_t size; // Total size of the MTD + /* + * size is becoming redundant, as it can be calculated from + * no_eraseblocks * erasesize. There is a fair chance it will die + * in the near future. For flashes with variable erase sizes, + * no_eraseblocks signifies the "normal" size, so above calculation + * remains valid. + */ + u_int64_t size; // Total size of the MTD + u_int64_t no_eraseblocks; /* "Major" erase size for the device. Naïve users may take this * to be the only erase size available, or may use the more detailed @@ -151,6 +192,7 @@ struct mtd_info { void (*unpoint) (struct mtd_info *mtd, loff_t from, size_t len); + int (*submit_bio) (struct mtd_info *mtd, struct fio *fio); int (*read) (struct mtd_info *mtd, loff_t from, size_t len, size_t *retlen, u_char *buf); int (*write) (struct mtd_info *mtd, loff_t to, size_t len, size_t *retlen, const u_char *buf); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-08 12:10 ` Jörn Engel @ 2008-06-08 12:13 ` Jörn Engel 2008-06-09 2:51 ` Bruce_Leonard ` (3 subsequent siblings) 4 siblings, 0 replies; 17+ messages in thread From: Jörn Engel @ 2008-06-08 12:13 UTC (permalink / raw) To: Bruce_Leonard; +Cc: linux-mtd On Sun, 8 June 2008 14:10:20 +0200, Jörn Engel wrote: > > Comments? And before anyone bothers complaining, the patch is completely untested, I haven't even compiled it. Jörn -- One of my most productive days was throwing away 1000 lines of code. -- Ken Thompson. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-08 12:10 ` Jörn Engel 2008-06-08 12:13 ` Jörn Engel @ 2008-06-09 2:51 ` Bruce_Leonard 2008-06-09 5:21 ` Jörn Engel 2008-06-09 4:43 ` Bruce_Leonard ` (2 subsequent siblings) 4 siblings, 1 reply; 17+ messages in thread From: Bruce_Leonard @ 2008-06-09 2:51 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd Jörn Engel <joern@logfs.org> wrote on 06/08/2008 05:10:20 AM: > On Sat, 7 June 2008 22:38:21 -0700, Bruce_Leonard@selinc.com wrote: > > > > But folks, I'm stuck. > > Sometimes the best answer comes in diff -u form, see below. > > This has been on my TODO list for a while. Long-term we want to get rid Jorn, Thanks so much for the feedback. I think I get what you're saying, but let me paraphrase it back to make sure. My approach should be to make a new interface that is 64-bit clean, using one or more functions (I'm not sure I should shove ALL the functions that exist in mtd_info into a single submit_fio(), but I don't know). For my driver I should write the new function(s) to be 64-bit clean and just use the new interface. I can then change the core code that impacts my specific case to take advantage of the new interface (being carefull to make sure I don't break the old interface). Existing drivers can continue to use the existing interface but will eventually migrate to the new interface. If they can't be made 64-bit clean for whatever reason, the new interface just becomes a wrapper to the old functions. Sooner or later everything will be on the new interface and the old one can be removed. Does that sound about right? Just a few questions about the details of your (admitidley un-compiled and un-tested :) ) solution: > > +static int ram_submit_fio(struct mtd_info *mtd, struct fio *fio) > +{ I gather fio is something that's to be filled in each time ram_submit_fio is called rather than being a static thing in mtd_info? Sort of like struct mtd_oob_ops in the NAND layer gets filled in based on what's being called? > + int i, err = 0; > + u32 ofs; > + size_t *retlen; > + const u_char *buf; > + struct fio_vec *vec; > + > + /* This driver cannot deal with device >32bit yet */ > + if (fio->fi_blockno * mtd->erasesize + fio->block_ofs + fio->fi_len > + > 0xffffffff) > + return -EIO; > + > + ofs = fio->fi_blockno * mtd->erasesize + fio->block_ofs; > + > + if (fio->fi_command == FI_ERASE) { > + memset(mtd->priv + ofs, 0xff, fio->fi_len); > + goto out; > + } > + > + for (i = 0; i < fio->fi_vcnt; i++) { > + vec = fio->fi_io_vec + i; > + buf = kmap_atomic(vec->fv_page, KM_USER0); > + switch (fio->fi_command) { > + case FI_READ: > + err = ram_read(mtd, ofs, vec->fv_len, &retlen, > + buf + vec->fv_offset); I'm not famliar enough with block I/O to know what advantages or disadvantages is has, but since all the upper layers already allocate a buffer for putting date into or taking data out of, wouldn't it make more sense/be simpler to just keep using it? For example, mtd_read() in mtdchar.c does a kmalloc to get a buffer that is then passed to mtd->read(). Wouldn't it be simpler to just keep using that rather than switching to struct page? Or is there something that I'm missing that struct page brings to the game? > + break; > + case FI_WRITE: > + err = ram_write(mtd, ofs, vec->fv_len, &retlen, > + buf + vec->fv_offset); > + default: > + BUG(); > + } > + kunmap_atomic(buf, KM_USER0); > + ofs += vec->fv_len; > + if (err) > + goto out; > + } > +out: > + fio->fi_end_io(fio, err); Not sure I follow the logic of the typedef here and what it is you're trying to accomplish. <snip> > + > +typedef void (fio_end_io_t) (struct fio *, int); > +/* <snip> > struct mtd_info { > u_char type; > u_int32_t flags; > - u_int32_t size; // Total size of the MTD > + /* > + * size is becoming redundant, as it can be calculated from > + * no_eraseblocks * erasesize. There is a fair chance it will die > + * in the near future. For flashes with variable erase sizes, > + * no_eraseblocks signifies the "normal" size, so above calculation > + * remains valid. > + */ Which calculation are you refering to? > + u_int64_t size; // Total size of the MTD > + u_int64_t no_eraseblocks; > In my original email I mentioned that I ran into a few problems makeing size a 64-bit value. Did you have any problems with it? Thanks again for the feedback. It's given me something I think I can sink my teeth into and finally make some progress. At least I hope so :) Bruce ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-09 2:51 ` Bruce_Leonard @ 2008-06-09 5:21 ` Jörn Engel 0 siblings, 0 replies; 17+ messages in thread From: Jörn Engel @ 2008-06-09 5:21 UTC (permalink / raw) To: Bruce_Leonard; +Cc: linux-mtd On Sun, 8 June 2008 19:51:47 -0700, Bruce_Leonard@selinc.com wrote: > Jörn Engel <joern@logfs.org> wrote on 06/08/2008 05:10:20 AM: > > Thanks so much for the feedback. I think I get what you're saying, but > let me paraphrase it back to make sure. My approach should be to make a > new interface that is 64-bit clean, using one or more functions (I'm not > sure I should shove ALL the functions that exist in mtd_info into a single > submit_fio(), but I don't know). For my driver I should write the new > function(s) to be 64-bit clean and just use the new interface. I can then > change the core code that impacts my specific case to take advantage of > the new interface (being carefull to make sure I don't break the old > interface). Existing drivers can continue to use the existing interface > but will eventually migrate to the new interface. Pretty much. > If they can't be made > 64-bit clean for whatever reason, the new interface just becomes a wrapper > to the old functions. Everything can be made 64bit clean. The only reason against it is a crufty driver for obsolete hardware that noone wants to touch with a 10' pole. So such plumbing is an indication of a social problem (laziness) rather than a technical one. ;) > Sooner or later everything will be on the new > interface and the old one can be removed. Does that sound about right? Yes. > > +static int ram_submit_fio(struct mtd_info *mtd, struct fio *fio) > > +{ > > I gather fio is something that's to be filled in each time ram_submit_fio > is called rather than being a static thing in mtd_info? Sort of like > struct mtd_oob_ops in the NAND layer gets filled in based on what's being > called? Yes. > > + int i, err = 0; > > + u32 ofs; > > + size_t *retlen; > > + const u_char *buf; > > + struct fio_vec *vec; > > + > > + /* This driver cannot deal with device >32bit yet */ > > + if (fio->fi_blockno * mtd->erasesize + fio->block_ofs + fio->fi_len > > + > 0xffffffff) > > + return -EIO; > > + > > + ofs = fio->fi_blockno * mtd->erasesize + fio->block_ofs; > > + > > + if (fio->fi_command == FI_ERASE) { > > + memset(mtd->priv + ofs, 0xff, fio->fi_len); > > + goto out; > > + } > > + > > + for (i = 0; i < fio->fi_vcnt; i++) { > > + vec = fio->fi_io_vec + i; > > + buf = kmap_atomic(vec->fv_page, KM_USER0); > > + switch (fio->fi_command) { > > + case FI_READ: > > + err = ram_read(mtd, ofs, vec->fv_len, &retlen, > > + buf + vec->fv_offset); > > I'm not famliar enough with block I/O to know what advantages or > disadvantages is has, but since all the upper layers already allocate a > buffer for putting date into or taking data out of, wouldn't it make more > sense/be simpler to just keep using it? For example, mtd_read() in > mtdchar.c does a kmalloc to get a buffer that is then passed to > mtd->read(). Wouldn't it be simpler to just keep using that rather than > switching to struct page? Or is there something that I'm missing that > struct page brings to the game? Performance. If you want the same path to deal with data from both userspace and kernelspace, the only alternative would be memcpy, which isn't exactly known to aid in benchmarks. > > + break; > > + case FI_WRITE: > > + err = ram_write(mtd, ofs, vec->fv_len, &retlen, > > + buf + vec->fv_offset); > > + default: > > + BUG(); > > + } > > + kunmap_atomic(buf, KM_USER0); > > + ofs += vec->fv_len; > > + if (err) > > + goto out; > > + } > > +out: > > + fio->fi_end_io(fio, err); > > Not sure I follow the logic of the typedef here and what it is you're > trying to accomplish. Asynchronous completion. mtdram is a simple driver. But if you have a piece of hardware that actually takes time to do things, preferrably with a dma engine and generating interrupts, you have to queue those requests. An even better example is a device consisting of four independent chips. Here you want four independent queues. When a new request comes in, you decide which chip it belongs to and queue it for that chip. If one chip is busy and three are idle, one queue fills up and the odd request for the other three is handled fairly quickly. So io requests and io completion are in a different order. You may even want to have two queues per chip and always handle reads before dealing with any writes or erases. All this gives you extra performance. It just doesn't matter much for the stupid driver I used as an example. > > + > > +typedef void (fio_end_io_t) (struct fio *, int); > > +/* Or do you ask why I used a typedef instead of directly using the type in struct fio? No reason, really. bio works that way and I simply copied it for consistency. > > struct mtd_info { > > u_char type; > > u_int32_t flags; > > - u_int32_t size; // Total size of the MTD > > + /* > > + * size is becoming redundant, as it can be calculated from > > + * no_eraseblocks * erasesize. There is a fair chance it will die > > + * in the near future. For flashes with variable erase sizes, > > + * no_eraseblocks signifies the "normal" size, so above calculation > > + * remains valid. > > + */ > > Which calculation are you refering to? no_eraseblocks * erasesize > > + u_int64_t size; // Total size of the MTD > > + u_int64_t no_eraseblocks; > > In my original email I mentioned that I ran into a few problems makeing > size a 64-bit value. Did you have any problems with it? I didn't even compile. ;) Basically, you could pick either of the approaches in your mail - the result would be the same. You need to work your way through every single driver and every single user and fix things up. All work and no play. If we are going through the trouble anyway, we might as well add asynchronous stuff while at it. Which I want, badly. Plus, by having a new function (or set of functions, doesn't matter), we can easily check at runtime whether we're dealing with an old driver or a new one. Jörn -- Joern's library part 2: http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-08 12:10 ` Jörn Engel 2008-06-08 12:13 ` Jörn Engel 2008-06-09 2:51 ` Bruce_Leonard @ 2008-06-09 4:43 ` Bruce_Leonard 2008-06-09 5:28 ` Jörn Engel 2008-06-09 13:21 ` Jamie Lokier 2008-06-11 0:09 ` Bruce_Leonard 4 siblings, 1 reply; 17+ messages in thread From: Bruce_Leonard @ 2008-06-09 4:43 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd Jörn Engel <joern@logfs.org> wrote on 06/08/2008 05:10:20 AM: > > And while at it, replace the 32bit size with a 64bit blockno and 32bit > offset. Unless someone is stupid enough to create devices with 4G > erasesize, that should last for a human lifetime or something close. > What kind of block is blockno? An erase block (which is 128K in my case) or a write block (2K)? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-09 4:43 ` Bruce_Leonard @ 2008-06-09 5:28 ` Jörn Engel 0 siblings, 0 replies; 17+ messages in thread From: Jörn Engel @ 2008-06-09 5:28 UTC (permalink / raw) To: Bruce_Leonard; +Cc: linux-mtd On Sun, 8 June 2008 21:43:33 -0700, Bruce_Leonard@selinc.com wrote: > Jörn Engel <joern@logfs.org> wrote on 06/08/2008 05:10:20 AM: > > > And while at it, replace the 32bit size with a 64bit blockno and 32bit > > offset. Unless someone is stupid enough to create devices with 4G > > erasesize, that should last for a human lifetime or something close. > > What kind of block is blockno? An erase block (which is 128K in my case) > or a write block (2K)? eraseblock Jörn -- Those who come seeking peace without a treaty are plotting. -- Sun Tzu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-08 12:10 ` Jörn Engel ` (2 preceding siblings ...) 2008-06-09 4:43 ` Bruce_Leonard @ 2008-06-09 13:21 ` Jamie Lokier 2008-06-09 14:36 ` Jörn Engel 2008-06-11 0:09 ` Bruce_Leonard 4 siblings, 1 reply; 17+ messages in thread From: Jamie Lokier @ 2008-06-09 13:21 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Bruce_Leonard Jörn Engel wrote: > replacing them with submit_fio (or some equivalent function, if > people dislike the name). May I suggest submit_mtdio? One concept to one name, you know. F means flash, sure. But we don't call the devices /dev/flash0, /dev/flash1. Nor do we call it the Flash subsystem. Or, maybe we should change the name to flash everywhere. Most people recognise that nowadays - flash is a common term. > Submit_fio has the advantage of allowing asynchronous operations, > queueing, out-of-order completion and all those goodies. If you > have several chips in a device, I'm sure you want those as well. Yes! I always wondered why it wasn't done like that from the beginning. > And while at it, replace the 32bit size with a 64bit blockno and 32bit > offset. Unless someone is stupid enough to create devices with 4G > erasesize, that should last for a human lifetime or something close. Although a single chip is unlikely to have 4GiB erasesize for a while, a cluster of chips might, and it should be possible to represent them as a single device. I strongly suggest supporting erasesizes comparable with the device size. That doesn't mean writesize has to be that big - writesize should fit in size_t (actually ssize_t), just like sys_write(). > > It still leaves the problem of converting all driver. I've only started > on mtdram - because it is simple and everyone can test it. Other > drivers can follow over time. Users like jffs2 could be converted as > well, but I'd rather leave them for now. > We want synchronous wrappers around submit_fio anyway, as those are > useful for development and for trivial performance-agnostic code. > Comments? I'm thinking we want asynchronous wrappers around the synchronous methods too, using a dispatch thread. So that drivers can be converted one by one, and old drivers don't need to be. Something like generic_submit_fio. -- Jamie ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-09 13:21 ` Jamie Lokier @ 2008-06-09 14:36 ` Jörn Engel 2008-06-09 16:31 ` Jamie Lokier 0 siblings, 1 reply; 17+ messages in thread From: Jörn Engel @ 2008-06-09 14:36 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mtd, Bruce_Leonard On Mon, 9 June 2008 14:21:17 +0100, Jamie Lokier wrote: > Jörn Engel wrote: > > replacing them with submit_fio (or some equivalent function, if > > people dislike the name). > > May I suggest submit_mtdio? One concept to one name, you know. > > F means flash, sure. But we don't call the devices /dev/flash0, /dev/flash1. > Nor do we call it the Flash subsystem. I never liked the name "mtd". It is unpronouncable and reeks of the kind of premature overgeneralization big corporations usually come up with. "One day this will cover all memory, including hard disks, cpu caches and those five bytes in the rtc." Or some such madness. But mtdio is even worse. Dude, what are you smoking? ;) > Or, maybe we should change the name to flash everywhere. Most people > recognise that nowadays - flash is a common term. The old name - as stupid as it is - has become part of the eternal interfaces. You'll never hunt it to extinction. > > And while at it, replace the 32bit size with a 64bit blockno and 32bit > > offset. Unless someone is stupid enough to create devices with 4G > > erasesize, that should last for a human lifetime or something close. > > Although a single chip is unlikely to have 4GiB erasesize for a while, > a cluster of chips might, and it should be possible to represent them > as a single device. When you have a cluster of chips, striping them and increasing writesize and erasesize (and having each bad block kill perfectly good blocks in the other stripes) may be the first idea you have. But I hope it won't be the last. Part of the reason for asynchronous interfaces is to make parallel chips work fast without striping them. Or rather, make them work even faster than stripes. > I'm thinking we want asynchronous wrappers around the synchronous > methods too, using a dispatch thread. So that drivers can be > converted one by one, and old drivers don't need to be. Something like > generic_submit_fio. Maybe. That depends on when and how the users will take advantage of the new interface. I have plans for logfs, but don't intend to convert jffs2, mtdchar, mtdblock or ubi. If others do, please send patches. Jörn -- Public Domain - Free as in Beer General Public - Free as in Speech BSD License - Free as in Enterprise Shared Source - Free as in "Work will make you..." ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-09 14:36 ` Jörn Engel @ 2008-06-09 16:31 ` Jamie Lokier 2008-06-09 17:22 ` Jörn Engel 0 siblings, 1 reply; 17+ messages in thread From: Jamie Lokier @ 2008-06-09 16:31 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Bruce_Leonard Jörn Engel wrote: > > May I suggest submit_mtdio? One concept to one name, you know. > > > > F means flash, sure. But we don't call the devices /dev/flash0, /dev/flash1. > > Nor do we call it the Flash subsystem. > > I never liked the name "mtd". Me neither. > But mtdio is even worse. Dude, what are you smoking? ;) submit_fio suggests "file I/O" to me.... Even if I knew better, it looks just that bit too much like submit_bio for something file-ish. submit_mtd_request or async_mtd, then! :-) > > > And while at it, replace the 32bit size with a 64bit blockno and 32bit > > > offset. Unless someone is stupid enough to create devices with 4G > > > erasesize, that should last for a human lifetime or something close. > > > > Although a single chip is unlikely to have 4GiB erasesize for a while, > > a cluster of chips might, and it should be possible to represent them > > as a single device. > > When you have a cluster of chips, striping them and increasing writesize > and erasesize (and having each bad block kill perfectly good blocks in > the other stripes) may be the first idea you have. But I hope it won't > be the last. I didn't say it's a _good_ idea! Just that I won't be surprised when someone produces hardware like that. And, there will probably be a reason you haven't thought of why they did it that way. > Part of the reason for asynchronous interfaces is to make parallel chips > work fast without striping them. Or rather, make them work even faster > than stripes. And erasing simultaneous with streaming writes, let the CPU get on with something else, etc. What should the behaviour be with devices which block the CPU during synchronous writes and erases? Right now, you can be sure the temporary CPU blockage (if there is one) happens when you request the I/O. If it's asynchronous, depending on how it's implemented, the CPU could block at an unexpected time later. -- Jamie ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-09 16:31 ` Jamie Lokier @ 2008-06-09 17:22 ` Jörn Engel 2008-06-10 10:56 ` Jamie Lokier 0 siblings, 1 reply; 17+ messages in thread From: Jörn Engel @ 2008-06-09 17:22 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mtd, Bruce_Leonard On Mon, 9 June 2008 17:31:49 +0100, Jamie Lokier wrote: > > > But mtdio is even worse. Dude, what are you smoking? ;) > > submit_fio suggests "file I/O" to me.... Even if I knew better, it > looks just that bit too much like submit_bio for something file-ish. > > submit_mtd_request or async_mtd, then! :-) No, the similarities to submit_bio are _not_ accidental. In a perfect world, I would have used submit_bio. In our less than perfect world, I want to retain as many similarities as possible. We have two storage subsystems in the kernel that have lived their own seperate lives for all too long. They should get combined by some brave soul. And while I am not that brave soul, at least I can make his/her work a little easier. > I didn't say it's a _good_ idea! Just that I won't be surprised when > someone produces hardware like that. And, there will probably be a > reason you haven't thought of why they did it that way. If someone is stupid enough to create such hardware, I don't mind if they suffer. Quite the contrary. Send them the darwin award of hardware engineering! > > Part of the reason for asynchronous interfaces is to make parallel chips > > work fast without striping them. Or rather, make them work even faster > > than stripes. > > And erasing simultaneous with streaming writes, let the CPU get on > with something else, etc. Same with asynchronous erases on all chips - except that the blocks being erased don't need any special relationship. > What should the behaviour be with devices which block the CPU during > synchronous writes and erases? Right now, you can be sure the > temporary CPU blockage (if there is one) happens when you request the > I/O. If it's asynchronous, depending on how it's implemented, the CPU > could block at an unexpected time later. Good devices should have a dma engine and an interrupt line. Lesser devices can set a timer in the driver. If the cpu has to actively do some work during erase, you won't win any benchmark crowns with it anyway and I care fairly little. If you are designing hardware and want it to be fast, make sure it can handle at least one request autonomously. Better yet add a small queue with - say - 16 entries and simply handle them in order. Reads/writes should dma the data to/from main memory. Each completed request yields an interrupt. This is not rocket science. If you have any trouble implementing something like this, just talk to some network/ide/scsi person. Crap design like dual plane chips will disappear over time. Once you get to 64 parallel planes, it is obvious that the same approach no longer works. Jörn -- Doubt is not a pleasant condition, but certainty is an absurd one. -- Voltaire ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-09 17:22 ` Jörn Engel @ 2008-06-10 10:56 ` Jamie Lokier 2008-06-10 11:46 ` Jörn Engel 0 siblings, 1 reply; 17+ messages in thread From: Jamie Lokier @ 2008-06-10 10:56 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Bruce_Leonard Jörn Engel wrote: > > I didn't say it's a _good_ idea! Just that I won't be surprised when > > someone produces hardware like that. And, there will probably be a > > reason you haven't thought of why they did it that way. > > If someone is stupid enough to create such hardware, I don't mind if > they suffer. Quite the contrary. Send them the darwin award of > hardware engineering! Erm, I think a good reason will crop up in a few years that you haven't thought of. Like a fast system with 128TiB molecular flash where the erase power circuit is 10^7 x larger than each storage bit. I just don't see any advantage to assuming "nobody will ever use 4GiB erase blocks", when changing the API with larger sizes. > > What should the behaviour be with devices which block the CPU during > > synchronous writes and erases? Right now, you can be sure the > > temporary CPU blockage (if there is one) happens when you request the > > I/O. If it's asynchronous, depending on how it's implemented, the CPU > > could block at an unexpected time later. > > Good devices should have a dma engine and an interrupt line. Lesser > devices can set a timer in the driver. If the cpu has to actively do > some work during erase, you won't win any benchmark crowns with it > anyway and I care fairly little. > > If you are designing hardware ... Back in the real world, we want to run current kernels on hardware which is already built, or which is not designed for current kernels. We also want to run current kernels on hardware designed by people who care about minimising hardware cost, or design time, or using the cheapest available old chips. Fine, the async interface isn't designed for that. If it's only used for high-performance chips, that's fine with me. But if you want to make the async interface the _central_ interface for kernel flash API in future, around which everything should revolve, please at least specify what behaviour to expect with non-async devices - particularly, how should chip drivers behave. There are plenty of slow, crappy parts in use for good reasons, cost, size, and compatibility with other software or reference designs, primarily. I have a particularly crappy one where erase blocks the CPU if the CPU reads from the chip during the erase - so the cfi_cmdset_0002.c file needs a patch to avoid polling the status register for this board. This is topic for another post, though. -- Jamie ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-10 10:56 ` Jamie Lokier @ 2008-06-10 11:46 ` Jörn Engel 0 siblings, 0 replies; 17+ messages in thread From: Jörn Engel @ 2008-06-10 11:46 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mtd, Bruce_Leonard On Tue, 10 June 2008 11:56:48 +0100, Jamie Lokier wrote: > > I just don't see any advantage to assuming "nobody will ever use 4GiB > erase blocks", when changing the API with larger sizes. *shrug* I use u64 for the number of eraseblocks because it doesn't take a lot of imagination to see u32 overflow before I retire. It didn't take a lot of imagination to see the u32 size problem coming either. Relying on "a good reason that I haven't thought of" is a bit of a stretch. Let me put it this way: if you spend the time and create all the patches, you can use u128 for all I care. > I have a particularly crappy one where erase blocks the CPU if the CPU > reads from the chip during the erase - so the cfi_cmdset_0002.c file > needs a patch to avoid polling the status register for this board. > This is topic for another post, though. Nothing can prevent bad drivers. At least they are easy to fix. Jörn -- "Security vulnerabilities are here to stay." -- Scott Culp, Manager of the Microsoft Security Response Center, 2001 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-08 12:10 ` Jörn Engel ` (3 preceding siblings ...) 2008-06-09 13:21 ` Jamie Lokier @ 2008-06-11 0:09 ` Bruce_Leonard 2008-06-11 7:18 ` Jörn Engel 4 siblings, 1 reply; 17+ messages in thread From: Bruce_Leonard @ 2008-06-11 0:09 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd Jorn, Well I'm glad I started a lively discussion :). Most of what you and Jamie are talking about are beyond me though so I don't think I'll try and contribute. I'll just keep asking questions to try and understand. I think I'm starting to glean some things and one of those things is that I think I'm in over my head :(. Are you suggesting that the direction we want to move is to change the MTD layer over to function the same way as the Block I/O layer? With all of it's attendant infrastructure? I've read through the 'Block Device Drivers' chapter of 'Understanding the Linux Kernal'. It talks about the I/O scheduler and queues of requests consisting of multiple bios which can consist of multiple segments (bio_vec?), and a WHOLE lot of 'stuff'. If that's what we're talking about then it's WAY beyond my scope or my abilities. I'm trying to solve a very narrow problem (i.e., I want to access 8GiB of flash as a single device using UBI/UBIFS). So here's what I suggest I can do for you and you can tell me if it'll be of any help. I can continue to ask questions about the code you posted so I understand both the plumbing end and the calling end (I'm still a bit confused on the struct page part and how to use it), implement basically just what you showed me (not anything like I've read about BIO), and try to make it work for my UBI/MTD/NAND case. That would be a start to the changes without being beyond my abilities and I can do it in a way that should't break any existing code. Sound resonable or just a waste of your's and the list's time? On to some questions: > +static int ram_submit_fio(struct mtd_info *mtd, struct fio *fio) > +{ > + int i, err = 0; > + u32 ofs; > + size_t *retlen; > + const u_char *buf; > + struct fio_vec *vec; > + > + /* This driver cannot deal with device >32bit yet */ > + if (fio->fi_blockno * mtd->erasesize + fio->block_ofs + fio->fi_len > + > 0xffffffff) > + return -EIO; > + > + ofs = fio->fi_blockno * mtd->erasesize + fio->block_ofs; > + > + if (fio->fi_command == FI_ERASE) { > + memset(mtd->priv + ofs, 0xff, fio->fi_len); > + goto out; > + } > + > + for (i = 0; i < fio->fi_vcnt; i++) { > + vec = fio->fi_io_vec + i; > + buf = kmap_atomic(vec->fv_page, KM_USER0); > + switch (fio->fi_command) { > + case FI_READ: > + err = ram_read(mtd, ofs, vec->fv_len, &retlen, > + buf + vec->fv_offset); I don't understand 'struct page' which is probably where my confusion comes from. 'ofs' passed to ram_read() is calculated from fi_blockno and fi_blockofs. This makes sense as ram_read() is going to use this as the starting location in the "mtd" device to read from. 'buf' is calculated from vec->fv_page and vec->fv_offset. I think this makes sense, as this is used by ram_read() is the starting location in fv_page to write to. 'len' confuses me a bit as it comes from vec->fv_len rather than fi_len. Why use the length from the vector rather than the struct fio? Should they be initialized to the same thing? I would think to use fi_len as the calling function should have loaded that at the same time it figured out where to start from (i.e., when it loaded fi_blockno and fi_blockofs). I'm sure I'm missing something because I've never used struct page before and it's probably obvious to someone with more kernel experiance. Does vec->fv_len represent the remaining length of vec->fv_page? And vec->fv_offset the offset to start accessing vec->fv_page from? Sorry to be asking dumb questions. Bruce ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-11 0:09 ` Bruce_Leonard @ 2008-06-11 7:18 ` Jörn Engel 2008-06-11 7:46 ` Bruce_Leonard 0 siblings, 1 reply; 17+ messages in thread From: Jörn Engel @ 2008-06-11 7:18 UTC (permalink / raw) To: Bruce_Leonard; +Cc: linux-mtd On Tue, 10 June 2008 17:09:35 -0700, Bruce_Leonard@selinc.com wrote: > > I think I'm starting to glean some things and one of those things is that > I think I'm in over my head :(. Are you suggesting that the direction we > want to move is to change the MTD layer over to function the same way as > the Block I/O layer? Either that or move the block io layer to function the same way as mtd. Effectively the two should meet somewhere in between, but it's going to be a lot closer to current block io than current mtd. We want all the goodies of both worlds, and block io simply has more. > With all of it's attendant infrastructure? I've > read through the 'Block Device Drivers' chapter of 'Understanding the > Linux Kernal'. It talks about the I/O scheduler and queues of requests > consisting of multiple bios which can consist of multiple segments > (bio_vec?), and a WHOLE lot of 'stuff'. If that's what we're talking > about then it's WAY beyond my scope or my abilities. That's ok. I don't expect you to do it all. What I would like is to solve your problem in a way that brings us closer to block io instead of farther away. > > +static int ram_submit_fio(struct mtd_info *mtd, struct fio *fio) > > +{ > > + int i, err = 0; > > + u32 ofs; > > + size_t *retlen; > > + const u_char *buf; > > + struct fio_vec *vec; > > + > > + /* This driver cannot deal with device >32bit yet */ > > + if (fio->fi_blockno * mtd->erasesize + fio->block_ofs + fio->fi_len > > + > 0xffffffff) > > + return -EIO; > > + > > + ofs = fio->fi_blockno * mtd->erasesize + fio->block_ofs; > > + > > + if (fio->fi_command == FI_ERASE) { > > + memset(mtd->priv + ofs, 0xff, fio->fi_len); > > + goto out; > > + } > > + > > + for (i = 0; i < fio->fi_vcnt; i++) { > > + vec = fio->fi_io_vec + i; > > + buf = kmap_atomic(vec->fv_page, KM_USER0); > > + switch (fio->fi_command) { > > + case FI_READ: > > + err = ram_read(mtd, ofs, vec->fv_len, &retlen, > > + buf + vec->fv_offset); > > 'len' confuses me a bit as it comes from vec->fv_len rather than fi_len. > Why use the length from the vector rather than the struct fio? Should > they be initialized to the same thing? I would think to use fi_len as the > calling function should have loaded that at the same time it figured out > where to start from (i.e., when it loaded fi_blockno and fi_blockofs). I'm > sure I'm missing something because I've never used struct page before and > it's probably obvious to someone with more kernel experiance. An io-vector is used to scatter-gather io. Basically you tell a driver to write those 4 meg of data in one huge chunk to the device. And, by the way, the data is currently scattered all over memory and the individual bits are here, here, here, here, here,... That's why we have a loop and within each iteration of the loop only one fragment is processed. Each fragment consists of the page, starting offset within the page and length. The sum of all fragments length fields should equal fio->fi_len. (And it might be a good idea to check that to catch bugs.) > Does vec->fv_len represent the remaining length of vec->fv_page? And > vec->fv_offset the offset to start accessing vec->fv_page from? Sorry to > be asking dumb questions. Just keep asking. Your questions are far from dumb. And lack of experience is often a good thing. Sometimes all the adults strongly believe the emperor is wearing clothes and it takes a child to point out the obvious facts. :) Jörn -- Homo Sapiens is a goal, not a description. -- unknown ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-11 7:18 ` Jörn Engel @ 2008-06-11 7:46 ` Bruce_Leonard 2008-06-11 9:13 ` Jörn Engel 0 siblings, 1 reply; 17+ messages in thread From: Bruce_Leonard @ 2008-06-11 7:46 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd > > > I think I'm in over my head :(. Are you suggesting that the direction we > > want to move is to change the MTD layer over to function the same way as > > the Block I/O layer? > > Either that or move the block io layer to function the same way as mtd. > Effectively the two should meet somewhere in between, but it's going to > be a lot closer to current block io than current mtd. We want all the > goodies of both worlds, and block io simply has more. > Okay, that makes sense. I think I'm actually starting to understand some of this :\. How tied up in the block layer is the I/O scheduler? With my limited understanding, it seems that is going to be the real sticking point in moving block and mtd io towards each other. If the I/O scheduler is largely decoupled from bio it may make this easier than I think. > > That's ok. I don't expect you to do it all. What I would like is to > solve your problem in a way that brings us closer to block io instead of > farther away. > Fortunately, I think I'm well on my way to doing that. I've taken your code snippets (okay pretty much stolen them wholesale, I hope that's okay) and started makeing changes based on them. The changes aren't really radical, I don't make use of the struct fio_vec and I'm just making submit_fio() a wrapper around existing NAND functions, simple stuff like that for now. That will at least get me working and maybe some proof of concept. I hope to have a patch set in the next day or two for the list to look over. Thanks again for all the help. Bruce ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: mtd_info->size again (lengthy) 2008-06-11 7:46 ` Bruce_Leonard @ 2008-06-11 9:13 ` Jörn Engel 0 siblings, 0 replies; 17+ messages in thread From: Jörn Engel @ 2008-06-11 9:13 UTC (permalink / raw) To: Bruce_Leonard; +Cc: linux-mtd On Wed, 11 June 2008 00:46:30 -0700, Bruce_Leonard@selinc.com wrote: > > Okay, that makes sense. I think I'm actually starting to understand some > of this :\. How tied up in the block layer is the I/O scheduler? With my > limited understanding, it seems that is going to be the real sticking > point in moving block and mtd io towards each other. If the I/O scheduler > is largely decoupled from bio it may make this easier than I think. The block io schedulers do two things. First they implement some kind of elevator, combining operations and reordering them for better disk head utilization. That is fairly useless in flash context. Some schedulers also add policy, like giving all process a fair share of io. That might become useful for flash as well. > Fortunately, I think I'm well on my way to doing that. I've taken your > code snippets (okay pretty much stolen them wholesale, I hope that's okay) > and started makeing changes based on them. The changes aren't really > radical, I don't make use of the struct fio_vec and I'm just making > submit_fio() a wrapper around existing NAND functions, simple stuff like > that for now. That will at least get me working and maybe some proof of > concept. I hope to have a patch set in the next day or two for the list > to look over. Cool. If you solve your problem, don't break anything existing and the infrastructure allows to handle existing drivers one by one, I am happy. Bonus points for changing existing drivers, of course. But I think a one-by-one migration is better than a flag-day. If the conversion introduces bugs to individual drivers, we can simply revert those bits and redo them at our leasure, instead of dealing with everything at once. Jörn -- I don't understand it. Nobody does. -- Richard P. Feynman ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-06-11 9:14 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-08 5:38 mtd_info->size again (lengthy) Bruce_Leonard 2008-06-08 12:10 ` Jörn Engel 2008-06-08 12:13 ` Jörn Engel 2008-06-09 2:51 ` Bruce_Leonard 2008-06-09 5:21 ` Jörn Engel 2008-06-09 4:43 ` Bruce_Leonard 2008-06-09 5:28 ` Jörn Engel 2008-06-09 13:21 ` Jamie Lokier 2008-06-09 14:36 ` Jörn Engel 2008-06-09 16:31 ` Jamie Lokier 2008-06-09 17:22 ` Jörn Engel 2008-06-10 10:56 ` Jamie Lokier 2008-06-10 11:46 ` Jörn Engel 2008-06-11 0:09 ` Bruce_Leonard 2008-06-11 7:18 ` Jörn Engel 2008-06-11 7:46 ` Bruce_Leonard 2008-06-11 9:13 ` Jörn Engel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox