* Jffs2 and big file = very slow jffs2_garbage_collect_pass @ 2008-01-17 16:12 Matthieu CASTET 2008-01-17 16:26 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Matthieu CASTET @ 2008-01-17 16:12 UTC (permalink / raw) To: linux-mtd, David Woodhouse Hi, we have a 240 MB jffs2 partition with summary enabled and no compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git jffs2 version (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d) On this partition we have several file (less than 1 MB) and a big file in the root (200 MB). The big file is a FAT image that is exported with usb-storage (with usb device) or mounted on a loopback device. After some FAT operations, we manage to get in a situation were the jffs2_garbage_collect_pass take 12 minutes. jffs2_lookup for the big file (triggered with a ls in the root) take 12 minutes. If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls takes 12 minutes to complete. We applied the 4 patches starting from "Trigger garbage collection when very_dirty_list size becomes excessive" to "Don't count all 'very dirty' blocks except in debug mode", but it doesn't change anything. Why jffs2 take so much time in jffs2_garbage_collect_pass for checking the nodes ? Reading the whole raw flash take about 40s-1min. Does it read the flash in a random order ? What does the jffs2_lookup ? Why it is so long. What are the alternative ? Trying yaffs2 ? Export a smaller file ? Matthieu PS : if the big file is moved in a subdirectory, then the ls in the root dir is fast, but access to the big file is slow (12 Minutes). ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-17 16:12 Jffs2 and big file = very slow jffs2_garbage_collect_pass Matthieu CASTET @ 2008-01-17 16:26 ` Jörn Engel 2008-01-17 17:43 ` Josh Boyer ` (3 more replies) 0 siblings, 4 replies; 34+ messages in thread From: Jörn Engel @ 2008-01-17 16:26 UTC (permalink / raw) To: Matthieu CASTET; +Cc: David Woodhouse, linux-mtd On Thu, 17 January 2008 17:12:29 +0100, Matthieu CASTET wrote: > > we have a 240 MB jffs2 partition with summary enabled and no > compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git jffs2 > version > (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d) > > > On this partition we have several file (less than 1 MB) and a big file > in the root (200 MB). > > The big file is a FAT image that is exported with usb-storage (with usb > device) or mounted on a loopback device. > > After some FAT operations, we manage to get in a situation were the > jffs2_garbage_collect_pass take 12 minutes. > > jffs2_lookup for the big file (triggered with a ls in the root) take 12 > minutes. > > If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls > takes 12 minutes to complete. Impressive! JFFS2 may be slow, but it shouldn't be _that_ slow. Not sure who cares enough to look at this. My approach would be to $ echo t > /proc/sysrq_trigger several times during those 12 minutes and take a close look at the code paths showing up. Most likely it will spend 99% of the time in one place. Jörn -- ...one more straw can't possibly matter... -- Kirby Bakken ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-17 16:26 ` Jörn Engel @ 2008-01-17 17:43 ` Josh Boyer 2008-01-18 9:39 ` Matthieu CASTET 2008-01-18 17:20 ` Glenn Henshaw 2008-01-17 23:22 ` David Woodhouse ` (2 subsequent siblings) 3 siblings, 2 replies; 34+ messages in thread From: Josh Boyer @ 2008-01-17 17:43 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, David Woodhouse, Matthieu CASTET On Thu, 17 Jan 2008 17:26:01 +0100 Jörn Engel <joern@logfs.org> wrote: > On Thu, 17 January 2008 17:12:29 +0100, Matthieu CASTET wrote: > > > > we have a 240 MB jffs2 partition with summary enabled and no > > compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git jffs2 > > version > > (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d) > > > > > > On this partition we have several file (less than 1 MB) and a big file > > in the root (200 MB). > > > > The big file is a FAT image that is exported with usb-storage (with usb > > device) or mounted on a loopback device. > > > > After some FAT operations, we manage to get in a situation were the > > jffs2_garbage_collect_pass take 12 minutes. > > > > jffs2_lookup for the big file (triggered with a ls in the root) take 12 > > minutes. > > > > If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls > > takes 12 minutes to complete. > > Impressive! JFFS2 may be slow, but it shouldn't be _that_ slow. Not How do you know? A 200MiB file will likely have around 50,000 nodes. If the summary stuff is incorrect, and since we have no idea what kind of platform is being used here, it may well be within reason. > sure who cares enough to look at this. My approach would be to > $ echo t > /proc/sysrq_trigger > several times during those 12 minutes and take a close look at the code > paths showing up. Most likely it will spend 99% of the time in one > place. That's sound advice in any case. josh ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-17 17:43 ` Josh Boyer @ 2008-01-18 9:39 ` Matthieu CASTET 2008-01-18 12:48 ` Josh Boyer 2008-01-18 17:20 ` Glenn Henshaw 1 sibling, 1 reply; 34+ messages in thread From: Matthieu CASTET @ 2008-01-18 9:39 UTC (permalink / raw) To: Josh Boyer; +Cc: linux-mtd, Jörn Engel, David Woodhouse Josh Boyer wrote: > On Thu, 17 Jan 2008 17:26:01 +0100 > Jörn Engel <joern@logfs.org> wrote: > >>> If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls >>> takes 12 minutes to complete. >> Impressive! JFFS2 may be slow, but it shouldn't be _that_ slow. Not > > How do you know? A 200MiB file will likely have around 50,000 nodes. Yes the file got 41324 nodes. > If the summary stuff is incorrect, and since we have no idea what kind > of platform is being used here, it may well be within reason. > The summary stuff is correct (I check it with a parser on a dump of the image). Also if the summary wasn't correct, only the mount time will grow ? In my case the mount is ok : less than 5-10s. The platform used is an arm926 @247 Mhz Matthieu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-18 9:39 ` Matthieu CASTET @ 2008-01-18 12:48 ` Josh Boyer 2008-01-18 16:17 ` Matthieu CASTET 0 siblings, 1 reply; 34+ messages in thread From: Josh Boyer @ 2008-01-18 12:48 UTC (permalink / raw) To: Matthieu CASTET; +Cc: David Woodhouse, Jörn Engel, linux-mtd On Fri, 18 Jan 2008 10:39:29 +0100 Matthieu CASTET <matthieu.castet@parrot.com> wrote: > Josh Boyer wrote: > > On Thu, 17 Jan 2008 17:26:01 +0100 > > Jörn Engel <joern@logfs.org> wrote: > > > >>> If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls > >>> takes 12 minutes to complete. > >> Impressive! JFFS2 may be slow, but it shouldn't be _that_ slow. Not > > > > How do you know? A 200MiB file will likely have around 50,000 nodes. > Yes the file got 41324 nodes. Wow, I'm surprised I actually got close with my top of the head math :) > > If the summary stuff is incorrect, and since we have no idea what kind > > of platform is being used here, it may well be within reason. > > > The summary stuff is correct (I check it with a parser on a dump of the > image). Also if the summary wasn't correct, only the mount time will grow ? Yes, you're correct. > In my case the mount is ok : less than 5-10s. > The platform used is an arm926 @247 Mhz Ok, so running at 247MHz your board has to calculate the CRCs on 41324 nodes before the file can be opened. I have no idea how long that should really take, but you're doing about 57 nodes per second if it's taking 12 minutes. As Jörn and David suggested, do some profiling to see where it is spending most of it's time. josh ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-18 12:48 ` Josh Boyer @ 2008-01-18 16:17 ` Matthieu CASTET 2008-01-18 17:55 ` Josh Boyer 0 siblings, 1 reply; 34+ messages in thread From: Matthieu CASTET @ 2008-01-18 16:17 UTC (permalink / raw) To: Josh Boyer; +Cc: David Woodhouse, Jörn Engel, linux-mtd Josh Boyer wrote: > On Fri, 18 Jan 2008 10:39:29 +0100 > Matthieu CASTET <matthieu.castet@parrot.com> wrote: > > >> In my case the mount is ok : less than 5-10s. >> The platform used is an arm926 @247 Mhz > > Ok, so running at 247MHz your board has to calculate the CRCs on 41324 > nodes before the file can be opened. I have no idea how long that > should really take, but you're doing about 57 nodes per second if it's > taking 12 minutes. > > As Jörn and David suggested, do some profiling to see where it is > spending most of it's time. > I sent a mail, but because of Message has a suspicious header, the message wait moderator approval. In summary, the code spend lot's of time in the rbtree code (7 minutes) and 4 minutes in jffs2_get_inode_nodes. Matthieu 54366 rb_prev 543,6600 28345 rb_next 283,4500 8602 default_idle 71,6833 10251 __raw_readsl 40,0430 49648 jffs2_get_inode_nodes 11,8097 251 s3c2412_nand_devready 7,8438 1222 crc32_le 4,8492 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-18 16:17 ` Matthieu CASTET @ 2008-01-18 17:55 ` Josh Boyer 2008-01-18 18:17 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Josh Boyer @ 2008-01-18 17:55 UTC (permalink / raw) To: Matthieu CASTET; +Cc: David Woodhouse, Jörn Engel, linux-mtd On Fri, 18 Jan 2008 17:17:34 +0100 Matthieu CASTET <matthieu.castet@parrot.com> wrote: > I sent a mail, but because of Message has a suspicious header, the > message wait moderator approval. > > > In summary, > the code spend lot's of time in the rbtree code (7 minutes) and 4 > minutes in jffs2_get_inode_nodes. > > > Matthieu > > > 54366 rb_prev 543,6600 > 28345 rb_next 283,4500 > 8602 default_idle 71,6833 > 10251 __raw_readsl 40,0430 > 49648 jffs2_get_inode_nodes 11,8097 > 251 s3c2412_nand_devready 7,8438 > 1222 crc32_le 4,8492 That seems consistent with JFFS2 doing the CRC checks and constructing the in-memory representation of your large file. I suspect the older list-based in-memory implementation would have taken even longer, but there could be something amiss with the rb-tree stuff perhaps. josh ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-18 17:55 ` Josh Boyer @ 2008-01-18 18:17 ` Jörn Engel 2008-01-21 15:57 ` Matthieu CASTET 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-18 18:17 UTC (permalink / raw) To: Josh Boyer; +Cc: David Woodhouse, Jörn Engel, linux-mtd, Matthieu CASTET On Fri, 18 January 2008 11:55:31 -0600, Josh Boyer wrote: > > That seems consistent with JFFS2 doing the CRC checks and constructing > the in-memory representation of your large file. I suspect the older > list-based in-memory implementation would have taken even longer, but > there could be something amiss with the rb-tree stuff perhaps. There is something conceptually amiss with rb-trees. Each node effectively occupies its own cacheline. With those 40k+ nodes, you would need a rather sizeable cache with at least 20k cachelines to have an impact. Noone does. So for all practical purposes, every single lookup will go to main memory. Maybe it is about time to suggest trying logfs? Jörn -- People will accept your ideas much more readily if you tell them that Benjamin Franklin said it first. -- unknown ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-18 18:17 ` Jörn Engel @ 2008-01-21 15:57 ` Matthieu CASTET 2008-01-21 21:25 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Matthieu CASTET @ 2008-01-21 15:57 UTC (permalink / raw) To: Jörn Engel; +Cc: David Woodhouse, Josh Boyer, linux-mtd Hi, Jörn Engel wrote: > On Fri, 18 January 2008 11:55:31 -0600, Josh Boyer wrote: >> That seems consistent with JFFS2 doing the CRC checks and constructing >> the in-memory representation of your large file. I suspect the older >> list-based in-memory implementation would have taken even longer, but >> there could be something amiss with the rb-tree stuff perhaps. > > There is something conceptually amiss with rb-trees. Each node > effectively occupies its own cacheline. With those 40k+ nodes, you > would need a rather sizeable cache with at least 20k cachelines to have > an impact. Noone does. So for all practical purposes, every single > lookup will go to main memory. > > Maybe it is about time to suggest trying logfs? What's the status of logfs on NAND ? Last time I check, it didn't manage badblock. Matthieu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-21 15:57 ` Matthieu CASTET @ 2008-01-21 21:25 ` Jörn Engel 2008-01-21 22:16 ` Josh Boyer 2008-01-21 22:36 ` Glenn Henshaw 0 siblings, 2 replies; 34+ messages in thread From: Jörn Engel @ 2008-01-21 21:25 UTC (permalink / raw) To: Matthieu CASTET; +Cc: David Woodhouse, Jörn Engel, linux-mtd, Josh Boyer On Mon, 21 January 2008 16:57:59 +0100, Matthieu CASTET wrote: > > What's the status of logfs on NAND ? > > Last time I check, it didn't manage badblock. Is that the only thing stopping you from using logfs? mklogfs handles bad blocks. Blocks rotting during lifetime are handled half-heartedly. If that is a real problem for you I wouldn't be surprised if you caught logfs on the wrong foot once or twice. Jörn -- Joern's library part 10: http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-21 21:25 ` Jörn Engel @ 2008-01-21 22:16 ` Josh Boyer 2008-01-21 22:29 ` Jörn Engel 2008-01-21 22:36 ` Glenn Henshaw 1 sibling, 1 reply; 34+ messages in thread From: Josh Boyer @ 2008-01-21 22:16 UTC (permalink / raw) To: Jörn Engel Cc: linux-mtd, Jörn Engel, David Woodhouse, Matthieu CASTET On Mon, 21 Jan 2008 22:25:56 +0100 Jörn Engel <joern@logfs.org> wrote: > On Mon, 21 January 2008 16:57:59 +0100, Matthieu CASTET wrote: > > > > What's the status of logfs on NAND ? > > > > Last time I check, it didn't manage badblock. > > Is that the only thing stopping you from using logfs? > > mklogfs handles bad blocks. Blocks rotting during lifetime are handled > half-heartedly. If that is a real problem for you I wouldn't be > surprised if you caught logfs on the wrong foot once or twice. Wait... you're writing a flash filesystem that doesn't really deal with bad blocks? josh ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-21 22:16 ` Josh Boyer @ 2008-01-21 22:29 ` Jörn Engel 2008-01-22 8:57 ` Matthieu CASTET 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-21 22:29 UTC (permalink / raw) To: Josh Boyer; +Cc: linux-mtd, Jörn Engel, David Woodhouse, Matthieu CASTET On Mon, 21 January 2008 16:16:12 -0600, Josh Boyer wrote: > > Wait... you're writing a flash filesystem that doesn't really deal with > bad blocks? I never said that. Like any other new piece of code, logfs has bugs. Plain and simple. And having blocks rot underneith you is something I don't have automated tests for, so don't be surprised to find bugs in this area. If you send a patch or a nice test setup or even just a bug report, the chances of getting those bugs fixed improve. Jörn -- One of my most productive days was throwing away 1000 lines of code. -- Ken Thompson. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-21 22:29 ` Jörn Engel @ 2008-01-22 8:57 ` Matthieu CASTET 2008-01-22 12:03 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Matthieu CASTET @ 2008-01-22 8:57 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse Jörn Engel wrote: > On Mon, 21 January 2008 16:16:12 -0600, Josh Boyer wrote: >> Wait... you're writing a flash filesystem that doesn't really deal with >> bad blocks? > > I never said that. Like any other new piece of code, logfs has bugs. > Plain and simple. And having blocks rot underneith you is something I > don't have automated tests for, so don't be surprised to find bugs in > this area. On mtd->read I have see no checking for EBADMSG or EUCLEAN. There no call to mtd->block_markbad or mtd->block_isbad (it is only called in mtd_find_sb). So I don't see how this can work on NAND flash. Good test could be to add bad block simulation to nandsim. There is some patch for this (http://lists.infradead.org/pipermail/linux-mtd/2006-December/017107.html). Note they don't simulate bit-flip on read. Matthieu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-22 8:57 ` Matthieu CASTET @ 2008-01-22 12:03 ` Jörn Engel 2008-01-22 13:24 ` Ricard Wanderlof 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-22 12:03 UTC (permalink / raw) To: Matthieu CASTET; +Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer On Tue, 22 January 2008 09:57:07 +0100, Matthieu CASTET wrote: > > On mtd->read I have see no checking for EBADMSG or EUCLEAN. Correct. The CRC check will barf when uncorrectable errors are encountered. Using -EUCLEAN as a trigger to scrub the blocks would be useful. > There no call to mtd->block_markbad or mtd->block_isbad (it is only > called in mtd_find_sb). Used to be there and was removed. mtd->erase() does the same as mtd->block_isbad(). Calling both would be redundant and a waste of time. And logfs has its own bad block table (bad segment table, actually), so mtd->block_markbad could only be called to play nice with others after filesystem gets nuked and the flash reused for something else. Not all devices define that method. For a while I carried a patch that would add a dummy noop call in add_mtd_device (noop call is faster than a conditional), but dropped it because it just doesn't matter enough. > Good test could be to add bad block simulation to nandsim. > There is some patch for this > (http://lists.infradead.org/pipermail/linux-mtd/2006-December/017107.html). > Note they don't simulate bit-flip on read. Ramtd can simulate bit-flips as well. A nice test setup needs a bit more than that, some way to do random, yet repeatable errors. Jörn -- ticks = jiffies; while (ticks == jiffies); ticks = jiffies; -- /usr/src/linux/init/main.c ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-22 12:03 ` Jörn Engel @ 2008-01-22 13:24 ` Ricard Wanderlof 2008-01-22 15:05 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Ricard Wanderlof @ 2008-01-22 13:24 UTC (permalink / raw) To: Jörn Engel; +Cc: David Woodhouse, Josh Boyer, linux-mtd, Matthieu CASTET > On Tue, 22 Jan 2008, Jörn Engel wrote: > > > On Tue, 22 January 2008 09:57:07 +0100, Matthieu CASTET wrote: > > > ... > > There no call to mtd->block_markbad or mtd->block_isbad (it is only > > called in mtd_find_sb). > > Used to be there and was removed. mtd->erase() does the same as > mtd->block_isbad(). How do you mean? mtd->erase() will not erase a bad block, that is true. However, while it seems that mtd->erase() can mark a block bad if it fails, the fact that a block is eraseble without errors does not imply that it is good. I've seen NAND flash blocks which have way passed their specified number of max write/erase cycles still be successfully erased and subsequently written without errors, but the data retention was lousy (blocks started to show bit flips after a few thousand reads). /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-22 13:24 ` Ricard Wanderlof @ 2008-01-22 15:05 ` Jörn Engel 2008-01-23 9:23 ` Ricard Wanderlof 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-22 15:05 UTC (permalink / raw) To: Ricard Wanderlof Cc: David Woodhouse, Jörn Engel, linux-mtd, Josh Boyer, Matthieu CASTET On Tue, 22 January 2008 14:24:56 +0100, Ricard Wanderlof wrote: > >On Tue, 22 Jan 2008, Jörn Engel wrote: > > > >Used to be there and was removed. mtd->erase() does the same as > >mtd->block_isbad(). > > How do you mean? mtd->erase() will not erase a bad block, that is true. Exactly. The only use logfs has for block_isbad() is to skip bad blocks in the beginning when looking for the superblock. > However, while it seems that mtd->erase() can mark a block bad if it > fails, the fact that a block is eraseble without errors does not imply > that it is good. I've seen NAND flash blocks which have way passed their > specified number of max write/erase cycles still be successfully erased > and subsequently written without errors, but the data retention was lousy > (blocks started to show bit flips after a few thousand reads). I think we are being silly[1]. The question is not whether logfs handles bad block (it does), but which particular failure case it doesn't handle well enough. - Easy: blocks are initially marked bad, erase returns an error. Mklogfs erases the complete device once, any bad blocks get stored in the bad segment table. Segments can span multiple eraseblocks, one bad block will spoil the complete segment. - Impossible: data rots without early warning. If the device is that bad, you can either have a RAID or replace the device. Nothing the filesystem could or should do about it. - Moderate: one block continuously spews -EUCLEAN, then becomes terminally bad. If those are just random bitflips, garbage collection will move the data sooner or later. Logfs does not force GC to happen soon when encountering -EUCLEAN, which it should. Are correctable errors an indication of block going bad in the near future? If yes, I should do something about it. The list probably goes on and on. And I am sure that I would miss at least half the interesting cases if I had to create it on my own. But if Matthieu or you or anyone else is willing to compose an extensive list of such failure cases, I will walk through it and try to handle them one by one. [1] Silly in the way politicians are talking about freedom and security. Everyone agrees that both are valuable goals and any policy can be justified in the name of one or the other. Ensures heated talkshow discussions, but otherwise useless. Jörn -- The story so far: In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- Douglas Adams ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-22 15:05 ` Jörn Engel @ 2008-01-23 9:23 ` Ricard Wanderlof 2008-01-23 10:19 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Ricard Wanderlof @ 2008-01-23 9:23 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET On Tue, 22 Jan 2008, Jörn Engel wrote: > - Moderate: one block continuously spews -EUCLEAN, then becomes > terminally bad. > If those are just random bitflips, garbage collection will move the > data sooner or later. Logfs does not force GC to happen soon when > encountering -EUCLEAN, which it should. Are correctable errors an > indication of block going bad in the near future? If yes, I should do > something about it. I would say that correctable errors occurring "soon" after writing are an indication that the block is going bad. My experience has been that extensive reading can cause bitflips (and it probably happens over time too), but that for fresh blocks, billions of read operations need to be done before a bit flips. For blocks that are nearing their best before date, a couple of hundred thousand reads can cause a bit to flip. So if I was implementing some sort of 'when is this block considered bad'-algorithm, I'd try to keep tabs on how often the block has been (read-) accessed in relation to when it was last writen. If this number is "low", the block should be considered bad and not used again. I'm also think that when (if) logfs decides a block is bad, it should mark it bad using mtd->block_markbad(). That way, if the flash is rewritten by something else than logfs (say during a firmware upgrade), bad blocks can be handled in a consistent and startad way. >The list probably goes on and on. And I am sure that I would miss at >least half the interesting cases if I had to create it on my own. But >if Matthieu or you or anyone else is willing to compose an extensive >list of such failure cases, I will walk through it and try to handle >them one by one. We ran some tests here on a particular flash chip type to try and determine at least some of the failure modes that are related to block wear (due to write/erase) and bit decay (due to reading). The end result was basically what I tried to describe above, but I can go into more detail if you're interested. /Ricard ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-23 9:23 ` Ricard Wanderlof @ 2008-01-23 10:19 ` Jörn Engel 2008-01-23 10:41 ` Ricard Wanderlof 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-23 10:19 UTC (permalink / raw) To: Ricard Wanderlof Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer, Matthieu CASTET On Wed, 23 January 2008 10:23:55 +0100, Ricard Wanderlof wrote: > On Tue, 22 Jan 2008, Jörn Engel wrote: > > >- Moderate: one block continuously spews -EUCLEAN, then becomes > > terminally bad. > > If those are just random bitflips, garbage collection will move the > > data sooner or later. Logfs does not force GC to happen soon when > > encountering -EUCLEAN, which it should. Are correctable errors an > > indication of block going bad in the near future? If yes, I should do > > something about it. > > I would say that correctable errors occurring "soon" after writing are an > indication that the block is going bad. My experience has been that > extensive reading can cause bitflips (and it probably happens over time > too), but that for fresh blocks, billions of read operations need to be > done before a bit flips. For blocks that are nearing their best before > date, a couple of hundred thousand reads can cause a bit to flip. So if I > was implementing some sort of 'when is this block considered > bad'-algorithm, I'd try to keep tabs on how often the block has been > (read-) accessed in relation to when it was last writen. If this number is > "low", the block should be considered bad and not used again. That sounds like an impossible strategy. Causing a write for every read will significantly increase write pressure, thereby reduce flash lifetime, reduce performance etc. What would be possible was a counter for soft/hard errors per physical block. On soft error, move data elsewhere and reuse the block, but increment the error counter. If the counter increases beyond 17 (or any other random number), mark the block as bad. Limit can be an mkfs option. > I'm also think that when (if) logfs decides a block is bad, it should mark > it bad using mtd->block_markbad(). That way, if the flash is rewritten by > something else than logfs (say during a firmware upgrade), bad blocks can > be handled in a consistent and startad way. Maybe I should revive the old patch then. I don't think it matters much either way. > We ran some tests here on a particular flash chip type to try and > determine at least some of the failure modes that are related to block > wear (due to write/erase) and bit decay (due to reading). The end result > was basically what I tried to describe above, but I can go into more > detail if you're interested. I do remember your mail describing the test. One of the interesting conclusions is that even awefully worn out block is still good enough to store short-lived information. It appears to be a surprisingly robust strategy to have a high wear-out, as long as you keep the wear constantly high and replace block contents at a high rate. Jörn -- You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. -- Rob Pike ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-23 10:19 ` Jörn Engel @ 2008-01-23 10:41 ` Ricard Wanderlof 2008-01-23 10:57 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Ricard Wanderlof @ 2008-01-23 10:41 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET On Wed, 23 Jan 2008, Jörn Engel wrote: >> I would say that correctable errors occurring "soon" after writing are an >> indication that the block is going bad. My experience has been that >> extensive reading can cause bitflips (and it probably happens over time >> too), but that for fresh blocks, billions of read operations need to be >> done before a bit flips. For blocks that are nearing their best before >> date, a couple of hundred thousand reads can cause a bit to flip. So if I >> was implementing some sort of 'when is this block considered >> bad'-algorithm, I'd try to keep tabs on how often the block has been >> (read-) accessed in relation to when it was last writen. If this number is >> "low", the block should be considered bad and not used again. > > That sounds like an impossible strategy. Causing a write for every read > will significantly increase write pressure, thereby reduce flash > lifetime, reduce performance etc. > > What would be possible was a counter for soft/hard errors per physical > block. On soft error, move data elsewhere and reuse the block, but > increment the error counter. If the counter increases beyond 17 (or any > other random number), mark the block as bad. Limit can be an mkfs > option. Sorry, I didn't express myself clearly. I should have said '...keep tabs on how _many_times_ the block has been read accessed in relation to when it was last written.' If a page has been read, say, 100 000 times since it was last written, and starts to show bit flips, it is a sign that the block is wearing out. If it has been read, say, 100 000 000 times since it was written and starts showing bit flips, it's probably sufficient just to do a garbage collect and rewrite the data (in the same block or elsewhere). The algorithm you suggest also sounds reasonable. Repeatedly occurring bit flips (-EUCLEAN) are an indication that the block is wearing out. Probably more efficient than logging the number of read accesses somewhere. One problem may be what to do when the system is powered down. If we don't store the error counters in the flash (or some other non-volatile place), then each time the system is powered up, all the error counters will be reset. >> We ran some tests here on a particular flash chip type to try and >> determine at least some of the failure modes that are related to block >> wear (due to write/erase) and bit decay (due to reading). The end result >> was basically what I tried to describe above, but I can go into more >> detail if you're interested. > > I do remember your mail describing the test. One of the interesting > conclusions is that even awefully worn out block is still good enough to > store short-lived information. It appears to be a surprisingly robust > strategy to have a high wear-out, as long as you keep the wear > constantly high and replace block contents at a high rate. You're probably right, but I'm not sure I understand what you mean. /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-23 10:41 ` Ricard Wanderlof @ 2008-01-23 10:57 ` Jörn Engel 2008-01-23 11:57 ` Ricard Wanderlof 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-23 10:57 UTC (permalink / raw) To: Ricard Wanderlof Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer, Matthieu CASTET On Wed, 23 January 2008 11:41:14 +0100, Ricard Wanderlof wrote: > > One problem may be what to do when the system is powered down. If we don't > store the error counters in the flash (or some other non-volatile place), > then each time the system is powered up, all the error counters will be > reset. Exactly. If the information we need to detect problems is not stored in the filesystem, it is useless. Maybe it would work for a very specialized system to keep that information in DRAM or NVRAM, but in general it will get lost. As a matter of principle logfs does not do any special things for special systems. If you have a nice optimization, it has to work for everyone. If it doesn't, your systems behaves differently from everyone else's systems. So whatever bugs you have cannot be reproduced by anyone else. For example, when the system crashes, some data may get written that isn't accounted for in the journal. On reboot/remount that gets detected and the wasted space is skipped. Writes continue beyond it. On hard disks or consumer flash media, it is legal to rewrite the same location without erase. But logfs still skips that space and is deliberately inefficient. > >I do remember your mail describing the test. One of the interesting > >conclusions is that even awefully worn out block is still good enough to > >store short-lived information. It appears to be a surprisingly robust > >strategy to have a high wear-out, as long as you keep the wear > >constantly high and replace block contents at a high rate. > > You're probably right, but I'm not sure I understand what you mean. High wearout means short time between two writes. Which also means few reads between two writes. As long as the write rate (wear rate) remains roughly constant, high wear doesn't seem to cause any problems. The block's ability to retain information degrades, but that doesn't matter because it doesn't have to retain the information for a long time. Jörn -- Victory in war is not repetitious. -- Sun Tzu ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-23 10:57 ` Jörn Engel @ 2008-01-23 11:57 ` Ricard Wanderlof 2008-01-23 13:01 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Ricard Wanderlof @ 2008-01-23 11:57 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET On Wed, 23 Jan 2008, Jörn Engel wrote: >> One problem may be what to do when the system is powered down. If we don't >> store the error counters in the flash (or some other non-volatile place), >> then each time the system is powered up, all the error counters will be >> reset. > > Exactly. If the information we need to detect problems is not stored in > the filesystem, it is useless. Maybe it would work for a very > specialized system to keep that information in DRAM or NVRAM, but in > general it will get lost. > > As a matter of principle logfs does not do any special things for special > systems. If you have a nice optimization, it has to work for everyone. > If it doesn't, your systems behaves differently from everyone else's > systems. So whatever bugs you have cannot be reproduced by anyone else. Very true. Perhaps it's possible to devise something that at least accomplishes part of the goal. Such as when writing a new block, also write some statistical information such as the number of read accesses since the previous write (or power up), or the reason for writing (new data, gc because of bitflips, ...) and a write counter. Something of that nature. > High wearout means short time between two writes. Which also means few > reads between two writes. > > As long as the write rate (wear rate) remains roughly constant, high > wear doesn't seem to cause any problems. The block's ability to retain > information degrades, but that doesn't matter because it doesn't have to > retain the information for a long time. I'd be a bit wary of this with NAND chips some of which have a 100 000 maximum erase/write cycle specification, though. And I think that especially when nearing the maximum value and going beyond it, that there is some bit decay occurring over time and not just from reading. On the other hand, with 100 000 write cycles total, and assuming a product lifetime of 3 years, we end up with over 90 permitted write/erase cycles per day. Depending on the situation, it might be quite ok to take advantage over this. /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-23 11:57 ` Ricard Wanderlof @ 2008-01-23 13:01 ` Jörn Engel 2008-01-23 13:16 ` Ricard Wanderlof 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-23 13:01 UTC (permalink / raw) To: Ricard Wanderlof Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer, Matthieu CASTET On Wed, 23 January 2008 12:57:09 +0100, Ricard Wanderlof wrote: > > Perhaps it's possible to devise something that at least accomplishes part > of the goal. Such as when writing a new block, also write some statistical > information such as the number of read accesses since the previous write > (or power up), or the reason for writing (new data, gc because of > bitflips, ...) and a write counter. Something of that nature. I'm still fairly unconvinced about the read accounting. We could do something purely stochastic like accounting _every_ read, but just with a probability of, say, 1:100,000. That would still, within statistical jitter, behave the same for everyone. But once we depend on the average mount time of systems, I'm quite unhappy with the solution. Also, logfs stores a very limited amount of data for each segment (read: eraseblock). Currently this is just the erase count, used for wear leveling and the segment number. The latter can be used to detect blocks being moved around by "something", be it an image flasher, bootloader, FTL or whatever. There are still 16 bytes of padding in the structure, so we could add an error counter without breaking the format. > I'd be a bit wary of this with NAND chips some of which have a 100 000 > maximum erase/write cycle specification, though. And I think that > especially when nearing the maximum value and going beyond it, that there > is some bit decay occurring over time and not just from reading. It doesn't really matter whether the data degrades from a number of reads or from time passing. With a constantly high write rate, there is less time for degradations then with a low write rate. Problematic would be to have a high write rate for a while, then a very low write rate that allows data to rot for a long time. And also this depends on your numbers being representative for every flash chip. ;) Jörn -- The only real mistake is the one from which we learn nothing. -- John Powell ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-23 13:01 ` Jörn Engel @ 2008-01-23 13:16 ` Ricard Wanderlof 2008-01-23 14:06 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Ricard Wanderlof @ 2008-01-23 13:16 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET On Wed, 23 Jan 2008, Jörn Engel wrote: > On Wed, 23 January 2008 12:57:09 +0100, Ricard Wanderlof wrote: >> >> Perhaps it's possible to devise something that at least accomplishes part >> of the goal. Such as when writing a new block, also write some statistical >> information such as the number of read accesses since the previous write >> (or power up), or the reason for writing (new data, gc because of >> bitflips, ...) and a write counter. Something of that nature. > > I'm still fairly unconvinced about the read accounting. We could do > something purely stochastic like accounting _every_ read, but just with > a probability of, say, 1:100,000. That would still, within statistical > jitter, behave the same for everyone. But once we depend on the average > mount time of systems, I'm quite unhappy with the solution. I think you are right. An error counter should be sufficient to get enough statistics to determine if a block has begun to go bad. >> I'd be a bit wary of this with NAND chips some of which have a 100 000 >> maximum erase/write cycle specification, though. And I think that >> especially when nearing the maximum value and going beyond it, that there >> is some bit decay occurring over time and not just from reading. > > It doesn't really matter whether the data degrades from a number of > reads or from time passing. With a constantly high write rate, there is > less time for degradations then with a low write rate. If we have a system that is only used (= powered on) rarely, then any degradation from time passing could become significant. > Problematic would be to have a high write rate for a while, then a very > low write rate that allows data to rot for a long time. And also this > depends on your numbers being representative for every flash chip. ;) Yes. And the latter is very true. Our tests were only of a certain chip type from a certain manufacturer, and of course other chips might behave differently. The only input I have got from chip manufacturers regarding this issue is that with inreasing bit densities and decreasing bit cell sizes in the future, things like the probability of random bit flips are likely to increase. (Somewhere there is a limit when the amount of error correction needed to handle this things grows too large to make the chip practically useful; say 10 error correction bits per stored bit or whatever). /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-23 13:16 ` Ricard Wanderlof @ 2008-01-23 14:06 ` Jörn Engel 2008-01-23 14:25 ` Ricard Wanderlof 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-23 14:06 UTC (permalink / raw) To: Ricard Wanderlof Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer, Matthieu CASTET On Wed, 23 January 2008 14:16:12 +0100, Ricard Wanderlof wrote: > > > >It doesn't really matter whether the data degrades from a number of > >reads or from time passing. With a constantly high write rate, there is > >less time for degradations then with a low write rate. > > If we have a system that is only used (= powered on) rarely, then any > degradation from time passing could become significant. In that case the write rate wouldn't be _constantly_ high. ;) > The only input I have got from chip manufacturers regarding this issue is > that with inreasing bit densities and decreasing bit cell sizes in the > future, things like the probability of random bit flips are likely to > increase. (Somewhere there is a limit when the amount of error correction > needed to handle this things grows too large to make the chip practically > useful; say 10 error correction bits per stored bit or whatever). If error rates increase, device drivers have to do stronger error correction. Quality after error correction has been done should stay roughly the same. Jörn -- The cost of changing business rules is much more expensive for software than for a secretaty. -- unknown ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-23 14:06 ` Jörn Engel @ 2008-01-23 14:25 ` Ricard Wanderlof 0 siblings, 0 replies; 34+ messages in thread From: Ricard Wanderlof @ 2008-01-23 14:25 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET On Wed, 23 Jan 2008, Jörn Engel wrote: >> The only input I have got from chip manufacturers regarding this issue is >> that with inreasing bit densities and decreasing bit cell sizes in the >> future, things like the probability of random bit flips are likely to >> increase. (Somewhere there is a limit when the amount of error correction >> needed to handle this things grows too large to make the chip practically >> useful; say 10 error correction bits per stored bit or whatever). > > If error rates increase, device drivers have to do stronger error > correction. Quality after error correction has been done should stay > roughly the same. Yes, true, the first step is to increase the error correction capabilites, but there comes a point when there are so many error correction bits required per data bit that there is no point of increasing the memory size. Today we have 3 ECC bytes per 256 data bytes in an ordinary nand flash. If geometries decrease we might at some point need, say 128 ECC bytes, and further down the line perhaps even more ECC bytes than data bytes. It then eventually comes to a point of diminishing returns; if the geometries are decreased and error rates go up, the increase in number ECC bits might be more than the gain in number of data bits. This is far down the line, and partly speculative I agree. /Ricard -- Ricard Wolf Wanderlöf ricardw(at)axis.com Axis Communications AB, Lund, Sweden www.axis.com Phone +46 46 272 2016 Fax +46 46 13 61 30 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-21 21:25 ` Jörn Engel 2008-01-21 22:16 ` Josh Boyer @ 2008-01-21 22:36 ` Glenn Henshaw 1 sibling, 0 replies; 34+ messages in thread From: Glenn Henshaw @ 2008-01-21 22:36 UTC (permalink / raw) To: linux-mtd On 21-Jan-08, at 4:25 PM, Jörn Engel wrote: > On Mon, 21 January 2008 16:57:59 +0100, Matthieu CASTET wrote: >> >> What's the status of logfs on NAND ? >> >> Last time I check, it didn't manage badblock. > > Is that the only thing stopping you from using logfs? > > mklogfs handles bad blocks. Blocks rotting during lifetime are > handled > half-heartedly. If that is a real problem for you I wouldn't be > surprised if you caught logfs on the wrong foot once or twice. > > Has it been ported to a 2.4 kernel? I can't upgrade due to the amount of work necessary to rewrite drivers. -- Glenn Henshaw Logical Outcome Ltd. e: thraxisp@logicaloutcome.ca w: www.logicaloutcome.ca ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-17 17:43 ` Josh Boyer 2008-01-18 9:39 ` Matthieu CASTET @ 2008-01-18 17:20 ` Glenn Henshaw 2008-01-18 18:39 ` Jamie Lokier 1 sibling, 1 reply; 34+ messages in thread From: Glenn Henshaw @ 2008-01-18 17:20 UTC (permalink / raw) To: linux-mtd On 17-Jan-08, at 12:43 PM, Josh Boyer wrote: > On Thu, 17 Jan 2008 17:26:01 +0100 > Jörn Engel <joern@logfs.org> wrote: > >> On Thu, 17 January 2008 17:12:29 +0100, Matthieu CASTET wrote: >>> >>> we have a 240 MB jffs2 partition with summary enabled and no >>> compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git >>> jffs2 >>> version >>> (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d >>> ) >>> >>> >>> On this partition we have several file (less than 1 MB) and a big >>> file >>> in the root (200 MB). >>> >>> The big file is a FAT image that is exported with usb-storage >>> (with usb >>> device) or mounted on a loopback device. >>> >>> After some FAT operations, we manage to get in a situation were the >>> jffs2_garbage_collect_pass take 12 minutes. >>> >>> jffs2_lookup for the big file (triggered with a ls in the root) >>> take 12 >>> minutes. >>> >>> If we do a ls without waiting that jffs2_garbage_collect_pass >>> finish, ls >>> takes 12 minutes to complete. >> >> Impressive! JFFS2 may be slow, but it shouldn't be _that_ slow. Not > > How do you know? A 200MiB file will likely have around 50,000 nodes. > If the summary stuff is incorrect, and since we have no idea what kind > of platform is being used here, it may well be within reason. I found a similar problem on an older 2.4.27 based system. We have a 64k JFFS2 partition (1024 blocks of 4kbytes). As the file system fills up, the time for any operation increases exponentially. When it reaches 90% full, it takes minutes to write a file. After a cursory inspection, it seems to block doing garbage collection and compressing blocks. We gave up and limited the capacity to 60% full at the application level. I'd appreciate any pointer to fix this, as migrating to a 2.6 kernel is not an option. -- Glenn Henshaw Logical Outcome Ltd. e: thraxisp@logicaloutcome.ca w: www.logicaloutcome.ca ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-18 17:20 ` Glenn Henshaw @ 2008-01-18 18:39 ` Jamie Lokier 2008-01-18 21:00 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Jamie Lokier @ 2008-01-18 18:39 UTC (permalink / raw) To: Glenn Henshaw; +Cc: linux-mtd Glenn Henshaw wrote: > I found a similar problem on an older 2.4.27 based system. We have > a 64k JFFS2 partition (1024 blocks of 4kbytes). As the file system > fills up, the time for any operation increases exponentially. When it > reaches 90% full, it takes minutes to write a file. After a cursory > inspection, it seems to block doing garbage collection and compressing > blocks. > > We gave up and limited the capacity to 60% full at the application > level. > > I'd appreciate any pointer to fix this, as migrating to a 2.6 > kernel is not an option. Yes! I have exactly the same problem, except I'm using 2.4.26-uc0, and it's a 1MB partition (16 blocks of 64kbytes). I am tempted to modify the JFFS2 code to implement a hard limit of 50% full at the kernel level. The JFFS2 docs suggest 5 free blocks are enough to ensure GC is working. In my experience that does often work, but occasionally there's a catastrophically long and CPU intensive GC. -- Jamie ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-18 18:39 ` Jamie Lokier @ 2008-01-18 21:00 ` Jörn Engel 2008-01-19 0:23 ` Jamie Lokier 0 siblings, 1 reply; 34+ messages in thread From: Jörn Engel @ 2008-01-18 21:00 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mtd, Glenn Henshaw On Fri, 18 January 2008 18:39:01 +0000, Jamie Lokier wrote: > > Yes! I have exactly the same problem, except I'm using 2.4.26-uc0, > and it's a 1MB partition (16 blocks of 64kbytes). > > I am tempted to modify the JFFS2 code to implement a hard limit of 50% > full at the kernel level. > > The JFFS2 docs suggest 5 free blocks are enough to ensure GC is > working. In my experience that does often work, but occasionally > there's a catastrophically long and CPU intensive GC. If you want to make GC go berzerk, here's a simple recipe: 1. Fill filesystem 100%. 2. Randomly replace single blocks. There are two ways to solve this problem: 1. Reserve some amount of free space for GC performance. 2. Write in some non-random fashion. Solution 2 works even better if the filesystem actually sorts data very roughly by life expectency. That requires writing to several blocks in parallel, i.e. one for long-lived data, one for short-lived data. Made an impressive difference in logfs when I implemented that. And of course academics can write many papers about good heuristics to predict life expectency. In fact, they already have. Jörn -- "Security vulnerabilities are here to stay." -- Scott Culp, Manager of the Microsoft Security Response Center, 2001 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-18 21:00 ` Jörn Engel @ 2008-01-19 0:23 ` Jamie Lokier 2008-01-19 2:38 ` Jörn Engel 0 siblings, 1 reply; 34+ messages in thread From: Jamie Lokier @ 2008-01-19 0:23 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Glenn Henshaw Jörn Engel wrote: > If you want to make GC go berzerk, here's a simple recipe: > 1. Fill filesystem 100%. > 2. Randomly replace single blocks. > > There are two ways to solve this problem: > 1. Reserve some amount of free space for GC performance. The real difficulty is that it's not clear how much to reserve for _reliable_ performance. We're left guessing based on experience, and that gives only limited confidence. The 5 blocks suggested in JFFS2 docs seemed promising, but didn't work out. Perhaps it does work with 5 blocks, but you have to count all potential metadata overhead and misalignment overhead when working out how much free "file" data that translates to? Really, some of us just want JFFS2 to return -ENOSPC at _some_ sensible deterministic point before the GC might behave peculiarly, rather than trying to squeeze as much as possible onto the partition. > 2. Write in some non-random fashion. > > Solution 2 works even better if the filesystem actually sorts data > very roughly by life expectency. That requires writing to several > blocks in parallel, i.e. one for long-lived data, one for short-lived > data. Made an impressive difference in logfs when I implemented that. Ah, a bit like generational GC :-) -- Jamie ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-19 0:23 ` Jamie Lokier @ 2008-01-19 2:38 ` Jörn Engel 0 siblings, 0 replies; 34+ messages in thread From: Jörn Engel @ 2008-01-19 2:38 UTC (permalink / raw) To: Jamie Lokier; +Cc: Jörn Engel, linux-mtd, Glenn Henshaw On Sat, 19 January 2008 00:23:02 +0000, Jamie Lokier wrote: > Jörn Engel wrote: > > > > There are two ways to solve this problem: > > 1. Reserve some amount of free space for GC performance. > > The real difficulty is that it's not clear how much to reserve for > _reliable_ performance. We're left guessing based on experience, and > that gives only limited confidence. The 5 blocks suggested in JFFS2 > docs seemed promising, but didn't work out. Perhaps it does work with > 5 blocks, but you have to count all potential metadata overhead and > misalignment overhead when working out how much free "file" data that > translates to? The five blocks work well enough if your goal is that GC will return _eventually_. Now you come along and even want it to return within a reasonable amount of time. That is a different problem. ;) Math is fairly simple. The worst case is when the write pattern is completely random and every block contains the same amount of data. Let us pick a 99% full filesystem for starters. In order to write one block worth of data, GC need to move 99 blocks worth of old data around, before it has freed a full block. So on average 99% of all writes handle GC data and only 1% handly the data you - the user - care about. If your filesystem is 80% full, 80% of all writes are GC data and 20% are user data. Very simple. Latency is a different problem. Depending on your design, those 80% or 99% GC writes can happen continuously or in huge batches. > Really, some of us just want JFFS2 to return -ENOSPC > at _some_ sensible deterministic point before the GC might behave > peculiarly, rather than trying to squeeze as much as possible onto the > partition. Logfs has a field defined for GC reserve space. I know the problem and I care about it. Although I have to admit that mkfs doesn't allow setting this field yet. > > 2. Write in some non-random fashion. > > > > Solution 2 works even better if the filesystem actually sorts data > > very roughly by life expectency. That requires writing to several > > blocks in parallel, i.e. one for long-lived data, one for short-lived > > data. Made an impressive difference in logfs when I implemented that. > > Ah, a bit like generational GC :-) Actually, no. The different levels of the tree, which JFFS2 doesn't store on the medium, also happen to have vastly different lifetimes. Generational GC is the logical next step, which I haven't done yet. Jörn -- Science is like sex: sometimes something useful comes out, but that is not the reason we are doing it. -- Richard Feynman ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-17 16:26 ` Jörn Engel 2008-01-17 17:43 ` Josh Boyer @ 2008-01-17 23:22 ` David Woodhouse 2008-01-18 9:45 ` Matthieu CASTET 2008-01-18 18:20 ` Jamie Lokier 3 siblings, 0 replies; 34+ messages in thread From: David Woodhouse @ 2008-01-17 23:22 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, Matthieu CASTET On Thu, 2008-01-17 at 17:26 +0100, Jörn Engel wrote: > > Impressive! JFFS2 may be slow, but it shouldn't be _that_ slow. Not > sure who cares enough to look at this. My approach would be to > $ echo t > /proc/sysrq_trigger > several times during those 12 minutes and take a close look at the > code > paths showing up. Most likely it will spend 99% of the time in one > place. I was going to suggest booting with 'profile=1' and using readprofile, which is a slightly more reliable way of getting the same information. -- dwmw2 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-17 16:26 ` Jörn Engel 2008-01-17 17:43 ` Josh Boyer 2008-01-17 23:22 ` David Woodhouse @ 2008-01-18 9:45 ` Matthieu CASTET 2008-01-18 18:20 ` Jamie Lokier 3 siblings, 0 replies; 34+ messages in thread From: Matthieu CASTET @ 2008-01-18 9:45 UTC (permalink / raw) To: Jörn Engel; +Cc: David Woodhouse, linux-mtd [-- Attachment #1: Type: text/plain, Size: 3210 bytes --] Hi, Jörn Engel wrote: > On Thu, 17 January 2008 17:12:29 +0100, Matthieu CASTET wrote: >> we have a 240 MB jffs2 partition with summary enabled and no >> compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git jffs2 >> version >> (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d) >> If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls >> takes 12 minutes to complete. > > Impressive! JFFS2 may be slow, but it shouldn't be _that_ slow. Not > sure who cares enough to look at this. My approach would be to > $ echo t > /proc/sysrq_trigger > several times during those 12 minutes and take a close look at the code > paths showing up. Most likely it will spend 99% of the time in one > place. I have a jtag debugger that allow me to know where the code take time. When I mount the partition, thanks to the summary the mount is very short (less than 10s). Then the garbage collector start to check nodes [1]. It spend 12 minutes in jffs2_garbage_collect_pass. Then the system goes idle. Then if I try to access the file [2]. It take 12 minutes to finish jffs2_lookup. I have attached the result of booting with 'profile=1'. (HZ=200) The code spend lot's of time in the rbtree code (7 minutes) and 4 minutes in jffs2_get_inode_nodes. Matthieu [1] #0 rb_next (node=0xc1c76e80) at lib/rbtree.c:325 #1 0xc00c5568 in jffs2_get_inode_nodes (c=0xc0a5a800, f=0xc0a5a200, rii=0xc1c19dbc) at fs/jffs2/readinode.c:317 #2 0xc00c59d4 in jffs2_do_read_inode_internal (c=0xc0a5a800, f=0xc0a5a200, latest_node=0xc1c19e14) at fs/jffs2/readinode.c:1124 #3 0xc00c63a0 in jffs2_do_crccheck_inode (c=0xc0a5a800, ic=0xc03993c8) at fs/jffs2/readinode.c:1379 #4 0xc00c9afc in jffs2_garbage_collect_pass (c=0xc0a5a800) at fs/jffs2/gc.c:208 #5 0xc00cc56c in jffs2_garbage_collect_thread (_c=<value optimized out>) at fs/jffs2/background.c:138 #6 0xc003766c in sys_waitid (which=19019, pid=20115456, infop=0x4a0e, options=-1044275912, ru=0x0) at kernel/exit.c:1634 [2] #0 0xc00e8c14 in rb_prev (node=<value optimized out>) at lib/rbtree.c:368 #1 0xc00c5624 in jffs2_get_inode_nodes (c=0xc0a5a800, f=0xc1c16ca0, rii=0xc0fadbf4) at fs/jffs2/readinode.c:355 #2 0xc00c59d4 in jffs2_do_read_inode_internal (c=0xc0a5a800, f=0xc1c16ca0, latest_node=0xc0fadca8) at fs/jffs2/readinode.c:1124 #3 0xc00c6604 in jffs2_do_read_inode (c=0xc0a5a800, f=0xc1c16ca0, ino=165, latest_node=0xc0fadca8) at fs/jffs2/readinode.c:1364 #4 0xc00cd5c8 in jffs2_read_inode (inode=0xc1c16cd0) at fs/jffs2/fs.c:247 #5 0xc00c0204 in jffs2_lookup (dir_i=0xc1c16310, target=0xc1c0d0d8, nd=<value optimized out>) at include/linux/fs.h:1670 #6 0xc0080100 in do_lookup (nd=0xc0fadf08, name=0xc0fadd8c, path=0xc0fadd98) at fs/namei.c:494 #7 0xc0081e24 in __link_path_walk (name=0xc085300f "", nd=0xc0fadf08) at fs/namei.c:940 #8 0xc008245c in link_path_walk (name=0xc0853000 "/mnt/toto/media", nd=0xc0fadf08) at fs/namei.c:1011 #9 0xc00829b0 in do_path_lookup (dfd=<value optimized out>, name=0xc0853000 "/mnt/toto/media", flags=<value optimized out>, nd=0xc0fadf08) at fs/namei.c:1157 [-- Attachment #2: profile.txt --] [-- Type: text/plain, Size: 1710 bytes --] 54366 rb_prev 543,6600 28345 rb_next 283,4500 8602 default_idle 71,6833 10251 __raw_readsl 40,0430 49648 jffs2_get_inode_nodes 11,8097 251 s3c2412_nand_devready 7,8438 1222 crc32_le 4,8492 58 __delay 4,8333 164 touch_softlockup_watchdog 4,1000 245 nand_wait_ready 2,7841 78 s3c2440_nand_hwcontrol 1,6250 37 s3c2412_nand_enable_hwecc 1,0278 44 s3c2412_nand_calculate_ecc 0,8462 30 mutex_lock 0,7500 65 kmem_cache_alloc 0,6250 12 down_read 0,6000 13 __aeabi_uidivmod 0,5417 163 nand_read_page_hwecc 0,4970 53 s3c2412_nand_read_buf 0,4907 46 jffs2_lookup_node_frag 0,4423 24 s3c2412_clkcon_enable 0,4286 39 clk_disable 0,3750 10 __const_udelay 0,3571 39 clk_enable 0,3362 18 strcmp 0,3214 32 sysfs_dirent_exist 0,2759 33 __wake_up 0,2750 9 mutex_unlock 0,2500 37 kmem_cache_free 0,2202 177 memcpy 0,2169 ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass 2008-01-17 16:26 ` Jörn Engel ` (2 preceding siblings ...) 2008-01-18 9:45 ` Matthieu CASTET @ 2008-01-18 18:20 ` Jamie Lokier 3 siblings, 0 replies; 34+ messages in thread From: Jamie Lokier @ 2008-01-18 18:20 UTC (permalink / raw) To: Jörn Engel; +Cc: linux-mtd, David Woodhouse, Matthieu CASTET Jörn Engel wrote: > > If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls > > takes 12 minutes to complete. > > Impressive! JFFS2 may be slow, but it shouldn't be _that_ slow. Not > sure who cares enough to look at this. My approach would be to > $ echo t > /proc/sysrq_trigger > several times during those 12 minutes and take a close look at the code > paths showing up. Most likely it will spend 99% of the time in one > place. I have seen similar slow GCs with JFFS2 on a 2.4.26-uc0 kernel (which is very old now), just 1MB size, and of course no summary support. In this case it wasn't 12 minutes, but about 1 minute with the GC thread using 100% CPU. I saw it a couple of times. But that's much slower than erasing and writing the whole 1MB, so it's possible there has been a GC bug doing excessive flash operations which remains unfixed for a very long time. -- Jamie ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2008-01-23 14:26 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-17 16:12 Jffs2 and big file = very slow jffs2_garbage_collect_pass Matthieu CASTET 2008-01-17 16:26 ` Jörn Engel 2008-01-17 17:43 ` Josh Boyer 2008-01-18 9:39 ` Matthieu CASTET 2008-01-18 12:48 ` Josh Boyer 2008-01-18 16:17 ` Matthieu CASTET 2008-01-18 17:55 ` Josh Boyer 2008-01-18 18:17 ` Jörn Engel 2008-01-21 15:57 ` Matthieu CASTET 2008-01-21 21:25 ` Jörn Engel 2008-01-21 22:16 ` Josh Boyer 2008-01-21 22:29 ` Jörn Engel 2008-01-22 8:57 ` Matthieu CASTET 2008-01-22 12:03 ` Jörn Engel 2008-01-22 13:24 ` Ricard Wanderlof 2008-01-22 15:05 ` Jörn Engel 2008-01-23 9:23 ` Ricard Wanderlof 2008-01-23 10:19 ` Jörn Engel 2008-01-23 10:41 ` Ricard Wanderlof 2008-01-23 10:57 ` Jörn Engel 2008-01-23 11:57 ` Ricard Wanderlof 2008-01-23 13:01 ` Jörn Engel 2008-01-23 13:16 ` Ricard Wanderlof 2008-01-23 14:06 ` Jörn Engel 2008-01-23 14:25 ` Ricard Wanderlof 2008-01-21 22:36 ` Glenn Henshaw 2008-01-18 17:20 ` Glenn Henshaw 2008-01-18 18:39 ` Jamie Lokier 2008-01-18 21:00 ` Jörn Engel 2008-01-19 0:23 ` Jamie Lokier 2008-01-19 2:38 ` Jörn Engel 2008-01-17 23:22 ` David Woodhouse 2008-01-18 9:45 ` Matthieu CASTET 2008-01-18 18:20 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox