public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed
* Jffs2 and big file = very slow jffs2_garbage_collect_pass
@ 2008-01-17 16:12 Matthieu CASTET
  2008-01-17 16:26 ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Matthieu CASTET @ 2008-01-17 16:12 UTC (permalink / raw)
  To: linux-mtd, David Woodhouse

Hi,


we have a 240 MB jffs2 partition with summary enabled and no 
compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git jffs2 
version 
(http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d)


On this partition we have several file (less than 1 MB) and a big file 
in the root (200 MB).

The big file is a FAT image that is exported with usb-storage (with usb 
device) or mounted on a loopback device.

After some FAT operations, we manage to get in a situation were the 
jffs2_garbage_collect_pass take 12 minutes.

jffs2_lookup for the big file (triggered with a ls in the root) take 12 
minutes.

If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls 
takes 12 minutes to complete.

We applied the 4 patches starting from "Trigger garbage collection when 
very_dirty_list size becomes excessive" to "Don't count all 'very dirty' 
blocks except in debug mode", but it doesn't change anything.


Why jffs2 take so much time in jffs2_garbage_collect_pass for checking 
the nodes ?
Reading the whole raw flash take about 40s-1min.
Does it read the flash in a random order ?

What does the jffs2_lookup ?
Why it is so long.


What are the alternative ?
Trying yaffs2 ?
Export a smaller file ?

Matthieu


PS : if the big file is moved in a subdirectory, then the ls in the root 
dir is fast, but access to the big file is slow (12 Minutes).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-17 16:12 Jffs2 and big file = very slow jffs2_garbage_collect_pass Matthieu CASTET
@ 2008-01-17 16:26 ` Jörn Engel
  2008-01-17 17:43   ` Josh Boyer
                     ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Jörn Engel @ 2008-01-17 16:26 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: David Woodhouse, linux-mtd

On Thu, 17 January 2008 17:12:29 +0100, Matthieu CASTET wrote:
> 
> we have a 240 MB jffs2 partition with summary enabled and no 
> compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git jffs2 
> version 
> (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d)
> 
> 
> On this partition we have several file (less than 1 MB) and a big file 
> in the root (200 MB).
> 
> The big file is a FAT image that is exported with usb-storage (with usb 
> device) or mounted on a loopback device.
> 
> After some FAT operations, we manage to get in a situation were the 
> jffs2_garbage_collect_pass take 12 minutes.
> 
> jffs2_lookup for the big file (triggered with a ls in the root) take 12 
> minutes.
> 
> If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls 
> takes 12 minutes to complete.

Impressive!  JFFS2 may be slow, but it shouldn't be _that_ slow.  Not
sure who cares enough to look at this.  My approach would be to 
$ echo t > /proc/sysrq_trigger
several times during those 12 minutes and take a close look at the code
paths showing up.  Most likely it will spend 99% of the time in one
place.

Jörn

-- 
...one more straw can't possibly matter...
-- Kirby Bakken

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-17 16:26 ` Jörn Engel
@ 2008-01-17 17:43   ` Josh Boyer
  2008-01-18  9:39     ` Matthieu CASTET
  2008-01-18 17:20     ` Glenn Henshaw
  2008-01-17 23:22   ` David Woodhouse
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 34+ messages in thread
From: Josh Boyer @ 2008-01-17 17:43 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, David Woodhouse, Matthieu CASTET

On Thu, 17 Jan 2008 17:26:01 +0100
Jörn Engel <joern@logfs.org> wrote:

> On Thu, 17 January 2008 17:12:29 +0100, Matthieu CASTET wrote:
> > 
> > we have a 240 MB jffs2 partition with summary enabled and no 
> > compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git jffs2 
> > version 
> > (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d)
> > 
> > 
> > On this partition we have several file (less than 1 MB) and a big file 
> > in the root (200 MB).
> > 
> > The big file is a FAT image that is exported with usb-storage (with usb 
> > device) or mounted on a loopback device.
> > 
> > After some FAT operations, we manage to get in a situation were the 
> > jffs2_garbage_collect_pass take 12 minutes.
> > 
> > jffs2_lookup for the big file (triggered with a ls in the root) take 12 
> > minutes.
> > 
> > If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls 
> > takes 12 minutes to complete.
> 
> Impressive!  JFFS2 may be slow, but it shouldn't be _that_ slow.  Not

How do you know?  A 200MiB file will likely have around 50,000 nodes.
If the summary stuff is incorrect, and since we have no idea what kind
of platform is being used here, it may well be within reason.

> sure who cares enough to look at this.  My approach would be to 
> $ echo t > /proc/sysrq_trigger
> several times during those 12 minutes and take a close look at the code
> paths showing up.  Most likely it will spend 99% of the time in one
> place.

That's sound advice in any case.

josh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-17 16:26 ` Jörn Engel
  2008-01-17 17:43   ` Josh Boyer
@ 2008-01-17 23:22   ` David Woodhouse
  2008-01-18  9:45   ` Matthieu CASTET
  2008-01-18 18:20   ` Jamie Lokier
  3 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2008-01-17 23:22 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, Matthieu CASTET


On Thu, 2008-01-17 at 17:26 +0100, Jörn Engel wrote:
> 
> Impressive!  JFFS2 may be slow, but it shouldn't be _that_ slow.  Not
> sure who cares enough to look at this.  My approach would be to 
> $ echo t > /proc/sysrq_trigger
> several times during those 12 minutes and take a close look at the
> code
> paths showing up.  Most likely it will spend 99% of the time in one
> place.

I was going to suggest booting with 'profile=1' and using readprofile,
which is a slightly more reliable way of getting the same information.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-17 17:43   ` Josh Boyer
@ 2008-01-18  9:39     ` Matthieu CASTET
  2008-01-18 12:48       ` Josh Boyer
  2008-01-18 17:20     ` Glenn Henshaw
  1 sibling, 1 reply; 34+ messages in thread
From: Matthieu CASTET @ 2008-01-18  9:39 UTC (permalink / raw)
  To: Josh Boyer; +Cc: linux-mtd, Jörn Engel, David Woodhouse

Josh Boyer wrote:
> On Thu, 17 Jan 2008 17:26:01 +0100
> Jörn Engel <joern@logfs.org> wrote:
> 
>>> If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls 
>>> takes 12 minutes to complete.
>> Impressive!  JFFS2 may be slow, but it shouldn't be _that_ slow.  Not
> 
> How do you know?  A 200MiB file will likely have around 50,000 nodes.
Yes the file got 41324 nodes.
> If the summary stuff is incorrect, and since we have no idea what kind
> of platform is being used here, it may well be within reason.
> 
The summary stuff is correct (I check it with a parser on a dump of the 
image). Also if the summary wasn't correct, only the mount time will grow ?
In my case the mount is ok : less than 5-10s.
The platform used is an arm926 @247 Mhz

Matthieu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-17 16:26 ` Jörn Engel
  2008-01-17 17:43   ` Josh Boyer
  2008-01-17 23:22   ` David Woodhouse
@ 2008-01-18  9:45   ` Matthieu CASTET
  2008-01-18 18:20   ` Jamie Lokier
  3 siblings, 0 replies; 34+ messages in thread
From: Matthieu CASTET @ 2008-01-18  9:45 UTC (permalink / raw)
  To: Jörn Engel; +Cc: David Woodhouse, linux-mtd

[-- Attachment #1: Type: text/plain, Size: 3210 bytes --]

Hi,

Jörn Engel wrote:
> On Thu, 17 January 2008 17:12:29 +0100, Matthieu CASTET wrote:
>> we have a 240 MB jffs2 partition with summary enabled and no 
>> compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git jffs2 
>> version 
>> (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d)
>> If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls 
>> takes 12 minutes to complete.
> 
> Impressive!  JFFS2 may be slow, but it shouldn't be _that_ slow.  Not
> sure who cares enough to look at this.  My approach would be to 
> $ echo t > /proc/sysrq_trigger
> several times during those 12 minutes and take a close look at the code
> paths showing up.  Most likely it will spend 99% of the time in one
> place.
I have a jtag debugger that allow me to know where the code take time.
When I mount the partition, thanks to the summary the mount is very 
short (less than 10s).

Then the garbage collector start to check nodes [1]. It spend 12 minutes 
in jffs2_garbage_collect_pass.

Then the system goes idle.

Then if I try to access the file [2]. It take 12 minutes to finish 
jffs2_lookup.

I have attached the result of booting with 'profile=1'. (HZ=200)

The code spend lot's of time in the rbtree code (7 minutes) and 4 
minutes in jffs2_get_inode_nodes.


Matthieu

[1]
#0  rb_next (node=0xc1c76e80) at lib/rbtree.c:325
#1  0xc00c5568 in jffs2_get_inode_nodes (c=0xc0a5a800, f=0xc0a5a200,
     rii=0xc1c19dbc) at fs/jffs2/readinode.c:317
#2  0xc00c59d4 in jffs2_do_read_inode_internal (c=0xc0a5a800, f=0xc0a5a200,
     latest_node=0xc1c19e14) at fs/jffs2/readinode.c:1124
#3  0xc00c63a0 in jffs2_do_crccheck_inode (c=0xc0a5a800, ic=0xc03993c8)
     at fs/jffs2/readinode.c:1379
#4  0xc00c9afc in jffs2_garbage_collect_pass (c=0xc0a5a800)
     at fs/jffs2/gc.c:208
#5  0xc00cc56c in jffs2_garbage_collect_thread (_c=<value optimized out>)
     at fs/jffs2/background.c:138
#6  0xc003766c in sys_waitid (which=19019, pid=20115456, infop=0x4a0e,
     options=-1044275912, ru=0x0) at kernel/exit.c:1634

[2]
#0  0xc00e8c14 in rb_prev (node=<value optimized out>) at lib/rbtree.c:368
#1  0xc00c5624 in jffs2_get_inode_nodes (c=0xc0a5a800, f=0xc1c16ca0,
     rii=0xc0fadbf4) at fs/jffs2/readinode.c:355
#2  0xc00c59d4 in jffs2_do_read_inode_internal (c=0xc0a5a800, f=0xc1c16ca0,
     latest_node=0xc0fadca8) at fs/jffs2/readinode.c:1124
#3  0xc00c6604 in jffs2_do_read_inode (c=0xc0a5a800, f=0xc1c16ca0, ino=165,
     latest_node=0xc0fadca8) at fs/jffs2/readinode.c:1364
#4  0xc00cd5c8 in jffs2_read_inode (inode=0xc1c16cd0) at fs/jffs2/fs.c:247
#5  0xc00c0204 in jffs2_lookup (dir_i=0xc1c16310, target=0xc1c0d0d8,
     nd=<value optimized out>) at include/linux/fs.h:1670
#6  0xc0080100 in do_lookup (nd=0xc0fadf08, name=0xc0fadd8c, 
path=0xc0fadd98)
     at fs/namei.c:494
#7  0xc0081e24 in __link_path_walk (name=0xc085300f "", nd=0xc0fadf08)
     at fs/namei.c:940
#8  0xc008245c in link_path_walk (name=0xc0853000 "/mnt/toto/media",
     nd=0xc0fadf08) at fs/namei.c:1011
#9  0xc00829b0 in do_path_lookup (dfd=<value optimized out>,
     name=0xc0853000 "/mnt/toto/media", flags=<value optimized out>,
     nd=0xc0fadf08) at fs/namei.c:1157

[-- Attachment #2: profile.txt --]
[-- Type: text/plain, Size: 1710 bytes --]

 54366 rb_prev                                  543,6600
 28345 rb_next                                  283,4500
  8602 default_idle                              71,6833
 10251 __raw_readsl                              40,0430
 49648 jffs2_get_inode_nodes                     11,8097
   251 s3c2412_nand_devready                      7,8438
  1222 crc32_le                                   4,8492
    58 __delay                                    4,8333
   164 touch_softlockup_watchdog                  4,1000
   245 nand_wait_ready                            2,7841
    78 s3c2440_nand_hwcontrol                     1,6250
    37 s3c2412_nand_enable_hwecc                  1,0278
    44 s3c2412_nand_calculate_ecc                 0,8462
    30 mutex_lock                                 0,7500
    65 kmem_cache_alloc                           0,6250
    12 down_read                                  0,6000
    13 __aeabi_uidivmod                           0,5417
   163 nand_read_page_hwecc                       0,4970
    53 s3c2412_nand_read_buf                      0,4907
    46 jffs2_lookup_node_frag                     0,4423
    24 s3c2412_clkcon_enable                      0,4286
    39 clk_disable                                0,3750
    10 __const_udelay                             0,3571
    39 clk_enable                                 0,3362
    18 strcmp                                     0,3214
    32 sysfs_dirent_exist                         0,2759
    33 __wake_up                                  0,2750
     9 mutex_unlock                               0,2500
    37 kmem_cache_free                            0,2202
   177 memcpy                                     0,2169

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-18  9:39     ` Matthieu CASTET
@ 2008-01-18 12:48       ` Josh Boyer
  2008-01-18 16:17         ` Matthieu CASTET
  0 siblings, 1 reply; 34+ messages in thread
From: Josh Boyer @ 2008-01-18 12:48 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: David Woodhouse, Jörn Engel, linux-mtd

On Fri, 18 Jan 2008 10:39:29 +0100
Matthieu CASTET <matthieu.castet@parrot.com> wrote:

> Josh Boyer wrote:
> > On Thu, 17 Jan 2008 17:26:01 +0100
> > Jörn Engel <joern@logfs.org> wrote:
> > 
> >>> If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls 
> >>> takes 12 minutes to complete.
> >> Impressive!  JFFS2 may be slow, but it shouldn't be _that_ slow.  Not
> > 
> > How do you know?  A 200MiB file will likely have around 50,000 nodes.
> Yes the file got 41324 nodes.

Wow, I'm surprised I actually got close with my top of the head math :)

> > If the summary stuff is incorrect, and since we have no idea what kind
> > of platform is being used here, it may well be within reason.
> > 
> The summary stuff is correct (I check it with a parser on a dump of the 
> image). Also if the summary wasn't correct, only the mount time will grow ?

Yes, you're correct.

> In my case the mount is ok : less than 5-10s.
> The platform used is an arm926 @247 Mhz

Ok, so running at 247MHz your board has to calculate the CRCs on 41324
nodes before the file can be opened.  I have no idea how long that
should really take, but you're doing about 57 nodes per second if it's
taking 12 minutes.

As Jörn and David suggested, do some profiling to see where it is
spending most of it's time.

josh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-18 12:48       ` Josh Boyer
@ 2008-01-18 16:17         ` Matthieu CASTET
  2008-01-18 17:55           ` Josh Boyer
  0 siblings, 1 reply; 34+ messages in thread
From: Matthieu CASTET @ 2008-01-18 16:17 UTC (permalink / raw)
  To: Josh Boyer; +Cc: David Woodhouse, Jörn Engel, linux-mtd

Josh Boyer wrote:
> On Fri, 18 Jan 2008 10:39:29 +0100
> Matthieu CASTET <matthieu.castet@parrot.com> wrote:
> 
> 
>> In my case the mount is ok : less than 5-10s.
>> The platform used is an arm926 @247 Mhz
> 
> Ok, so running at 247MHz your board has to calculate the CRCs on 41324
> nodes before the file can be opened.  I have no idea how long that
> should really take, but you're doing about 57 nodes per second if it's
> taking 12 minutes.
> 
> As Jörn and David suggested, do some profiling to see where it is
> spending most of it's time.
> 
I sent a mail, but  because of Message has a suspicious header, the 
message wait  moderator approval.


In summary,
the code spend lot's of time in the rbtree code (7 minutes) and 4 
minutes in jffs2_get_inode_nodes.


Matthieu


  54366 rb_prev                                  543,6600
  28345 rb_next                                  283,4500
   8602 default_idle                              71,6833
  10251 __raw_readsl                              40,0430
  49648 jffs2_get_inode_nodes                     11,8097
    251 s3c2412_nand_devready                      7,8438
   1222 crc32_le                                   4,8492

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-17 17:43   ` Josh Boyer
  2008-01-18  9:39     ` Matthieu CASTET
@ 2008-01-18 17:20     ` Glenn Henshaw
  2008-01-18 18:39       ` Jamie Lokier
  1 sibling, 1 reply; 34+ messages in thread
From: Glenn Henshaw @ 2008-01-18 17:20 UTC (permalink / raw)
  To: linux-mtd


On 17-Jan-08, at 12:43 PM, Josh Boyer wrote:

> On Thu, 17 Jan 2008 17:26:01 +0100
> Jörn Engel <joern@logfs.org> wrote:
>
>> On Thu, 17 January 2008 17:12:29 +0100, Matthieu CASTET wrote:
>>>
>>> we have a 240 MB jffs2 partition with summary enabled and no
>>> compression. We use 2ad8ee713566671875216ebcec64f2eda47bd19d git  
>>> jffs2
>>> version
>>> (http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=2ad8ee713566671875216ebcec64f2eda47bd19d 
>>> )
>>>
>>>
>>> On this partition we have several file (less than 1 MB) and a big  
>>> file
>>> in the root (200 MB).
>>>
>>> The big file is a FAT image that is exported with usb-storage  
>>> (with usb
>>> device) or mounted on a loopback device.
>>>
>>> After some FAT operations, we manage to get in a situation were the
>>> jffs2_garbage_collect_pass take 12 minutes.
>>>
>>> jffs2_lookup for the big file (triggered with a ls in the root)  
>>> take 12
>>> minutes.
>>>
>>> If we do a ls without waiting that jffs2_garbage_collect_pass  
>>> finish, ls
>>> takes 12 minutes to complete.
>>
>> Impressive!  JFFS2 may be slow, but it shouldn't be _that_ slow.  Not
>
> How do you know?  A 200MiB file will likely have around 50,000 nodes.
> If the summary stuff is incorrect, and since we have no idea what kind
> of platform is being used here, it may well be within reason.

   I found a similar problem on an older 2.4.27 based system. We have  
a 64k JFFS2 partition (1024 blocks of 4kbytes). As the file system  
fills up, the time for any operation increases exponentially. When it  
reaches 90% full, it takes minutes to write a file. After a cursory  
inspection, it seems to block doing garbage collection and compressing  
blocks.

   We gave up and limited the capacity to 60% full at the application  
level.

   I'd appreciate any pointer to fix this, as migrating to a 2.6  
kernel is not an option.


-- 
Glenn Henshaw                     Logical Outcome Ltd.
e: thraxisp@logicaloutcome.ca     w: www.logicaloutcome.ca

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-18 16:17         ` Matthieu CASTET
@ 2008-01-18 17:55           ` Josh Boyer
  2008-01-18 18:17             ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Josh Boyer @ 2008-01-18 17:55 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: David Woodhouse, Jörn Engel, linux-mtd

On Fri, 18 Jan 2008 17:17:34 +0100
Matthieu CASTET <matthieu.castet@parrot.com> wrote:
 
> I sent a mail, but  because of Message has a suspicious header, the 
> message wait  moderator approval.
> 
> 
> In summary,
> the code spend lot's of time in the rbtree code (7 minutes) and 4 
> minutes in jffs2_get_inode_nodes.
> 
> 
> Matthieu
> 
> 
>   54366 rb_prev                                  543,6600
>   28345 rb_next                                  283,4500
>    8602 default_idle                              71,6833
>   10251 __raw_readsl                              40,0430
>   49648 jffs2_get_inode_nodes                     11,8097
>     251 s3c2412_nand_devready                      7,8438
>    1222 crc32_le                                   4,8492

That seems consistent with JFFS2 doing the CRC checks and constructing
the in-memory representation of your large file.  I suspect the older
list-based in-memory implementation would have taken even longer, but
there could be something amiss with the rb-tree stuff perhaps.

josh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-18 17:55           ` Josh Boyer
@ 2008-01-18 18:17             ` Jörn Engel
  2008-01-21 15:57               ` Matthieu CASTET
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-18 18:17 UTC (permalink / raw)
  To: Josh Boyer; +Cc: David Woodhouse, Jörn Engel, linux-mtd, Matthieu CASTET

On Fri, 18 January 2008 11:55:31 -0600, Josh Boyer wrote:
> 
> That seems consistent with JFFS2 doing the CRC checks and constructing
> the in-memory representation of your large file.  I suspect the older
> list-based in-memory implementation would have taken even longer, but
> there could be something amiss with the rb-tree stuff perhaps.

There is something conceptually amiss with rb-trees.  Each node
effectively occupies its own cacheline.  With those 40k+ nodes, you
would need a rather sizeable cache with at least 20k cachelines to have
an impact.  Noone does.  So for all practical purposes, every single
lookup will go to main memory.

Maybe it is about time to suggest trying logfs?

Jörn

-- 
People will accept your ideas much more readily if you tell them
that Benjamin Franklin said it first.
-- unknown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-17 16:26 ` Jörn Engel
                     ` (2 preceding siblings ...)
  2008-01-18  9:45   ` Matthieu CASTET
@ 2008-01-18 18:20   ` Jamie Lokier
  3 siblings, 0 replies; 34+ messages in thread
From: Jamie Lokier @ 2008-01-18 18:20 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, David Woodhouse, Matthieu CASTET

Jörn Engel wrote:
> > If we do a ls without waiting that jffs2_garbage_collect_pass finish, ls 
> > takes 12 minutes to complete.
> 
> Impressive!  JFFS2 may be slow, but it shouldn't be _that_ slow.  Not
> sure who cares enough to look at this.  My approach would be to 
> $ echo t > /proc/sysrq_trigger
> several times during those 12 minutes and take a close look at the code
> paths showing up.  Most likely it will spend 99% of the time in one
> place.

I have seen similar slow GCs with JFFS2 on a 2.4.26-uc0 kernel (which
is very old now), just 1MB size, and of course no summary support.  In
this case it wasn't 12 minutes, but about 1 minute with the GC thread
using 100% CPU.  I saw it a couple of times.  But that's much slower
than erasing and writing the whole 1MB, so it's possible there has
been a GC bug doing excessive flash operations which remains unfixed
for a very long time.

-- Jamie

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-18 17:20     ` Glenn Henshaw
@ 2008-01-18 18:39       ` Jamie Lokier
  2008-01-18 21:00         ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Jamie Lokier @ 2008-01-18 18:39 UTC (permalink / raw)
  To: Glenn Henshaw; +Cc: linux-mtd

Glenn Henshaw wrote:
>    I found a similar problem on an older 2.4.27 based system. We have  
> a 64k JFFS2 partition (1024 blocks of 4kbytes). As the file system  
> fills up, the time for any operation increases exponentially. When it  
> reaches 90% full, it takes minutes to write a file. After a cursory  
> inspection, it seems to block doing garbage collection and compressing  
> blocks.
> 
>    We gave up and limited the capacity to 60% full at the application  
> level.
> 
>    I'd appreciate any pointer to fix this, as migrating to a 2.6  
> kernel is not an option.

Yes!  I have exactly the same problem, except I'm using 2.4.26-uc0,
and it's a 1MB partition (16 blocks of 64kbytes).

I am tempted to modify the JFFS2 code to implement a hard limit of 50%
full at the kernel level.

The JFFS2 docs suggest 5 free blocks are enough to ensure GC is
working.  In my experience that does often work, but occasionally
there's a catastrophically long and CPU intensive GC.

-- Jamie

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-18 18:39       ` Jamie Lokier
@ 2008-01-18 21:00         ` Jörn Engel
  2008-01-19  0:23           ` Jamie Lokier
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-18 21:00 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-mtd, Glenn Henshaw

On Fri, 18 January 2008 18:39:01 +0000, Jamie Lokier wrote:
> 
> Yes!  I have exactly the same problem, except I'm using 2.4.26-uc0,
> and it's a 1MB partition (16 blocks of 64kbytes).
> 
> I am tempted to modify the JFFS2 code to implement a hard limit of 50%
> full at the kernel level.
> 
> The JFFS2 docs suggest 5 free blocks are enough to ensure GC is
> working.  In my experience that does often work, but occasionally
> there's a catastrophically long and CPU intensive GC.

If you want to make GC go berzerk, here's a simple recipe:
1. Fill filesystem 100%.
2. Randomly replace single blocks.

There are two ways to solve this problem:
1. Reserve some amount of free space for GC performance.
2. Write in some non-random fashion.

Solution 2 works even better if the filesystem actually sorts data
very roughly by life expectency.  That requires writing to several
blocks in parallel, i.e. one for long-lived data, one for short-lived
data.  Made an impressive difference in logfs when I implemented that.

And of course academics can write many papers about good heuristics to
predict life expectency.  In fact, they already have.

Jörn

-- 
"Security vulnerabilities are here to stay."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-18 21:00         ` Jörn Engel
@ 2008-01-19  0:23           ` Jamie Lokier
  2008-01-19  2:38             ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Jamie Lokier @ 2008-01-19  0:23 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, Glenn Henshaw

Jörn Engel wrote:
> If you want to make GC go berzerk, here's a simple recipe:
> 1. Fill filesystem 100%.
> 2. Randomly replace single blocks.
> 
> There are two ways to solve this problem:
> 1. Reserve some amount of free space for GC performance.

The real difficulty is that it's not clear how much to reserve for
_reliable_ performance.  We're left guessing based on experience, and
that gives only limited confidence.  The 5 blocks suggested in JFFS2
docs seemed promising, but didn't work out.  Perhaps it does work with
5 blocks, but you have to count all potential metadata overhead and
misalignment overhead when working out how much free "file" data that
translates to?  Really, some of us just want JFFS2 to return -ENOSPC
at _some_ sensible deterministic point before the GC might behave
peculiarly, rather than trying to squeeze as much as possible onto the
partition.

> 2. Write in some non-random fashion.
> 
> Solution 2 works even better if the filesystem actually sorts data
> very roughly by life expectency.  That requires writing to several
> blocks in parallel, i.e. one for long-lived data, one for short-lived
> data.  Made an impressive difference in logfs when I implemented that.

Ah, a bit like generational GC :-)

-- Jamie

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-19  0:23           ` Jamie Lokier
@ 2008-01-19  2:38             ` Jörn Engel
  0 siblings, 0 replies; 34+ messages in thread
From: Jörn Engel @ 2008-01-19  2:38 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Jörn Engel, linux-mtd, Glenn Henshaw

On Sat, 19 January 2008 00:23:02 +0000, Jamie Lokier wrote:
> Jörn Engel wrote:
> > 
> > There are two ways to solve this problem:
> > 1. Reserve some amount of free space for GC performance.
> 
> The real difficulty is that it's not clear how much to reserve for
> _reliable_ performance.  We're left guessing based on experience, and
> that gives only limited confidence.  The 5 blocks suggested in JFFS2
> docs seemed promising, but didn't work out.  Perhaps it does work with
> 5 blocks, but you have to count all potential metadata overhead and
> misalignment overhead when working out how much free "file" data that
> translates to?

The five blocks work well enough if your goal is that GC will return
_eventually_.  Now you come along and even want it to return within a
reasonable amount of time.  That is a different problem. ;)

Math is fairly simple.  The worst case is when the write pattern is
completely random and every block contains the same amount of data.  Let
us pick a 99% full filesystem for starters.

In order to write one block worth of data, GC need to move 99 blocks
worth of old data around, before it has freed a full block.  So on
average 99% of all writes handle GC data and only 1% handly the data you
- the user - care about.  If your filesystem is 80% full, 80% of all
writes are GC data and 20% are user data.  Very simple.

Latency is a different problem.  Depending on your design, those 80% or
99% GC writes can happen continuously or in huge batches.

> Really, some of us just want JFFS2 to return -ENOSPC
> at _some_ sensible deterministic point before the GC might behave
> peculiarly, rather than trying to squeeze as much as possible onto the
> partition.

Logfs has a field defined for GC reserve space.  I know the problem and
I care about it.  Although I have to admit that mkfs doesn't allow
setting this field yet.

> > 2. Write in some non-random fashion.
> > 
> > Solution 2 works even better if the filesystem actually sorts data
> > very roughly by life expectency.  That requires writing to several
> > blocks in parallel, i.e. one for long-lived data, one for short-lived
> > data.  Made an impressive difference in logfs when I implemented that.
> 
> Ah, a bit like generational GC :-)

Actually, no.  The different levels of the tree, which JFFS2 doesn't
store on the medium, also happen to have vastly different lifetimes.
Generational GC is the logical next step, which I haven't done yet.

Jörn

-- 
Science is like sex: sometimes something useful comes out,
but that is not the reason we are doing it.
-- Richard Feynman

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-18 18:17             ` Jörn Engel
@ 2008-01-21 15:57               ` Matthieu CASTET
  2008-01-21 21:25                 ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Matthieu CASTET @ 2008-01-21 15:57 UTC (permalink / raw)
  To: Jörn Engel; +Cc: David Woodhouse, Josh Boyer, linux-mtd

Hi,

Jörn Engel wrote:
> On Fri, 18 January 2008 11:55:31 -0600, Josh Boyer wrote:
>> That seems consistent with JFFS2 doing the CRC checks and constructing
>> the in-memory representation of your large file.  I suspect the older
>> list-based in-memory implementation would have taken even longer, but
>> there could be something amiss with the rb-tree stuff perhaps.
> 
> There is something conceptually amiss with rb-trees.  Each node
> effectively occupies its own cacheline.  With those 40k+ nodes, you
> would need a rather sizeable cache with at least 20k cachelines to have
> an impact.  Noone does.  So for all practical purposes, every single
> lookup will go to main memory.
> 
> Maybe it is about time to suggest trying logfs?
What's the status of logfs on NAND ?

Last time I check, it didn't manage badblock.


Matthieu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-21 15:57               ` Matthieu CASTET
@ 2008-01-21 21:25                 ` Jörn Engel
  2008-01-21 22:16                   ` Josh Boyer
  2008-01-21 22:36                   ` Glenn Henshaw
  0 siblings, 2 replies; 34+ messages in thread
From: Jörn Engel @ 2008-01-21 21:25 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: David Woodhouse, Jörn Engel, linux-mtd, Josh Boyer

On Mon, 21 January 2008 16:57:59 +0100, Matthieu CASTET wrote:
>
> What's the status of logfs on NAND ?
> 
> Last time I check, it didn't manage badblock.

Is that the only thing stopping you from using logfs?

mklogfs handles bad blocks.  Blocks rotting during lifetime are handled
half-heartedly.  If that is a real problem for you I wouldn't be
surprised if you caught logfs on the wrong foot once or twice.

Jörn

-- 
Joern's library part 10:
http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-21 21:25                 ` Jörn Engel
@ 2008-01-21 22:16                   ` Josh Boyer
  2008-01-21 22:29                     ` Jörn Engel
  2008-01-21 22:36                   ` Glenn Henshaw
  1 sibling, 1 reply; 34+ messages in thread
From: Josh Boyer @ 2008-01-21 22:16 UTC (permalink / raw)
  To: Jörn Engel
  Cc: linux-mtd, Jörn Engel, David Woodhouse, Matthieu CASTET

On Mon, 21 Jan 2008 22:25:56 +0100
Jörn Engel <joern@logfs.org> wrote:

> On Mon, 21 January 2008 16:57:59 +0100, Matthieu CASTET wrote:
> >
> > What's the status of logfs on NAND ?
> > 
> > Last time I check, it didn't manage badblock.
> 
> Is that the only thing stopping you from using logfs?
> 
> mklogfs handles bad blocks.  Blocks rotting during lifetime are handled
> half-heartedly.  If that is a real problem for you I wouldn't be
> surprised if you caught logfs on the wrong foot once or twice.

Wait... you're writing a flash filesystem that doesn't really deal with
bad blocks?

josh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-21 22:16                   ` Josh Boyer
@ 2008-01-21 22:29                     ` Jörn Engel
  2008-01-22  8:57                       ` Matthieu CASTET
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-21 22:29 UTC (permalink / raw)
  To: Josh Boyer; +Cc: linux-mtd, Jörn Engel, David Woodhouse, Matthieu CASTET

On Mon, 21 January 2008 16:16:12 -0600, Josh Boyer wrote:
> 
> Wait... you're writing a flash filesystem that doesn't really deal with
> bad blocks?

I never said that.  Like any other new piece of code, logfs has bugs.
Plain and simple.  And having blocks rot underneith you is something I
don't have automated tests for, so don't be surprised to find bugs in
this area.

If you send a patch or a nice test setup or even just a bug report, the
chances of getting those bugs fixed improve.

Jörn

-- 
One of my most productive days was throwing away 1000 lines of code.
-- Ken Thompson.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-21 21:25                 ` Jörn Engel
  2008-01-21 22:16                   ` Josh Boyer
@ 2008-01-21 22:36                   ` Glenn Henshaw
  1 sibling, 0 replies; 34+ messages in thread
From: Glenn Henshaw @ 2008-01-21 22:36 UTC (permalink / raw)
  To: linux-mtd


On 21-Jan-08, at 4:25 PM, Jörn Engel wrote:

> On Mon, 21 January 2008 16:57:59 +0100, Matthieu CASTET wrote:
>>
>> What's the status of logfs on NAND ?
>>
>> Last time I check, it didn't manage badblock.
>
> Is that the only thing stopping you from using logfs?
>
> mklogfs handles bad blocks.  Blocks rotting during lifetime are  
> handled
> half-heartedly.  If that is a real problem for you I wouldn't be
> surprised if you caught logfs on the wrong foot once or twice.
>
>

   Has it been ported to a 2.4 kernel? I can't upgrade due to the  
amount of work necessary to rewrite drivers.

-- 
Glenn Henshaw                     Logical Outcome Ltd.
e: thraxisp@logicaloutcome.ca     w: www.logicaloutcome.ca

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-21 22:29                     ` Jörn Engel
@ 2008-01-22  8:57                       ` Matthieu CASTET
  2008-01-22 12:03                         ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Matthieu CASTET @ 2008-01-22  8:57 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse

Jörn Engel wrote:
> On Mon, 21 January 2008 16:16:12 -0600, Josh Boyer wrote:
>> Wait... you're writing a flash filesystem that doesn't really deal with
>> bad blocks?
> 
> I never said that.  Like any other new piece of code, logfs has bugs.
> Plain and simple.  And having blocks rot underneith you is something I
> don't have automated tests for, so don't be surprised to find bugs in
> this area.
On mtd->read I have see no checking for EBADMSG or EUCLEAN.
There no call to mtd->block_markbad or mtd->block_isbad (it is only 
called in mtd_find_sb).

So I don't see how this can work on NAND flash.


Good test could be to add bad block simulation to nandsim.
There is some patch  for this 
(http://lists.infradead.org/pipermail/linux-mtd/2006-December/017107.html). 
Note they don't simulate bit-flip on read.


Matthieu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-22  8:57                       ` Matthieu CASTET
@ 2008-01-22 12:03                         ` Jörn Engel
  2008-01-22 13:24                           ` Ricard Wanderlof
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-22 12:03 UTC (permalink / raw)
  To: Matthieu CASTET; +Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer

On Tue, 22 January 2008 09:57:07 +0100, Matthieu CASTET wrote:
>
> On mtd->read I have see no checking for EBADMSG or EUCLEAN.

Correct.  The CRC check will barf when uncorrectable errors are
encountered.  Using -EUCLEAN as a trigger to scrub the blocks would be
useful.

> There no call to mtd->block_markbad or mtd->block_isbad (it is only 
> called in mtd_find_sb).

Used to be there and was removed.  mtd->erase() does the same as
mtd->block_isbad().  Calling both would be redundant and a waste of
time.  And logfs has its own bad block table (bad segment table,
actually), so mtd->block_markbad could only be called to play nice with
others after filesystem gets nuked and the flash reused for something
else.  Not all devices define that method.  For a while I carried a
patch that would add a dummy noop call in add_mtd_device (noop call is
faster than a conditional), but dropped it because it just doesn't
matter enough.

> Good test could be to add bad block simulation to nandsim.
> There is some patch  for this 
> (http://lists.infradead.org/pipermail/linux-mtd/2006-December/017107.html). 
> Note they don't simulate bit-flip on read.

Ramtd can simulate bit-flips as well.  A nice test setup needs a bit
more than that, some way to do random, yet repeatable errors.

Jörn

-- 
ticks = jiffies;
while (ticks == jiffies);
ticks = jiffies;
-- /usr/src/linux/init/main.c

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-22 12:03                         ` Jörn Engel
@ 2008-01-22 13:24                           ` Ricard Wanderlof
  2008-01-22 15:05                             ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Ricard Wanderlof @ 2008-01-22 13:24 UTC (permalink / raw)
  To: Jörn Engel; +Cc: David Woodhouse, Josh Boyer, linux-mtd, Matthieu CASTET


> On Tue, 22 Jan 2008, Jörn Engel wrote:
> 
> > On Tue, 22 January 2008 09:57:07 +0100, Matthieu CASTET wrote:
> >
> ... 
> > There no call to mtd->block_markbad or mtd->block_isbad (it is only 
> > called in mtd_find_sb).
> 
> Used to be there and was removed.  mtd->erase() does the same as
> mtd->block_isbad().

How do you mean? mtd->erase() will not erase a bad block, that is true.

However, while it seems that mtd->erase() can mark a block bad if it 
fails, the fact that a block is eraseble without errors does not imply 
that it is good. I've seen NAND flash blocks which have way passed their 
specified number of max write/erase cycles still be successfully erased 
and subsequently written without errors, but the data retention was lousy 
(blocks started to show bit flips after a few thousand reads).

/Ricard
--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-22 13:24                           ` Ricard Wanderlof
@ 2008-01-22 15:05                             ` Jörn Engel
  2008-01-23  9:23                               ` Ricard Wanderlof
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-22 15:05 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: David Woodhouse, Jörn Engel, linux-mtd, Josh Boyer,
	Matthieu CASTET

On Tue, 22 January 2008 14:24:56 +0100, Ricard Wanderlof wrote:
> >On Tue, 22 Jan 2008, Jörn Engel wrote:
> >
> >Used to be there and was removed.  mtd->erase() does the same as
> >mtd->block_isbad().
> 
> How do you mean? mtd->erase() will not erase a bad block, that is true.

Exactly.  The only use logfs has for block_isbad() is to skip bad blocks
in the beginning when looking for the superblock.

> However, while it seems that mtd->erase() can mark a block bad if it 
> fails, the fact that a block is eraseble without errors does not imply 
> that it is good. I've seen NAND flash blocks which have way passed their 
> specified number of max write/erase cycles still be successfully erased 
> and subsequently written without errors, but the data retention was lousy 
> (blocks started to show bit flips after a few thousand reads).

I think we are being silly[1].  The question is not whether logfs
handles bad block (it does), but which particular failure case it
doesn't handle well enough.

- Easy: blocks are initially marked bad, erase returns an error.
  Mklogfs erases the complete device once, any bad blocks get stored in
  the bad segment table.  Segments can span multiple eraseblocks, one
  bad block will spoil the complete segment.

- Impossible: data rots without early warning.
  If the device is that bad, you can either have a RAID or replace the
  device.  Nothing the filesystem could or should do about it.

- Moderate: one block continuously spews -EUCLEAN, then becomes
  terminally bad.
  If those are just random bitflips, garbage collection will move the
  data sooner or later.  Logfs does not force GC to happen soon when
  encountering -EUCLEAN, which it should.  Are correctable errors an
  indication of block going bad in the near future?  If yes, I should do
  something about it.

The list probably goes on and on.  And I am sure that I would miss at
least half the interesting cases if I had to create it on my own.  But
if Matthieu or you or anyone else is willing to compose an extensive
list of such failure cases, I will walk through it and try to handle
them one by one.


[1] Silly in the way politicians are talking about freedom and security.
Everyone agrees that both are valuable goals and any policy can be
justified in the name of one or the other.  Ensures heated talkshow
discussions, but otherwise useless.

Jörn

-- 
The story so far:
In the beginning the Universe was created.  This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-22 15:05                             ` Jörn Engel
@ 2008-01-23  9:23                               ` Ricard Wanderlof
  2008-01-23 10:19                                 ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Ricard Wanderlof @ 2008-01-23  9:23 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET


On Tue, 22 Jan 2008, Jörn Engel wrote:

> - Moderate: one block continuously spews -EUCLEAN, then becomes
>  terminally bad.
>  If those are just random bitflips, garbage collection will move the
>  data sooner or later.  Logfs does not force GC to happen soon when
>  encountering -EUCLEAN, which it should.  Are correctable errors an
>  indication of block going bad in the near future?  If yes, I should do
>  something about it.

I would say that correctable errors occurring "soon" after writing are an 
indication that the block is going bad. My experience has been that 
extensive reading can cause bitflips (and it probably happens over time 
too), but that for fresh blocks, billions of read operations need to be 
done before a bit flips. For blocks that are nearing their best before 
date, a couple of hundred thousand reads can cause a bit to flip. So if I 
was implementing some sort of 'when is this block considered 
bad'-algorithm, I'd try to keep tabs on how often the block has been 
(read-) accessed in relation to when it was last writen. If this number is 
"low", the block should be considered bad and not used again.

I'm also think that when (if) logfs decides a block is bad, it should mark 
it bad using mtd->block_markbad(). That way, if the flash is rewritten by 
something else than logfs (say during a firmware upgrade), bad blocks can 
be handled in a consistent and startad way.

>The list probably goes on and on.  And I am sure that I would miss at
>least half the interesting cases if I had to create it on my own.  But
>if Matthieu or you or anyone else is willing to compose an extensive
>list of such failure cases, I will walk through it and try to handle
>them one by one.

We ran some tests here on a particular flash chip type to try and 
determine at least some of the failure modes that are related to block 
wear (due to write/erase) and bit decay (due to reading). The end result 
was basically what I tried to describe above, but I can go into more 
detail if you're interested.

/Ricard

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-23  9:23                               ` Ricard Wanderlof
@ 2008-01-23 10:19                                 ` Jörn Engel
  2008-01-23 10:41                                   ` Ricard Wanderlof
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-23 10:19 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer,
	Matthieu CASTET

On Wed, 23 January 2008 10:23:55 +0100, Ricard Wanderlof wrote:
> On Tue, 22 Jan 2008, Jörn Engel wrote:
> 
> >- Moderate: one block continuously spews -EUCLEAN, then becomes
> > terminally bad.
> > If those are just random bitflips, garbage collection will move the
> > data sooner or later.  Logfs does not force GC to happen soon when
> > encountering -EUCLEAN, which it should.  Are correctable errors an
> > indication of block going bad in the near future?  If yes, I should do
> > something about it.
> 
> I would say that correctable errors occurring "soon" after writing are an 
> indication that the block is going bad. My experience has been that 
> extensive reading can cause bitflips (and it probably happens over time 
> too), but that for fresh blocks, billions of read operations need to be 
> done before a bit flips. For blocks that are nearing their best before 
> date, a couple of hundred thousand reads can cause a bit to flip. So if I 
> was implementing some sort of 'when is this block considered 
> bad'-algorithm, I'd try to keep tabs on how often the block has been 
> (read-) accessed in relation to when it was last writen. If this number is 
> "low", the block should be considered bad and not used again.

That sounds like an impossible strategy.  Causing a write for every read
will significantly increase write pressure, thereby reduce flash
lifetime, reduce performance etc.

What would be possible was a counter for soft/hard errors per physical
block.  On soft error, move data elsewhere and reuse the block, but
increment the error counter.  If the counter increases beyond 17 (or any
other random number), mark the block as bad.  Limit can be an mkfs
option.

> I'm also think that when (if) logfs decides a block is bad, it should mark 
> it bad using mtd->block_markbad(). That way, if the flash is rewritten by 
> something else than logfs (say during a firmware upgrade), bad blocks can 
> be handled in a consistent and startad way.

Maybe I should revive the old patch then.  I don't think it matters much
either way.

> We ran some tests here on a particular flash chip type to try and 
> determine at least some of the failure modes that are related to block 
> wear (due to write/erase) and bit decay (due to reading). The end result 
> was basically what I tried to describe above, but I can go into more 
> detail if you're interested.

I do remember your mail describing the test.  One of the interesting
conclusions is that even awefully worn out block is still good enough to
store short-lived information.  It appears to be a surprisingly robust
strategy to have a high wear-out, as long as you keep the wear
constantly high and replace block contents at a high rate.

Jörn

-- 
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-23 10:19                                 ` Jörn Engel
@ 2008-01-23 10:41                                   ` Ricard Wanderlof
  2008-01-23 10:57                                     ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Ricard Wanderlof @ 2008-01-23 10:41 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET


On Wed, 23 Jan 2008, Jörn Engel wrote:

>> I would say that correctable errors occurring "soon" after writing are an
>> indication that the block is going bad. My experience has been that
>> extensive reading can cause bitflips (and it probably happens over time
>> too), but that for fresh blocks, billions of read operations need to be
>> done before a bit flips. For blocks that are nearing their best before
>> date, a couple of hundred thousand reads can cause a bit to flip. So if I
>> was implementing some sort of 'when is this block considered
>> bad'-algorithm, I'd try to keep tabs on how often the block has been
>> (read-) accessed in relation to when it was last writen. If this number is
>> "low", the block should be considered bad and not used again.
>
> That sounds like an impossible strategy.  Causing a write for every read
> will significantly increase write pressure, thereby reduce flash
> lifetime, reduce performance etc.
>
> What would be possible was a counter for soft/hard errors per physical
> block.  On soft error, move data elsewhere and reuse the block, but
> increment the error counter.  If the counter increases beyond 17 (or any
> other random number), mark the block as bad.  Limit can be an mkfs
> option.

Sorry, I didn't express myself clearly. I should have said '...keep tabs 
on how _many_times_ the block has been read accessed in relation to when 
it was last written.' If a page has been read, say, 100 000 times since it 
was last written, and starts to show bit flips, it is a sign that the 
block is wearing out. If it has been read, say, 100 000 000 times since it 
was written and starts showing bit flips, it's probably sufficient just to 
do a garbage collect and rewrite the data (in the same block or 
elsewhere).

The algorithm you suggest also sounds reasonable. Repeatedly occurring bit 
flips (-EUCLEAN) are an indication that the block is wearing out. Probably 
more efficient than logging the number of read accesses somewhere.

One problem may be what to do when the system is powered down. If we don't 
store the error counters in the flash (or some other non-volatile place), 
then each time the system is powered up, all the error counters will be 
reset.

>> We ran some tests here on a particular flash chip type to try and
>> determine at least some of the failure modes that are related to block
>> wear (due to write/erase) and bit decay (due to reading). The end result
>> was basically what I tried to describe above, but I can go into more
>> detail if you're interested.
>
> I do remember your mail describing the test.  One of the interesting
> conclusions is that even awefully worn out block is still good enough to
> store short-lived information.  It appears to be a surprisingly robust
> strategy to have a high wear-out, as long as you keep the wear
> constantly high and replace block contents at a high rate.

You're probably right, but I'm not sure I understand what you mean.

/Ricard
--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-23 10:41                                   ` Ricard Wanderlof
@ 2008-01-23 10:57                                     ` Jörn Engel
  2008-01-23 11:57                                       ` Ricard Wanderlof
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-23 10:57 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer,
	Matthieu CASTET

On Wed, 23 January 2008 11:41:14 +0100, Ricard Wanderlof wrote:
> 
> One problem may be what to do when the system is powered down. If we don't 
> store the error counters in the flash (or some other non-volatile place), 
> then each time the system is powered up, all the error counters will be 
> reset.

Exactly.  If the information we need to detect problems is not stored in
the filesystem, it is useless.  Maybe it would work for a very
specialized system to keep that information in DRAM or NVRAM, but in
general it will get lost.

As a matter of principle logfs does not do any special things for special
systems.  If you have a nice optimization, it has to work for everyone.
If it doesn't, your systems behaves differently from everyone else's
systems.  So whatever bugs you have cannot be reproduced by anyone else.

For example, when the system crashes, some data may get written that
isn't accounted for in the journal.  On reboot/remount that gets
detected and the wasted space is skipped.  Writes continue beyond it.
On hard disks or consumer flash media, it is legal to rewrite the same
location without erase.  But logfs still skips that space and is
deliberately inefficient.

> >I do remember your mail describing the test.  One of the interesting
> >conclusions is that even awefully worn out block is still good enough to
> >store short-lived information.  It appears to be a surprisingly robust
> >strategy to have a high wear-out, as long as you keep the wear
> >constantly high and replace block contents at a high rate.
> 
> You're probably right, but I'm not sure I understand what you mean.

High wearout means short time between two writes.  Which also means few
reads between two writes.

As long as the write rate (wear rate) remains roughly constant, high
wear doesn't seem to cause any problems.  The block's ability to retain
information degrades, but that doesn't matter because it doesn't have to
retain the information for a long time.

Jörn

-- 
Victory in war is not repetitious.
-- Sun Tzu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-23 10:57                                     ` Jörn Engel
@ 2008-01-23 11:57                                       ` Ricard Wanderlof
  2008-01-23 13:01                                         ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Ricard Wanderlof @ 2008-01-23 11:57 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET


On Wed, 23 Jan 2008, Jörn Engel wrote:

>> One problem may be what to do when the system is powered down. If we don't
>> store the error counters in the flash (or some other non-volatile place),
>> then each time the system is powered up, all the error counters will be
>> reset.
>
> Exactly.  If the information we need to detect problems is not stored in
> the filesystem, it is useless.  Maybe it would work for a very
> specialized system to keep that information in DRAM or NVRAM, but in
> general it will get lost.
>
> As a matter of principle logfs does not do any special things for special
> systems.  If you have a nice optimization, it has to work for everyone.
> If it doesn't, your systems behaves differently from everyone else's
> systems.  So whatever bugs you have cannot be reproduced by anyone else.

Very true.

Perhaps it's possible to devise something that at least accomplishes part 
of the goal. Such as when writing a new block, also write some statistical 
information such as the number of read accesses since the previous write 
(or power up), or the reason for writing (new data, gc because of 
bitflips, ...) and a write counter. Something of that nature.

> High wearout means short time between two writes.  Which also means few
> reads between two writes.
>
> As long as the write rate (wear rate) remains roughly constant, high
> wear doesn't seem to cause any problems.  The block's ability to retain
> information degrades, but that doesn't matter because it doesn't have to
> retain the information for a long time.

I'd be a bit wary of this with NAND chips some of which have a 100 000 
maximum erase/write cycle specification, though. And I think that 
especially when nearing the maximum value and going beyond it, that there 
is some bit decay occurring over time and not just from reading.

On the other hand, with 100 000 write cycles total, and assuming a product 
lifetime of 3 years, we end up with over 90 permitted write/erase cycles 
per day. Depending on the situation, it might be quite ok to take 
advantage over this.

/Ricard
--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-23 11:57                                       ` Ricard Wanderlof
@ 2008-01-23 13:01                                         ` Jörn Engel
  2008-01-23 13:16                                           ` Ricard Wanderlof
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-23 13:01 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer,
	Matthieu CASTET

On Wed, 23 January 2008 12:57:09 +0100, Ricard Wanderlof wrote:
> 
> Perhaps it's possible to devise something that at least accomplishes part 
> of the goal. Such as when writing a new block, also write some statistical 
> information such as the number of read accesses since the previous write 
> (or power up), or the reason for writing (new data, gc because of 
> bitflips, ...) and a write counter. Something of that nature.

I'm still fairly unconvinced about the read accounting.  We could do
something purely stochastic like accounting _every_ read, but just with
a probability of, say, 1:100,000.  That would still, within statistical
jitter, behave the same for everyone.  But once we depend on the average
mount time of systems, I'm quite unhappy with the solution.

Also, logfs stores a very limited amount of data for each segment (read:
eraseblock).  Currently this is just the erase count, used for wear
leveling and the segment number.  The latter can be used to detect
blocks being moved around by "something", be it an image flasher,
bootloader, FTL or whatever.  There are still 16 bytes of padding in the
structure, so we could add an error counter without breaking the format.

> I'd be a bit wary of this with NAND chips some of which have a 100 000 
> maximum erase/write cycle specification, though. And I think that 
> especially when nearing the maximum value and going beyond it, that there 
> is some bit decay occurring over time and not just from reading.

It doesn't really matter whether the data degrades from a number of
reads or from time passing.  With a constantly high write rate, there is
less time for degradations then with a low write rate.

Problematic would be to have a high write rate for a while, then a very
low write rate that allows data to rot for a long time.  And also this
depends on your numbers being representative for every flash chip. ;)

Jörn

-- 
The only real mistake is the one from which we learn nothing.
-- John Powell

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-23 13:01                                         ` Jörn Engel
@ 2008-01-23 13:16                                           ` Ricard Wanderlof
  2008-01-23 14:06                                             ` Jörn Engel
  0 siblings, 1 reply; 34+ messages in thread
From: Ricard Wanderlof @ 2008-01-23 13:16 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET


On Wed, 23 Jan 2008, Jörn Engel wrote:

> On Wed, 23 January 2008 12:57:09 +0100, Ricard Wanderlof wrote:
>>
>> Perhaps it's possible to devise something that at least accomplishes part
>> of the goal. Such as when writing a new block, also write some statistical
>> information such as the number of read accesses since the previous write
>> (or power up), or the reason for writing (new data, gc because of
>> bitflips, ...) and a write counter. Something of that nature.
>
> I'm still fairly unconvinced about the read accounting.  We could do
> something purely stochastic like accounting _every_ read, but just with
> a probability of, say, 1:100,000.  That would still, within statistical
> jitter, behave the same for everyone.  But once we depend on the average
> mount time of systems, I'm quite unhappy with the solution.

I think you are right. An error counter should be sufficient to get enough 
statistics to determine if a block has begun to go bad.

>> I'd be a bit wary of this with NAND chips some of which have a 100 000
>> maximum erase/write cycle specification, though. And I think that
>> especially when nearing the maximum value and going beyond it, that there
>> is some bit decay occurring over time and not just from reading.
>
> It doesn't really matter whether the data degrades from a number of
> reads or from time passing.  With a constantly high write rate, there is
> less time for degradations then with a low write rate.

If we have a system that is only used (= powered on) rarely, then any 
degradation from time passing could become significant.

> Problematic would be to have a high write rate for a while, then a very
> low write rate that allows data to rot for a long time.  And also this
> depends on your numbers being representative for every flash chip. ;)

Yes. And the latter is very true. Our tests were only of a certain chip 
type from a certain manufacturer, and of course other chips might behave 
differently.

The only input I have got from chip manufacturers regarding this issue is 
that with inreasing bit densities and decreasing bit cell sizes in the 
future, things like the probability of random bit flips are likely to 
increase. (Somewhere there is a limit when the amount of error correction 
needed to handle this things grows too large to make the chip practically 
useful; say 10 error correction bits per stored bit or whatever).

/Ricard
--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-23 13:16                                           ` Ricard Wanderlof
@ 2008-01-23 14:06                                             ` Jörn Engel
  2008-01-23 14:25                                               ` Ricard Wanderlof
  0 siblings, 1 reply; 34+ messages in thread
From: Jörn Engel @ 2008-01-23 14:06 UTC (permalink / raw)
  To: Ricard Wanderlof
  Cc: linux-mtd, Jörn Engel, David Woodhouse, Josh Boyer,
	Matthieu CASTET

On Wed, 23 January 2008 14:16:12 +0100, Ricard Wanderlof wrote:
> >
> >It doesn't really matter whether the data degrades from a number of
> >reads or from time passing.  With a constantly high write rate, there is
> >less time for degradations then with a low write rate.
> 
> If we have a system that is only used (= powered on) rarely, then any 
> degradation from time passing could become significant.

In that case the write rate wouldn't be _constantly_ high. ;)

> The only input I have got from chip manufacturers regarding this issue is 
> that with inreasing bit densities and decreasing bit cell sizes in the 
> future, things like the probability of random bit flips are likely to 
> increase. (Somewhere there is a limit when the amount of error correction 
> needed to handle this things grows too large to make the chip practically 
> useful; say 10 error correction bits per stored bit or whatever).

If error rates increase, device drivers have to do stronger error
correction.  Quality after error correction has been done should stay
roughly the same.

Jörn

-- 
The cost of changing business rules is much more expensive for software
than for a secretaty.
-- unknown

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Jffs2 and big file = very slow jffs2_garbage_collect_pass
  2008-01-23 14:06                                             ` Jörn Engel
@ 2008-01-23 14:25                                               ` Ricard Wanderlof
  0 siblings, 0 replies; 34+ messages in thread
From: Ricard Wanderlof @ 2008-01-23 14:25 UTC (permalink / raw)
  To: Jörn Engel; +Cc: linux-mtd, Josh Boyer, David Woodhouse, Matthieu CASTET


On Wed, 23 Jan 2008, Jörn Engel wrote:

>> The only input I have got from chip manufacturers regarding this issue is
>> that with inreasing bit densities and decreasing bit cell sizes in the
>> future, things like the probability of random bit flips are likely to
>> increase. (Somewhere there is a limit when the amount of error correction
>> needed to handle this things grows too large to make the chip practically
>> useful; say 10 error correction bits per stored bit or whatever).
>
> If error rates increase, device drivers have to do stronger error
> correction.  Quality after error correction has been done should stay
> roughly the same.

Yes, true, the first step is to increase the error correction capabilites, 
but there comes a point when there are so many error correction bits 
required per data bit that there is no point of increasing the memory 
size.

Today we have 3 ECC bytes per 256 data bytes in an ordinary nand flash. If 
geometries decrease we might at some point need, say 128 ECC bytes, and 
further down the line perhaps even more ECC bytes than data bytes. It then 
eventually comes to a point of diminishing returns; if the geometries are 
decreased and error rates go up, the increase in number ECC bits might be 
more than the gain in number of data bits. This is far down the line, and 
partly speculative I agree.

/Ricard
--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2008-01-23 14:26 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-17 16:12 Jffs2 and big file = very slow jffs2_garbage_collect_pass Matthieu CASTET
2008-01-17 16:26 ` Jörn Engel
2008-01-17 17:43   ` Josh Boyer
2008-01-18  9:39     ` Matthieu CASTET
2008-01-18 12:48       ` Josh Boyer
2008-01-18 16:17         ` Matthieu CASTET
2008-01-18 17:55           ` Josh Boyer
2008-01-18 18:17             ` Jörn Engel
2008-01-21 15:57               ` Matthieu CASTET
2008-01-21 21:25                 ` Jörn Engel
2008-01-21 22:16                   ` Josh Boyer
2008-01-21 22:29                     ` Jörn Engel
2008-01-22  8:57                       ` Matthieu CASTET
2008-01-22 12:03                         ` Jörn Engel
2008-01-22 13:24                           ` Ricard Wanderlof
2008-01-22 15:05                             ` Jörn Engel
2008-01-23  9:23                               ` Ricard Wanderlof
2008-01-23 10:19                                 ` Jörn Engel
2008-01-23 10:41                                   ` Ricard Wanderlof
2008-01-23 10:57                                     ` Jörn Engel
2008-01-23 11:57                                       ` Ricard Wanderlof
2008-01-23 13:01                                         ` Jörn Engel
2008-01-23 13:16                                           ` Ricard Wanderlof
2008-01-23 14:06                                             ` Jörn Engel
2008-01-23 14:25                                               ` Ricard Wanderlof
2008-01-21 22:36                   ` Glenn Henshaw
2008-01-18 17:20     ` Glenn Henshaw
2008-01-18 18:39       ` Jamie Lokier
2008-01-18 21:00         ` Jörn Engel
2008-01-19  0:23           ` Jamie Lokier
2008-01-19  2:38             ` Jörn Engel
2008-01-17 23:22   ` David Woodhouse
2008-01-18  9:45   ` Matthieu CASTET
2008-01-18 18:20   ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox