The problem that I didn't think out

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* The problem that I didn't think out
@ 2005-11-25  8:36 Zhao, Forrest
  2005-11-25 11:58 ` Artem B. Bityutskiy
  0 siblings, 1 reply; 8+ messages in thread
From: Zhao, Forrest @ 2005-11-25  8:36 UTC (permalink / raw)
  To: Artem B. Bityutskiy; +Cc: linux-mtd

Hi, Artem

I have been thinking the GC in JFFS3 during these days. But I can't
think out a perfect answer for it. So I want to have some discussion 
with you.

I think the basic problem is: we need to keep track of both erase
blocks usage(which pages on block are obsolete) and erase blocks
with small erase count in order to achieve the reclamation
efficiency and wear-leveling in GC.
1
If we store the erase block usage information in a file on flash,
we need to sort it in RAM in order to find the dirtiest one.
Then this sort operation is O(N), N is the number of erase blocks.
Also the memory consumed by this sorted data structure is O(N).

2 
Whether erase count is stored in erase block header or in a file on
flash, we need to sort it in RAM in order to find the one with least
erase count for WL. Also this sort operation is O(N), N is the number
of erase blocks. Also the memory consumed by this sorted data
structure is O(N).

I used to think that we could put the sorted data structure on flash,
but this will incur enormous update operations.

Seems there's no a perfect solution for it. What's your idea
 about this?

BTW. I read some news that Samsung planned to use NAND of large
Volume (8G, 16G) to replace harddisk. But I think the out-of-place
update and limited erase times nature of flash memory makes this
replacement impossible(at least in new future). Although flash has
a faster RW speed than HD, I don't think the real writing operation
(running a file system) on flash will have a better performance than
the one on HD since an update operation on HD will incur multiple
update operations on flash. I think the FS write performance on
flash and FS scalability contradicts with each other. 
What's your opinion about the NAND replacing HD?

Thanks,
Forrest

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The problem that I didn't think out
  2005-11-25  8:36 The problem that I didn't think out Zhao, Forrest
@ 2005-11-25 11:58 ` Artem B. Bityutskiy
  2005-11-27 13:01   ` Ferenc Havasi
  2005-11-28  2:20   ` zhao, forrest
  0 siblings, 2 replies; 8+ messages in thread
From: Artem B. Bityutskiy @ 2005-11-25 11:58 UTC (permalink / raw)
  To: Zhao, Forrest; +Cc: linux-mtd

Hello Zhao,

> I have been thinking the GC in JFFS3 during these days. But I can't
> think out a perfect answer for it. So I want to have some discussion 
> with you.
Yeah, we still cannot invent something perfect.

> I think the basic problem is: we need to keep track of both erase
> blocks usage(which pages on block are obsolete) and erase blocks
> with small erase count in order to achieve the reclamation
> efficiency and wear-leveling in GC.
Imagine that this is not needed so far or that we keep this information 
in RAM. Does it help in inventing GC algorithm?

> If we store the erase block usage information in a file on flash,
> we need to sort it in RAM in order to find the dirtiest one.
> Then this sort operation is O(N), N is the number of erase blocks.
> Also the memory consumed by this sorted data structure is O(N).
We may devide this file on constant-size chunks, and search only one 
chunk at a time. And we may have a list of 10 most worn out eraseblocks 
and keep it in the superblock. So the scalability will be O(1) IMO.


> Whether erase count is stored in erase block header or in a file on
> flash, we need to sort it in RAM in order to find the one with least
> erase count for WL. Also this sort operation is O(N), N is the number
> of erase blocks. Also the memory consumed by this sorted data
> structure is O(N).
Similar is applicable here I think.

For erase count we may maintain a separate 'file', which is not part of 
the tree. Data of this file is put to distinct separate eraseblocks. The 
file is refered from SB.

> I used to think that we could put the sorted data structure on flash,
> but this will incur enormous update operations.
Enormous? This needs measuring and evaluating. Let's create a user-level 
prototype and see.

>  Seems there's no a perfect solution for it. What's your idea
>  about this?
Something like described above. Let's at least invent GC algorithm 
without thinking about wear-levelling/accounting so far. I believe this 
may be added independently afterwards.

> BTW. I read some news that Samsung planned to use NAND of large
> Volume (8G, 16G) to replace harddisk. But I think the out-of-place
> update and limited erase times nature of flash memory makes this
> replacement impossible(at least in new future).
Why? FTL-like stuff will work, just add more RAM.

> Although flash has
> a faster RW speed than HD, I don't think the real writing operation
> (running a file system) on flash will have a better performance than
> the one on HD since an update operation on HD will incur multiple
> update operations on flash.
May be. Tests are needed.

> I think the FS write performance on
> flash and FS scalability contradicts with each other. 
I still believe - not.

> What's your opinion about the NAND replacing HD?
Well, flash consume less energy, it is more compact, it may operate in 
environments with high vibration, so there are applications where flash 
beat HDD.

I dare not to argue about NAND vs HDD in general, this requires too much 
efforts to make an analysis.

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The problem that I didn't think out
  2005-11-25 11:58 ` Artem B. Bityutskiy
@ 2005-11-27 13:01   ` Ferenc Havasi
  2005-11-27 13:20     ` Artem B. Bityutskiy
  2005-11-28  2:20   ` zhao, forrest
  1 sibling, 1 reply; 8+ messages in thread
From: Ferenc Havasi @ 2005-11-27 13:01 UTC (permalink / raw)
  To: Artem B. Bityutskiy; +Cc: linux-mtd, Zhao, Forrest

Hi Artem (& Zhao),

We had some time, so we read the plan of JFFS3 (with RaiserFS
documentation).

There are many nice solutions in it. I like wandering trees, the journal
and the superblock management. Nice work, congratulation.

The key compression is the only one in the plan that I think that is
better if we don't use. I think to make key management as simple and
fast as possible is more important than some percent in flash usage. (if
I am right the bottleneck of real products is not the flash usage but
the speed) But it is small techniqual question.

The big questions are that questions you already thinging on: garbage
collection and wear leveling. Without solving them JFFS3 have no nice
future. I only would like to say that now we are also thinking on these
important problems... and we will write if anything usable found. (I may
be better feeling to thinking on someting not alone... :) )

Ferenc

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The problem that I didn't think out
  2005-11-27 13:01   ` Ferenc Havasi
@ 2005-11-27 13:20     ` Artem B. Bityutskiy
  2005-11-28  0:32       ` Ferenc Havasi
  0 siblings, 1 reply; 8+ messages in thread
From: Artem B. Bityutskiy @ 2005-11-27 13:20 UTC (permalink / raw)
  To: Ferenc Havasi; +Cc: linux-mtd, Zhao, Forrest

Ferenc, Greetings!

> We had some time, so we read the plan of JFFS3 (with RaiserFS
> documentation).
Oh, what a delightful news! :-)

> The key compression is the only one in the plan that I think that is
> better if we don't use. I think to make key management as simple and
> fast as possible is more important than some percent in flash usage. (if
> I am right the bottleneck of real products is not the flash usage but
> the speed) But it is small techniqual question.
Well, I would not agree, compression Guru Ferenc though you be :-) When 
you start thinking about GC, you'll notice that Indexing nodes are to be 
rewritten *a great deal* of times. So, the smaller is the index, the 
faster is JFFS3. The main idea of key compression is *not* to save flash 
space, but to lessen the (index size)/(data size) ratio.

Nonetheless, tests should show the worthiness of keys compression. There 
is obviously a (CPU time) vs. (amount of flash IO) trade-off. Thus, I 
offer to wait for a JFFS3 prototype and evaluate this. Let's mark this 
stuff as "to be evaluated".

> The big questions are that questions you already thinging on: garbage
> collection and wear leveling. Without solving them JFFS3 have no nice
> future. I only would like to say that now we are also thinking on these
> important problems... and we will write if anything usable found. (I may
> be better feeling to thinking on someting not alone... :) )
Great! :-)

Thank you for this feedback.

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The problem that I didn't think out
  2005-11-27 13:20     ` Artem B. Bityutskiy
@ 2005-11-28  0:32       ` Ferenc Havasi
  2005-11-28 10:15         ` Artem B. Bityutskiy
  0 siblings, 1 reply; 8+ messages in thread
From: Ferenc Havasi @ 2005-11-28  0:32 UTC (permalink / raw)
  To: Artem B. Bityutskiy; +Cc: linux-mtd, Zhao, Forrest

Hi Artem,

> Nonetheless, tests should show the worthiness of keys compression. 
> There is obviously a (CPU time) vs. (amount of flash IO) trade-off. 
> Thus, I offer to wait for a JFFS3 prototype and evaluate this. Let's 
> mark this stuff as "to be evaluated".

Yes, I think the same.

>> The big questions are that questions you already thinging on: garbage
>> collection and wear leveling. Without solving them JFFS3 have no nice
>> future. I only would like to say that now we are also thinking on these
>> important problems... and we will write if anything usable found. (I may
>> be better feeling to thinking on someting not alone... :) )
>
> Great! :-)

I was thinking a lot. There is something in my mind, but I have to think 
it over again tomorrow before writing it down.

Be that as it may I have a NAND question, I think you know the answer 
(NAND Guru Artem :) ):
- if there is and erase block aready fulled, is there any possibility to 
mark its pages as obsolated (individually)? Is it allowed to write to 
OOB (or data) area one more? (or write OOB later than the data... or 
similar trick?)
At NOR it is so easy.

Ferenc

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The problem that I didn't think out
  2005-11-25 11:58 ` Artem B. Bityutskiy
  2005-11-27 13:01   ` Ferenc Havasi
@ 2005-11-28  2:20   ` zhao, forrest
  2005-11-28 10:24     ` Artem B. Bityutskiy
  1 sibling, 1 reply; 8+ messages in thread
From: zhao, forrest @ 2005-11-28  2:20 UTC (permalink / raw)
  To: Artem B. Bityutskiy; +Cc: linux-mtd


> > If we store the erase block usage information in a file on flash,
> > we need to sort it in RAM in order to find the dirtiest one.
> > Then this sort operation is O(N), N is the number of erase blocks.
> > Also the memory consumed by this sorted data structure is O(N).
> We may devide this file on constant-size chunks, and search only one 
> chunk at a time. And we may have a list of 10 most worn out eraseblocks 
> and keep it in the superblock. So the scalability will be O(1) IMO.
> 
I think keeping track of a list of 10 most worn out eraseblocks
is hard to be O(1) since this list is changed dynamically.
In particular, we need to recalculate the list after every update
operation; in order to recalculate the list we need to keep the usage
information of all erase blocks in RAM in sorted manner. So if this is
the case, the RAM occupation is O(N).


> > I used to think that we could put the sorted data structure on flash,
> > but this will incur enormous update operations.
> Enormous? This needs measuring and evaluating. Let's create a user-level 
> prototype and see.

Agree. We need to do much measuring, evaluating and comparing work
before achieving the final design decision.
Do you have a time table for the user-level prototype? I think I can
give my contribution to it.


Thanks,
Forrest

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The problem that I didn't think out
  2005-11-28  0:32       ` Ferenc Havasi
@ 2005-11-28 10:15         ` Artem B. Bityutskiy
  0 siblings, 0 replies; 8+ messages in thread
From: Artem B. Bityutskiy @ 2005-11-28 10:15 UTC (permalink / raw)
  To: Ferenc Havasi; +Cc: linux-mtd, Zhao, Forrest

Ferenc Havasi wrote:
> Be that as it may I have a NAND question, I think you know the answer 
> (NAND Guru Artem :) ):
> - if there is and erase block aready fulled, is there any possibility to 
> mark its pages as obsolated (individually)? Is it allowed to write to 
> OOB (or data) area one more? (or write OOB later than the data... or 
> similar trick?)

Yes, but this depends on NAND type. In the worst case, you may write at 
least once there. In case of some NANDs - more.

Note, JFFS2 writes clean marker to OOB area of the first page of a newly 
erased eraseblock. But it doesn't prohibit JFFS2 to write to this page 
later, and store ECC in OOB, even though this OOB already contains some 
data (the clean marker).

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: The problem that I didn't think out
  2005-11-28  2:20   ` zhao, forrest
@ 2005-11-28 10:24     ` Artem B. Bityutskiy
  0 siblings, 0 replies; 8+ messages in thread
From: Artem B. Bityutskiy @ 2005-11-28 10:24 UTC (permalink / raw)
  To: zhao, forrest; +Cc: linux-mtd

zhao, forrest wrote:
> I think keeping track of a list of 10 most worn out eraseblocks
> is hard to be O(1) since this list is changed dynamically.
> In particular, we need to recalculate the list after every update
> operation; in order to recalculate the list we need to keep the usage
> information of all erase blocks in RAM in sorted manner. So if this is
> the case, the RAM occupation is O(N).
> 
Hmm, I meant the following.

1. How to choose an eraseblock to garbage collect.

We split all the eraseblocks onto several (virtual) groups, N 
eraseblocks in each. For example, 30 eraseblocks in each group: 0-29. 
30-59, ... One of the groups is "current".

Suppose we have mounted JFFS3. Assign group 0 to be "current". GC asks 
us to choose an eraseblock to GC. We search for the "best" candidate 
only in the current group. And GC it. During search we also find out M 
"best" candidates. Then switch the current group to the next. Next time 
we'll search in the next group. We'll correct our "best" list. Etc.

Something like this. Not, this wont give us absolutely best candidates. 
But still.

The description is just an idea, don't take it as a serious description.

-- 
Best Regards,
Artem B. Bityutskiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-11-28 10:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-25  8:36 The problem that I didn't think out Zhao, Forrest
2005-11-25 11:58 ` Artem B. Bityutskiy
2005-11-27 13:01   ` Ferenc Havasi
2005-11-27 13:20     ` Artem B. Bityutskiy
2005-11-28  0:32       ` Ferenc Havasi
2005-11-28 10:15         ` Artem B. Bityutskiy
2005-11-28  2:20   ` zhao, forrest
2005-11-28 10:24     ` Artem B. Bityutskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox