Flash erase groups and filesystems

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Flash erase groups and filesystems
@ 2005-08-15 20:21 Pierre Ossman
  2005-08-16 16:27 ` Jörn Engel
  0 siblings, 1 reply; 9+ messages in thread
From: Pierre Ossman @ 2005-08-15 20:21 UTC (permalink / raw)
  To: LKML

If you know how flash erase groups behave then skip a bit ;)

As you may or may not be aware, flash memory tends to be designed so
that only large groups of sectors can be erased at a time. For most
systems this is handled automatically by having the onboard controller
caching everyhing but the sector being overwritten. If you write a
single sector at a time (or if the controller is stupid) this will
result in the flash being erased several times because of writes in
sectors close by. The end result being that your flash is worn out faster.

--8<--- skip to here ----

To minimise the number of erases the MMC protocol supports pre-erasing
blocks before you actually write to them. Now what I'm unclear on is how
this will interact with filesystems and the assumptions they make.

If the controller gets a request to write 128 sectors and this fails
after 20 sectors, the remaining 108 sectors will still have lost their
data because of the pre-erase. Will this break assumptions made in the
VFS layer? I.e. does it assume that only the failed sector has unknown data?

I'm writing a patch that gives this functionality to the MMC layer and
since I'm no VFS expert I need some input into any side effects.

Rgds
Pierre

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flash erase groups and filesystems
  2005-08-15 20:21 Flash erase groups and filesystems Pierre Ossman
@ 2005-08-16 16:27 ` Jörn Engel
  2005-08-16 17:09   ` Pierre Ossman
  0 siblings, 1 reply; 9+ messages in thread
From: Jörn Engel @ 2005-08-16 16:27 UTC (permalink / raw)
  To: Pierre Ossman; +Cc: LKML

On Mon, 15 August 2005 22:21:55 +0200, Pierre Ossman wrote:
> 
> To minimise the number of erases the MMC protocol supports pre-erasing
> blocks before you actually write to them. Now what I'm unclear on is how
> this will interact with filesystems and the assumptions they make.
> 
> If the controller gets a request to write 128 sectors and this fails
> after 20 sectors, the remaining 108 sectors will still have lost their
> data because of the pre-erase. Will this break assumptions made in the
> VFS layer? I.e. does it assume that only the failed sector has unknown data?
> 
> I'm writing a patch that gives this functionality to the MMC layer and
> since I'm no VFS expert I need some input into any side effects.

Question came up before, albeit with a different phrasing.  One
possible approach to benefit from this ability would be to create a
"forget" operation.  When a filesystem already knows that some data is
unneeded (after a truncate or erase operation), it will ask the device
to forget previously occupied blocks.

The device then has the _option_ of handling the forget operation.
Further reads on these blocks may return random data.

And since noone stepped up to implement this yet, you can still get
all the fame and glory yourself! ;)

Jörn

-- 
All art is but imitation of nature.
-- Lucius Annaeus Seneca

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flash erase groups and filesystems
  2005-08-16 16:27 ` Jörn Engel
@ 2005-08-16 17:09   ` Pierre Ossman
  2005-08-16 18:13     ` Jörn Engel
  2005-08-17 14:35     ` Pavel Machek
  0 siblings, 2 replies; 9+ messages in thread
From: Pierre Ossman @ 2005-08-16 17:09 UTC (permalink / raw)
  To: Jörn Engel; +Cc: LKML

Jörn Engel wrote:

>Question came up before, albeit with a different phrasing.  One
>possible approach to benefit from this ability would be to create a
>"forget" operation.  When a filesystem already knows that some data is
>unneeded (after a truncate or erase operation), it will ask the device
>to forget previously occupied blocks.
>
>The device then has the _option_ of handling the forget operation.
>Further reads on these blocks may return random data.
>
>And since noone stepped up to implement this yet, you can still get
>all the fame and glory yourself! ;)
>  
>

I'm not sure we're talking about the same thing. I'm not suggesting new
features in the VFS layer. I want to know if something breaks if I
implement this erase feature in the MMC layer. In essence the file
system has marked the sectors as "forget" by issuing a write to them.
The question is if it is assumed that they are unchanged if the write
fails half-way through.

I'd have to say that this is a dangerous assumption to make already
today since some systems might not be able to tell where it fails if a
large chunk of data is given to it, perhaps because of a deep pipeline
before it actually reaches the physical storage.

Rgds
Pierre

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flash erase groups and filesystems
  2005-08-16 17:09   ` Pierre Ossman
@ 2005-08-16 18:13     ` Jörn Engel
  2005-08-16 18:52       ` Jörn Engel
  2005-08-17 14:35     ` Pavel Machek
  1 sibling, 1 reply; 9+ messages in thread
From: Jörn Engel @ 2005-08-16 18:13 UTC (permalink / raw)
  To: Pierre Ossman; +Cc: LKML

On Tue, 16 August 2005 19:09:12 +0200, Pierre Ossman wrote:
> 
> I'm not sure we're talking about the same thing. I'm not suggesting new
> features in the VFS layer. I want to know if something breaks if I
> implement this erase feature in the MMC layer. In essence the file
> system has marked the sectors as "forget" by issuing a write to them.
> The question is if it is assumed that they are unchanged if the write
> fails half-way through.

Yes.  Most filesystems expect to find either 1) old data or 2) new
data.  Blocks full of 0xff are non-expected.

These expectations are quite reasonable for hard disk media, which is
what the filesystems were designed for.

> I'd have to say that this is a dangerous assumption to make already
> today since some systems might not be able to tell where it fails if a
> large chunk of data is given to it, perhaps because of a deep pipeline
> before it actually reaches the physical storage.

The assumption is merely that, at no time, there will be random data
on the medium.  Both old and new data is somewhat well-defined.  It
doesn't take a PhD to see a potential problem when moving to flash.

Jörn

-- 
Fancy algorithms are buggier than simple ones, and they're much harder
to implement. Use simple algorithms as well as simple data structures.
-- Rob Pike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flash erase groups and filesystems
  2005-08-16 18:13     ` Jörn Engel
@ 2005-08-16 18:52       ` Jörn Engel
  2005-08-17 11:35         ` Pierre Ossman
  0 siblings, 1 reply; 9+ messages in thread
From: Jörn Engel @ 2005-08-16 18:52 UTC (permalink / raw)
  To: Pierre Ossman; +Cc: LKML

On Tue, 16 August 2005 20:13:36 +0200, Jörn Engel wrote:
> On Tue, 16 August 2005 19:09:12 +0200, Pierre Ossman wrote:
> > 
> > I'm not sure we're talking about the same thing. I'm not suggesting new
> > features in the VFS layer. I want to know if something breaks if I
> > implement this erase feature in the MMC layer. In essence the file
> > system has marked the sectors as "forget" by issuing a write to them.
> > The question is if it is assumed that they are unchanged if the write
> > fails half-way through.
> 
> Yes.  Most filesystems expect to find either 1) old data or 2) new
> data.  Blocks full of 0xff are non-expected.

Maybe this isn't obvious.  Because of this expectation, it is
absolutely not safe to pre-erase blocks, just because the fs will
write them anyway.  Unless you can guarantee that the write will
always succeed, even in case of power outage, you just broke the
expectation.

Fixing all filesystem is also not an option, even ignoring the
question whether such a change would be a fix, a change of behaviour
or a plain bug.

So the only remaining option is to add a new interface that lets
filesystems decide to support pre-erase in some form.  And one such
interface would be the "forget" operation.  Nice attribute of forget
is the fact that it would also help some FTL layers in the kernel.
There is nothing MMC-specific about it.

Jörn

-- 
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flash erase groups and filesystems
  2005-08-16 18:52       ` Jörn Engel
@ 2005-08-17 11:35         ` Pierre Ossman
  2005-08-17 11:45           ` Jörn Engel
  0 siblings, 1 reply; 9+ messages in thread
From: Pierre Ossman @ 2005-08-17 11:35 UTC (permalink / raw)
  To: Jörn Engel; +Cc: LKML

Jörn Engel wrote:

>On Tue, 16 August 2005 20:13:36 +0200, Jörn Engel wrote:
>  
>
>>Yes.  Most filesystems expect to find either 1) old data or 2) new
>>data.  Blocks full of 0xff are non-expected.
>>    
>>
>
>Maybe this isn't obvious.  Because of this expectation, it is
>absolutely not safe to pre-erase blocks, just because the fs will
>write them anyway.  Unless you can guarantee that the write will
>always succeed, even in case of power outage, you just broke the
>expectation.
>
>  
>

Darn. I suspected as much. I'll guess the erase function will have to be
scrapped for now...

Whilst we're on the subject, do the filesystems assume that the device
can tell them exactly where the write failed? I.e. if the driver knows
that 5 sectors were written correctly, but that it failed somewhere
beyond that. It might have failed at sector 6, but it might also have
failed at sector 10. The assumption that sectors contain either old or
new data is still true, we're just unsure which. This can be the case
when you feed a controller a lot of data and it can only report back
success or failure.

>Fixing all filesystem is also not an option, even ignoring the
>question whether such a change would be a fix, a change of behaviour
>or a plain bug.
>
>So the only remaining option is to add a new interface that lets
>filesystems decide to support pre-erase in some form.  And one such
>interface would be the "forget" operation.  Nice attribute of forget
>is the fact that it would also help some FTL layers in the kernel.
>There is nothing MMC-specific about it.
>
>  
>

A bit too much work for me right now. But I'll be there with my erase
patch when someone implements it. :)

Rgds
Pierre

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flash erase groups and filesystems
  2005-08-17 11:35         ` Pierre Ossman
@ 2005-08-17 11:45           ` Jörn Engel
  0 siblings, 0 replies; 9+ messages in thread
From: Jörn Engel @ 2005-08-17 11:45 UTC (permalink / raw)
  To: Pierre Ossman; +Cc: LKML

On Wed, 17 August 2005 13:35:11 +0200, Pierre Ossman wrote:
> 
> Whilst we're on the subject, do the filesystems assume that the device
> can tell them exactly where the write failed? I.e. if the driver knows
> that 5 sectors were written correctly, but that it failed somewhere
> beyond that. It might have failed at sector 6, but it might also have
> failed at sector 10. The assumption that sectors contain either old or
> new data is still true, we're just unsure which. This can be the case
> when you feed a controller a lot of data and it can only report back
> success or failure.

Not really.  In the most common case, things have failed because the
system died unexpectedly, either through power loss or kernel bugs or
the like.  After such a clean unmount, a journal replay or fsck,
depending on the fs type, will fix things for you.  That works without
any knowledge, where the last write failed.

If the error is really an IO error, the behaviour is heavily dependent
on the fs you used.  Ext[23] will usually remount the fs read-only, so
you can hopefully retrieve all your data from the failing "hard
drive".  In that case, again, it doesn't matter much where things
broke.

> >So the only remaining option is to add a new interface that lets
> >filesystems decide to support pre-erase in some form.  And one such
> >interface would be the "forget" operation.  Nice attribute of forget
> >is the fact that it would also help some FTL layers in the kernel.
> >There is nothing MMC-specific about it.
> 
> A bit too much work for me right now. But I'll be there with my erase
> patch when someone implements it. :)

Good to know.

Jörn

-- 
There is no worse hell than that provided by the regrets
for wasted opportunities.
-- Andre-Louis Moreau in Scarabouche

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flash erase groups and filesystems
  2005-08-16 17:09   ` Pierre Ossman
  2005-08-16 18:13     ` Jörn Engel
@ 2005-08-17 14:35     ` Pavel Machek
  2005-08-18 11:09       ` linux-os (Dick Johnson)
  1 sibling, 1 reply; 9+ messages in thread
From: Pavel Machek @ 2005-08-17 14:35 UTC (permalink / raw)
  To: Pierre Ossman; +Cc: Jörn Engel, LKML

Hi!

> >Question came up before, albeit with a different phrasing.  One
> >possible approach to benefit from this ability would be to create a
> >"forget" operation.  When a filesystem already knows that some data is
> >unneeded (after a truncate or erase operation), it will ask the device
> >to forget previously occupied blocks.
> >
> >The device then has the _option_ of handling the forget operation.
> >Further reads on these blocks may return random data.
> >
> >And since noone stepped up to implement this yet, you can still get
> >all the fame and glory yourself! ;)
> >  
> >
> 
> I'm not sure we're talking about the same thing. I'm not suggesting new
> features in the VFS layer. I want to know if something breaks if I
> implement this erase feature in the MMC layer. In essence the file
> system has marked the sectors as "forget" by issuing a write to them.
> The question is if it is assumed that they are unchanged if the write
> fails half-way through.

Journaling filesystems may not like finding 0xff's all over their journal...
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Flash erase groups and filesystems
  2005-08-17 14:35     ` Pavel Machek
@ 2005-08-18 11:09       ` linux-os (Dick Johnson)
  0 siblings, 0 replies; 9+ messages in thread
From: linux-os (Dick Johnson) @ 2005-08-18 11:09 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Pierre Ossman, Jörn Engel, LKML

On Wed, 17 Aug 2005, Pavel Machek wrote:

> Hi!
>
>>> Question came up before, albeit with a different phrasing.  One
>>> possible approach to benefit from this ability would be to create a
>>> "forget" operation.  When a filesystem already knows that some data is
>>> unneeded (after a truncate or erase operation), it will ask the device
>>> to forget previously occupied blocks.
>>>
>>> The device then has the _option_ of handling the forget operation.
>>> Further reads on these blocks may return random data.
>>>
>>> And since noone stepped up to implement this yet, you can still get
>>> all the fame and glory yourself! ;)
>>>
>> I'm not sure we're talking about the same thing. I'm not suggesting new
>> features in the VFS layer. I want to know if something breaks if I
>> implement this erase feature in the MMC layer. In essence the file
>> system has marked the sectors as "forget" by issuing a write to them.
>> The question is if it is assumed that they are unchanged if the write
>> fails half-way through.
>
> Journaling filesystems may not like finding 0xff's all over their journal...
> --

Then they are broken. A file-system can't assume an unwritten block
contains anything of importance. The method of writing these devices
involves erasing a sector, then writing new data. The "commit" can't
have happened until the write succeeds.

Power can fail or the system can crash at any time. The fact that
a write is a two-step process must be hidden from the file-system
code.

Erasing a sector ahead of time is a waste of time. There is no way
that the writer can know that the sector was previously erased except
by reading the whole sector (a waste of time). Some "forget" would
be forgotten after a power failure because one can't write some
"forget" flag to the device except by erasing then writing a
sector.

> 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.12 on an i686 machine (5537.79 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
I apologize for the following. I tried to kill it with the above dot :

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-08-18 11:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-15 20:21 Flash erase groups and filesystems Pierre Ossman
2005-08-16 16:27 ` Jörn Engel
2005-08-16 17:09   ` Pierre Ossman
2005-08-16 18:13     ` Jörn Engel
2005-08-16 18:52       ` Jörn Engel
2005-08-17 11:35         ` Pierre Ossman
2005-08-17 11:45           ` Jörn Engel
2005-08-17 14:35     ` Pavel Machek
2005-08-18 11:09       ` linux-os (Dick Johnson)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox