atomic file operations

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* atomic file operations
@ 2005-03-22 21:57 Sergei Sharonov
  2005-03-23  9:39 ` Estelle HAMMACHE
  2005-03-25 16:18 ` Sergei Sharonov
  0 siblings, 2 replies; 13+ messages in thread
From: Sergei Sharonov @ 2005-03-22 21:57 UTC (permalink / raw)
  To: linux-mtd

Hello,

I am working on a logging application where a (large) log file is appended 
with 1 kByte data chunks. I cannot miss a chunk or duplicate a chunk in 
case of a power failure. Kermit will be used to ensure atomicity for 
incomming data chunks. Now, the question is what file operations on JFFS2 
are guaranteed to be atomic/transactional? 
Is a write of 1024 bytes atomic? 
Does it relate to the page size in any way? BTW I am using NAND and the page 
may vary between 512 and 2048 bytes depending on a device.
Is file rename atomic?
Other file operations?

Second issue is: How badly these small chunks will affect my mount time?

Thanks in advance

Sergei Sharonov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-22 21:57 atomic file operations Sergei Sharonov
@ 2005-03-23  9:39 ` Estelle HAMMACHE
  2005-03-23 20:50   ` Sergei Sharonov
  2005-03-25 16:18 ` Sergei Sharonov
  1 sibling, 1 reply; 13+ messages in thread
From: Estelle HAMMACHE @ 2005-03-23  9:39 UTC (permalink / raw)
  To: Sergei Sharonov; +Cc: linux-mtd

Sergei Sharonov wrote:
> Is a write of 1024 bytes atomic?
> Does it relate to the page size in any way? BTW I am using NAND and the page
> may vary between 512 and 2048 bytes depending on a device.

No write operation is guaranteed to be atomic. Have a look
at jffs2_write_inode_range in write.c : if there is not enough
space in the current block for the whole data, it may be split
into several chunks. Additionally write ops that overlap a
cache page boundary (not a flash page) are always split at 
the page limit.

If you want to have atomic writes, you could:
1) Mandatorily: ensure that your application will not
issue write ops which overlap a page boundary. 
You should not tweak the JFFS2 code to write such 
overlapping nodes, otherwise you must also tweak 
the GC and it gets difficult.
2) Either tweak jffs2_write_inode_range to forbid 
splitting data which does not overlap a page boundary
or adjust JFFS2_MIN_DATA_LEN to reserve enough 
space (difficult to estimate maybe if you have
compression...).

The above tweaking should ensure that an input buffer
is written to JFFS2 FS as a single CRC-protected
data node.

You should be aware that on NAND flash JFFS2 uses
a (nand flash) page buffer (wbuf.c), which is flushed 
only on fsync/sync/umount. So even though your write
ops will be atomic (with above code tweaks), 
there is no guarantee that a buffer is effectively 
committed to flash when write() returns, because the
end of the data node may remain in the buffer.
If you want that also, you can tweak JFFS2 again 
by requiring a  wbuf flush after each "atomic write", 
or you can have your application call fsync after 
each write.

> Is file rename atomic?
See jffs2_rename in dir.c. There are two steps:
make the new hard link, remove the old hard link.
You may end up with two names for the same inode if
there is a powerdown, so no it is not atomic.

See dir.c, file.c, fs.c for other ops. Generally speaking
write_inode_range is not an atomic operation, write_dnode
and write_dirent are atomic ops. The order of operations
in a file-level operation should ensure global atomicity
in most cases. I don't know if there are other file-operations
besides rename which are not atomic.

> Second issue is: How badly these small chunks will affect my mount time?
There have been previous threads about this.
Some people proposed some (application-side) workaround, 
you can find it in the archive or maybe someone will point 
it to you.

Bye
Estelle

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-23  9:39 ` Estelle HAMMACHE
@ 2005-03-23 20:50   ` Sergei Sharonov
  2005-03-24 10:11     ` Estelle HAMMACHE
  2005-03-24 21:59     ` David Woodhouse
  0 siblings, 2 replies; 13+ messages in thread
From: Sergei Sharonov @ 2005-03-23 20:50 UTC (permalink / raw)
  To: linux-mtd

Estelle,

thanks, appreciate your help.

> 
> Sergei Sharonov wrote:
> > Is a write of 1024 bytes atomic?
> > Does it relate to the page size in any way? BTW I am using NAND and the 
> > page may vary between 512 and 2048 bytes depending on a device.
> 
> No write operation is guaranteed to be atomic. Have a look
> at jffs2_write_inode_range in write.c : if there is not enough
> space in the current block for the whole data, it may be split
> into several chunks. Additionally write ops that overlap a
> cache page boundary (not a flash page) are always split at 
> the page limit.

That means that one write may have several CRCs corresponding to 
splinter chunks? 

> If you want to have atomic writes, you could:
> 1) Mandatorily: ensure that your application will not
> issue write ops which overlap a page boundary. 
> You should not tweak the JFFS2 code to write such 
> overlapping nodes, otherwise you must also tweak 
> the GC and it gets difficult.
> 2) Either tweak jffs2_write_inode_range to forbid 
> splitting data which does not overlap a page boundary
> or adjust JFFS2_MIN_DATA_LEN to reserve enough 
> space (difficult to estimate maybe if you have
> compression...).
> 
> The above tweaking should ensure that an input buffer
> is written to JFFS2 FS as a single CRC-protected
> data node.

Ok, got that. Does not seem like a promissing idea considering
how fast jffs2 evolves and therefore how bad forking would be.
Thansk for the suggestion anyway.

> You should be aware that on NAND flash JFFS2 uses
> a (nand flash) page buffer (wbuf.c), which is flushed 
> only on fsync/sync/umount. So even though your write
> ops will be atomic (with above code tweaks), 
> there is no guarantee that a buffer is effectively 
> committed to flash when write() returns, because the
> end of the data node may remain in the buffer.
> If you want that also, you can tweak JFFS2 again 
> by requiring a  wbuf flush after each "atomic write", 
> or you can have your application call fsync after 
> each write.

Beg pardon if it is FAQ, but if I open the file with O_SYNC
flag, wouldn't that guarantee synchronous write that does not
return untill all the data is in flash?

> > Is file rename atomic?
> See jffs2_rename in dir.c. There are two steps:
> make the new hard link, remove the old hard link.
> You may end up with two names for the same inode if
> there is a powerdown, so no it is not atomic.

Could not see that comming. Usualy people assume rename operation
atomic.

> > Second issue is: How badly these small chunks will affect my mount time?
> There have been previous threads about this.
> Some people proposed some (application-side) workaround, 
> you can find it in the archive or maybe someone will point 
> it to you.

I believe I saw a proposal to save small chunks as separate files, then 
append them as a temp file and rename temp file to real log file. 
The problems are (1) the log file is huge (2) rename is not atomic per 
your reply.
 
Sergei Sharonov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-23 20:50   ` Sergei Sharonov
@ 2005-03-24 10:11     ` Estelle HAMMACHE
  2005-03-24 10:53       ` Artem B. Bityuckiy
  2005-03-24 21:59     ` David Woodhouse
  1 sibling, 1 reply; 13+ messages in thread
From: Estelle HAMMACHE @ 2005-03-24 10:11 UTC (permalink / raw)
  To: Sergei Sharonov; +Cc: linux-mtd

Hi Sergei,

more info below.

Sergei Sharonov wrote:
> > No write operation is guaranteed to be atomic. Have a look
> > at jffs2_write_inode_range in write.c : if there is not enough
> > space in the current block for the whole data, it may be split
> > into several chunks. Additionally write ops that overlap a
> > cache page boundary (not a flash page) are always split at
> > the page limit.
> 
> That means that one write may have several CRCs corresponding to
> splinter chunks?

Yes, when I write that the input buffer is split it means that
several data nodes are written to the flash - each data node
is an independent piece of data complete with header and CRC.
If a data node is only partly written to flash, its CRC check
will fail so the partial data will not be taken into account
when building the file at next mount. In this sense each data
node is an atomic write - but JFFS2 does not guarantee that
a write() input buffer will be written as a single data node.

> > If you want to have atomic writes, you could:
> > 1) Mandatorily: ensure that your application will not
> > issue write ops which overlap a page boundary.
> > You should not tweak the JFFS2 code to write such
> > overlapping nodes, otherwise you must also tweak
> > the GC and it gets difficult.
> > 2) Either tweak jffs2_write_inode_range to forbid
> > splitting data which does not overlap a page boundary
> > or adjust JFFS2_MIN_DATA_LEN to reserve enough
> > space (difficult to estimate maybe if you have
> > compression...).
> >
> > The above tweaking should ensure that an input buffer
> > is written to JFFS2 FS as a single CRC-protected
> > data node.
> 
> Ok, got that. Does not seem like a promissing idea considering
> how fast jffs2 evolves and therefore how bad forking would be.
> Thansk for the suggestion anyway.

You can always submit your patch to the list and then
either someone will merge it for you, or you can ask
for a CVS account to do it yourself.
It could be a conditionally-compiled option. Or maybe 
there is an appropriate fcntl or open flag that could 
be implemented in JFFS2  ?
Anyway I think it would be an interesting option to 
have. The main problem is the cache page boundary 
which would require more thinking about to solve 
and lots of testing...

> > You should be aware that on NAND flash JFFS2 uses
> > a (nand flash) page buffer (wbuf.c), which is flushed
> > only on fsync/sync/umount. So even though your write
> > ops will be atomic (with above code tweaks),
> > there is no guarantee that a buffer is effectively
> > committed to flash when write() returns, because the
> > end of the data node may remain in the buffer.
> > If you want that also, you can tweak JFFS2 again
> > by requiring a  wbuf flush after each "atomic write",
> > or you can have your application call fsync after
> > each write.
> 
> Beg pardon if it is FAQ, but if I open the file with O_SYNC
> flag, wouldn't that guarantee synchronous write that does not
> return untill all the data is in flash?

I am not familiar with Linux VFS, however from previous 
discussion on the list I was led to understand that
it doesn't work with JFFS2. Probably you could implement 
O_SYNC yourself without too much trouble.

bye
Estelle

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-24 10:11     ` Estelle HAMMACHE
@ 2005-03-24 10:53       ` Artem B. Bityuckiy
  2005-03-24 11:59         ` Estelle HAMMACHE
                           ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Artem B. Bityuckiy @ 2005-03-24 10:53 UTC (permalink / raw)
  To: Sergei Sharonov; +Cc: linux-mtd

Estelle HAMMACHE wrote:
> Yes, when I write that the input buffer is split it means that
> several data nodes are written to the flash - each data node
> is an independent piece of data complete with header and CRC.
> If a data node is only partly written to flash, its CRC check
> will fail so the partial data will not be taken into account
> when building the file at next mount. In this sense each data
> node is an atomic write - but JFFS2 does not guarantee that
> a write() input buffer will be written as a single data node.

But if you:

1. write only 0-PAGE_SIZE bytes;
2. do not overlap n*PAGE_SIZE borders (n is 1,2, ...)
3. do fsync after write.

then you have the guarantee that you either have written all or 
nowthing. JFFS2 does guarantee that due to its implementation.

To put it differently, write by small peaces, and do not overlap page 
boundaries, and do fsync, and be happy with an atomic writes :-)

Examples:
(assume PAGE_SIZE is 4K)

write 4K to offset 16K is OK
write 1 byte anywhere is OK
write 10 bytes at offset 4095 is not OK.

HTH.

-- 
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-24 10:53       ` Artem B. Bityuckiy
@ 2005-03-24 11:59         ` Estelle HAMMACHE
  2005-03-24 12:17           ` Artem B. Bityuckiy
  2005-03-24 17:28         ` Sergei Sharonov
  2005-03-24 22:00         ` David Woodhouse
  2 siblings, 1 reply; 13+ messages in thread
From: Estelle HAMMACHE @ 2005-03-24 11:59 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: linux-mtd

Hi Artem,

"Artem B. Bityuckiy" wrote:
> 
> Estelle HAMMACHE wrote:
> > Yes, when I write that the input buffer is split it means that
> > several data nodes are written to the flash - each data node
> > is an independent piece of data complete with header and CRC.
> > If a data node is only partly written to flash, its CRC check
> > will fail so the partial data will not be taken into account
> > when building the file at next mount. In this sense each data
> > node is an atomic write - but JFFS2 does not guarantee that
> > a write() input buffer will be written as a single data node.
> 
> But if you:
> 
> 1. write only 0-PAGE_SIZE bytes;
> 2. do not overlap n*PAGE_SIZE borders (n is 1,2, ...)
> 3. do fsync after write.
> 
> then you have the guarantee that you either have written all or
> nowthing. JFFS2 does guarantee that due to its implementation.

This is not what I understand from jffs2_write_inode_range.
When you reach the end of the block your data may be split at 
any offset because jffs2_reserve_space may return more than 
JFFS2_MIN_DATA_LEN but less than the data size (or not enough
space to compress the whole input buffer if you use compression).
Or is there some trick here that I don't understand ??

bye
Estelle

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-24 11:59         ` Estelle HAMMACHE
@ 2005-03-24 12:17           ` Artem B. Bityuckiy
  0 siblings, 0 replies; 13+ messages in thread
From: Artem B. Bityuckiy @ 2005-03-24 12:17 UTC (permalink / raw)
  To: Estelle HAMMACHE; +Cc: Artem B. Bityuckiy, linux-mtd

> Hi Artem,
Hi Estelle,

> This is not what I understand from jffs2_write_inode_range.
> When you reach the end of the block your data may be split at 
> any offset because jffs2_reserve_space may return more than 
> JFFS2_MIN_DATA_LEN but less than the data size (or not enough
> space to compress the whole input buffer if you use compression).
> Or is there some trick here that I don't understand ??

Hmm, yes, you're right. jffs2_write_inode_range may do further splits. I 
haven't mentioned this. I thought on the higher level - commit_write is 
called only once for the transactions I mentioned. But even in these cases 
jffs2_write_inode_range spoils the atomicity.

Then I must add one more requirement:
4. Transactions must be <= 128 bytes.

Or JFFS2_MIN_DATA_LEN should be re-defined.

Thus, some atomic writes are still possible :-)

Better now? :-)

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-24 10:53       ` Artem B. Bityuckiy
  2005-03-24 11:59         ` Estelle HAMMACHE
@ 2005-03-24 17:28         ` Sergei Sharonov
  2005-03-24 19:32           ` Artem B. Bityuckiy
  2005-03-24 22:00         ` David Woodhouse
  2 siblings, 1 reply; 13+ messages in thread
From: Sergei Sharonov @ 2005-03-24 17:28 UTC (permalink / raw)
  To: linux-mtd

Artem B. Bityuckiy <dedekind <at> yandex.ru> writes:


> 1. write only 0-PAGE_SIZE bytes;
> 2. do not overlap n*PAGE_SIZE borders (n is 1,2, ...)
> 3. do fsync after write.
........
> Examples:
> (assume PAGE_SIZE is 4K)
> 
> write 4K to offset 16K is OK
> write 1 byte anywhere is OK
> write 10 bytes at offset 4095 is not OK.

Artem, when figuring out if the write goes over the page boudary you did not say
anything about CRC and other non-user data.. Should it be taken into account 
too or am I missing something?

Sergei Sharonov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-24 17:28         ` Sergei Sharonov
@ 2005-03-24 19:32           ` Artem B. Bityuckiy
  0 siblings, 0 replies; 13+ messages in thread
From: Artem B. Bityuckiy @ 2005-03-24 19:32 UTC (permalink / raw)
  To: Sergei Sharonov; +Cc: linux-mtd

> Artem, when figuring out if the write goes over the page boudary you did 
not say
> anything about CRC and other non-user data.. Should it be taken into 
account
> too or am I missing something?
>

Well, page boundaries aren't problem. We want a write operation which goes 
to flash a one node.

There are two points of splitting:
1. Linux page cache splits data on PAGE_SIZE chunks.
2. jffs2_write_inode_range may do further split. This is what I missed 
first time.

Yes, jffs2_write_inode_range will not split if the node to write is fewer 
then sizeof(struct jffs2_raw_inode) + JFFS2_MIN_DATA_LEN.

So, sizeof(struct jffs2_raw_inode) includes all the "anything about CRC 
and other non-user data" and you may write JFFS2_MIN_DATA_LEN bytes 
atomically.

So, I'll write the rules again:
1. write only 0- JFFS2_MIN_DATA_LEN bytes;
2. do not overlap n*PAGE_SIZE borders (n is 1,2, ...)
3. do fsync after write

--
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-23 20:50   ` Sergei Sharonov
  2005-03-24 10:11     ` Estelle HAMMACHE
@ 2005-03-24 21:59     ` David Woodhouse
  1 sibling, 0 replies; 13+ messages in thread
From: David Woodhouse @ 2005-03-24 21:59 UTC (permalink / raw)
  To: Sergei Sharonov; +Cc: linux-mtd

On Wed, 2005-03-23 at 20:50 +0000, Sergei Sharonov wrote:
> I believe I saw a proposal to save small chunks as separate files,
> then append them as a temp file and rename temp file to real log
> file. 
> The problems are (1) the log file is huge (2) rename is not atomic per
> your reply.

The important part of rename is atomic. If you have 'log_file', and then
you create 'log_file.new' and rename that to 'log_file', then there is
never an instant where 'log_file' does not exist; it goes directly from
pointing to one inode, to pointing to the other.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-24 10:53       ` Artem B. Bityuckiy
  2005-03-24 11:59         ` Estelle HAMMACHE
  2005-03-24 17:28         ` Sergei Sharonov
@ 2005-03-24 22:00         ` David Woodhouse
  2005-03-25  8:18           ` Artem B. Bityuckiy
  2 siblings, 1 reply; 13+ messages in thread
From: David Woodhouse @ 2005-03-24 22:00 UTC (permalink / raw)
  To: Artem B. Bityuckiy; +Cc: linux-mtd

On Thu, 2005-03-24 at 13:53 +0300, Artem B. Bityuckiy wrote:
> then you have the guarantee that you either have written all or 
> nowthing. JFFS2 does guarantee that due to its implementation.

JFFS2 might happen to _do_ that due to its implementation, but it's not
a guarantee. POSIX doesn't require that writes are atomic w.r.t power
failure.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-24 22:00         ` David Woodhouse
@ 2005-03-25  8:18           ` Artem B. Bityuckiy
  0 siblings, 0 replies; 13+ messages in thread
From: Artem B. Bityuckiy @ 2005-03-25  8:18 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Artem B. Bityuckiy, linux-mtd

On Thu, 2005-03-24 at 22:00 +0000, David Woodhouse wrote:
> On Thu, 2005-03-24 at 13:53 +0300, Artem B. Bityuckiy wrote:
> > then you have the guarantee that you either have written all or 
> > nowthing. JFFS2 does guarantee that due to its implementation.
> 
> JFFS2 might happen to _do_ that due to its implementation, but it's not
> a guarantee. POSIX doesn't require that writes are atomic w.r.t power
> failure.

Absolutely. I've noted that this is just happens to be owing to JFFS2's
implementation. Nothing else.

-- 
Best Regards,
Artem B. Bityuckiy,
St.-Petersburg, Russia.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: atomic file operations
  2005-03-22 21:57 atomic file operations Sergei Sharonov
  2005-03-23  9:39 ` Estelle HAMMACHE
@ 2005-03-25 16:18 ` Sergei Sharonov
  1 sibling, 0 replies; 13+ messages in thread
From: Sergei Sharonov @ 2005-03-25 16:18 UTC (permalink / raw)
  To: linux-mtd

Hi,

I was browsing the mailing lists and found that this topic has already 
generated a fair bit of discussions started by Vipin Malik around 2001. 
Then the consensus was that it is not a responsibility of a filesystem 
to do transactions but a user space application. Nevertheless this question 
keeps coming back again and again because obviously people with 
"real life" applications want a way to store data in a transactional 
manner. Nowadays there are also filesystems out there that sport that
transactional functionality to address this need despite the fact that 
"POSIX does not require it" and such. Obviously there are also transactional 
databases out there (Dah!) but that does not help much if one needs to expose 
data as a filesystem.

I appologize since it now appears that the original post had little to 
do with JFFS2 ;-) so the question may be rephrased as: Is anybody aware 
of a generic transactional library that can be run on top of JFFS2 to 
provide transactions for the filesystem operations? Or is everybody 
for himself?

Sergei Sharonov

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2005-03-25 16:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-22 21:57 atomic file operations Sergei Sharonov
2005-03-23  9:39 ` Estelle HAMMACHE
2005-03-23 20:50   ` Sergei Sharonov
2005-03-24 10:11     ` Estelle HAMMACHE
2005-03-24 10:53       ` Artem B. Bityuckiy
2005-03-24 11:59         ` Estelle HAMMACHE
2005-03-24 12:17           ` Artem B. Bityuckiy
2005-03-24 17:28         ` Sergei Sharonov
2005-03-24 19:32           ` Artem B. Bityuckiy
2005-03-24 22:00         ` David Woodhouse
2005-03-25  8:18           ` Artem B. Bityuckiy
2005-03-24 21:59     ` David Woodhouse
2005-03-25 16:18 ` Sergei Sharonov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox