Atomic file data replace API

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Atomic file data replace API
@ 2011-01-06 20:01 Olaf van der Spek
  2011-01-07 13:55 ` Mike Fleetwood
  2011-01-07 14:58 ` Chris Mason
  0 siblings, 2 replies; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-06 20:01 UTC (permalink / raw)
  To: linux-btrfs

Hi,

Does btrfs support atomic file data replaces? Basically, the atomic
variant of this:
// old stage
open(O_TRUNC)
write() // 0+ times
close()
// new state
-- 
Olaf

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-06 20:01 Atomic file data replace API Olaf van der Spek
@ 2011-01-07 13:55 ` Mike Fleetwood
  2011-01-07 14:01   ` Olaf van der Spek
  2011-01-07 14:58 ` Chris Mason
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Fleetwood @ 2011-01-07 13:55 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: linux-btrfs

On 6 January 2011 20:01, Olaf van der Spek <olafvdspek@gmail.com> wrote:
> Hi,
>
> Does btrfs support atomic file data replaces?

Hi Olaf,

Yes btrfs does support atomic replace, since kernel 2.6.30 circa June 2009. [1]

Special handling was added to ext3, ext4, btrfs (and probably other
Linux FSs) for your replace-via-truncate and the alternative
replace-via-rename application patterns.  Try reading "Delayed
allocation and the zero-length file problem" article and comments by
Ted Ts'o for further discussion. [2]

Mike
-- 
[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5a3f23d515a2ebf0c750db80579ca57b28cbce6d
[2] http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 13:55 ` Mike Fleetwood
@ 2011-01-07 14:01   ` Olaf van der Spek
  2011-01-07 14:10     ` Olaf van der Spek
  0 siblings, 1 reply; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-07 14:01 UTC (permalink / raw)
  To: Mike Fleetwood; +Cc: linux-btrfs

On Fri, Jan 7, 2011 at 2:55 PM, Mike Fleetwood
<mike.fleetwood@googlemail.com> wrote:
> On 6 January 2011 20:01, Olaf van der Spek <olafvdspek@gmail.com> wro=
te:
>> Hi,
>>
>> Does btrfs support atomic file data replaces?
>
> Hi Olaf,
>
> Yes btrfs does support atomic replace, since kernel 2.6.30 circa June=
 2009. [1]
>
> Special handling was added to ext3, ext4, btrfs (and probably other
> Linux FSs) for your replace-via-truncate and the alternative
> replace-via-rename application patterns. =C2=A0Try reading "Delayed
> allocation and the zero-length file problem" article and comments by
> Ted Ts'o for further discussion. [2]

According to Ted, via-truncate and via-rename are unsafe. Only fsync,
rename is safe.
Disadvantage of rename is resetting file owner (if non-root), having
issues with meta-data and other stuff.

My proposal was for an open flag, O_ATOMIC, to be introduced to tell
the FS the whole file update should be done atomically.
Ted says this is too hard in ext4, so I was wondering if this would be
possible in btrfs.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 14:01   ` Olaf van der Spek
@ 2011-01-07 14:10     ` Olaf van der Spek
  0 siblings, 0 replies; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-07 14:10 UTC (permalink / raw)
  To: Mike Fleetwood; +Cc: linux-btrfs

On Fri, Jan 7, 2011 at 3:01 PM, Olaf van der Spek <olafvdspek@gmail.com> wrote:
> According to Ted, via-truncate and via-rename are unsafe. Only fsync,
> rename is safe.
> Disadvantage of rename is resetting file owner (if non-root), having
> issues with meta-data and other stuff.
>
> My proposal was for an open flag, O_ATOMIC, to be introduced to tell
> the FS the whole file update should be done atomically.
> Ted says this is too hard in ext4, so I was wondering if this would be
> possible in btrfs.

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2082
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2089
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2090

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-06 20:01 Atomic file data replace API Olaf van der Spek
  2011-01-07 13:55 ` Mike Fleetwood
@ 2011-01-07 14:58 ` Chris Mason
  2011-01-07 15:01   ` Olaf van der Spek
  2011-01-08  1:11   ` Phillip Susi
  1 sibling, 2 replies; 28+ messages in thread
From: Chris Mason @ 2011-01-07 14:58 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: linux-btrfs

Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -0500:
> Hi,
> 
> Does btrfs support atomic file data replaces? Basically, the atomic
> variant of this:
> // old stage
> open(O_TRUNC)
> write() // 0+ times
> close()
> // new state

Yes and no.  We have a best effort mechanism where we try to guess that
since you've done this truncate and the write that you want the writes
to show up quickly.  But its a guess.

The problem is the write() // 0+ times.  The kernel has no idea what
new result you want the file to contain because the application isn't
telling us.

What btrfs can do (but we haven't yet implemented) is make sure that the
results of a single write file are on disk atomically, even if they are
replacing existing bytes in the file.

Because we cow and because we don't update metadata pointers until the
IO is complete, we can wait until all the IO for a given write call is
on disk before we update any of the metadata.

This isn't hard, it's on my TODO list.

-chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 14:58 ` Chris Mason
@ 2011-01-07 15:01   ` Olaf van der Spek
  2011-01-07 15:05     ` Chris Mason
  2011-01-08  1:11   ` Phillip Susi
  1 sibling, 1 reply; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-07 15:01 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> wr=
ote:
> Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -050=
0:
>> Hi,
>>
>> Does btrfs support atomic file data replaces? Basically, the atomic
>> variant of this:
>> // old stage
>> open(O_TRUNC)
>> write() // 0+ times
>> close()
>> // new state
>
> Yes and no. =C2=A0We have a best effort mechanism where we try to gue=
ss that
> since you've done this truncate and the write that you want the write=
s
> to show up quickly. =C2=A0But its a guess.
>
> The problem is the write() // 0+ times. =C2=A0The kernel has no idea =
what
> new result you want the file to contain because the application isn't
> telling us.

Isn't it safe for the kernel to wait until the first write or close
before writing anything to disk?

> What btrfs can do (but we haven't yet implemented) is make sure that =
the
> results of a single write file are on disk atomically, even if they a=
re
> replacing existing bytes in the file.
>
> Because we cow and because we don't update metadata pointers until th=
e
> IO is complete, we can wait until all the IO for a given write call i=
s
> on disk before we update any of the metadata.
>
> This isn't hard, it's on my TODO list.

What about a new flag: O_ATOMIC that'd take the guesswork out of the ke=
rnel?

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 15:01   ` Olaf van der Spek
@ 2011-01-07 15:05     ` Chris Mason
  2011-01-07 15:08       ` Olaf van der Spek
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Mason @ 2011-01-07 15:05 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: linux-btrfs

Excerpts from Olaf van der Spek's message of 2011-01-07 10:01:59 -0500:
> On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> =
wrote:
> > Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -0=
500:
> >> Hi,
> >>
> >> Does btrfs support atomic file data replaces? Basically, the atomi=
c
> >> variant of this:
> >> // old stage
> >> open(O_TRUNC)
> >> write() // 0+ times
> >> close()
> >> // new state
> >
> > Yes and no. =C2=A0We have a best effort mechanism where we try to g=
uess that
> > since you've done this truncate and the write that you want the wri=
tes
> > to show up quickly. =C2=A0But its a guess.
> >
> > The problem is the write() // 0+ times. =C2=A0The kernel has no ide=
a what
> > new result you want the file to contain because the application isn=
't
> > telling us.
>=20
> Isn't it safe for the kernel to wait until the first write or close
> before writing anything to disk?

I'm afraid not.  Picture an application that opens a thousand files and
writes 1MB to each of them, and then didn't close any.  If we waited
until close, you'd have 1GB of memory pinned or staged somehow.

>=20
> > What btrfs can do (but we haven't yet implemented) is make sure tha=
t the
> > results of a single write file are on disk atomically, even if they=
 are
> > replacing existing bytes in the file.
> >
> > Because we cow and because we don't update metadata pointers until =
the
> > IO is complete, we can wait until all the IO for a given write call=
 is
> > on disk before we update any of the metadata.
> >
> > This isn't hard, it's on my TODO list.
>=20
> What about a new flag: O_ATOMIC that'd take the guesswork out of the =
kernel?

We can't guess beyond a single write call.  Otherwise we get into
the problem above where an application can force the kernel to wait
forever.  I'm not against O_ATOMIC to enable the new btrfs
functionality, but it will still be limited to one write.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 15:05     ` Chris Mason
@ 2011-01-07 15:08       ` Olaf van der Spek
  2011-01-07 15:13         ` Chris Mason
  0 siblings, 1 reply; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-07 15:08 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason <chris.mason@oracle.com> wr=
ote:
>> > The problem is the write() // 0+ times. =C2=A0The kernel has no id=
ea what
>> > new result you want the file to contain because the application is=
n't
>> > telling us.
>>
>> Isn't it safe for the kernel to wait until the first write or close
>> before writing anything to disk?
>
> I'm afraid not. =C2=A0Picture an application that opens a thousand fi=
les and
> writes 1MB to each of them, and then didn't close any. =C2=A0If we wa=
ited
> until close, you'd have 1GB of memory pinned or staged somehow.

That's not what I asked. ;)
I asked to wait until the first write (or close). That way, you don't
get unintentional empty files.
One step further, you don't have to keep the data in memory, you're
free to write them to disk. You just wouldn't update the meta-data
(yet).

>> > This isn't hard, it's on my TODO list.
>>
>> What about a new flag: O_ATOMIC that'd take the guesswork out of the=
 kernel?
>
> We can't guess beyond a single write call. =C2=A0Otherwise we get int=
o
> the problem above where an application can force the kernel to wait
> forever. =C2=A0I'm not against O_ATOMIC to enable the new btrfs
> functionality, but it will still be limited to one write.
>
> -chris
>



--=20
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 15:08       ` Olaf van der Spek
@ 2011-01-07 15:13         ` Chris Mason
  2011-01-07 15:17           ` Olaf van der Spek
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Mason @ 2011-01-07 15:13 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: linux-btrfs

Excerpts from Olaf van der Spek's message of 2011-01-07 10:08:24 -0500:
> On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason <chris.mason@oracle.com> =
wrote:
> >> > The problem is the write() // 0+ times. =C2=A0The kernel has no =
idea what
> >> > new result you want the file to contain because the application =
isn't
> >> > telling us.
> >>
> >> Isn't it safe for the kernel to wait until the first write or clos=
e
> >> before writing anything to disk?
> >
> > I'm afraid not. =C2=A0Picture an application that opens a thousand =
files and
> > writes 1MB to each of them, and then didn't close any. =C2=A0If we =
waited
> > until close, you'd have 1GB of memory pinned or staged somehow.
>=20
> That's not what I asked. ;)
> I asked to wait until the first write (or close). That way, you don't
> get unintentional empty files.
> One step further, you don't have to keep the data in memory, you're
> free to write them to disk. You just wouldn't update the meta-data
> (yet).

Sorry ;) Picture an application that truncates 1024 files without closi=
ng any
of them.  Basically any operation that includes the kernel waiting for
applications because they promise to do something soon is a denial of
service attack, or a really easy way to run out of memory on the box.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 15:13         ` Chris Mason
@ 2011-01-07 15:17           ` Olaf van der Spek
  2011-01-07 16:12             ` Chris Mason
  2011-01-07 16:32             ` Massimo Maggi
  0 siblings, 2 replies; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-07 15:17 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> wr=
ote:
>> That's not what I asked. ;)
>> I asked to wait until the first write (or close). That way, you don'=
t
>> get unintentional empty files.
>> One step further, you don't have to keep the data in memory, you're
>> free to write them to disk. You just wouldn't update the meta-data
>> (yet).
>
> Sorry ;) Picture an application that truncates 1024 files without clo=
sing any
> of them. =C2=A0Basically any operation that includes the kernel waiti=
ng for
> applications because they promise to do something soon is a denial of
> service attack, or a really easy way to run out of memory on the box.

I'm not sure why you would run out of memory in that case.

O_ATOMIC would be the solution for the rename workaround: write temp
file, rename
With advantages like a way simpler API, no issues with resetting
meta-data, no issues with temp file and maybe better performance.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 15:17           ` Olaf van der Spek
@ 2011-01-07 16:12             ` Chris Mason
  2011-01-07 16:19               ` Olaf van der Spek
  2011-01-07 16:26               ` Hubert Kario
  2011-01-07 16:32             ` Massimo Maggi
  1 sibling, 2 replies; 28+ messages in thread
From: Chris Mason @ 2011-01-07 16:12 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: linux-btrfs

Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -0500:
> On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> =
wrote:
> >> That's not what I asked. ;)
> >> I asked to wait until the first write (or close). That way, you do=
n't
> >> get unintentional empty files.
> >> One step further, you don't have to keep the data in memory, you'r=
e
> >> free to write them to disk. You just wouldn't update the meta-data
> >> (yet).
> >
> > Sorry ;) Picture an application that truncates 1024 files without c=
losing any
> > of them. =C2=A0Basically any operation that includes the kernel wai=
ting for
> > applications because they promise to do something soon is a denial =
of
> > service attack, or a really easy way to run out of memory on the bo=
x.
>=20
> I'm not sure why you would run out of memory in that case.

Well, lets make sure I've got a good handle on the proposed interface:

1) fd =3D open(some_file, O_ATOMIC)
2) truncate(fd, 0)
3) write(fd, new data)

The semantics are that we promise not to let the truncate hit the disk
until the application does the write.

We have a few choices on how we do this:

1) Leave the disk untouched, but keep something in memory that says thi=
s
inode is really truncated

2) Record on disk that we've done our atomic truncate but it is still
pending.  We'd need some way to remove or invalidate this record after =
a
crash.

3) Go ahead and do the operation but don't allow the transaction to
commit until the write is done.

option #1: keep something in memory.  Well, any time we have a
requirement to pin something in memory until userland decides to do a
write, we risk oom.

option #2: disk format change.  Actually somewhat complex because if we
haven't crashed, we need to be able to read the inode in again without
invalidating the record but if we do crash, we have to invalidate the
record.  Not impossible, but not trivial.

option #3: Pin the whole transaction.  Depending on the FS this may be
impossible.  Certain operations require us to commit the transaction to
reclaim space, and we cannot allow userland to put that on hold without
deadlocking.

What most people don't realize about the crash safe filesystems is they
don't have fine grained transactions.  There is one single transaction
for all the operations done.  This is mostly because it is less complex
and much faster, but it also makes any 'pin the whole transaction' type
system unusable.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 16:12             ` Chris Mason
@ 2011-01-07 16:19               ` Olaf van der Spek
  2011-01-07 16:26               ` Hubert Kario
  1 sibling, 0 replies; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-07 16:19 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Fri, Jan 7, 2011 at 5:12 PM, Chris Mason <chris.mason@oracle.com> wr=
ote:
>> I'm not sure why you would run out of memory in that case.
>
> Well, lets make sure I've got a good handle on the proposed interface=
:
>
> 1) fd =3D open(some_file, O_ATOMIC)

No, O_TRUNC should be used in open. Maybe it works with a separate trun=
cate too.

> 2) truncate(fd, 0)
> 3) write(fd, new data)
>
> The semantics are that we promise not to let the truncate hit the dis=
k
> until the application does the write.
>
> We have a few choices on how we do this:
>
> 1) Leave the disk untouched, but keep something in memory that says t=
his
> inode is really truncated
>
> 2) Record on disk that we've done our atomic truncate but it is still
> pending. =C2=A0We'd need some way to remove or invalidate this record=
 after a
> crash.
>
> 3) Go ahead and do the operation but don't allow the transaction to
> commit until the write is done.
>
> option #1: keep something in memory. =C2=A0Well, any time we have a
> requirement to pin something in memory until userland decides to do a
> write, we risk oom.

Since the file is open, you have to keep something in memory anyway,
right? Adding a bit (or bool) does not make a difference IMO.
Isn't this comparable to opening a temp file?

> option #2: disk format change. =C2=A0Actually somewhat complex becaus=
e if we
> haven't crashed, we need to be able to read the inode in again withou=
t
> invalidating the record but if we do crash, we have to invalidate the
> record. =C2=A0Not impossible, but not trivial.
>
> option #3: Pin the whole transaction. =C2=A0Depending on the FS this =
may be
> impossible. =C2=A0Certain operations require us to commit the transac=
tion to
> reclaim space, and we cannot allow userland to put that on hold witho=
ut
> deadlocking.

#1 is the only one that makes sense.

> What most people don't realize about the crash safe filesystems is th=
ey
> don't have fine grained transactions. =C2=A0There is one single trans=
action
> for all the operations done. =C2=A0This is mostly because it is less =
complex
> and much faster, but it also makes any 'pin the whole transaction' ty=
pe
> system unusable.

AFAIK the cost is mostly more complex code / runtime. The cost is not
disk performance.

--=20
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 16:12             ` Chris Mason
  2011-01-07 16:19               ` Olaf van der Spek
@ 2011-01-07 16:26               ` Hubert Kario
  2011-01-07 19:29                 ` Chris Mason
  1 sibling, 1 reply; 28+ messages in thread
From: Hubert Kario @ 2011-01-07 16:26 UTC (permalink / raw)
  To: Chris Mason; +Cc: Olaf van der Spek, linux-btrfs

On Friday, January 07, 2011 17:12:11 Chris Mason wrote:
> Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -050=
0:
> > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com=
>=20
wrote:
> > >> That's not what I asked. ;)
> > >> I asked to wait until the first write (or close). That way, you =
don't
> > >> get unintentional empty files.
> > >> One step further, you don't have to keep the data in memory, you=
're
> > >> free to write them to disk. You just wouldn't update the meta-da=
ta
> > >> (yet).
> > >=20
> > > Sorry ;) Picture an application that truncates 1024 files without
> > > closing any of them.  Basically any operation that includes the k=
ernel
> > > waiting for applications because they promise to do something soo=
n is
> > > a denial of service attack, or a really easy way to run out of me=
mory
> > > on the box.
> >=20
> > I'm not sure why you would run out of memory in that case.
>=20
> Well, lets make sure I've got a good handle on the proposed interface=
:
>=20
> 1) fd =3D open(some_file, O_ATOMIC)
> 2) truncate(fd, 0)
> 3) write(fd, new data)
>=20
> The semantics are that we promise not to let the truncate hit the dis=
k
> until the application does the write.
>=20
> We have a few choices on how we do this:
>=20
> 1) Leave the disk untouched, but keep something in memory that says t=
his
> inode is really truncated
>=20
> 2) Record on disk that we've done our atomic truncate but it is still
> pending.  We'd need some way to remove or invalidate this record afte=
r a
> crash.
>=20
> 3) Go ahead and do the operation but don't allow the transaction to
> commit until the write is done.
>=20
> option #1: keep something in memory.  Well, any time we have a
> requirement to pin something in memory until userland decides to do a
> write, we risk oom.

Userland has already a file descriptor allocated (which can fail anyway=
=20
because of OOM), I see no problem in increasing the size of kernel memo=
ry=20
usage by 4 bytes (if not less) just to note that the application wants =
to see=20
the file as truncated (1 bit) and the next write has to be atomic (2nd =
bit?).

--=20
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawer=C3=B3w 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 15:17           ` Olaf van der Spek
  2011-01-07 16:12             ` Chris Mason
@ 2011-01-07 16:32             ` Massimo Maggi
  2011-01-07 16:34               ` Olaf van der Spek
  1 sibling, 1 reply; 28+ messages in thread
From: Massimo Maggi @ 2011-01-07 16:32 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: linux-btrfs

Are you suggesting to do:
1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
2)application writes to that fd, with one or more system calls, in a
short time or in long time, at his will.
3)at fclose (or even at fsync ) atomically swap "data pointer" of "real
file" with "temp file", then delete temp.In a transparent mode to
userland.  (something similar to e4defrag).
Is this sum up correct?

Massimo Maggi

Il 07/01/2011 16:17, Olaf van der Spek ha scritto:
> On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> wrote:
>>> That's not what I asked. ;)
>>> I asked to wait until the first write (or close). That way, you don't
>>> get unintentional empty files.
>>> One step further, you don't have to keep the data in memory, you're
>>> free to write them to disk. You just wouldn't update the meta-data
>>> (yet).
>> Sorry ;) Picture an application that truncates 1024 files without closing any
>> of them.  Basically any operation that includes the kernel waiting for
>> applications because they promise to do something soon is a denial of
>> service attack, or a really easy way to run out of memory on the box.
> I'm not sure why you would run out of memory in that case.
>
> O_ATOMIC would be the solution for the rename workaround: write temp
> file, rename
> With advantages like a way simpler API, no issues with resetting
> meta-data, no issues with temp file and maybe better performance.
>
> Olaf
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 16:32             ` Massimo Maggi
@ 2011-01-07 16:34               ` Olaf van der Spek
  2011-01-07 19:29                 ` Thomas Bellman
  0 siblings, 1 reply; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-07 16:34 UTC (permalink / raw)
  To: Massimo Maggi; +Cc: linux-btrfs

On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi <massimo@mmmm.it> wrote:
> Are you suggesting to do:
> 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
> 2)application writes to that fd, with one or more system calls, in a
> short time or in long time, at his will.
> 3)at fclose (or even at fsync ) atomically swap "data pointer" of "re=
al
> file" with "temp file", then delete temp.In a transparent mode to
> userland. =C2=A0(something similar to e4defrag).
> Is this sum up correct?

Almost. Swap should probably not be done at fsync time.
Other open references (for example running executables) should be swapp=
ed too.

The new-file case has to be handled too.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 16:34               ` Olaf van der Spek
@ 2011-01-07 19:29                 ` Thomas Bellman
  2011-01-08 14:36                   ` Olaf van der Spek
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Bellman @ 2011-01-07 19:29 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: Massimo Maggi, linux-btrfs

Olaf van der Spek wrote:

> On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi <massimo@mmmm.it> wrote:
>> Are you suggesting to do:
>> 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file
>> 2)application writes to that fd, with one or more system calls, in a
>> short time or in long time, at his will.
>> 3)at fclose (or even at fsync ) atomically swap "data pointer" of "real
>> file" with "temp file", then delete temp.In a transparent mode to
>> userland.  (something similar to e4defrag).
>> Is this sum up correct?
> 
> Almost. Swap should probably not be done at fsync time.
> Other open references (for example running executables) should be swapped too.

What is the visibility of the changes for other processes supposed
to be in the meantime?  I.e., if things happen in this order:

1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC)
2. Process B does fdb = open("foo.txt", O_RDONLY)
3. B does read(fdb, buf, 4096)
4. A does write(fda, "NEW DATA\n", 9)
5. Process C comes in and does fdc = open("foo.txt", O_RDONLY)
6. C does read(fdc, buf, 4096)
7. A calls close(fda)

Does B see an empty file, or does it see the old contents of
the file?  Does C see "NEW DATA\n", or does it see the old
contents of the file, or perhaps an empty file?

	/Bellman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 16:26               ` Hubert Kario
@ 2011-01-07 19:29                 ` Chris Mason
  2011-01-08 14:40                   ` Olaf van der Spek
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Mason @ 2011-01-07 19:29 UTC (permalink / raw)
  To: Hubert Kario; +Cc: Olaf van der Spek, linux-btrfs

Excerpts from Hubert Kario's message of 2011-01-07 11:26:02 -0500:
> On Friday, January 07, 2011 17:12:11 Chris Mason wrote:
> > Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -0500:
> > > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> 
> wrote:
> > > >> That's not what I asked. ;)
> > > >> I asked to wait until the first write (or close). That way, you don't
> > > >> get unintentional empty files.
> > > >> One step further, you don't have to keep the data in memory, you're
> > > >> free to write them to disk. You just wouldn't update the meta-data
> > > >> (yet).
> > > > 
> > > > Sorry ;) Picture an application that truncates 1024 files without
> > > > closing any of them.  Basically any operation that includes the kernel
> > > > waiting for applications because they promise to do something soon is
> > > > a denial of service attack, or a really easy way to run out of memory
> > > > on the box.
> > > 
> > > I'm not sure why you would run out of memory in that case.
> > 
> > Well, lets make sure I've got a good handle on the proposed interface:
> > 
> > 1) fd = open(some_file, O_ATOMIC)
> > 2) truncate(fd, 0)
> > 3) write(fd, new data)
> > 
> > The semantics are that we promise not to let the truncate hit the disk
> > until the application does the write.
> > 
> > We have a few choices on how we do this:
> > 
> > 1) Leave the disk untouched, but keep something in memory that says this
> > inode is really truncated
> > 
> > 2) Record on disk that we've done our atomic truncate but it is still
> > pending.  We'd need some way to remove or invalidate this record after a
> > crash.
> > 
> > 3) Go ahead and do the operation but don't allow the transaction to
> > commit until the write is done.
> > 
> > option #1: keep something in memory.  Well, any time we have a
> > requirement to pin something in memory until userland decides to do a
> > write, we risk oom.
> 
> Userland has already a file descriptor allocated (which can fail anyway 
> because of OOM), I see no problem in increasing the size of kernel memory 
> usage by 4 bytes (if not less) just to note that the application wants to see 
> the file as truncated (1 bit) and the next write has to be atomic (2nd bit?).
> 

The exact amount of tracking is going to vary.  The reason why is that
actually doing the truncate is an O(size of the file) operation and so
you can't just flip a switch when the write or the close comes in.  You
have to run through all the metadata of the file and do something
temporary with each part that is only completed when the file IO is
actually done.

Honestly, there many different ways to solve this in the application.
Requiring high speed atomic replacement of individual file contents is a
recipe for frustration.

-chris

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 14:58 ` Chris Mason
  2011-01-07 15:01   ` Olaf van der Spek
@ 2011-01-08  1:11   ` Phillip Susi
  1 sibling, 0 replies; 28+ messages in thread
From: Phillip Susi @ 2011-01-08  1:11 UTC (permalink / raw)
  To: Chris Mason; +Cc: Olaf van der Spek, linux-btrfs

On 01/07/2011 09:58 AM, Chris Mason wrote:
> Yes and no.  We have a best effort mechanism where we try to guess that
> since you've done this truncate and the write that you want the writes
> to show up quickly.  But its a guess.

It is a pretty good guess, and one that the NT kernel has been making 
for 15 years or so.  I've been following this issue for some time and I 
still don't understand why Ted is so hostile to this and can't make it 
work right on ext4.  When you get a rename() you just need to check if 
there are outstanding journal transactions and/or dirty cache pages, and 
hang the rename() transaction on the end of those.  That way if the 
system crashes after the new file has fully hit the disk, the old file 
is gone and you only have the new one, but if it crashes before, you 
still have the old one in place.

Both the writes and the rename can be delayed in the cache to an 
arbitrary point in the future; what matters is that their order is 
preserved.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 19:29                 ` Thomas Bellman
@ 2011-01-08 14:36                   ` Olaf van der Spek
  2011-01-08 21:43                     ` Thomas Bellman
  0 siblings, 1 reply; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-08 14:36 UTC (permalink / raw)
  To: Thomas Bellman; +Cc: Massimo Maggi, linux-btrfs

On Fri, Jan 7, 2011 at 8:29 PM, Thomas Bellman <bellman@nsc.liu.se> wro=
te:
> What is the visibility of the changes for other processes supposed
> to be in the meantime? =C2=A0I.e., if things happen in this order:

Should be atomic too, at close time.

> 1. Process A does fda =3D open("foo.txt", O_TRUNC|O_ATOMIC)
> 2. Process B does fdb =3D open("foo.txt", O_RDONLY)
> 3. B does read(fdb, buf, 4096)
> 4. A does write(fda, "NEW DATA\n", 9)
> 5. Process C comes in and does fdc =3D open("foo.txt", O_RDONLY)
> 6. C does read(fdc, buf, 4096)
> 7. A calls close(fda)
>
> Does B see an empty file, or does it see the old contents of
> the file?

Old file, otherwise A wouldn't be atomic.

> Does C see "NEW DATA\n", or does it see the old
> contents of the file, or perhaps an empty file?

Old file again, as the 'transaction' isn't finished until close.

--=20
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-07 19:29                 ` Chris Mason
@ 2011-01-08 14:40                   ` Olaf van der Spek
  2011-01-26 18:30                     ` Olaf van der Spek
  0 siblings, 1 reply; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-08 14:40 UTC (permalink / raw)
  To: Chris Mason; +Cc: Hubert Kario, linux-btrfs

On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com> wr=
ote:
> The exact amount of tracking is going to vary. =C2=A0The reason why i=
s that
> actually doing the truncate is an O(size of the file) operation and s=
o
> you can't just flip a switch when the write or the close comes in. =C2=
=A0You
> have to run through all the metadata of the file and do something
> temporary with each part that is only completed when the file IO is
> actually done.

That's true. Maybe the proper way, via O_ATOMIC, is better.

> Honestly, there many different ways to solve this in the application.
> Requiring high speed atomic replacement of individual file contents i=
s a
> recipe for frustration.

Did you see message of Massimo? That'd be the ideal way from an app
point of view.
Not solving this properly in the FS moves the problem to userspace
where it's even harder to solve and is not as performant.

Replacing file data is a common operation that IMO the FS should
support in a safe way.
--=20
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-08 14:36                   ` Olaf van der Spek
@ 2011-01-08 21:43                     ` Thomas Bellman
  2011-01-09 15:16                       ` Olaf van der Spek
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Bellman @ 2011-01-08 21:43 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: Massimo Maggi, linux-btrfs

Olaf van der Spek wrote:

> On Fri, Jan 7, 2011 at 8:29 PM, Thomas Bellman <bellman@nsc.liu.se> wrote:
>> What is the visibility of the changes for other processes supposed
>> to be in the meantime?  I.e., if things happen in this order:
> 
> Should be atomic too, at close time.
> 
>> 1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC)
>> 2. Process B does fdb = open("foo.txt", O_RDONLY)
>> 3. B does read(fdb, buf, 4096)
>> 4. A does write(fda, "NEW DATA\n", 9)
>> 5. Process C comes in and does fdc = open("foo.txt", O_RDONLY)
>> 6. C does read(fdc, buf, 4096)
>> 7. A calls close(fda)
>>
>> Does B see an empty file, or does it see the old contents of
>> the file?
> 
> Old file, otherwise A wouldn't be atomic.
> 
>> Does C see "NEW DATA\n", or does it see the old
>> contents of the file, or perhaps an empty file?
> 
> Old file again, as the 'transaction' isn't finished until close.

So, basically database transactions with an isolation level of
"committed read", for file operations.  That's something I have
wanted for a long time, especially if I also get a rollback()
operation, but have never heard of any Unix that implemented it.

A separate commit() operation would be better than conflating it
with close().  And as I said, we want a rollback() as well.  And
a process that terminates without committing the transaction that
it is performing, should have the transaction automatically rolled
back.

I only have a very shallow knowledge about the internals of the
Linux kernel in regards to filesystems, but I suspect that this
could be implemented almost entirely within the VFS, and not need
to touch the actual filesystems, as long as you are satisfied
with a limited amount of transaction space (what fits in RAM +
swap).

I'm looking forward to your implementation. :-)  Even though I
suspect that it would be a rather large undertaking to implement...

	/Bellman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-08 21:43                     ` Thomas Bellman
@ 2011-01-09 15:16                       ` Olaf van der Spek
  2011-01-09 18:56                         ` Thomas Bellman
  0 siblings, 1 reply; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-09 15:16 UTC (permalink / raw)
  To: Thomas Bellman; +Cc: Massimo Maggi, linux-btrfs

On Sat, Jan 8, 2011 at 10:43 PM, Thomas Bellman <bellman@nsc.liu.se> wr=
ote:
> So, basically database transactions with an isolation level of
> "committed read", for file operations. =C2=A0That's something I have
> wanted for a long time, especially if I also get a rollback()
> operation, but have never heard of any Unix that implemented it.

True, that's why this feature request is here.
Note that it's (ATM) only about  single file data replace.

> A separate commit() operation would be better than conflating it
> with close(). =C2=A0And as I said, we want a rollback() as well. =C2=A0=
And
> a process that terminates without committing the transaction that
> it is performing, should have the transaction automatically rolled
> back.

What could you do between commit and close?

> I only have a very shallow knowledge about the internals of the
> Linux kernel in regards to filesystems, but I suspect that this
> could be implemented almost entirely within the VFS, and not need
> to touch the actual filesystems, as long as you are satisfied
> with a limited amount of transaction space (what fits in RAM +
> swap).
>
> I'm looking forward to your implementation. :-) =C2=A0Even though I
> suspect that it would be a rather large undertaking to implement...

I have no plans to work on an implementation.

--=20
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-09 15:16                       ` Olaf van der Spek
@ 2011-01-09 18:56                         ` Thomas Bellman
  2011-01-09 19:06                           ` Olaf van der Spek
  2011-01-09 20:13                           ` Phillip Susi
  0 siblings, 2 replies; 28+ messages in thread
From: Thomas Bellman @ 2011-01-09 18:56 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: Massimo Maggi, linux-btrfs

Olaf van der Spek wrote:

> On Sat, Jan 8, 2011 at 10:43 PM, Thomas Bellman <bellman@nsc.liu.se> wrote:
>> So, basically database transactions with an isolation level of
>> "committed read", for file operations.  That's something I have
>> wanted for a long time, especially if I also get a rollback()
>> operation, but have never heard of any Unix that implemented it.
> 
> True, that's why this feature request is here.
> Note that it's (ATM) only about  single file data replace.

That particular problem was solved with the introduction of the
rename(2) system call in 4.2BSD a bit more than a quarter of a
century ago.  There is no need to introduce another, less flexible,
API for doing the same thing.

>> A separate commit() operation would be better than conflating it
>> with close().  And as I said, we want a rollback() as well.  And
>> a process that terminates without committing the transaction that
>> it is performing, should have the transaction automatically rolled
>> back.
> 
> What could you do between commit and close?

More write() operations, of course.  Just like you can continue
with more transactions after a COMMIT WORK call without having
to close and re-open the database in SQL.


	/Bellman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-09 18:56                         ` Thomas Bellman
@ 2011-01-09 19:06                           ` Olaf van der Spek
  2011-01-09 20:13                           ` Phillip Susi
  1 sibling, 0 replies; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-09 19:06 UTC (permalink / raw)
  To: Thomas Bellman; +Cc: Massimo Maggi, linux-btrfs

On Sun, Jan 9, 2011 at 7:56 PM, Thomas Bellman <bellman@nsc.liu.se> wro=
te:
>> True, that's why this feature request is here.
>> Note that it's (ATM) only about =C2=A0single file data replace.
>
> That particular problem was solved with the introduction of the
> rename(2) system call in 4.2BSD a bit more than a quarter of a
> century ago. =C2=A0There is no need to introduce another, less flexib=
le,
> API for doing the same thing.

You might want to read about the problems with that workaround.

>> What could you do between commit and close?
>
> More write() operations, of course. =C2=A0Just like you can continue
> with more transactions after a COMMIT WORK call without having
> to close and re-open the database in SQL.

The transaction is defined as beginning with open and ending with close=
=2E
--=20
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-09 18:56                         ` Thomas Bellman
  2011-01-09 19:06                           ` Olaf van der Spek
@ 2011-01-09 20:13                           ` Phillip Susi
  1 sibling, 0 replies; 28+ messages in thread
From: Phillip Susi @ 2011-01-09 20:13 UTC (permalink / raw)
  To: Thomas Bellman; +Cc: Olaf van der Spek, Massimo Maggi, linux-btrfs

On 01/09/2011 01:56 PM, Thomas Bellman wrote:
> That particular problem was solved with the introduction of the
> rename(2) system call in 4.2BSD a bit more than a quarter of a
> century ago. There is no need to introduce another, less flexible,
> API for doing the same thing.

I'm curious if there are any BSD specifications that state that rename() 
has this behavior.  Ted Tso has been claiming that POSIX does not 
require this behavior in the face of a crash and that as a result, an 
application that relies on such behavior is broken, and needs to fsync() 
before rename().  This of course, makes replacing numerous files much 
slower, glacially so on btrfs.  There has been a great deal of 
discussion ok the dpkg mailing lists about it since plenty of people are 
upset that dpkg runs much slower these days than it used to, because it 
now calls fsync() before rename() in order to avoid breakage on ext4.

You can read more, including the rationale of why POSIX does not require 
this behavior at http://lwn.net/Articles/323607/.

I still say that preserving the order of the writes and rename is the 
only sane thing to do, whether POSIX requires it or not.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-08 14:40                   ` Olaf van der Spek
@ 2011-01-26 18:30                     ` Olaf van der Spek
  2011-01-26 19:30                       ` Chris Mason
  0 siblings, 1 reply; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-26 18:30 UTC (permalink / raw)
  To: Chris Mason; +Cc: Hubert Kario, linux-btrfs

On Sat, Jan 8, 2011 at 3:40 PM, Olaf van der Spek <olafvdspek@gmail.com=
> wrote:
> On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com> =
wrote:
>> The exact amount of tracking is going to vary. =C2=A0The reason why =
is that
>> actually doing the truncate is an O(size of the file) operation and =
so
>> you can't just flip a switch when the write or the close comes in. =C2=
=A0You
>> have to run through all the metadata of the file and do something
>> temporary with each part that is only completed when the file IO is
>> actually done.
>
> That's true. Maybe the proper way, via O_ATOMIC, is better.
>
>> Honestly, there many different ways to solve this in the application=
=2E
>> Requiring high speed atomic replacement of individual file contents =
is a
>> recipe for frustration.
>
> Did you see message of Massimo? That'd be the ideal way from an app
> point of view.
> Not solving this properly in the FS moves the problem to userspace
> where it's even harder to solve and is not as performant.
>
> Replacing file data is a common operation that IMO the FS should
> support in a safe way.

Chris?


--=20
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-26 18:30                     ` Olaf van der Spek
@ 2011-01-26 19:30                       ` Chris Mason
  2011-01-26 21:56                         ` Olaf van der Spek
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Mason @ 2011-01-26 19:30 UTC (permalink / raw)
  To: Olaf van der Spek; +Cc: Hubert Kario, linux-btrfs

Excerpts from Olaf van der Spek's message of 2011-01-26 13:30:08 -0500:
> On Sat, Jan 8, 2011 at 3:40 PM, Olaf van der Spek <olafvdspek@gmail.c=
om> wrote:
> > On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com=
> wrote:
> >> The exact amount of tracking is going to vary. =C2=A0The reason wh=
y is that
> >> actually doing the truncate is an O(size of the file) operation an=
d so
> >> you can't just flip a switch when the write or the close comes in.=
 =C2=A0You
> >> have to run through all the metadata of the file and do something
> >> temporary with each part that is only completed when the file IO i=
s
> >> actually done.
> >
> > That's true. Maybe the proper way, via O_ATOMIC, is better.
> >
> >> Honestly, there many different ways to solve this in the applicati=
on.
> >> Requiring high speed atomic replacement of individual file content=
s is a
> >> recipe for frustration.
> >
> > Did you see message of Massimo? That'd be the ideal way from an app
> > point of view.
> > Not solving this properly in the FS moves the problem to userspace
> > where it's even harder to solve and is not as performant.
> >
> > Replacing file data is a common operation that IMO the FS should
> > support in a safe way.
>=20
> Chris?
>=20

My answer hasn't really changed ;)  Replacing file data is a common
operation, but it is still surprisingly complex.  Again, the truncate i=
s
O(size of the file) and it is actually impossible to do this atomically
in most filesystems.

You don't notice this because xfs/ext34/btrfs (and many others) have
code that makes sure a truncate is restarted if you crash.  So, it
appears to be atomic even though we're really just restarting the
operation.  In order to have a truncate + replacement of data operation=
,
we'd have to do a disk format change that includes both the truncate an=
d
the new data.

It would look a lot like echo data > file.new ; truncate file ; mv
file.new file, but recorded in the FS metadata.

I don't have this in the btrfs roadmap.  It would be nice but most
people use databases for things that require atomic operations.  I
think what ext4 and btrfs do today fall into the category of best
effort and least surprise, and I think it is as good as we can get
without huge performance penalties for normal use.

Now, if you want to talk about atomic replacement of file data without
changing the file size, that's much easier.  At least it's easier for
those of us with cows in our pockets.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Atomic file data replace API
  2011-01-26 19:30                       ` Chris Mason
@ 2011-01-26 21:56                         ` Olaf van der Spek
  0 siblings, 0 replies; 28+ messages in thread
From: Olaf van der Spek @ 2011-01-26 21:56 UTC (permalink / raw)
  To: Chris Mason; +Cc: Hubert Kario, linux-btrfs

On Wed, Jan 26, 2011 at 8:30 PM, Chris Mason <chris.mason@oracle.com> w=
rote:
> My answer hasn't really changed ;) =C2=A0Replacing file data is a com=
mon
> operation, but it is still surprisingly complex. =C2=A0Again, the tru=
ncate is
> O(size of the file) and it is actually impossible to do this atomical=
ly
> in most filesystems.

Unfortunately life isn't trivial. ;)
Given that it's common, it doesn't make sense to have code duplication
in lots of apps to implement the temp file rename pattern.
If it's too complex to implement in the FS (ATM), would it be possible
to implement it in a higher layer?

> You don't notice this because xfs/ext34/btrfs (and many others) have
> code that makes sure a truncate is restarted if you crash. =C2=A0So, =
it
> appears to be atomic even though we're really just restarting the
> operation. =C2=A0In order to have a truncate + replacement of data op=
eration,
> we'd have to do a disk format change that includes both the truncate =
and
> the new data.

I'm not sure why the disk format would have to change.
Conceptually, just like the temp file case, you'd write the new data
to newly allocated blocks.
After (and I guess that's the complex part) they're safely on disk,
you update the meta data, in an atomic way.

> It would look a lot like echo data > file.new ; truncate file ; mv
> file.new file, but recorded in the FS metadata.
>
> I don't have this in the btrfs roadmap. =C2=A0It would be nice but mo=
st
> people use databases for things that require atomic operations. =C2=A0=
I

Executables and files shouldn't be in a DB.

Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-01-26 21:56 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-06 20:01 Atomic file data replace API Olaf van der Spek
2011-01-07 13:55 ` Mike Fleetwood
2011-01-07 14:01   ` Olaf van der Spek
2011-01-07 14:10     ` Olaf van der Spek
2011-01-07 14:58 ` Chris Mason
2011-01-07 15:01   ` Olaf van der Spek
2011-01-07 15:05     ` Chris Mason
2011-01-07 15:08       ` Olaf van der Spek
2011-01-07 15:13         ` Chris Mason
2011-01-07 15:17           ` Olaf van der Spek
2011-01-07 16:12             ` Chris Mason
2011-01-07 16:19               ` Olaf van der Spek
2011-01-07 16:26               ` Hubert Kario
2011-01-07 19:29                 ` Chris Mason
2011-01-08 14:40                   ` Olaf van der Spek
2011-01-26 18:30                     ` Olaf van der Spek
2011-01-26 19:30                       ` Chris Mason
2011-01-26 21:56                         ` Olaf van der Spek
2011-01-07 16:32             ` Massimo Maggi
2011-01-07 16:34               ` Olaf van der Spek
2011-01-07 19:29                 ` Thomas Bellman
2011-01-08 14:36                   ` Olaf van der Spek
2011-01-08 21:43                     ` Thomas Bellman
2011-01-09 15:16                       ` Olaf van der Spek
2011-01-09 18:56                         ` Thomas Bellman
2011-01-09 19:06                           ` Olaf van der Spek
2011-01-09 20:13                           ` Phillip Susi
2011-01-08  1:11   ` Phillip Susi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).