* Atomic file data replace API @ 2010-12-27 11:51 Olaf van der Spek 2010-12-27 13:20 ` Amir Goldstein 2010-12-28 2:59 ` Ted Ts'o 0 siblings, 2 replies; 47+ messages in thread From: Olaf van der Spek @ 2010-12-27 11:51 UTC (permalink / raw) To: linux-fsdevel, linux-ext4 Hi, Since non-durable appears to be controversial, let's consider the case without that aspect. Since the introduction of ext4, some apps/users have had issues with file corruption after a system crash. It's not a bug in the FS AFAIK and it's not exclusive to ext4. Writing a temp file, fsync, rename is often proposed. But how does one preserve meta-data, including file owner? Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-27 11:51 Atomic file data replace API Olaf van der Spek @ 2010-12-27 13:20 ` Amir Goldstein 2010-12-27 15:53 ` Olaf van der Spek 2010-12-28 2:59 ` Ted Ts'o 1 sibling, 1 reply; 47+ messages in thread From: Amir Goldstein @ 2010-12-27 13:20 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-fsdevel, linux-ext4 On Mon, Dec 27, 2010 at 1:51 PM, Olaf van der Spek <olafvdspek@gmail.com> wrote: > Hi, > > Since non-durable appears to be controversial, let's consider the case > without that aspect. > > Since the introduction of ext4, some apps/users have had issues with > file corruption after a system crash. It's not a bug in the FS AFAIK > and it's not exclusive to ext4. > Writing a temp file, fsync, rename is often proposed. > But how does one preserve meta-data, including file owner? > So as I wrote you on the previous thread, in Ext4 you can probably accomplish that already by using the Ext4 specific EXT4_IOC_EXT_MOVE ioctl, which is used by e4defrag to atomically switch the fragmented copy of the data with a de-fragmented copy of the data. It is a more granular version of the exchangedata() BSD API mentioned in the previous thread: http://www.manpagez.com/man/2/exchangedata/ So the atomic update is: write(tempfd); fdatasync(tempfd); exchangedata(tempfd, fd) If you choose to pursue your campaign for "Atomic file data replace API", I recommend that you: 1. change the slogan to the more catchy "Implementing exchangedata() API" (you already have a man page for that) 2. convince VFS people to support the new generic system call / optional FS operation exchangedata() 3. if you can, post the relevant patches, so people can review and test them Implementation of exchangedata() operation in Ext4 should be trivial using the ext4_move_extents() function and I didn't check, but I bet that XFS has that functionality as well. Good luck, Amir. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-27 13:20 ` Amir Goldstein @ 2010-12-27 15:53 ` Olaf van der Spek 2010-12-27 17:20 ` Amir Goldstein 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2010-12-27 15:53 UTC (permalink / raw) To: Amir Goldstein; +Cc: linux-fsdevel, linux-ext4 On Mon, Dec 27, 2010 at 2:20 PM, Amir Goldstein <amir73il@gmail.com> wrote: > So as I wrote you on the previous thread, in Ext4 you can probably FS-specific code should of course be avoided in normal apps. > It is a more granular version of the exchangedata() BSD API mentioned > in the previous thread: > http://www.manpagez.com/man/2/exchangedata/ > > So the atomic update is: write(tempfd); fdatasync(tempfd); > exchangedata(tempfd, fd) Except exchangedata is not (widely) implemented? Don't you agree it's undesirable to lose meta-data? Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-27 15:53 ` Olaf van der Spek @ 2010-12-27 17:20 ` Amir Goldstein 2010-12-27 18:34 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Amir Goldstein @ 2010-12-27 17:20 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-fsdevel, linux-ext4 On Mon, Dec 27, 2010 at 5:53 PM, Olaf van der Spek <olafvdspek@gmail.com> wrote: > On Mon, Dec 27, 2010 at 2:20 PM, Amir Goldstein <amir73il@gmail.com> wrote: >> So as I wrote you on the previous thread, in Ext4 you can probably > > FS-specific code should of course be avoided in normal apps. > >> It is a more granular version of the exchangedata() BSD API mentioned >> in the previous thread: >> http://www.manpagez.com/man/2/exchangedata/ >> >> So the atomic update is: write(tempfd); fdatasync(tempfd); >> exchangedata(tempfd, fd) > > Except exchangedata is not (widely) implemented? Not in Linux anyway. > Don't you agree it's undesirable to lose meta-data? Yes I agree. you can have my vote for "it's nice to have this", but the fact that we did without it for so long must mean something... Anyway, you need to convince someone to implement it (unless you do it yourself), some developers to review it and the maintainers to accept it, so unless you come up with 'a real world problem', the busy FS developers will not be bothered to accept 'the fix'. Accepting new API's has a huge price of testing them and maintaining them every release, so don't take the resistance personally. Now let's say that you decide to focus on the problem of: 'safe editor save to a file which is not owned by you but writable by you'. You may want to look for a specific editor which has 'safe save' functionality (maybe LibreOffice?) and query the developers if they would like the new feature and if they would support your proposal. That is the way kernel development works - and for good reasons. Amir. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-27 17:20 ` Amir Goldstein @ 2010-12-27 18:34 ` Olaf van der Spek 0 siblings, 0 replies; 47+ messages in thread From: Olaf van der Spek @ 2010-12-27 18:34 UTC (permalink / raw) To: Amir Goldstein; +Cc: linux-fsdevel, linux-ext4 On Mon, Dec 27, 2010 at 6:20 PM, Amir Goldstein <amir73il@gmail.com> wrote: >> Don't you agree it's undesirable to lose meta-data? > > Yes I agree. you can have my vote for "it's nice to have this", > but the fact that we did without it for so long must mean something... I'm not sure it means something positive. > Anyway, you need to convince someone to implement it > (unless you do it yourself), some developers to review it > and the maintainers to accept it, so unless you come up with 'a real > world problem', > the busy FS developers will not be bothered to accept 'the fix'. > Accepting new API's has a huge price of testing them and maintaining them > every release, so don't take the resistance personally. > > Now let's say that you decide to focus on the problem of: > 'safe editor save to a file which is not owned by you but writable by you'. > You may want to look for a specific editor which has 'safe save' functionality > (maybe LibreOffice?) and query the developers if they would like the new feature > and if they would support your proposal. > > That is the way kernel development works - and for good reasons. I agree in general you need a good use case. But AFAIK FS devs are aware of many apps not doing it the right way. So I expected them to have a FAQ entry that shows what this right way is. Ted says a huge performance hit is involved, but nobody has been able to tell why yet. There's also the problem of not having permission to create a temp file. Olaf Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-27 11:51 Atomic file data replace API Olaf van der Spek 2010-12-27 13:20 ` Amir Goldstein @ 2010-12-28 2:59 ` Ted Ts'o 2010-12-28 17:27 ` Olaf van der Spek 1 sibling, 1 reply; 47+ messages in thread From: Ted Ts'o @ 2010-12-28 2:59 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-fsdevel, linux-ext4 On Mon, Dec 27, 2010 at 12:51:54PM +0100, Olaf van der Spek wrote: > Since the introduction of ext4, some apps/users have had issues with > file corruption after a system crash. It's not a bug in the FS AFAIK > and it's not exclusive to ext4. > Writing a temp file, fsync, rename is often proposed. > But how does one preserve meta-data, including file owner? What's the use case where preserving file ownership matters? - Ted ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-28 2:59 ` Ted Ts'o @ 2010-12-28 17:27 ` Olaf van der Spek 2010-12-28 19:06 ` Ric Wheeler 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2010-12-28 17:27 UTC (permalink / raw) To: Ted Ts'o; +Cc: linux-fsdevel, linux-ext4 On Tue, Dec 28, 2010 at 3:59 AM, Ted Ts'o <tytso@mit.edu> wrote: > On Mon, Dec 27, 2010 at 12:51:54PM +0100, Olaf van der Spek wrote: >> Since the introduction of ext4, some apps/users have had issues with >> file corruption after a system crash. It's not a bug in the FS AFAIK >> and it's not exclusive to ext4. >> Writing a temp file, fsync, rename is often proposed. >> But how does one preserve meta-data, including file owner? > > What's the use case where preserving file ownership matters? Why is it you ignore most of the question and only challenge a tiny bit? I can't think of a problem case right now, but I sure can't guarantee always resetting file owner is never a problem. Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-28 17:27 ` Olaf van der Spek @ 2010-12-28 19:06 ` Ric Wheeler 2010-12-28 22:25 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Ric Wheeler @ 2010-12-28 19:06 UTC (permalink / raw) To: Olaf van der Spek; +Cc: Ted Ts'o, linux-fsdevel, linux-ext4 On 12/28/2010 12:27 PM, Olaf van der Spek wrote: > On Tue, Dec 28, 2010 at 3:59 AM, Ted Ts'o<tytso@mit.edu> wrote: >> On Mon, Dec 27, 2010 at 12:51:54PM +0100, Olaf van der Spek wrote: >>> Since the introduction of ext4, some apps/users have had issues with >>> file corruption after a system crash. It's not a bug in the FS AFAIK >>> and it's not exclusive to ext4. >>> Writing a temp file, fsync, rename is often proposed. >>> But how does one preserve meta-data, including file owner? >> What's the use case where preserving file ownership matters? > Why is it you ignore most of the question and only challenge a tiny bit? > I can't think of a problem case right now, but I sure can't guarantee > always resetting file owner is never a problem. > > Olaf I really think that you have missed the point of this list. This list is for either developers (those who have downloaded the free code and work on it) or others who want to move things forward concretely. Perfectly fine to contribute ideas, but if you are not a coder or do not have the time or inclination to work on things yourself, you have to be *really* convincing. We continually get bombarded with ideas, wish list items, etc so we are not lacking in work to do. If you cannot explain the use case, you will not get any buy in... Regards, Ric ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-28 19:06 ` Ric Wheeler @ 2010-12-28 22:25 ` Olaf van der Spek 2010-12-28 22:36 ` Ric Wheeler 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2010-12-28 22:25 UTC (permalink / raw) To: Ric Wheeler; +Cc: Ted Ts'o, linux-fsdevel, linux-ext4 On Tue, Dec 28, 2010 at 8:06 PM, Ric Wheeler <rwheeler@redhat.com> wrote: > I really think that you have missed the point of this list. > > This list is for either developers (those who have downloaded the free code > and work on it) or others who want to move things forward concretely. Maybe. > Perfectly fine to contribute ideas, but if you are not a coder or do not > have the time or inclination to work on things yourself, you have to be > *really* convincing. > > We continually get bombarded with ideas, wish list items, etc so we are not > lacking in work to do. I understand. > If you cannot explain the use case, you will not get any buy in... I assumed that preserving file owner would be a normal feature and would not require additional explanation. One use case would be updating a file in a save way when you have write access to that file but not to anything else. Also, according to Ted, a lot of app devs get saving a file in a safe way wrong. So I'm asking what the recommended way to do it is. Is that strange? Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-28 22:25 ` Olaf van der Spek @ 2010-12-28 22:36 ` Ric Wheeler 2010-12-28 22:58 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Ric Wheeler @ 2010-12-28 22:36 UTC (permalink / raw) To: Olaf van der Spek; +Cc: Ted Ts'o, linux-fsdevel, linux-ext4 On 12/28/2010 05:25 PM, Olaf van der Spek wrote: > On Tue, Dec 28, 2010 at 8:06 PM, Ric Wheeler<rwheeler@redhat.com> wrote: >> I really think that you have missed the point of this list. >> >> This list is for either developers (those who have downloaded the free code >> and work on it) or others who want to move things forward concretely. > Maybe. > >> Perfectly fine to contribute ideas, but if you are not a coder or do not >> have the time or inclination to work on things yourself, you have to be >> *really* convincing. >> >> We continually get bombarded with ideas, wish list items, etc so we are not >> lacking in work to do. > I understand. > >> If you cannot explain the use case, you will not get any buy in... > I assumed that preserving file owner would be a normal feature and > would not require additional explanation. > One use case would be updating a file in a save way when you have > write access to that file but not to anything else. > > Also, according to Ted, a lot of app devs get saving a file in a safe > way wrong. So I'm asking what the recommended way to do it is. Is that > strange? > > Olaf I think that various developers have answered this for you several times. As a suggestion, if you are not a kernel developer, show us specifically a bit of application code that demonstrates something that you want to have work differently. Test it with power failure (buy an external e-sata or USB disk and pull power while running your app). Ric ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-28 22:36 ` Ric Wheeler @ 2010-12-28 22:58 ` Olaf van der Spek 2010-12-29 9:20 ` Amir Goldstein 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2010-12-28 22:58 UTC (permalink / raw) To: Ric Wheeler; +Cc: Ted Ts'o, linux-fsdevel, linux-ext4 On Tue, Dec 28, 2010 at 11:36 PM, Ric Wheeler <rwheeler@redhat.com> wrote: > I think that various developers have answered this for you several times. Not really, unfortunately. Haven't seen a single link to code that shows how to do it properly. Temp file, fsync, rename is often mentioned but that skips the preserving meta-data part and this part, which you also skipped: One use case would be updating a file in a safe way when you have write access to that file but not to anything else. > As a suggestion, if you are not a kernel developer, show us specifically a > bit of application code that demonstrates something that you want to have > work differently. I will. > Test it with power failure (buy an external e-sata or USB disk and pull > power while running your app). The current code? I think I'll use a VM instead of an external disk. ;) Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-28 22:58 ` Olaf van der Spek @ 2010-12-29 9:20 ` Amir Goldstein 2010-12-29 12:42 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Amir Goldstein @ 2010-12-29 9:20 UTC (permalink / raw) To: Olaf van der Spek; +Cc: Ric Wheeler, Ted Ts'o, linux-fsdevel, linux-ext4 On Wed, Dec 29, 2010 at 12:58 AM, Olaf van der Spek <olafvdspek@gmail.com> wrote: > On Tue, Dec 28, 2010 at 11:36 PM, Ric Wheeler <rwheeler@redhat.com> wrote: >> I think that various developers have answered this for you several times. > > Not really, unfortunately. Haven't seen a single link to code that > shows how to do it properly. > Temp file, fsync, rename is often mentioned but that skips the > preserving meta-data part and this part, which you also skipped: > One use case would be updating a file in a safe way when you have > write access to that file but not to anything else. > I think it is safe to say that the *only* option you have now is "temp file, fsync, rename". There is no "generic atomic file data replace API in Linux", though it is available via private ioctl for XFS and EXT4. You have started a bit of a storm with your previous thread, which doesn't help you much in moving forward in the current thread (previous thread is still more popular). I suggest that you humbly swallow you need to know WHY is it hard to implement non-durable atomic API and focus your attention on the very achievable data replace API. IMHO, implementing atomic swap_inodes_data operation shouldn't be difficult in most file systems (only implementation is simple, but testing and maintaining is not to be taken lightly). Something along the lines of: 1. aquire inodes write/truncate locks 2. start transaction 3. check/update quota limits 4. swap inodes i_data content 5. invalidate (or swap?) inodes page caches 6. mark inodes dirty 7. end transaction & release locks The real challenge would be to get everyone to agree on a common API and carve it in stone to the kernel's ABI (is it just swap_inodes_data? maybe also swap_inode_data_ranges? what about some options?) Also, as wacky and (some say) faulty the UNIX permissions models is, current systems have grown old with it, and even 'improving' the behavior of some applications, may wake up sleeping monsters, so it will not be done until enough people have pointed out security or usability issues, which could not be solved otherwise. In other words, until you find an *application* that wants to allow other user to modify the content of a file and preserve it's metadata and ownership. And unless that application cannot find a better way to achieve what it wanted to do in the first place, or unless that application already has a large install base which suffers from *a problem*, you will not have proven *the need*. Maybe preserving privileged extended attributes is *a need*. I wouldn't know myself. Amir. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-29 9:20 ` Amir Goldstein @ 2010-12-29 12:42 ` Olaf van der Spek 2010-12-29 15:30 ` Christian Stroetmann 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2010-12-29 12:42 UTC (permalink / raw) To: Amir Goldstein; +Cc: Ric Wheeler, Ted Ts'o, linux-fsdevel, linux-ext4 On Wed, Dec 29, 2010 at 10:20 AM, Amir Goldstein <amir73il@gmail.com> wrote: > On Wed, Dec 29, 2010 at 12:58 AM, Olaf van der Spek > <olafvdspek@gmail.com> wrote: >> On Tue, Dec 28, 2010 at 11:36 PM, Ric Wheeler <rwheeler@redhat.com> wrote: >>> I think that various developers have answered this for you several times. >> >> Not really, unfortunately. Haven't seen a single link to code that >> shows how to do it properly. >> Temp file, fsync, rename is often mentioned but that skips the >> preserving meta-data part and this part, which you also skipped: >> One use case would be updating a file in a safe way when you have >> write access to that file but not to anything else. >> > > I think it is safe to say that the *only* option you have now is "temp > file, fsync, rename". I'm really looking for a concrete code snippet/function that does this. For example, file permissions should definitely be preserved. > There is no "generic atomic file data replace API in Linux", though it > is available via > private ioctl for XFS and EXT4. > > You have started a bit of a storm with your previous thread, which > doesn't help you > much in moving forward in the current thread (previous thread is still > more popular). > I suggest that you humbly swallow you need to know WHY is it hard to implement > non-durable atomic API and focus your attention on the very achievable > data replace API. > > IMHO, implementing atomic swap_inodes_data operation shouldn't be difficult > in most file systems (only implementation is simple, but testing and > maintaining > is not to be taken lightly). > Something along the lines of: > 1. aquire inodes write/truncate locks > 2. start transaction > 3. check/update quota limits > 4. swap inodes i_data content > 5. invalidate (or swap?) inodes page caches > 6. mark inodes dirty > 7. end transaction & release locks > > The real challenge would be to get everyone to agree on a common API > and carve it in stone to the kernel's ABI (is it just swap_inodes_data? > maybe also swap_inode_data_ranges? what about some options?) Swapping data is an improvement but still not ideal. The API is also more complex than O_ATOMIC. > Also, as wacky and (some say) faulty the UNIX permissions models is, > current systems have grown old with it, and even 'improving' the behavior > of some applications, may wake up sleeping monsters, so it will not > be done until enough people have pointed out security or usability > issues, which could not be solved otherwise. Each app makes it's own decision about what API to use. Supporting atomic stuff doesn't change the behaviour of existing apps. > In other words, until you find an *application* that wants to allow other > user to modify the content of a file and preserve it's metadata and ownership. > And unless that application cannot find a better way to achieve what it wanted > to do in the first place, or unless that application already has a > large install base > which suffers from *a problem*, you will not have proven *the need*. Maybe I should ask devs of some large apps on their take of this issue. > Maybe preserving privileged extended attributes is *a need*. > I wouldn't know myself. Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-29 12:42 ` Olaf van der Spek @ 2010-12-29 15:30 ` Christian Stroetmann 2010-12-29 15:35 ` Olaf van der Spek 2010-12-29 17:15 ` Greg Freemyer 0 siblings, 2 replies; 47+ messages in thread From: Christian Stroetmann @ 2010-12-29 15:30 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-fsdevel, linux-ext4, Ric Wheeler, Amir Goldstein On the 29.12.2010 13:42, Olaf van der Spek wrote: > On Wed, Dec 29, 2010 at 10:20 AM, Amir Goldstein<amir73il@gmail.com> wrote: >> On Wed, Dec 29, 2010 at 12:58 AM, Olaf van der Spek >> <olafvdspek@gmail.com> wrote: >>> On Tue, Dec 28, 2010 at 11:36 PM, Ric Wheeler<rwheeler@redhat.com> wrote: >>>> I think that various developers have answered this for you several times. >>> Not really, unfortunately. Haven't seen a single link to code that >>> shows how to do it properly. No, not this way. You were and still are asked for delivering the code. Don't pervert the threat of the discussion. >>> Temp file, fsync, rename is often mentioned but that skips the >>> preserving meta-data part and this part, which you also skipped: >>> One use case would be updating a file in a safe way when you have >>> write access to that file but not to anything else. >>> >> I think it is safe to say that the *only* option you have now is "temp >> file, fsync, rename". > I'm really looking for a concrete code snippet/function that does this. > For example, file permissions should definitely be preserved. > >> There is no "generic atomic file data replace API in Linux", though it >> is available via >> private ioctl for XFS and EXT4. >> >> You have started a bit of a storm with your previous thread, which >> doesn't help you >> much in moving forward in the current thread (previous thread is still >> more popular). >> I suggest that you humbly swallow you need to know WHY is it hard to implement >> non-durable atomic API and focus your attention on the very achievable >> data replace API. >> >> IMHO, implementing atomic swap_inodes_data operation shouldn't be difficult >> in most file systems (only implementation is simple, but testing and >> maintaining >> is not to be taken lightly). >> Something along the lines of: >> 1. aquire inodes write/truncate locks >> 2. start transaction >> 3. check/update quota limits >> 4. swap inodes i_data content >> 5. invalidate (or swap?) inodes page caches >> 6. mark inodes dirty >> 7. end transaction& release locks >> >> The real challenge would be to get everyone to agree on a common API >> and carve it in stone to the kernel's ABI (is it just swap_inodes_data? >> maybe also swap_inode_data_ranges? what about some options?) > Swapping data is an improvement but still not ideal. The API is also > more complex than O_ATOMIC. > >> Also, as wacky and (some say) faulty the UNIX permissions models is, >> current systems have grown old with it, and even 'improving' the behavior >> of some applications, may wake up sleeping monsters, so it will not >> be done until enough people have pointed out security or usability >> issues, which could not be solved otherwise. > Each app makes it's own decision about what API to use. Supporting > atomic stuff doesn't change the behaviour of existing apps. Wrong, we are talking here in the first place about general atomic FS operations. And to guarantee atomicity you have to change general FS functions in such a way that in the end all other applications are affected, or otherwise you have to implement an own (larger part of an) FS. At this point there is no discussion anymore without code from you, because this subject is as well discussed to the maximum in information processing/informatics/computer science. >> In other words, until you find an *application* that wants to allow other >> user to modify the content of a file and preserve it's metadata and ownership. >> And unless that application cannot find a better way to achieve what it wanted >> to do in the first place, or unless that application already has a >> large install base >> which suffers from *a problem*, you will not have proven *the need*. > Maybe I should ask devs of some large apps on their take of this issue. Nonsense, because they are already using: a) the functions available by an FS, b) the functions available by a DBMS, or c) a propritary special solution based on the available functions of the OS and additional functionality that they develope and maintain themselves for their comparable use cases since decades due to the cost vs. benefit ratio. >> Maybe preserving privileged extended attributes is *a need*. >> I wouldn't know myself. > Olaf Christian Stroetmann ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-29 15:30 ` Christian Stroetmann @ 2010-12-29 15:35 ` Olaf van der Spek 2010-12-29 16:30 ` Christian Stroetmann 2010-12-29 17:15 ` Greg Freemyer 1 sibling, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2010-12-29 15:35 UTC (permalink / raw) To: Christian Stroetmann Cc: linux-fsdevel, linux-ext4, Ric Wheeler, Amir Goldstein On Wed, Dec 29, 2010 at 4:30 PM, Christian Stroetmann <stroetmann@ontolinux.com> wrote: > On the 29.12.2010 13:42, Olaf van der Spek wrote: >>>> Not really, unfortunately. Haven't seen a single link to code that >>>> shows how to do it properly. > > No, not this way. You were and still are asked for delivering the code. > Don't pervert the threat of the discussion. I'm talking about the code for temp file, fsync, rename. Not about O_ATOMIC code. >> Each app makes it's own decision about what API to use. Supporting >> atomic stuff doesn't change the behaviour of existing apps. > > Wrong, we are talking here in the first place about general atomic FS > operations. And to guarantee atomicity you have to change general FS > functions in such a way that in the end all other applications are affected, Why's that? > or otherwise you have to implement an own (larger part of an) FS. > At this point there is no discussion anymore without code from you, because > this subject is as well discussed to the maximum in information > processing/informatics/computer science. This subject? Exactly what subject? >> Maybe I should ask devs of some large apps on their take of this issue. > > Nonsense, because they are already using: > a) the functions available by an FS, Of course. Does that mean the situation can't be improved for them? > b) the functions available by a DBMS, or > c) a propritary special solution based on the available functions of the OS > and additional functionality that they develope and maintain themselves > for their comparable use cases since decades due to the cost vs. benefit > ratio. Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-29 15:35 ` Olaf van der Spek @ 2010-12-29 16:30 ` Christian Stroetmann 2010-12-29 17:12 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Christian Stroetmann @ 2010-12-29 16:30 UTC (permalink / raw) To: Olaf van der Spek Cc: linux-fsdevel, linux-ext4, Ted Ts'o, Ric Wheeler, Amir Goldstein On the 29.12.2010 16:35, Olaf van der Spek wrote: > On Wed, Dec 29, 2010 at 4:30 PM, Christian Stroetmann > <stroetmann@ontolinux.com> wrote: >> On the 29.12.2010 13:42, Olaf van der Spek wrote: >>>>> Not really, unfortunately. Haven't seen a single link to code that >>>>> shows how to do it properly. >> No, not this way. You were and still are asked for delivering the code. >> Don't pervert the threat of the discussion. > I'm talking about the code for temp file, fsync, rename. Not about > O_ATOMIC code. Maybe you have not understood the hints: It doesn't matter anymore about what you are talking unless you present code. >>> Each app makes it's own decision about what API to use. Supporting >>> atomic stuff doesn't change the behaviour of existing apps. >> Wrong, we are talking here in the first place about general atomic FS >> operations. And to guarantee atomicity you have to change general FS >> functions in such a way that in the end all other applications are affected, > Why's that? read the paragraph as a whole >> or otherwise you have to implement an own (larger part of an) FS. >> At this point there is no discussion anymore without code from you, because >> this subject is as well discussed to the maximum in information >> processing/informatics/computer science. > This subject? Exactly what subject? read the begining of the paragraph >>> Maybe I should ask devs of some large apps on their take of this issue. >> Nonsense, because they are already using: >> a) the functions available by an FS, > Of course. Does that mean the situation can't be improved for them? Do you have any code that improves the situation to discuss here? >> b) the functions available by a DBMS, or >> c) a propritary special solution based on the available functions of the OS >> and additional functionality that they develope and maintain themselves >> for their comparable use cases since decades due to the cost vs. benefit >> ratio. > Olaf Christian Stroetmann ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-29 16:30 ` Christian Stroetmann @ 2010-12-29 17:12 ` Olaf van der Spek 0 siblings, 0 replies; 47+ messages in thread From: Olaf van der Spek @ 2010-12-29 17:12 UTC (permalink / raw) To: Christian Stroetmann Cc: linux-fsdevel, linux-ext4, Ted Ts'o, Ric Wheeler, Amir Goldstein On Wed, Dec 29, 2010 at 5:30 PM, Christian Stroetmann <stroetmann@ontolinux.com> wrote: >> I'm talking about the code for temp file, fsync, rename. Not about >> O_ATOMIC code. > > Maybe you have not understood the hints: It doesn't matter anymore about > what you are talking unless you present code. What code? >>>> Each app makes it's own decision about what API to use. Supporting >>>> atomic stuff doesn't change the behaviour of existing apps. >>> >>> Wrong, we are talking here in the first place about general atomic FS >>> operations. And to guarantee atomicity you have to change general FS >>> functions in such a way that in the end all other applications are >>> affected, >> >> Why's that? > > read the paragraph as a whole I have. Still wondering why. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-29 15:30 ` Christian Stroetmann 2010-12-29 15:35 ` Olaf van der Spek @ 2010-12-29 17:15 ` Greg Freemyer 2010-12-29 19:30 ` Christian Stroetmann 1 sibling, 1 reply; 47+ messages in thread From: Greg Freemyer @ 2010-12-29 17:15 UTC (permalink / raw) To: Christian Stroetmann Cc: Olaf van der Spek, linux-fsdevel, linux-ext4, Ric Wheeler, Amir Goldstein On Wed, Dec 29, 2010 at 10:30 AM, Christian Stroetmann <stroetmann@ontolinux.com> wrote: > On the 29.12.2010 13:42, Olaf van der Spek wrote: >> >> On Wed, Dec 29, 2010 at 10:20 AM, Amir Goldstein<amir73il@gmail.com> >> wrote: >>> >>> On Wed, Dec 29, 2010 at 12:58 AM, Olaf van der Spek >>> <olafvdspek@gmail.com> wrote: >>>> >>>> On Tue, Dec 28, 2010 at 11:36 PM, Ric Wheeler<rwheeler@redhat.com> >>>> wrote: >>>>> >>>>> I think that various developers have answered this for you several >>>>> times. >>>> >>>> Not really, unfortunately. Haven't seen a single link to code that >>>> shows how to do it properly. > > No, not this way. You were and still are asked for delivering the code. > Don't pervert the threat of the discussion. > >>>> Temp file, fsync, rename is often mentioned but that skips the >>>> preserving meta-data part and this part, which you also skipped: >>>> One use case would be updating a file in a safe way when you have >>>> write access to that file but not to anything else. >>>> >>> I think it is safe to say that the *only* option you have now is "temp >>> file, fsync, rename". >> >> I'm really looking for a concrete code snippet/function that does this. >> For example, file permissions should definitely be preserved. >> >>> There is no "generic atomic file data replace API in Linux", though it >>> is available via >>> private ioctl for XFS and EXT4. >>> >>> You have started a bit of a storm with your previous thread, which >>> doesn't help you >>> much in moving forward in the current thread (previous thread is still >>> more popular). >>> I suggest that you humbly swallow you need to know WHY is it hard to >>> implement >>> non-durable atomic API and focus your attention on the very achievable >>> data replace API. >>> >>> IMHO, implementing atomic swap_inodes_data operation shouldn't be >>> difficult >>> in most file systems (only implementation is simple, but testing and >>> maintaining >>> is not to be taken lightly). >>> Something along the lines of: >>> 1. aquire inodes write/truncate locks >>> 2. start transaction >>> 3. check/update quota limits >>> 4. swap inodes i_data content >>> 5. invalidate (or swap?) inodes page caches >>> 6. mark inodes dirty >>> 7. end transaction& release locks >>> >>> The real challenge would be to get everyone to agree on a common API >>> and carve it in stone to the kernel's ABI (is it just swap_inodes_data? >>> maybe also swap_inode_data_ranges? what about some options?) >> >> Swapping data is an improvement but still not ideal. The API is also >> more complex than O_ATOMIC. >> >>> Also, as wacky and (some say) faulty the UNIX permissions models is, >>> current systems have grown old with it, and even 'improving' the behavior >>> of some applications, may wake up sleeping monsters, so it will not >>> be done until enough people have pointed out security or usability >>> issues, which could not be solved otherwise. >> >> Each app makes it's own decision about what API to use. Supporting >> atomic stuff doesn't change the behaviour of existing apps. > > Wrong, we are talking here in the first place about general atomic FS > operations. And to guarantee atomicity you have to change general FS > functions in such a way that in the end all other applications are affected, > or otherwise you have to implement an own (larger part of an) FS. > At this point there is no discussion anymore without code from you, because > this subject is as well discussed to the maximum in information > processing/informatics/computer science. > >>> In other words, until you find an *application* that wants to allow other >>> user to modify the content of a file and preserve it's metadata and >>> ownership. >>> And unless that application cannot find a better way to achieve what it >>> wanted >>> to do in the first place, or unless that application already has a >>> large install base >>> which suffers from *a problem*, you will not have proven *the need*. >> >> Maybe I should ask devs of some large apps on their take of this issue. > > Nonsense, because they are already using: > a) the functions available by an FS, > b) the functions available by a DBMS, or > c) a propritary special solution based on the available functions of the OS > and additional functionality that they develope and maintain themselves > for their comparable use cases since decades due to the cost vs. benefit > ratio. <sarcasm> Olaf, clearly if you want to find issues / use cases for your new API you should not talk to developers of complex tools. They have it all figured out. It's only you that doesn't know how to code up a userspace solution to the problem. <\sarcasm> Surely productivity suites like openoffice have to address the issue. How satisfied they are I don't know. And despite Neil's argument that only one user should be able to write to a given doc, that is just not how normal office suites work today. Also, I believe KDE and its myriad of config files has issues with major config file corruption due to unexpected shutdowns during the config file update process, so they certainly don't have it figured out. Why don't they use the temp file, fsync, rename process? Those are the 2 user-space suites I would go investigate first. I'm sure there are many others. Also, I believe Windows offers an API like your proposing. How does Samba support it? Greg -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2010-12-29 17:15 ` Greg Freemyer @ 2010-12-29 19:30 ` Christian Stroetmann 0 siblings, 0 replies; 47+ messages in thread From: Christian Stroetmann @ 2010-12-29 19:30 UTC (permalink / raw) To: Greg Freemyer Cc: linux-fsdevel, linux-ext4, Olaf van der Spek, Ric Wheeler, Amir Goldstein, Neil Brown On the 29.12.2010 18:15, Greg Freemyer wrote: > On Wed, Dec 29, 2010 at 10:30 AM, Christian Stroetmann > <stroetmann@ontolinux.com> wrote: >> On the 29.12.2010 13:42, Olaf van der Spek wrote: >>> On Wed, Dec 29, 2010 at 10:20 AM, Amir Goldstein<amir73il@gmail.com> >>> wrote: >>>> On Wed, Dec 29, 2010 at 12:58 AM, Olaf van der Spek >>>> <olafvdspek@gmail.com> wrote: >>>>> On Tue, Dec 28, 2010 at 11:36 PM, Ric Wheeler<rwheeler@redhat.com> >>>>> wrote: >>>>>> I think that various developers have answered this for you several >>>>>> times. >>>>> Not really, unfortunately. Haven't seen a single link to code that >>>>> shows how to do it properly. >> No, not this way. You were and still are asked for delivering the code. >> Don't pervert the threat of the discussion. >> >>>>> Temp file, fsync, rename is often mentioned but that skips the >>>>> preserving meta-data part and this part, which you also skipped: >>>>> One use case would be updating a file in a safe way when you have >>>>> write access to that file but not to anything else. >>>>> >>>> I think it is safe to say that the *only* option you have now is "temp >>>> file, fsync, rename". >>> I'm really looking for a concrete code snippet/function that does this. >>> For example, file permissions should definitely be preserved. >>> >>>> There is no "generic atomic file data replace API in Linux", though it >>>> is available via >>>> private ioctl for XFS and EXT4. >>>> >>>> You have started a bit of a storm with your previous thread, which >>>> doesn't help you >>>> much in moving forward in the current thread (previous thread is still >>>> more popular). >>>> I suggest that you humbly swallow you need to know WHY is it hard to >>>> implement >>>> non-durable atomic API and focus your attention on the very achievable >>>> data replace API. >>>> >>>> IMHO, implementing atomic swap_inodes_data operation shouldn't be >>>> difficult >>>> in most file systems (only implementation is simple, but testing and >>>> maintaining >>>> is not to be taken lightly). >>>> Something along the lines of: >>>> 1. aquire inodes write/truncate locks >>>> 2. start transaction >>>> 3. check/update quota limits >>>> 4. swap inodes i_data content >>>> 5. invalidate (or swap?) inodes page caches >>>> 6. mark inodes dirty >>>> 7. end transaction& release locks >>>> >>>> The real challenge would be to get everyone to agree on a common API >>>> and carve it in stone to the kernel's ABI (is it just swap_inodes_data? >>>> maybe also swap_inode_data_ranges? what about some options?) >>> Swapping data is an improvement but still not ideal. The API is also >>> more complex than O_ATOMIC. >>> >>>> Also, as wacky and (some say) faulty the UNIX permissions models is, >>>> current systems have grown old with it, and even 'improving' the behavior >>>> of some applications, may wake up sleeping monsters, so it will not >>>> be done until enough people have pointed out security or usability >>>> issues, which could not be solved otherwise. >>> Each app makes it's own decision about what API to use. Supporting >>> atomic stuff doesn't change the behaviour of existing apps. >> Wrong, we are talking here in the first place about general atomic FS >> operations. And to guarantee atomicity you have to change general FS >> functions in such a way that in the end all other applications are affected, >> or otherwise you have to implement an own (larger part of an) FS. >> At this point there is no discussion anymore without code from you, because >> this subject is as well discussed to the maximum in information >> processing/informatics/computer science. >> >>>> In other words, until you find an *application* that wants to allow other >>>> user to modify the content of a file and preserve it's metadata and >>>> ownership. >>>> And unless that application cannot find a better way to achieve what it >>>> wanted >>>> to do in the first place, or unless that application already has a >>>> large install base >>>> which suffers from *a problem*, you will not have proven *the need*. >>> Maybe I should ask devs of some large apps on their take of this issue. >> Nonsense, because they are already using: >> a) the functions available by an FS, >> b) the functions available by a DBMS, or >> c) a propritary special solution based on the available functions of the OS >> and additional functionality that they develope and maintain themselves >> for their comparable use cases since decades due to the cost vs. benefit >> ratio. > <sarcasm> > Olaf, clearly if you want to find issues / use cases for your new API > you should not talk to developers of complex tools. They have it all > figured out. > > It's only you that doesn't know how to code up a userspace solution to > the problem. > <\sarcasm> <no_sarcasm> This is not the place for sarcasm. </no_sarcasm> > Surely productivity suites like openoffice have to address the issue. > How satisfied they are I don't know. And despite Neil's argument that > only one user should be able to write to a given doc, that is just not > how normal office suites work today. I think that Neil doesn't meant it in this way or context. > Also, I believe KDE and its myriad of config files has issues with > major config file corruption due to unexpected shutdowns during the > config file update process, so they certainly don't have it figured > out. > > Why don't they use the temp file, fsync, rename process? <no_sarcasm> Because they figured it out?! </no_sarcasm> > Those are the 2 user-space suites I would go investigate first. I'm > sure there are many others. > > Also, I believe Windows offers an API like your proposing. How does > Samba support it? > > Greg > <no sarcasm> Furthermore, in conjunction with the given 2 user-space suites it was said: "I don't know" and "I believe". </no sarcasm> ==> leaving the thread Please don't TO and CC anymore. E-mails that are related with this thread will be sorted by name and then deleted without reading on the behalf of the reciever. Christian Stroetmann ^ permalink raw reply [flat|nested] 47+ messages in thread
* Atomic file data replace API @ 2011-01-06 20:01 Olaf van der Spek 2011-01-07 13:55 ` Mike Fleetwood 2011-01-07 14:58 ` Chris Mason 0 siblings, 2 replies; 47+ messages in thread From: Olaf van der Spek @ 2011-01-06 20:01 UTC (permalink / raw) To: linux-btrfs Hi, Does btrfs support atomic file data replaces? Basically, the atomic variant of this: // old stage open(O_TRUNC) write() // 0+ times close() // new state -- Olaf ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-06 20:01 Olaf van der Spek @ 2011-01-07 13:55 ` Mike Fleetwood 2011-01-07 14:01 ` Olaf van der Spek 2011-01-07 14:58 ` Chris Mason 1 sibling, 1 reply; 47+ messages in thread From: Mike Fleetwood @ 2011-01-07 13:55 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-btrfs On 6 January 2011 20:01, Olaf van der Spek <olafvdspek@gmail.com> wrote: > Hi, > > Does btrfs support atomic file data replaces? Hi Olaf, Yes btrfs does support atomic replace, since kernel 2.6.30 circa June 2009. [1] Special handling was added to ext3, ext4, btrfs (and probably other Linux FSs) for your replace-via-truncate and the alternative replace-via-rename application patterns. Try reading "Delayed allocation and the zero-length file problem" article and comments by Ted Ts'o for further discussion. [2] Mike -- [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5a3f23d515a2ebf0c750db80579ca57b28cbce6d [2] http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/ ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 13:55 ` Mike Fleetwood @ 2011-01-07 14:01 ` Olaf van der Spek 2011-01-07 14:10 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2011-01-07 14:01 UTC (permalink / raw) To: Mike Fleetwood; +Cc: linux-btrfs On Fri, Jan 7, 2011 at 2:55 PM, Mike Fleetwood <mike.fleetwood@googlemail.com> wrote: > On 6 January 2011 20:01, Olaf van der Spek <olafvdspek@gmail.com> wro= te: >> Hi, >> >> Does btrfs support atomic file data replaces? > > Hi Olaf, > > Yes btrfs does support atomic replace, since kernel 2.6.30 circa June= 2009. [1] > > Special handling was added to ext3, ext4, btrfs (and probably other > Linux FSs) for your replace-via-truncate and the alternative > replace-via-rename application patterns. =C2=A0Try reading "Delayed > allocation and the zero-length file problem" article and comments by > Ted Ts'o for further discussion. [2] According to Ted, via-truncate and via-rename are unsafe. Only fsync, rename is safe. Disadvantage of rename is resetting file owner (if non-root), having issues with meta-data and other stuff. My proposal was for an open flag, O_ATOMIC, to be introduced to tell the FS the whole file update should be done atomically. Ted says this is too hard in ext4, so I was wondering if this would be possible in btrfs. Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 14:01 ` Olaf van der Spek @ 2011-01-07 14:10 ` Olaf van der Spek 0 siblings, 0 replies; 47+ messages in thread From: Olaf van der Spek @ 2011-01-07 14:10 UTC (permalink / raw) To: Mike Fleetwood; +Cc: linux-btrfs On Fri, Jan 7, 2011 at 3:01 PM, Olaf van der Spek <olafvdspek@gmail.com> wrote: > According to Ted, via-truncate and via-rename are unsafe. Only fsync, > rename is safe. > Disadvantage of rename is resetting file owner (if non-root), having > issues with meta-data and other stuff. > > My proposal was for an open flag, O_ATOMIC, to be introduced to tell > the FS the whole file update should be done atomically. > Ted says this is too hard in ext4, so I was wondering if this would be > possible in btrfs. http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2082 http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2089 http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2090 ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-06 20:01 Olaf van der Spek 2011-01-07 13:55 ` Mike Fleetwood @ 2011-01-07 14:58 ` Chris Mason 2011-01-07 15:01 ` Olaf van der Spek 2011-01-08 1:11 ` Phillip Susi 1 sibling, 2 replies; 47+ messages in thread From: Chris Mason @ 2011-01-07 14:58 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-btrfs Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -0500: > Hi, > > Does btrfs support atomic file data replaces? Basically, the atomic > variant of this: > // old stage > open(O_TRUNC) > write() // 0+ times > close() > // new state Yes and no. We have a best effort mechanism where we try to guess that since you've done this truncate and the write that you want the writes to show up quickly. But its a guess. The problem is the write() // 0+ times. The kernel has no idea what new result you want the file to contain because the application isn't telling us. What btrfs can do (but we haven't yet implemented) is make sure that the results of a single write file are on disk atomically, even if they are replacing existing bytes in the file. Because we cow and because we don't update metadata pointers until the IO is complete, we can wait until all the IO for a given write call is on disk before we update any of the metadata. This isn't hard, it's on my TODO list. -chris ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 14:58 ` Chris Mason @ 2011-01-07 15:01 ` Olaf van der Spek 2011-01-07 15:05 ` Chris Mason 2011-01-08 1:11 ` Phillip Susi 1 sibling, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2011-01-07 15:01 UTC (permalink / raw) To: Chris Mason; +Cc: linux-btrfs On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> wr= ote: > Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -050= 0: >> Hi, >> >> Does btrfs support atomic file data replaces? Basically, the atomic >> variant of this: >> // old stage >> open(O_TRUNC) >> write() // 0+ times >> close() >> // new state > > Yes and no. =C2=A0We have a best effort mechanism where we try to gue= ss that > since you've done this truncate and the write that you want the write= s > to show up quickly. =C2=A0But its a guess. > > The problem is the write() // 0+ times. =C2=A0The kernel has no idea = what > new result you want the file to contain because the application isn't > telling us. Isn't it safe for the kernel to wait until the first write or close before writing anything to disk? > What btrfs can do (but we haven't yet implemented) is make sure that = the > results of a single write file are on disk atomically, even if they a= re > replacing existing bytes in the file. > > Because we cow and because we don't update metadata pointers until th= e > IO is complete, we can wait until all the IO for a given write call i= s > on disk before we update any of the metadata. > > This isn't hard, it's on my TODO list. What about a new flag: O_ATOMIC that'd take the guesswork out of the ke= rnel? Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 15:01 ` Olaf van der Spek @ 2011-01-07 15:05 ` Chris Mason 2011-01-07 15:08 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Chris Mason @ 2011-01-07 15:05 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-btrfs Excerpts from Olaf van der Spek's message of 2011-01-07 10:01:59 -0500: > On Fri, Jan 7, 2011 at 3:58 PM, Chris Mason <chris.mason@oracle.com> = wrote: > > Excerpts from Olaf van der Spek's message of 2011-01-06 15:01:15 -0= 500: > >> Hi, > >> > >> Does btrfs support atomic file data replaces? Basically, the atomi= c > >> variant of this: > >> // old stage > >> open(O_TRUNC) > >> write() // 0+ times > >> close() > >> // new state > > > > Yes and no. =C2=A0We have a best effort mechanism where we try to g= uess that > > since you've done this truncate and the write that you want the wri= tes > > to show up quickly. =C2=A0But its a guess. > > > > The problem is the write() // 0+ times. =C2=A0The kernel has no ide= a what > > new result you want the file to contain because the application isn= 't > > telling us. >=20 > Isn't it safe for the kernel to wait until the first write or close > before writing anything to disk? I'm afraid not. Picture an application that opens a thousand files and writes 1MB to each of them, and then didn't close any. If we waited until close, you'd have 1GB of memory pinned or staged somehow. >=20 > > What btrfs can do (but we haven't yet implemented) is make sure tha= t the > > results of a single write file are on disk atomically, even if they= are > > replacing existing bytes in the file. > > > > Because we cow and because we don't update metadata pointers until = the > > IO is complete, we can wait until all the IO for a given write call= is > > on disk before we update any of the metadata. > > > > This isn't hard, it's on my TODO list. >=20 > What about a new flag: O_ATOMIC that'd take the guesswork out of the = kernel? We can't guess beyond a single write call. Otherwise we get into the problem above where an application can force the kernel to wait forever. I'm not against O_ATOMIC to enable the new btrfs functionality, but it will still be limited to one write. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 15:05 ` Chris Mason @ 2011-01-07 15:08 ` Olaf van der Spek 2011-01-07 15:13 ` Chris Mason 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2011-01-07 15:08 UTC (permalink / raw) To: Chris Mason; +Cc: linux-btrfs On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason <chris.mason@oracle.com> wr= ote: >> > The problem is the write() // 0+ times. =C2=A0The kernel has no id= ea what >> > new result you want the file to contain because the application is= n't >> > telling us. >> >> Isn't it safe for the kernel to wait until the first write or close >> before writing anything to disk? > > I'm afraid not. =C2=A0Picture an application that opens a thousand fi= les and > writes 1MB to each of them, and then didn't close any. =C2=A0If we wa= ited > until close, you'd have 1GB of memory pinned or staged somehow. That's not what I asked. ;) I asked to wait until the first write (or close). That way, you don't get unintentional empty files. One step further, you don't have to keep the data in memory, you're free to write them to disk. You just wouldn't update the meta-data (yet). >> > This isn't hard, it's on my TODO list. >> >> What about a new flag: O_ATOMIC that'd take the guesswork out of the= kernel? > > We can't guess beyond a single write call. =C2=A0Otherwise we get int= o > the problem above where an application can force the kernel to wait > forever. =C2=A0I'm not against O_ATOMIC to enable the new btrfs > functionality, but it will still be limited to one write. > > -chris > --=20 Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 15:08 ` Olaf van der Spek @ 2011-01-07 15:13 ` Chris Mason 2011-01-07 15:17 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Chris Mason @ 2011-01-07 15:13 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-btrfs Excerpts from Olaf van der Spek's message of 2011-01-07 10:08:24 -0500: > On Fri, Jan 7, 2011 at 4:05 PM, Chris Mason <chris.mason@oracle.com> = wrote: > >> > The problem is the write() // 0+ times. =C2=A0The kernel has no = idea what > >> > new result you want the file to contain because the application = isn't > >> > telling us. > >> > >> Isn't it safe for the kernel to wait until the first write or clos= e > >> before writing anything to disk? > > > > I'm afraid not. =C2=A0Picture an application that opens a thousand = files and > > writes 1MB to each of them, and then didn't close any. =C2=A0If we = waited > > until close, you'd have 1GB of memory pinned or staged somehow. >=20 > That's not what I asked. ;) > I asked to wait until the first write (or close). That way, you don't > get unintentional empty files. > One step further, you don't have to keep the data in memory, you're > free to write them to disk. You just wouldn't update the meta-data > (yet). Sorry ;) Picture an application that truncates 1024 files without closi= ng any of them. Basically any operation that includes the kernel waiting for applications because they promise to do something soon is a denial of service attack, or a really easy way to run out of memory on the box. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 15:13 ` Chris Mason @ 2011-01-07 15:17 ` Olaf van der Spek 2011-01-07 16:12 ` Chris Mason 2011-01-07 16:32 ` Massimo Maggi 0 siblings, 2 replies; 47+ messages in thread From: Olaf van der Spek @ 2011-01-07 15:17 UTC (permalink / raw) To: Chris Mason; +Cc: linux-btrfs On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> wr= ote: >> That's not what I asked. ;) >> I asked to wait until the first write (or close). That way, you don'= t >> get unintentional empty files. >> One step further, you don't have to keep the data in memory, you're >> free to write them to disk. You just wouldn't update the meta-data >> (yet). > > Sorry ;) Picture an application that truncates 1024 files without clo= sing any > of them. =C2=A0Basically any operation that includes the kernel waiti= ng for > applications because they promise to do something soon is a denial of > service attack, or a really easy way to run out of memory on the box. I'm not sure why you would run out of memory in that case. O_ATOMIC would be the solution for the rename workaround: write temp file, rename With advantages like a way simpler API, no issues with resetting meta-data, no issues with temp file and maybe better performance. Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 15:17 ` Olaf van der Spek @ 2011-01-07 16:12 ` Chris Mason 2011-01-07 16:19 ` Olaf van der Spek 2011-01-07 16:26 ` Hubert Kario 2011-01-07 16:32 ` Massimo Maggi 1 sibling, 2 replies; 47+ messages in thread From: Chris Mason @ 2011-01-07 16:12 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-btrfs Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -0500: > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> = wrote: > >> That's not what I asked. ;) > >> I asked to wait until the first write (or close). That way, you do= n't > >> get unintentional empty files. > >> One step further, you don't have to keep the data in memory, you'r= e > >> free to write them to disk. You just wouldn't update the meta-data > >> (yet). > > > > Sorry ;) Picture an application that truncates 1024 files without c= losing any > > of them. =C2=A0Basically any operation that includes the kernel wai= ting for > > applications because they promise to do something soon is a denial = of > > service attack, or a really easy way to run out of memory on the bo= x. >=20 > I'm not sure why you would run out of memory in that case. Well, lets make sure I've got a good handle on the proposed interface: 1) fd =3D open(some_file, O_ATOMIC) 2) truncate(fd, 0) 3) write(fd, new data) The semantics are that we promise not to let the truncate hit the disk until the application does the write. We have a few choices on how we do this: 1) Leave the disk untouched, but keep something in memory that says thi= s inode is really truncated 2) Record on disk that we've done our atomic truncate but it is still pending. We'd need some way to remove or invalidate this record after = a crash. 3) Go ahead and do the operation but don't allow the transaction to commit until the write is done. option #1: keep something in memory. Well, any time we have a requirement to pin something in memory until userland decides to do a write, we risk oom. option #2: disk format change. Actually somewhat complex because if we haven't crashed, we need to be able to read the inode in again without invalidating the record but if we do crash, we have to invalidate the record. Not impossible, but not trivial. option #3: Pin the whole transaction. Depending on the FS this may be impossible. Certain operations require us to commit the transaction to reclaim space, and we cannot allow userland to put that on hold without deadlocking. What most people don't realize about the crash safe filesystems is they don't have fine grained transactions. There is one single transaction for all the operations done. This is mostly because it is less complex and much faster, but it also makes any 'pin the whole transaction' type system unusable. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 16:12 ` Chris Mason @ 2011-01-07 16:19 ` Olaf van der Spek 2011-01-07 16:26 ` Hubert Kario 1 sibling, 0 replies; 47+ messages in thread From: Olaf van der Spek @ 2011-01-07 16:19 UTC (permalink / raw) To: Chris Mason; +Cc: linux-btrfs On Fri, Jan 7, 2011 at 5:12 PM, Chris Mason <chris.mason@oracle.com> wr= ote: >> I'm not sure why you would run out of memory in that case. > > Well, lets make sure I've got a good handle on the proposed interface= : > > 1) fd =3D open(some_file, O_ATOMIC) No, O_TRUNC should be used in open. Maybe it works with a separate trun= cate too. > 2) truncate(fd, 0) > 3) write(fd, new data) > > The semantics are that we promise not to let the truncate hit the dis= k > until the application does the write. > > We have a few choices on how we do this: > > 1) Leave the disk untouched, but keep something in memory that says t= his > inode is really truncated > > 2) Record on disk that we've done our atomic truncate but it is still > pending. =C2=A0We'd need some way to remove or invalidate this record= after a > crash. > > 3) Go ahead and do the operation but don't allow the transaction to > commit until the write is done. > > option #1: keep something in memory. =C2=A0Well, any time we have a > requirement to pin something in memory until userland decides to do a > write, we risk oom. Since the file is open, you have to keep something in memory anyway, right? Adding a bit (or bool) does not make a difference IMO. Isn't this comparable to opening a temp file? > option #2: disk format change. =C2=A0Actually somewhat complex becaus= e if we > haven't crashed, we need to be able to read the inode in again withou= t > invalidating the record but if we do crash, we have to invalidate the > record. =C2=A0Not impossible, but not trivial. > > option #3: Pin the whole transaction. =C2=A0Depending on the FS this = may be > impossible. =C2=A0Certain operations require us to commit the transac= tion to > reclaim space, and we cannot allow userland to put that on hold witho= ut > deadlocking. #1 is the only one that makes sense. > What most people don't realize about the crash safe filesystems is th= ey > don't have fine grained transactions. =C2=A0There is one single trans= action > for all the operations done. =C2=A0This is mostly because it is less = complex > and much faster, but it also makes any 'pin the whole transaction' ty= pe > system unusable. AFAIK the cost is mostly more complex code / runtime. The cost is not disk performance. --=20 Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 16:12 ` Chris Mason 2011-01-07 16:19 ` Olaf van der Spek @ 2011-01-07 16:26 ` Hubert Kario 2011-01-07 19:29 ` Chris Mason 1 sibling, 1 reply; 47+ messages in thread From: Hubert Kario @ 2011-01-07 16:26 UTC (permalink / raw) To: Chris Mason; +Cc: Olaf van der Spek, linux-btrfs On Friday, January 07, 2011 17:12:11 Chris Mason wrote: > Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -050= 0: > > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com= >=20 wrote: > > >> That's not what I asked. ;) > > >> I asked to wait until the first write (or close). That way, you = don't > > >> get unintentional empty files. > > >> One step further, you don't have to keep the data in memory, you= 're > > >> free to write them to disk. You just wouldn't update the meta-da= ta > > >> (yet). > > >=20 > > > Sorry ;) Picture an application that truncates 1024 files without > > > closing any of them. Basically any operation that includes the k= ernel > > > waiting for applications because they promise to do something soo= n is > > > a denial of service attack, or a really easy way to run out of me= mory > > > on the box. > >=20 > > I'm not sure why you would run out of memory in that case. >=20 > Well, lets make sure I've got a good handle on the proposed interface= : >=20 > 1) fd =3D open(some_file, O_ATOMIC) > 2) truncate(fd, 0) > 3) write(fd, new data) >=20 > The semantics are that we promise not to let the truncate hit the dis= k > until the application does the write. >=20 > We have a few choices on how we do this: >=20 > 1) Leave the disk untouched, but keep something in memory that says t= his > inode is really truncated >=20 > 2) Record on disk that we've done our atomic truncate but it is still > pending. We'd need some way to remove or invalidate this record afte= r a > crash. >=20 > 3) Go ahead and do the operation but don't allow the transaction to > commit until the write is done. >=20 > option #1: keep something in memory. Well, any time we have a > requirement to pin something in memory until userland decides to do a > write, we risk oom. Userland has already a file descriptor allocated (which can fail anyway= =20 because of OOM), I see no problem in increasing the size of kernel memo= ry=20 usage by 4 bytes (if not less) just to note that the application wants = to see=20 the file as truncated (1 bit) and the next write has to be atomic (2nd = bit?). --=20 Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawer=C3=B3w 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 16:26 ` Hubert Kario @ 2011-01-07 19:29 ` Chris Mason 2011-01-08 14:40 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Chris Mason @ 2011-01-07 19:29 UTC (permalink / raw) To: Hubert Kario; +Cc: Olaf van der Spek, linux-btrfs Excerpts from Hubert Kario's message of 2011-01-07 11:26:02 -0500: > On Friday, January 07, 2011 17:12:11 Chris Mason wrote: > > Excerpts from Olaf van der Spek's message of 2011-01-07 10:17:31 -0500: > > > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> > wrote: > > > >> That's not what I asked. ;) > > > >> I asked to wait until the first write (or close). That way, you don't > > > >> get unintentional empty files. > > > >> One step further, you don't have to keep the data in memory, you're > > > >> free to write them to disk. You just wouldn't update the meta-data > > > >> (yet). > > > > > > > > Sorry ;) Picture an application that truncates 1024 files without > > > > closing any of them. Basically any operation that includes the kernel > > > > waiting for applications because they promise to do something soon is > > > > a denial of service attack, or a really easy way to run out of memory > > > > on the box. > > > > > > I'm not sure why you would run out of memory in that case. > > > > Well, lets make sure I've got a good handle on the proposed interface: > > > > 1) fd = open(some_file, O_ATOMIC) > > 2) truncate(fd, 0) > > 3) write(fd, new data) > > > > The semantics are that we promise not to let the truncate hit the disk > > until the application does the write. > > > > We have a few choices on how we do this: > > > > 1) Leave the disk untouched, but keep something in memory that says this > > inode is really truncated > > > > 2) Record on disk that we've done our atomic truncate but it is still > > pending. We'd need some way to remove or invalidate this record after a > > crash. > > > > 3) Go ahead and do the operation but don't allow the transaction to > > commit until the write is done. > > > > option #1: keep something in memory. Well, any time we have a > > requirement to pin something in memory until userland decides to do a > > write, we risk oom. > > Userland has already a file descriptor allocated (which can fail anyway > because of OOM), I see no problem in increasing the size of kernel memory > usage by 4 bytes (if not less) just to note that the application wants to see > the file as truncated (1 bit) and the next write has to be atomic (2nd bit?). > The exact amount of tracking is going to vary. The reason why is that actually doing the truncate is an O(size of the file) operation and so you can't just flip a switch when the write or the close comes in. You have to run through all the metadata of the file and do something temporary with each part that is only completed when the file IO is actually done. Honestly, there many different ways to solve this in the application. Requiring high speed atomic replacement of individual file contents is a recipe for frustration. -chris ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 19:29 ` Chris Mason @ 2011-01-08 14:40 ` Olaf van der Spek 2011-01-26 18:30 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2011-01-08 14:40 UTC (permalink / raw) To: Chris Mason; +Cc: Hubert Kario, linux-btrfs On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com> wr= ote: > The exact amount of tracking is going to vary. =C2=A0The reason why i= s that > actually doing the truncate is an O(size of the file) operation and s= o > you can't just flip a switch when the write or the close comes in. =C2= =A0You > have to run through all the metadata of the file and do something > temporary with each part that is only completed when the file IO is > actually done. That's true. Maybe the proper way, via O_ATOMIC, is better. > Honestly, there many different ways to solve this in the application. > Requiring high speed atomic replacement of individual file contents i= s a > recipe for frustration. Did you see message of Massimo? That'd be the ideal way from an app point of view. Not solving this properly in the FS moves the problem to userspace where it's even harder to solve and is not as performant. Replacing file data is a common operation that IMO the FS should support in a safe way. --=20 Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-08 14:40 ` Olaf van der Spek @ 2011-01-26 18:30 ` Olaf van der Spek 2011-01-26 19:30 ` Chris Mason 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2011-01-26 18:30 UTC (permalink / raw) To: Chris Mason; +Cc: Hubert Kario, linux-btrfs On Sat, Jan 8, 2011 at 3:40 PM, Olaf van der Spek <olafvdspek@gmail.com= > wrote: > On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com> = wrote: >> The exact amount of tracking is going to vary. =C2=A0The reason why = is that >> actually doing the truncate is an O(size of the file) operation and = so >> you can't just flip a switch when the write or the close comes in. =C2= =A0You >> have to run through all the metadata of the file and do something >> temporary with each part that is only completed when the file IO is >> actually done. > > That's true. Maybe the proper way, via O_ATOMIC, is better. > >> Honestly, there many different ways to solve this in the application= =2E >> Requiring high speed atomic replacement of individual file contents = is a >> recipe for frustration. > > Did you see message of Massimo? That'd be the ideal way from an app > point of view. > Not solving this properly in the FS moves the problem to userspace > where it's even harder to solve and is not as performant. > > Replacing file data is a common operation that IMO the FS should > support in a safe way. Chris? --=20 Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-26 18:30 ` Olaf van der Spek @ 2011-01-26 19:30 ` Chris Mason 2011-01-26 21:56 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Chris Mason @ 2011-01-26 19:30 UTC (permalink / raw) To: Olaf van der Spek; +Cc: Hubert Kario, linux-btrfs Excerpts from Olaf van der Spek's message of 2011-01-26 13:30:08 -0500: > On Sat, Jan 8, 2011 at 3:40 PM, Olaf van der Spek <olafvdspek@gmail.c= om> wrote: > > On Fri, Jan 7, 2011 at 8:29 PM, Chris Mason <chris.mason@oracle.com= > wrote: > >> The exact amount of tracking is going to vary. =C2=A0The reason wh= y is that > >> actually doing the truncate is an O(size of the file) operation an= d so > >> you can't just flip a switch when the write or the close comes in.= =C2=A0You > >> have to run through all the metadata of the file and do something > >> temporary with each part that is only completed when the file IO i= s > >> actually done. > > > > That's true. Maybe the proper way, via O_ATOMIC, is better. > > > >> Honestly, there many different ways to solve this in the applicati= on. > >> Requiring high speed atomic replacement of individual file content= s is a > >> recipe for frustration. > > > > Did you see message of Massimo? That'd be the ideal way from an app > > point of view. > > Not solving this properly in the FS moves the problem to userspace > > where it's even harder to solve and is not as performant. > > > > Replacing file data is a common operation that IMO the FS should > > support in a safe way. >=20 > Chris? >=20 My answer hasn't really changed ;) Replacing file data is a common operation, but it is still surprisingly complex. Again, the truncate i= s O(size of the file) and it is actually impossible to do this atomically in most filesystems. You don't notice this because xfs/ext34/btrfs (and many others) have code that makes sure a truncate is restarted if you crash. So, it appears to be atomic even though we're really just restarting the operation. In order to have a truncate + replacement of data operation= , we'd have to do a disk format change that includes both the truncate an= d the new data. It would look a lot like echo data > file.new ; truncate file ; mv file.new file, but recorded in the FS metadata. I don't have this in the btrfs roadmap. It would be nice but most people use databases for things that require atomic operations. I think what ext4 and btrfs do today fall into the category of best effort and least surprise, and I think it is as good as we can get without huge performance penalties for normal use. Now, if you want to talk about atomic replacement of file data without changing the file size, that's much easier. At least it's easier for those of us with cows in our pockets. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-26 19:30 ` Chris Mason @ 2011-01-26 21:56 ` Olaf van der Spek 0 siblings, 0 replies; 47+ messages in thread From: Olaf van der Spek @ 2011-01-26 21:56 UTC (permalink / raw) To: Chris Mason; +Cc: Hubert Kario, linux-btrfs On Wed, Jan 26, 2011 at 8:30 PM, Chris Mason <chris.mason@oracle.com> w= rote: > My answer hasn't really changed ;) =C2=A0Replacing file data is a com= mon > operation, but it is still surprisingly complex. =C2=A0Again, the tru= ncate is > O(size of the file) and it is actually impossible to do this atomical= ly > in most filesystems. Unfortunately life isn't trivial. ;) Given that it's common, it doesn't make sense to have code duplication in lots of apps to implement the temp file rename pattern. If it's too complex to implement in the FS (ATM), would it be possible to implement it in a higher layer? > You don't notice this because xfs/ext34/btrfs (and many others) have > code that makes sure a truncate is restarted if you crash. =C2=A0So, = it > appears to be atomic even though we're really just restarting the > operation. =C2=A0In order to have a truncate + replacement of data op= eration, > we'd have to do a disk format change that includes both the truncate = and > the new data. I'm not sure why the disk format would have to change. Conceptually, just like the temp file case, you'd write the new data to newly allocated blocks. After (and I guess that's the complex part) they're safely on disk, you update the meta data, in an atomic way. > It would look a lot like echo data > file.new ; truncate file ; mv > file.new file, but recorded in the FS metadata. > > I don't have this in the btrfs roadmap. =C2=A0It would be nice but mo= st > people use databases for things that require atomic operations. =C2=A0= I Executables and files shouldn't be in a DB. Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 15:17 ` Olaf van der Spek 2011-01-07 16:12 ` Chris Mason @ 2011-01-07 16:32 ` Massimo Maggi 2011-01-07 16:34 ` Olaf van der Spek 1 sibling, 1 reply; 47+ messages in thread From: Massimo Maggi @ 2011-01-07 16:32 UTC (permalink / raw) To: Olaf van der Spek; +Cc: linux-btrfs Are you suggesting to do: 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file 2)application writes to that fd, with one or more system calls, in a short time or in long time, at his will. 3)at fclose (or even at fsync ) atomically swap "data pointer" of "real file" with "temp file", then delete temp.In a transparent mode to userland. (something similar to e4defrag). Is this sum up correct? Massimo Maggi Il 07/01/2011 16:17, Olaf van der Spek ha scritto: > On Fri, Jan 7, 2011 at 4:13 PM, Chris Mason <chris.mason@oracle.com> wrote: >>> That's not what I asked. ;) >>> I asked to wait until the first write (or close). That way, you don't >>> get unintentional empty files. >>> One step further, you don't have to keep the data in memory, you're >>> free to write them to disk. You just wouldn't update the meta-data >>> (yet). >> Sorry ;) Picture an application that truncates 1024 files without closing any >> of them. Basically any operation that includes the kernel waiting for >> applications because they promise to do something soon is a denial of >> service attack, or a really easy way to run out of memory on the box. > I'm not sure why you would run out of memory in that case. > > O_ATOMIC would be the solution for the rename workaround: write temp > file, rename > With advantages like a way simpler API, no issues with resetting > meta-data, no issues with temp file and maybe better performance. > > Olaf > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 16:32 ` Massimo Maggi @ 2011-01-07 16:34 ` Olaf van der Spek 2011-01-07 19:29 ` Thomas Bellman 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2011-01-07 16:34 UTC (permalink / raw) To: Massimo Maggi; +Cc: linux-btrfs On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi <massimo@mmmm.it> wrote: > Are you suggesting to do: > 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file > 2)application writes to that fd, with one or more system calls, in a > short time or in long time, at his will. > 3)at fclose (or even at fsync ) atomically swap "data pointer" of "re= al > file" with "temp file", then delete temp.In a transparent mode to > userland. =C2=A0(something similar to e4defrag). > Is this sum up correct? Almost. Swap should probably not be done at fsync time. Other open references (for example running executables) should be swapp= ed too. The new-file case has to be handled too. Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 16:34 ` Olaf van der Spek @ 2011-01-07 19:29 ` Thomas Bellman 2011-01-08 14:36 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Thomas Bellman @ 2011-01-07 19:29 UTC (permalink / raw) To: Olaf van der Spek; +Cc: Massimo Maggi, linux-btrfs Olaf van der Spek wrote: > On Fri, Jan 7, 2011 at 5:32 PM, Massimo Maggi <massimo@mmmm.it> wrote: >> Are you suggesting to do: >> 1)fopen with O_TRUNC, O_ATOMIC: returns fd to a temporary file >> 2)application writes to that fd, with one or more system calls, in a >> short time or in long time, at his will. >> 3)at fclose (or even at fsync ) atomically swap "data pointer" of "real >> file" with "temp file", then delete temp.In a transparent mode to >> userland. (something similar to e4defrag). >> Is this sum up correct? > > Almost. Swap should probably not be done at fsync time. > Other open references (for example running executables) should be swapped too. What is the visibility of the changes for other processes supposed to be in the meantime? I.e., if things happen in this order: 1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC) 2. Process B does fdb = open("foo.txt", O_RDONLY) 3. B does read(fdb, buf, 4096) 4. A does write(fda, "NEW DATA\n", 9) 5. Process C comes in and does fdc = open("foo.txt", O_RDONLY) 6. C does read(fdc, buf, 4096) 7. A calls close(fda) Does B see an empty file, or does it see the old contents of the file? Does C see "NEW DATA\n", or does it see the old contents of the file, or perhaps an empty file? /Bellman ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 19:29 ` Thomas Bellman @ 2011-01-08 14:36 ` Olaf van der Spek 2011-01-08 21:43 ` Thomas Bellman 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2011-01-08 14:36 UTC (permalink / raw) To: Thomas Bellman; +Cc: Massimo Maggi, linux-btrfs On Fri, Jan 7, 2011 at 8:29 PM, Thomas Bellman <bellman@nsc.liu.se> wro= te: > What is the visibility of the changes for other processes supposed > to be in the meantime? =C2=A0I.e., if things happen in this order: Should be atomic too, at close time. > 1. Process A does fda =3D open("foo.txt", O_TRUNC|O_ATOMIC) > 2. Process B does fdb =3D open("foo.txt", O_RDONLY) > 3. B does read(fdb, buf, 4096) > 4. A does write(fda, "NEW DATA\n", 9) > 5. Process C comes in and does fdc =3D open("foo.txt", O_RDONLY) > 6. C does read(fdc, buf, 4096) > 7. A calls close(fda) > > Does B see an empty file, or does it see the old contents of > the file? Old file, otherwise A wouldn't be atomic. > Does C see "NEW DATA\n", or does it see the old > contents of the file, or perhaps an empty file? Old file again, as the 'transaction' isn't finished until close. --=20 Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-08 14:36 ` Olaf van der Spek @ 2011-01-08 21:43 ` Thomas Bellman 2011-01-09 15:16 ` Olaf van der Spek 0 siblings, 1 reply; 47+ messages in thread From: Thomas Bellman @ 2011-01-08 21:43 UTC (permalink / raw) To: Olaf van der Spek; +Cc: Massimo Maggi, linux-btrfs Olaf van der Spek wrote: > On Fri, Jan 7, 2011 at 8:29 PM, Thomas Bellman <bellman@nsc.liu.se> wrote: >> What is the visibility of the changes for other processes supposed >> to be in the meantime? I.e., if things happen in this order: > > Should be atomic too, at close time. > >> 1. Process A does fda = open("foo.txt", O_TRUNC|O_ATOMIC) >> 2. Process B does fdb = open("foo.txt", O_RDONLY) >> 3. B does read(fdb, buf, 4096) >> 4. A does write(fda, "NEW DATA\n", 9) >> 5. Process C comes in and does fdc = open("foo.txt", O_RDONLY) >> 6. C does read(fdc, buf, 4096) >> 7. A calls close(fda) >> >> Does B see an empty file, or does it see the old contents of >> the file? > > Old file, otherwise A wouldn't be atomic. > >> Does C see "NEW DATA\n", or does it see the old >> contents of the file, or perhaps an empty file? > > Old file again, as the 'transaction' isn't finished until close. So, basically database transactions with an isolation level of "committed read", for file operations. That's something I have wanted for a long time, especially if I also get a rollback() operation, but have never heard of any Unix that implemented it. A separate commit() operation would be better than conflating it with close(). And as I said, we want a rollback() as well. And a process that terminates without committing the transaction that it is performing, should have the transaction automatically rolled back. I only have a very shallow knowledge about the internals of the Linux kernel in regards to filesystems, but I suspect that this could be implemented almost entirely within the VFS, and not need to touch the actual filesystems, as long as you are satisfied with a limited amount of transaction space (what fits in RAM + swap). I'm looking forward to your implementation. :-) Even though I suspect that it would be a rather large undertaking to implement... /Bellman ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-08 21:43 ` Thomas Bellman @ 2011-01-09 15:16 ` Olaf van der Spek 2011-01-09 18:56 ` Thomas Bellman 0 siblings, 1 reply; 47+ messages in thread From: Olaf van der Spek @ 2011-01-09 15:16 UTC (permalink / raw) To: Thomas Bellman; +Cc: Massimo Maggi, linux-btrfs On Sat, Jan 8, 2011 at 10:43 PM, Thomas Bellman <bellman@nsc.liu.se> wr= ote: > So, basically database transactions with an isolation level of > "committed read", for file operations. =C2=A0That's something I have > wanted for a long time, especially if I also get a rollback() > operation, but have never heard of any Unix that implemented it. True, that's why this feature request is here. Note that it's (ATM) only about single file data replace. > A separate commit() operation would be better than conflating it > with close(). =C2=A0And as I said, we want a rollback() as well. =C2=A0= And > a process that terminates without committing the transaction that > it is performing, should have the transaction automatically rolled > back. What could you do between commit and close? > I only have a very shallow knowledge about the internals of the > Linux kernel in regards to filesystems, but I suspect that this > could be implemented almost entirely within the VFS, and not need > to touch the actual filesystems, as long as you are satisfied > with a limited amount of transaction space (what fits in RAM + > swap). > > I'm looking forward to your implementation. :-) =C2=A0Even though I > suspect that it would be a rather large undertaking to implement... I have no plans to work on an implementation. --=20 Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-09 15:16 ` Olaf van der Spek @ 2011-01-09 18:56 ` Thomas Bellman 2011-01-09 19:06 ` Olaf van der Spek 2011-01-09 20:13 ` Phillip Susi 0 siblings, 2 replies; 47+ messages in thread From: Thomas Bellman @ 2011-01-09 18:56 UTC (permalink / raw) To: Olaf van der Spek; +Cc: Massimo Maggi, linux-btrfs Olaf van der Spek wrote: > On Sat, Jan 8, 2011 at 10:43 PM, Thomas Bellman <bellman@nsc.liu.se> wrote: >> So, basically database transactions with an isolation level of >> "committed read", for file operations. That's something I have >> wanted for a long time, especially if I also get a rollback() >> operation, but have never heard of any Unix that implemented it. > > True, that's why this feature request is here. > Note that it's (ATM) only about single file data replace. That particular problem was solved with the introduction of the rename(2) system call in 4.2BSD a bit more than a quarter of a century ago. There is no need to introduce another, less flexible, API for doing the same thing. >> A separate commit() operation would be better than conflating it >> with close(). And as I said, we want a rollback() as well. And >> a process that terminates without committing the transaction that >> it is performing, should have the transaction automatically rolled >> back. > > What could you do between commit and close? More write() operations, of course. Just like you can continue with more transactions after a COMMIT WORK call without having to close and re-open the database in SQL. /Bellman ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-09 18:56 ` Thomas Bellman @ 2011-01-09 19:06 ` Olaf van der Spek 2011-01-09 20:13 ` Phillip Susi 1 sibling, 0 replies; 47+ messages in thread From: Olaf van der Spek @ 2011-01-09 19:06 UTC (permalink / raw) To: Thomas Bellman; +Cc: Massimo Maggi, linux-btrfs On Sun, Jan 9, 2011 at 7:56 PM, Thomas Bellman <bellman@nsc.liu.se> wro= te: >> True, that's why this feature request is here. >> Note that it's (ATM) only about =C2=A0single file data replace. > > That particular problem was solved with the introduction of the > rename(2) system call in 4.2BSD a bit more than a quarter of a > century ago. =C2=A0There is no need to introduce another, less flexib= le, > API for doing the same thing. You might want to read about the problems with that workaround. >> What could you do between commit and close? > > More write() operations, of course. =C2=A0Just like you can continue > with more transactions after a COMMIT WORK call without having > to close and re-open the database in SQL. The transaction is defined as beginning with open and ending with close= =2E --=20 Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-09 18:56 ` Thomas Bellman 2011-01-09 19:06 ` Olaf van der Spek @ 2011-01-09 20:13 ` Phillip Susi 1 sibling, 0 replies; 47+ messages in thread From: Phillip Susi @ 2011-01-09 20:13 UTC (permalink / raw) To: Thomas Bellman; +Cc: Olaf van der Spek, Massimo Maggi, linux-btrfs On 01/09/2011 01:56 PM, Thomas Bellman wrote: > That particular problem was solved with the introduction of the > rename(2) system call in 4.2BSD a bit more than a quarter of a > century ago. There is no need to introduce another, less flexible, > API for doing the same thing. I'm curious if there are any BSD specifications that state that rename() has this behavior. Ted Tso has been claiming that POSIX does not require this behavior in the face of a crash and that as a result, an application that relies on such behavior is broken, and needs to fsync() before rename(). This of course, makes replacing numerous files much slower, glacially so on btrfs. There has been a great deal of discussion ok the dpkg mailing lists about it since plenty of people are upset that dpkg runs much slower these days than it used to, because it now calls fsync() before rename() in order to avoid breakage on ext4. You can read more, including the rationale of why POSIX does not require this behavior at http://lwn.net/Articles/323607/. I still say that preserving the order of the writes and rename is the only sane thing to do, whether POSIX requires it or not. ^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: Atomic file data replace API 2011-01-07 14:58 ` Chris Mason 2011-01-07 15:01 ` Olaf van der Spek @ 2011-01-08 1:11 ` Phillip Susi 1 sibling, 0 replies; 47+ messages in thread From: Phillip Susi @ 2011-01-08 1:11 UTC (permalink / raw) To: Chris Mason; +Cc: Olaf van der Spek, linux-btrfs On 01/07/2011 09:58 AM, Chris Mason wrote: > Yes and no. We have a best effort mechanism where we try to guess that > since you've done this truncate and the write that you want the writes > to show up quickly. But its a guess. It is a pretty good guess, and one that the NT kernel has been making for 15 years or so. I've been following this issue for some time and I still don't understand why Ted is so hostile to this and can't make it work right on ext4. When you get a rename() you just need to check if there are outstanding journal transactions and/or dirty cache pages, and hang the rename() transaction on the end of those. That way if the system crashes after the new file has fully hit the disk, the old file is gone and you only have the new one, but if it crashes before, you still have the old one in place. Both the writes and the rename can be delayed in the cache to an arbitrary point in the future; what matters is that their order is preserved. ^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2011-01-26 21:56 UTC | newest] Thread overview: 47+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-12-27 11:51 Atomic file data replace API Olaf van der Spek 2010-12-27 13:20 ` Amir Goldstein 2010-12-27 15:53 ` Olaf van der Spek 2010-12-27 17:20 ` Amir Goldstein 2010-12-27 18:34 ` Olaf van der Spek 2010-12-28 2:59 ` Ted Ts'o 2010-12-28 17:27 ` Olaf van der Spek 2010-12-28 19:06 ` Ric Wheeler 2010-12-28 22:25 ` Olaf van der Spek 2010-12-28 22:36 ` Ric Wheeler 2010-12-28 22:58 ` Olaf van der Spek 2010-12-29 9:20 ` Amir Goldstein 2010-12-29 12:42 ` Olaf van der Spek 2010-12-29 15:30 ` Christian Stroetmann 2010-12-29 15:35 ` Olaf van der Spek 2010-12-29 16:30 ` Christian Stroetmann 2010-12-29 17:12 ` Olaf van der Spek 2010-12-29 17:15 ` Greg Freemyer 2010-12-29 19:30 ` Christian Stroetmann -- strict thread matches above, loose matches on Subject: below -- 2011-01-06 20:01 Olaf van der Spek 2011-01-07 13:55 ` Mike Fleetwood 2011-01-07 14:01 ` Olaf van der Spek 2011-01-07 14:10 ` Olaf van der Spek 2011-01-07 14:58 ` Chris Mason 2011-01-07 15:01 ` Olaf van der Spek 2011-01-07 15:05 ` Chris Mason 2011-01-07 15:08 ` Olaf van der Spek 2011-01-07 15:13 ` Chris Mason 2011-01-07 15:17 ` Olaf van der Spek 2011-01-07 16:12 ` Chris Mason 2011-01-07 16:19 ` Olaf van der Spek 2011-01-07 16:26 ` Hubert Kario 2011-01-07 19:29 ` Chris Mason 2011-01-08 14:40 ` Olaf van der Spek 2011-01-26 18:30 ` Olaf van der Spek 2011-01-26 19:30 ` Chris Mason 2011-01-26 21:56 ` Olaf van der Spek 2011-01-07 16:32 ` Massimo Maggi 2011-01-07 16:34 ` Olaf van der Spek 2011-01-07 19:29 ` Thomas Bellman 2011-01-08 14:36 ` Olaf van der Spek 2011-01-08 21:43 ` Thomas Bellman 2011-01-09 15:16 ` Olaf van der Spek 2011-01-09 18:56 ` Thomas Bellman 2011-01-09 19:06 ` Olaf van der Spek 2011-01-09 20:13 ` Phillip Susi 2011-01-08 1:11 ` Phillip Susi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.