questions regarding fsync in btrfs

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* questions regarding fsync in btrfs
@ 2014-01-25  0:09 Aastha Mehta
  2014-01-25 15:21 ` Josef Bacik
  0 siblings, 1 reply; 4+ messages in thread
From: Aastha Mehta @ 2014-01-25  0:09 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I would like to clarify a bit on how the fsync works in btrfs. The log
tree journals only the metadata of the files that have been modified
prior to the fsync, correct? It does not log the data extents of
files, which are directly sync'ed to the disk. Also, if I understand
correctly, fsync and fdatasync are the same thing in btrfs currently.
Is it more like fsync or fdatasync?

What exactly happens once a file inode is in the tree log? Does it
mean it is guaranteed to be persisted on disk, or is it already on
disk? I see two flags in btrfs_sync_file -
BTRFS_INODE_HAS_ASYNC_EXTENT and BTRFS_INODE_NEEDS_FULL_SYNC. I do not
fully understand them. After full sync, what does log_dentry_safe and
sync_log do?

Finally, Wikipedia says that "the items in the log tree are replayed
and deleted at the next full tree commit or (if there was a system
crash) at the next remount". Even if there is no crash, why is there a
need to replay the log?

Thanks,
Aastha.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: questions regarding fsync in btrfs
  2014-01-25  0:09 questions regarding fsync in btrfs Aastha Mehta
@ 2014-01-25 15:21 ` Josef Bacik
  2014-01-29 16:42   ` Aastha Mehta
  0 siblings, 1 reply; 4+ messages in thread
From: Josef Bacik @ 2014-01-25 15:21 UTC (permalink / raw)
  To: Aastha Mehta, linux-btrfs


On 01/24/2014 07:09 PM, Aastha Mehta wrote:
> Hello,
>
> I would like to clarify a bit on how the fsync works in btrfs. The log
> tree journals only the metadata of the files that have been modified
> prior to the fsync, correct? It does not log the data extents of
> files, which are directly sync'ed to the disk. Also, if I understand
> correctly, fsync and fdatasync are the same thing in btrfs currently.
> Is it more like fsync or fdatasync?

More like fsync.  Because we cow we always are updating metadata so 
there is no "fdatasync", we can't get away with just flushing the data.

> What exactly happens once a file inode is in the tree log? Does it
> mean it is guaranteed to be persisted on disk, or is it already on
> disk? I see two flags in btrfs_sync_file -
> BTRFS_INODE_HAS_ASYNC_EXTENT and BTRFS_INODE_NEEDS_FULL_SYNC. I do not
> fully understand them. After full sync, what does log_dentry_safe and
> sync_log do?

It is guaranteed to be on disk.  We copy all of the inode metadata to 
the log, sync the log and the data and the super block that points to 
hte tree log.  HAS_ASYNC_EXTENT is for compression where we will return 
to writepages without actually having marked the page as writeback, so 
we need to go back and re-lock the pages to make sure it has passed 
through the async compression threads and the pages have been properly 
marked writeback so we can wait on them properly.  NEEDS_FULL_SYNC means 
we can't do our fancy tricks of only updating some of the metadata, we 
have to go and copy all of the inode metadata (the inode, its 
references, its xattrs) and all of its extents.  log_dentry_safe copies 
all the info into the tree log and sync_log syncs the tree log to disk 
and writes out a super that points to the tree log.
> Finally, Wikipedia says that "the items in the log tree are replayed
> and deleted at the next full tree commit or (if there was a system
> crash) at the next remount". Even if there is no crash, why is there a
> need to replay the log?
>
There isn't, once we commit a transaction we commit a super that doesn't 
point to the tree log and we free up the blocks we used for the tree 
log.  The tree log only exists for one transaction, if we crash before a 
transaction commits we will see that there is a tree log on the next 
mount and replay it.  If we commit the transaction we simply free the 
tree log and carry on.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: questions regarding fsync in btrfs
  2014-01-25 15:21 ` Josef Bacik
@ 2014-01-29 16:42   ` Aastha Mehta
  2014-01-29 17:04     ` Josef Bacik
  0 siblings, 1 reply; 4+ messages in thread
From: Aastha Mehta @ 2014-01-29 16:42 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On 25 January 2014 16:21, Josef Bacik <jbacik@fb.com> wrote:
>
> On 01/24/2014 07:09 PM, Aastha Mehta wrote:
>>
>> Hello,
>>
>> I would like to clarify a bit on how the fsync works in btrfs. The log
>> tree journals only the metadata of the files that have been modified
>> prior to the fsync, correct? It does not log the data extents of
>> files, which are directly sync'ed to the disk. Also, if I understand
>> correctly, fsync and fdatasync are the same thing in btrfs currently.
>> Is it more like fsync or fdatasync?
>
>
> More like fsync.  Because we cow we always are updating metadata so there is
> no "fdatasync", we can't get away with just flushing the data.
>
>
>> What exactly happens once a file inode is in the tree log? Does it
>> mean it is guaranteed to be persisted on disk, or is it already on
>> disk? I see two flags in btrfs_sync_file -
>> BTRFS_INODE_HAS_ASYNC_EXTENT and BTRFS_INODE_NEEDS_FULL_SYNC. I do not
>> fully understand them. After full sync, what does log_dentry_safe and
>> sync_log do?
>
>
> It is guaranteed to be on disk.  We copy all of the inode metadata to the
> log, sync the log and the data and the super block that points to hte tree
> log.  HAS_ASYNC_EXTENT is for compression where we will return to writepages
> without actually having marked the page as writeback, so we need to go back
> and re-lock the pages to make sure it has passed through the async
> compression threads and the pages have been properly marked writeback so we
> can wait on them properly.  NEEDS_FULL_SYNC means we can't do our fancy
> tricks of only updating some of the metadata, we have to go and copy all of
> the inode metadata (the inode, its references, its xattrs) and all of its
> extents.  log_dentry_safe copies all the info into the tree log and sync_log
> syncs the tree log to disk and writes out a super that points to the tree
> log.
>
>> Finally, Wikipedia says that "the items in the log tree are replayed
>> and deleted at the next full tree commit or (if there was a system
>> crash) at the next remount". Even if there is no crash, why is there a
>> need to replay the log?
>>
> There isn't, once we commit a transaction we commit a super that doesn't
> point to the tree log and we free up the blocks we used for the tree log.
> The tree log only exists for one transaction, if we crash before a
> transaction commits we will see that there is a tree log on the next mount
> and replay it.  If we commit the transaction we simply free the tree log and
> carry on.  Thanks,
>
> Josef


Thank you for your response. I ran few small experiments and I see
that fsync on an average leads to writing of about 30-40KB of
metadata, irrespective of the amount of data changes. I wonder why is
it so much? Besides the superblocks and a couple of blocks in the tree
log, what else may be updated? Also, why does it seem to be
independent of the amount of writes?

Thanks,
Aastha.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: questions regarding fsync in btrfs
  2014-01-29 16:42   ` Aastha Mehta
@ 2014-01-29 17:04     ` Josef Bacik
  0 siblings, 0 replies; 4+ messages in thread
From: Josef Bacik @ 2014-01-29 17:04 UTC (permalink / raw)
  To: Aastha Mehta; +Cc: linux-btrfs


On 01/29/2014 11:42 AM, Aastha Mehta wrote:
> On 25 January 2014 16:21, Josef Bacik <jbacik@fb.com> wrote:
>> On 01/24/2014 07:09 PM, Aastha Mehta wrote:
>>> Hello,
>>>
>>> I would like to clarify a bit on how the fsync works in btrfs. The log
>>> tree journals only the metadata of the files that have been modified
>>> prior to the fsync, correct? It does not log the data extents of
>>> files, which are directly sync'ed to the disk. Also, if I understand
>>> correctly, fsync and fdatasync are the same thing in btrfs currently.
>>> Is it more like fsync or fdatasync?
>>
>> More like fsync.  Because we cow we always are updating metadata so there is
>> no "fdatasync", we can't get away with just flushing the data.
>>
>>
>>> What exactly happens once a file inode is in the tree log? Does it
>>> mean it is guaranteed to be persisted on disk, or is it already on
>>> disk? I see two flags in btrfs_sync_file -
>>> BTRFS_INODE_HAS_ASYNC_EXTENT and BTRFS_INODE_NEEDS_FULL_SYNC. I do not
>>> fully understand them. After full sync, what does log_dentry_safe and
>>> sync_log do?
>>
>> It is guaranteed to be on disk.  We copy all of the inode metadata to the
>> log, sync the log and the data and the super block that points to hte tree
>> log.  HAS_ASYNC_EXTENT is for compression where we will return to writepages
>> without actually having marked the page as writeback, so we need to go back
>> and re-lock the pages to make sure it has passed through the async
>> compression threads and the pages have been properly marked writeback so we
>> can wait on them properly.  NEEDS_FULL_SYNC means we can't do our fancy
>> tricks of only updating some of the metadata, we have to go and copy all of
>> the inode metadata (the inode, its references, its xattrs) and all of its
>> extents.  log_dentry_safe copies all the info into the tree log and sync_log
>> syncs the tree log to disk and writes out a super that points to the tree
>> log.
>>
>>> Finally, Wikipedia says that "the items in the log tree are replayed
>>> and deleted at the next full tree commit or (if there was a system
>>> crash) at the next remount". Even if there is no crash, why is there a
>>> need to replay the log?
>>>
>> There isn't, once we commit a transaction we commit a super that doesn't
>> point to the tree log and we free up the blocks we used for the tree log.
>> The tree log only exists for one transaction, if we crash before a
>> transaction commits we will see that there is a tree log on the next mount
>> and replay it.  If we commit the transaction we simply free the tree log and
>> carry on.  Thanks,
>>
>> Josef
>
> Thank you for your response. I ran few small experiments and I see
> that fsync on an average leads to writing of about 30-40KB of
> metadata, irrespective of the amount of data changes. I wonder why is
> it so much? Besides the superblocks and a couple of blocks in the tree
> log, what else may be updated? Also, why does it seem to be
> independent of the amount of writes?
>
I'm not sure, you'll have to figure that out.  With a small amount of 
data and a few extents you should probably get

1 block for the log root tree
2-3 blocks for the actual log root (this changes depending on how much 
data you are logging)
1 block for your superblock

It's pretty easy to see, just put a printk everytime we allocate a block 
for the log tree and that should tell you how many blocks are used for 
the tree, and then just the superblock should go out. Thanks,

Josef

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-01-29 17:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-25  0:09 questions regarding fsync in btrfs Aastha Mehta
2014-01-25 15:21 ` Josef Bacik
2014-01-29 16:42   ` Aastha Mehta
2014-01-29 17:04     ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).