* Questions regarding logging upon fsync in btrfs @ 2013-09-28 23:35 Aastha Mehta 2013-09-28 23:46 ` Aastha Mehta 2013-09-29 0:42 ` Josef Bacik 0 siblings, 2 replies; 19+ messages in thread From: Aastha Mehta @ 2013-09-28 23:35 UTC (permalink / raw) To: linux-btrfs Hi, I have few questions regarding logging triggered by calling fsync in BTRFS: 1. If I understand correctly, fsync will call to log entire inode in the log tree. Does this mean that the data extents are also logged into the log tree? Are they copied into the log tree, or just referenced? Are they copied into the subvolume's extent tree again upon replay? 2. During replay, when the extents are added into the extent allocation tree, do they acquire the physical extent number during replay? Does they physical extent allocated to the data in the log tree differ from that in the subvolume? 3. I see there is a mount option of notreelog available. After disabling tree logging, does fsync still lead to flushing of buffers to the disk directly? 4. Is it possible to selectively identify certain files in the log tree and flush them to disk directly, without waiting for the replay to do it? Thanks -- Aastha Mehta ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-28 23:35 Questions regarding logging upon fsync in btrfs Aastha Mehta @ 2013-09-28 23:46 ` Aastha Mehta 2013-09-29 0:21 ` Hugo Mills 2013-09-29 0:42 ` Josef Bacik 1 sibling, 1 reply; 19+ messages in thread From: Aastha Mehta @ 2013-09-28 23:46 UTC (permalink / raw) To: linux-btrfs I am using linux kernel 3.1.10-1.16, just to let you know. Thanks On 29 September 2013 01:35, Aastha Mehta <aasthakm@gmail.com> wrote: > Hi, > > I have few questions regarding logging triggered by calling fsync in BTRFS: > > 1. If I understand correctly, fsync will call to log entire inode in > the log tree. Does this mean that the data extents are also logged > into the log tree? Are they copied into the log tree, or just > referenced? Are they copied into the subvolume's extent tree again > upon replay? > > 2. During replay, when the extents are added into the extent > allocation tree, do they acquire the physical extent number during > replay? Does they physical extent allocated to the data in the log > tree differ from that in the subvolume? > > 3. I see there is a mount option of notreelog available. After > disabling tree logging, does fsync still lead to flushing of buffers > to the disk directly? > > 4. Is it possible to selectively identify certain files in the log > tree and flush them to disk directly, without waiting for the replay > to do it? > > Thanks > > -- > Aastha Mehta -- Aastha Mehta ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-28 23:46 ` Aastha Mehta @ 2013-09-29 0:21 ` Hugo Mills 0 siblings, 0 replies; 19+ messages in thread From: Hugo Mills @ 2013-09-29 0:21 UTC (permalink / raw) To: Aastha Mehta; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1988 bytes --] On Sun, Sep 29, 2013 at 01:46:23AM +0200, Aastha Mehta wrote: > I am using linux kernel 3.1.10-1.16, just to let you know. Not that it invalidates the questions below, but that's a really old kernel. You should update to something recent (3.11, or 3.12-rc2) as soon as possible. There are major problems in 3.1 (and most of the subsequent kernels) that have been fixed in 3.11. Of course, there are still major problems in 3.11 that haven't been fixed yet, but we don't know about very many of those. :) (And when we do, we'll be recommending that you upgrade to whatever has them fixed...) Hugo. > Thanks > > On 29 September 2013 01:35, Aastha Mehta <aasthakm@gmail.com> wrote: > > Hi, > > > > I have few questions regarding logging triggered by calling fsync in BTRFS: > > > > 1. If I understand correctly, fsync will call to log entire inode in > > the log tree. Does this mean that the data extents are also logged > > into the log tree? Are they copied into the log tree, or just > > referenced? Are they copied into the subvolume's extent tree again > > upon replay? > > > > 2. During replay, when the extents are added into the extent > > allocation tree, do they acquire the physical extent number during > > replay? Does they physical extent allocated to the data in the log > > tree differ from that in the subvolume? > > > > 3. I see there is a mount option of notreelog available. After > > disabling tree logging, does fsync still lead to flushing of buffers > > to the disk directly? > > > > 4. Is it possible to selectively identify certain files in the log > > tree and flush them to disk directly, without waiting for the replay > > to do it? > > > > Thanks > > -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- <Diablo-D3> My code is never released, it escapes from the --- git repo and kills a few beta testers on the way out. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-28 23:35 Questions regarding logging upon fsync in btrfs Aastha Mehta 2013-09-28 23:46 ` Aastha Mehta @ 2013-09-29 0:42 ` Josef Bacik 2013-09-29 9:22 ` Aastha Mehta 1 sibling, 1 reply; 19+ messages in thread From: Josef Bacik @ 2013-09-29 0:42 UTC (permalink / raw) To: Aastha Mehta; +Cc: linux-btrfs On Sun, Sep 29, 2013 at 01:35:15AM +0200, Aastha Mehta wrote: > Hi, > > I have few questions regarding logging triggered by calling fsync in BTRFS: > > 1. If I understand correctly, fsync will call to log entire inode in > the log tree. Does this mean that the data extents are also logged > into the log tree? Are they copied into the log tree, or just > referenced? Are they copied into the subvolume's extent tree again > upon replay? > The data extents are copied as well, as in the metadata that points to the data, not the actual data itself. For 3.1 it's all of the extents in the inode, in 3.8 on it's only the extents that have changed this transaction. > 2. During replay, when the extents are added into the extent > allocation tree, do they acquire the physical extent number during > replay? Does they physical extent allocated to the data in the log > tree differ from that in the subvolume? > No the physical location was picked when we wrote the data out during fsync. If we crash and re-mount the replay will just insert the ref into the extent tree for the disk offset as it replays the extents. > 3. I see there is a mount option of notreelog available. After > disabling tree logging, does fsync still lead to flushing of buffers > to the disk directly? > notreelog just means that we write the data and wait on the ordered data extents and then commit the transaction. So you get the data for the inode you are fsycning and all of the metadata for the entire file system that has changed in that transaction. > 4. Is it possible to selectively identify certain files in the log > tree and flush them to disk directly, without waiting for the replay > to do it? > I don't understand this question, replay only happens on mount after a crash/power loss, and everything is replayed that is in the log, there is no way to select which inode is replayed. Thanks, Josef ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-29 0:42 ` Josef Bacik @ 2013-09-29 9:22 ` Aastha Mehta 2013-09-29 13:12 ` Josef Bacik 0 siblings, 1 reply; 19+ messages in thread From: Aastha Mehta @ 2013-09-29 9:22 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. Thanks On 29 September 2013 02:42, Josef Bacik <jbacik@fusionio.com> wrote: > On Sun, Sep 29, 2013 at 01:35:15AM +0200, Aastha Mehta wrote: >> Hi, >> >> I have few questions regarding logging triggered by calling fsync in BTRFS: >> >> 1. If I understand correctly, fsync will call to log entire inode in >> the log tree. Does this mean that the data extents are also logged >> into the log tree? Are they copied into the log tree, or just >> referenced? Are they copied into the subvolume's extent tree again >> upon replay? >> > > The data extents are copied as well, as in the metadata that points to the data, > not the actual data itself. For 3.1 it's all of the extents in the inode, in > 3.8 on it's only the extents that have changed this transaction. > >> 2. During replay, when the extents are added into the extent >> allocation tree, do they acquire the physical extent number during >> replay? Does they physical extent allocated to the data in the log >> tree differ from that in the subvolume? >> > > No the physical location was picked when we wrote the data out during fsync. If > we crash and re-mount the replay will just insert the ref into the extent tree > for the disk offset as it replays the extents. > >> 3. I see there is a mount option of notreelog available. After >> disabling tree logging, does fsync still lead to flushing of buffers >> to the disk directly? >> > > notreelog just means that we write the data and wait on the ordered data extents > and then commit the transaction. So you get the data for the inode you are > fsycning and all of the metadata for the entire file system that has changed in > that transaction. > >> 4. Is it possible to selectively identify certain files in the log >> tree and flush them to disk directly, without waiting for the replay >> to do it? >> > > I don't understand this question, replay only happens on mount after a > crash/power loss, and everything is replayed that is in the log, there is no way > to select which inode is replayed. Thanks, > > Josef -- Aastha Mehta MPI-SWS, Germany E-mail: aasthakm@mpi-sws.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-29 9:22 ` Aastha Mehta @ 2013-09-29 13:12 ` Josef Bacik 2013-09-30 19:32 ` Aastha Mehta 0 siblings, 1 reply; 19+ messages in thread From: Josef Bacik @ 2013-09-29 13:12 UTC (permalink / raw) To: Aastha Mehta; +Cc: Josef Bacik, linux-btrfs On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: > Thank you very much for the reply. That clarifies a lot of things. > > I was trying a small test case that opens a file, writes a block of > data, calls fsync and then closes the file. If I understand correctly, > fsync would return only after all in-memory buffers have been > committed to disk. I have added few print statements in the > __extent_writepage function, and I notice that the function gets > called a bit later after fsync returns. It seems that I am not > guaranteed to see the data going to disk by the time fsync returns. > > Am I doing something wrong, or am I looking at the wrong place for > disk write? This happens both with tree logging enabled as well as > with notreelog. > So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-29 13:12 ` Josef Bacik @ 2013-09-30 19:32 ` Aastha Mehta 2013-09-30 20:11 ` Josef Bacik 0 siblings, 1 reply; 19+ messages in thread From: Aastha Mehta @ 2013-09-30 19:32 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: >> Thank you very much for the reply. That clarifies a lot of things. >> >> I was trying a small test case that opens a file, writes a block of >> data, calls fsync and then closes the file. If I understand correctly, >> fsync would return only after all in-memory buffers have been >> committed to disk. I have added few print statements in the >> __extent_writepage function, and I notice that the function gets >> called a bit later after fsync returns. It seems that I am not >> guaranteed to see the data going to disk by the time fsync returns. >> >> Am I doing something wrong, or am I looking at the wrong place for >> disk write? This happens both with tree logging enabled as well as >> with notreelog. >> > > So 3.1 was a long time ago and to be sure it had issues I don't think it was > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 > or just build btrfs-next from git. But if I were to make a guess I'd say that > __extent_writepage was how both data and metadata was written out at the time (I > don't think I changed it until 3.2 or something later) so what you are likely > seeing is the normal transaction commit after the fsync. In the case of > notreelog we are likely starting another transaction and you are seeing that > commit (at the time the transaction kthread would start a transaction even if > none had been started yet.) Thanks, > > Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Thanks, Aastha. -- Aastha Mehta MPI-SWS, Germany E-mail: aasthakm@mpi-sws.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-30 19:32 ` Aastha Mehta @ 2013-09-30 20:11 ` Josef Bacik 2013-09-30 20:30 ` Aastha Mehta 0 siblings, 1 reply; 19+ messages in thread From: Josef Bacik @ 2013-09-30 20:11 UTC (permalink / raw) To: Aastha Mehta; +Cc: Josef Bacik, linux-btrfs On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: > On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: > > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: > >> Thank you very much for the reply. That clarifies a lot of things. > >> > >> I was trying a small test case that opens a file, writes a block of > >> data, calls fsync and then closes the file. If I understand correctly, > >> fsync would return only after all in-memory buffers have been > >> committed to disk. I have added few print statements in the > >> __extent_writepage function, and I notice that the function gets > >> called a bit later after fsync returns. It seems that I am not > >> guaranteed to see the data going to disk by the time fsync returns. > >> > >> Am I doing something wrong, or am I looking at the wrong place for > >> disk write? This happens both with tree logging enabled as well as > >> with notreelog. > >> > > > > So 3.1 was a long time ago and to be sure it had issues I don't think it was > > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 > > or just build btrfs-next from git. But if I were to make a guess I'd say that > > __extent_writepage was how both data and metadata was written out at the time (I > > don't think I changed it until 3.2 or something later) so what you are likely > > seeing is the normal transaction commit after the fsync. In the case of > > notreelog we are likely starting another transaction and you are seeing that > > commit (at the time the transaction kthread would start a transaction even if > > none had been started yet.) Thanks, > > > > Josef > > Is there any special handling for very small file write, less than 4K? As > I understand there is an optimization to inline the first extent in a file if > it is smaller than 4K, does it affect the writeback on fsync as well? I did > set the max_inline mount option to 0, but even then it seems there is > some difference in fsync behaviour for writing first extent of less than 4K > size and writing 4K or more. > Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-30 20:11 ` Josef Bacik @ 2013-09-30 20:30 ` Aastha Mehta 2013-09-30 20:47 ` Josef Bacik 0 siblings, 1 reply; 19+ messages in thread From: Aastha Mehta @ 2013-09-30 20:30 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2700 bytes --] On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: >> >> Thank you very much for the reply. That clarifies a lot of things. >> >> >> >> I was trying a small test case that opens a file, writes a block of >> >> data, calls fsync and then closes the file. If I understand correctly, >> >> fsync would return only after all in-memory buffers have been >> >> committed to disk. I have added few print statements in the >> >> __extent_writepage function, and I notice that the function gets >> >> called a bit later after fsync returns. It seems that I am not >> >> guaranteed to see the data going to disk by the time fsync returns. >> >> >> >> Am I doing something wrong, or am I looking at the wrong place for >> >> disk write? This happens both with tree logging enabled as well as >> >> with notreelog. >> >> >> > >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 >> > or just build btrfs-next from git. But if I were to make a guess I'd say that >> > __extent_writepage was how both data and metadata was written out at the time (I >> > don't think I changed it until 3.2 or something later) so what you are likely >> > seeing is the normal transaction commit after the fsync. In the case of >> > notreelog we are likely starting another transaction and you are seeing that >> > commit (at the time the transaction kthread would start a transaction even if >> > none had been started yet.) Thanks, >> > >> > Josef >> >> Is there any special handling for very small file write, less than 4K? As >> I understand there is an optimization to inline the first extent in a file if >> it is smaller than 4K, does it affect the writeback on fsync as well? I did >> set the max_inline mount option to 0, but even then it seems there is >> some difference in fsync behaviour for writing first extent of less than 4K >> size and writing 4K or more. >> > > Yeah if the file is an inline extent then it will be copied into the log > directly and the log will be written out, no going through the data write path > at all. Max inline == 0 should make it so we don't inline, so if it isn't > honoring that then that may be a bug. Thanks, > > Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Thanks, Aastha. -- Aastha Mehta MPI-SWS, Germany E-mail: aasthakm@mpi-sws.org [-- Attachment #2: fsync-3.12-rc2-4Kwrite.txt --] [-- Type: text/plain, Size: 6217 bytes --] [ 2808.125838] [do_writepages] inum: 260, nr_to_write: 9223372036854775807, writepages fn: ffffffff8129ce40 [ 2808.125977] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4 [ 2808.125980] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 2808.125983] ffff88003d6b5e60 ffff88003d6b5e28 ffffffff817a8a6f ffff88003fc0eb48 [ 2808.125987] ffff88002e325b48 ffff88003d6b5e48 ffffffff81105c1f ffff88002e325a00 [ 2808.125990] ffff880037b1f300 ffff88003d6b5e88 ffffffff810fb689 ffff88003fc12c00 [ 2808.125993] Call Trace: [ 2808.126003] [<ffffffff817a8a6f>] dump_stack+0x46/0x58 [ 2808.126010] [<ffffffff81105c1f>] do_writepages+0x4f/0x90 [ 2808.126013] [<ffffffff810fb689>] __filemap_fdatawrite_range+0x49/0x50 [ 2808.126013] [<ffffffff810fc41e>] filemap_fdatawrite_range+0xe/0x10 [ 2808.126013] [<ffffffff812aafdc>] btrfs_sync_file+0xac/0x380 [ 2808.126098] [<ffffffff817aedd2>] ? __schedule+0x692/0x730 [ 2808.126115] [<ffffffff811752d8>] do_fsync+0x58/0x90 [ 2808.126149] [<ffffffff81148a6d>] ? SyS_write+0x4d/0xa0 [ 2808.126165] [<ffffffff8117565b>] SyS_fsync+0xb/0x10 [ 2808.126182] [<ffffffff817b83a2>] system_call_fastpath+0x16/0x1b [ 2808.127447] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4 [ 2808.127450] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 2808.127452] ffffea0000802440 ffff88003d6b5ba8 ffffffff817a8a6f ffff88003d6b5de0 [ 2808.127456] ffff88002e325870 ffff88003d6b5cb8 ffffffff812b7c9f ffff88003e2083d0 [ 2808.127459] 0000000000000000 ffff88003d6b5c48 ffffffff817aedd2 ffff88003c3707b0 [ 2808.127462] Call Trace: [ 2808.127467] [<ffffffff817a8a6f>] dump_stack+0x46/0x58 [ 2808.127473] [<ffffffff812b7c9f>] __extent_writepage+0x6f/0x750 [ 2808.127479] [<ffffffff817aedd2>] ? __schedule+0x692/0x730 [ 2808.127483] [<ffffffff810fac71>] ? find_get_pages_tag+0x121/0x160 [ 2808.127487] [<ffffffff812b8552>] extent_write_cache_pages.isra.24.constprop.31+0x1d2/0x380 [ 2808.127492] [<ffffffff81005877>] ? show_trace_log_lvl+0x57/0x70 [ 2808.127496] [<ffffffff812b89d9>] extent_writepages+0x49/0x60 [ 2808.127501] [<ffffffff8129f160>] ? btrfs_submit_direct+0x6c0/0x6c0 [ 2808.127505] [<ffffffff8129ce63>] btrfs_writepages+0x23/0x30 [ 2808.127510] [<ffffffff81105c3e>] do_writepages+0x6e/0x90 [ 2808.127513] [<ffffffff810fb689>] __filemap_fdatawrite_range+0x49/0x50 [ 2808.127517] [<ffffffff810fc41e>] filemap_fdatawrite_range+0xe/0x10 [ 2808.127520] [<ffffffff812aafdc>] btrfs_sync_file+0xac/0x380 [ 2808.127524] [<ffffffff817aedd2>] ? __schedule+0x692/0x730 [ 2808.127529] [<ffffffff811752d8>] do_fsync+0x58/0x90 [ 2808.127533] [<ffffffff81148a6d>] ? SyS_write+0x4d/0xa0 [ 2808.127538] [<ffffffff8117565b>] SyS_fsync+0xb/0x10 [ 2808.127542] [<ffffffff817b83a2>] system_call_fastpath+0x16/0x1b [ 2808.127646] [btrfs_sync_file] inum: (260, 5), called write_range, start: 0, end: 9223372036854775807, ret: 0 [ 2808.127650] [btrfs_sync_file] inum: (260, 5), calling btrfs_wait_ordered_range, start: 0, end: 9223372036854775807 [ 2808.127653] [do_writepages] inum: 260, nr_to_write: 9223372036854775807, writepages fn: ffffffff8129ce40 [ 2808.127656] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4 [ 2808.127658] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 2808.127660] ffff88003d6b5e20 ffff88003d6b5de8 ffffffff817a8a6f 0000000000000007 [ 2808.127663] ffff88002e325b48 ffff88003d6b5e08 ffffffff81105c1f 0000000000000000 [ 2808.127666] ffff88002e325a00 ffff88003d6b5e48 ffffffff810fb689 ffff880037b1f300 [ 2808.127669] Call Trace: [ 2808.127673] [<ffffffff817a8a6f>] dump_stack+0x46/0x58 [ 2808.127678] [<ffffffff81105c1f>] do_writepages+0x4f/0x90 [ 2808.127681] [<ffffffff810fb689>] __filemap_fdatawrite_range+0x49/0x50 [ 2808.127684] [<ffffffff810fc41e>] filemap_fdatawrite_range+0xe/0x10 [ 2808.127688] [<ffffffff812b2a98>] btrfs_wait_ordered_range+0x58/0x100 [ 2808.127691] [<ffffffff812ab0d8>] btrfs_sync_file+0x1a8/0x380 [ 2808.127696] [<ffffffff817aedd2>] ? __schedule+0x692/0x730 [ 2808.127700] [<ffffffff811752d8>] do_fsync+0x58/0x90 [ 2808.127703] [<ffffffff81148a6d>] ? SyS_write+0x4d/0xa0 [ 2808.127707] [<ffffffff8117565b>] SyS_fsync+0xb/0x10 [ 2808.127712] [<ffffffff817b83a2>] system_call_fastpath+0x16/0x1b [ 2808.127913] [btrfs_sync_file] inum: (260, 5), return from log_dentry, start: 0, end: 9223372036854775807, ret: 0 [ 2808.127938] [btrfs_sync_log] 1. called write_marked_extents, ret: 0 [ 2808.127955] [btrfs_sync_log] 2. called write_marked_extents, ret: 0 [ 2838.176145] [do_writepages] inum: 260, nr_to_write: 3072, writepages fn: ffffffff8129ce40 [ 2838.176153] CPU: 0 PID: 711 Comm: kworker/u2:5 Not tainted 3.12.0-rc2-latest+ #4 [ 2838.176155] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 2838.176163] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1) [ 2838.176167] ffff88003c6bdb60 ffff88003c6bda88 ffffffff817a8a6f ffff88003fc0eb48 [ 2838.176170] ffff88002e325b48 ffff88003c6bdaa8 ffffffff81105c1f ffff88002e325a00 [ 2838.176173] ffff88003c6bdb60 ffff88003c6bdb08 ffffffff8116ea30 ffff88003dc257b0 [ 2838.176176] Call Trace: [ 2838.176184] [<ffffffff817a8a6f>] dump_stack+0x46/0x58 [ 2838.176190] [<ffffffff81105c1f>] do_writepages+0x4f/0x90 [ 2838.176194] [<ffffffff8116ea30>] __writeback_single_inode+0xa0/0x260 [ 2838.176199] [<ffffffff8116f741>] writeback_sb_inodes+0x2b1/0x430 [ 2838.176203] [<ffffffff8116ea77>] ? __writeback_single_inode+0xe7/0x260 [ 2838.176207] [<ffffffff8116f93e>] __writeback_inodes_wb+0x7e/0xc0 [ 2838.176211] [<ffffffff8117000b>] wb_writeback+0x18b/0x2e0 [ 2838.176215] [<ffffffff811701e3>] wb_check_old_data_flush+0x83/0xb0 [ 2838.176220] [<ffffffff81171aac>] bdi_writeback_workfn+0x1dc/0x440 [ 2838.176225] [<ffffffff8105f19c>] process_one_work+0x2bc/0x430 [ 2838.176229] [<ffffffff810600c5>] worker_thread+0x215/0x320 [ 2838.176233] [<ffffffff8105feb0>] ? manage_workers.isra.18+0x2b0/0x2b0 [ 2838.176237] [<ffffffff810666f6>] kthread+0xb6/0xc0 [ 2838.176241] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 2838.176246] [<ffffffff817b82fc>] ret_from_fork+0x7c/0xb0 [ 2838.176250] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [-- Attachment #3: fsync-3.12-rc2-54bytes.txt --] [-- Type: text/plain, Size: 5845 bytes --] I installed 3.12.0-rc2 and added dump_stack() at the beginning of do_writepages() and __extent_writepage(). When I run my test case (fopen, fwrite a small random string, fsync, fclose), it returns soon, and this is how my dmesg looks like: [ 5.471036] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 10.848064] EXT4-fs (sda2): re-mounted. Opts: acl,user_xattr,commit=0 [ 10.859285] EXT4-fs (sdc1): re-mounted. Opts: acl,user_xattr,commit=0 [ 10.861832] EXT4-fs (sdb1): re-mounted. Opts: acl,user_xattr,commit=0 [ 33.763486] btrfs: device fsid 4d2767ea-85cc-443c-88bd-2d8c693f7c7c devid 1 transid 2072 /dev/sdd [ 33.764218] btrfs: max_inline at 0 [ 33.764220] btrfs: disk space caching is enabled [ 61.083202] [do_writepages] inum: 258, nr_to_write: 9223372036854775807, writepages fn: ffffffff8129ce40 [ 61.083204] CPU: 0 PID: 4165 Comm: bb Not tainted 3.12.0-rc2-latest+ #4 [ 61.083205] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 61.083205] ffff8800399ade20 ffff8800399adde8 ffffffff817a8a6f 0000000000000007 [ 61.083207] ffff88002e324f30 ffff8800399ade08 ffffffff81105c1f 0000000000000000 [ 61.083208] ffff88002e324de8 ffff8800399ade48 ffffffff810fb689 ffff88003d9f2c00 [ 61.083209] Call Trace: [ 61.083211] [<ffffffff817a8a6f>] dump_stack+0x46/0x58 [ 61.083218] [<ffffffff81105c1f>] do_writepages+0x4f/0x90 [ 61.083220] [<ffffffff810fb689>] __filemap_fdatawrite_range+0x49/0x50 [ 61.083221] [<ffffffff810fc41e>] filemap_fdatawrite_range+0xe/0x10 [ 61.083224] [<ffffffff812b2a98>] btrfs_wait_ordered_range+0x58/0x100 [ 61.083225] [<ffffffff812ab0d8>] btrfs_sync_file+0x1a8/0x380 [ 61.083227] [<ffffffff817aedd2>] ? __schedule+0x692/0x730 [ 61.083229] [<ffffffff811752d8>] do_fsync+0x58/0x90 [ 61.083231] [<ffffffff81148a6d>] ? SyS_write+0x4d/0xa0 [ 61.083232] [<ffffffff8117565b>] SyS_fsync+0xb/0x10 [ 61.083234] [<ffffffff817b83a2>] system_call_fastpath+0x16/0x1b <<<<<<<<< immediately after the test program exits [ 96.144059] [do_writepages] inum: 258, nr_to_write: 3072, writepages fn: ffffffff8129ce40 <<<<<<<<< appears after ~35ms [ 96.144064] CPU: 0 PID: 711 Comm: kworker/u2:5 Not tainted 3.12.0-rc2-latest+ #4 [ 96.144065] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 96.144070] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1) [ 96.144072] ffff88003c6bdb60 ffff88003c6bda88 ffffffff817a8a6f ffff88003fc0eb48 [ 96.144074] ffff88002e324f30 ffff88003c6bdaa8 ffffffff81105c1f ffff88002e324de8 [ 96.144075] ffff88003c6bdb60 ffff88003c6bdb08 ffffffff8116ea30 ffff88003dc25130 [ 96.144077] Call Trace: [ 96.144081] [<ffffffff817a8a6f>] dump_stack+0x46/0x58 [ 96.144084] [<ffffffff81105c1f>] do_writepages+0x4f/0x90 [ 96.144087] [<ffffffff8116ea30>] __writeback_single_inode+0xa0/0x260 [ 96.144089] [<ffffffff8116f741>] writeback_sb_inodes+0x2b1/0x430 [ 96.144091] [<ffffffff8116f93e>] __writeback_inodes_wb+0x7e/0xc0 [ 96.144093] [<ffffffff8117000b>] wb_writeback+0x18b/0x2e0 [ 96.144095] [<ffffffff811701e3>] wb_check_old_data_flush+0x83/0xb0 [ 96.144097] [<ffffffff81171aac>] bdi_writeback_workfn+0x1dc/0x440 [ 96.144100] [<ffffffff8105f19c>] process_one_work+0x2bc/0x430 [ 96.144102] [<ffffffff810600c5>] worker_thread+0x215/0x320 [ 96.144103] [<ffffffff8105feb0>] ? manage_workers.isra.18+0x2b0/0x2b0 [ 96.144106] [<ffffffff810666f6>] kthread+0xb6/0xc0 [ 96.144107] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 96.144110] [<ffffffff817b82fc>] ret_from_fork+0x7c/0xb0 [ 96.144111] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 96.144116] CPU: 0 PID: 711 Comm: kworker/u2:5 Not tainted 3.12.0-rc2-latest+ #4 [ 96.144116] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 96.144118] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1) [ 96.144119] ffffea00007eb100 ffff88003c6bd808 ffffffff817a8a6f ffff88003c6bda40 [ 96.144121] ffff88002e324c58 ffff88003c6bd918 ffffffff812b7c9f ffff88003c6bd878 [ 96.144122] ffffffff817a364e ffff88003c6bd888 ffffffff00000020 ffff88003c6bd888 [ 96.144123] Call Trace: [ 96.144125] [<ffffffff817a8a6f>] dump_stack+0x46/0x58 [ 96.144128] [<ffffffff812b7c9f>] __extent_writepage+0x6f/0x750 [ 96.144130] [<ffffffff817a364e>] ? printk+0x5c/0x5e [ 96.144131] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 96.144133] [<ffffffff810fac71>] ? find_get_pages_tag+0x121/0x160 [ 96.144135] [<ffffffff812b8552>] extent_write_cache_pages.isra.24.constprop.31+0x1d2/0x380 [ 96.144138] [<ffffffff81005877>] ? show_trace_log_lvl+0x57/0x70 [ 96.144139] [<ffffffff812b89d9>] extent_writepages+0x49/0x60 [ 96.144142] [<ffffffff8129f160>] ? btrfs_submit_direct+0x6c0/0x6c0 [ 96.144144] [<ffffffff8129ce63>] btrfs_writepages+0x23/0x30 [ 96.144146] [<ffffffff81105c3e>] do_writepages+0x6e/0x90 [ 96.144147] [<ffffffff8116ea30>] __writeback_single_inode+0xa0/0x260 [ 96.144149] [<ffffffff8116f741>] writeback_sb_inodes+0x2b1/0x430 [ 96.144151] [<ffffffff8116f93e>] __writeback_inodes_wb+0x7e/0xc0 [ 96.144153] [<ffffffff8117000b>] wb_writeback+0x18b/0x2e0 [ 96.144154] [<ffffffff811701e3>] wb_check_old_data_flush+0x83/0xb0 [ 96.144156] [<ffffffff81171aac>] bdi_writeback_workfn+0x1dc/0x440 [ 96.144164] [<ffffffff8105f19c>] process_one_work+0x2bc/0x430 [ 96.144166] [<ffffffff810600c5>] worker_thread+0x215/0x320 [ 96.144168] [<ffffffff8105feb0>] ? manage_workers.isra.18+0x2b0/0x2b0 [ 96.144169] [<ffffffff810666f6>] kthread+0xb6/0xc0 [ 96.144171] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 96.144173] [<ffffffff817b82fc>] ret_from_fork+0x7c/0xb0 [ 96.144174] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 This is a contiguous dmesg output. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-30 20:30 ` Aastha Mehta @ 2013-09-30 20:47 ` Josef Bacik 2013-09-30 21:07 ` Aastha Mehta 0 siblings, 1 reply; 19+ messages in thread From: Josef Bacik @ 2013-09-30 20:47 UTC (permalink / raw) To: Aastha Mehta; +Cc: Josef Bacik, linux-btrfs On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: > On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: > > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: > >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: > >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: > >> >> Thank you very much for the reply. That clarifies a lot of things. > >> >> > >> >> I was trying a small test case that opens a file, writes a block of > >> >> data, calls fsync and then closes the file. If I understand correctly, > >> >> fsync would return only after all in-memory buffers have been > >> >> committed to disk. I have added few print statements in the > >> >> __extent_writepage function, and I notice that the function gets > >> >> called a bit later after fsync returns. It seems that I am not > >> >> guaranteed to see the data going to disk by the time fsync returns. > >> >> > >> >> Am I doing something wrong, or am I looking at the wrong place for > >> >> disk write? This happens both with tree logging enabled as well as > >> >> with notreelog. > >> >> > >> > > >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was > >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 > >> > or just build btrfs-next from git. But if I were to make a guess I'd say that > >> > __extent_writepage was how both data and metadata was written out at the time (I > >> > don't think I changed it until 3.2 or something later) so what you are likely > >> > seeing is the normal transaction commit after the fsync. In the case of > >> > notreelog we are likely starting another transaction and you are seeing that > >> > commit (at the time the transaction kthread would start a transaction even if > >> > none had been started yet.) Thanks, > >> > > >> > Josef > >> > >> Is there any special handling for very small file write, less than 4K? As > >> I understand there is an optimization to inline the first extent in a file if > >> it is smaller than 4K, does it affect the writeback on fsync as well? I did > >> set the max_inline mount option to 0, but even then it seems there is > >> some difference in fsync behaviour for writing first extent of less than 4K > >> size and writing 4K or more. > >> > > > > Yeah if the file is an inline extent then it will be copied into the log > > directly and the log will be written out, no going through the data write path > > at all. Max inline == 0 should make it so we don't inline, so if it isn't > > honoring that then that may be a bug. Thanks, > > > > Josef > > I tried it on 3.12-rc2 release, and it seems there is a bug then. > Please find attached logs to confirm. > Also, probably on the older release. > Oooh ok I understand, you have your printk's in the wrong place ;). do_writepages doesn't necessarily mean you are writing something. If you want to see if stuff got written to the disk I'd put a printk at run_delalloc_range and have it spit out the range it is writing out since thats what we think is actually dirty. Thanks, Josef ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-30 20:47 ` Josef Bacik @ 2013-09-30 21:07 ` Aastha Mehta 2013-09-30 21:17 ` Josef Bacik 2013-10-01 17:34 ` Josef Bacik 0 siblings, 2 replies; 19+ messages in thread From: Aastha Mehta @ 2013-09-30 21:07 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: >> >> >> Thank you very much for the reply. That clarifies a lot of things. >> >> >> >> >> >> I was trying a small test case that opens a file, writes a block of >> >> >> data, calls fsync and then closes the file. If I understand correctly, >> >> >> fsync would return only after all in-memory buffers have been >> >> >> committed to disk. I have added few print statements in the >> >> >> __extent_writepage function, and I notice that the function gets >> >> >> called a bit later after fsync returns. It seems that I am not >> >> >> guaranteed to see the data going to disk by the time fsync returns. >> >> >> >> >> >> Am I doing something wrong, or am I looking at the wrong place for >> >> >> disk write? This happens both with tree logging enabled as well as >> >> >> with notreelog. >> >> >> >> >> > >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that >> >> > __extent_writepage was how both data and metadata was written out at the time (I >> >> > don't think I changed it until 3.2 or something later) so what you are likely >> >> > seeing is the normal transaction commit after the fsync. In the case of >> >> > notreelog we are likely starting another transaction and you are seeing that >> >> > commit (at the time the transaction kthread would start a transaction even if >> >> > none had been started yet.) Thanks, >> >> > >> >> > Josef >> >> >> >> Is there any special handling for very small file write, less than 4K? As >> >> I understand there is an optimization to inline the first extent in a file if >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did >> >> set the max_inline mount option to 0, but even then it seems there is >> >> some difference in fsync behaviour for writing first extent of less than 4K >> >> size and writing 4K or more. >> >> >> > >> > Yeah if the file is an inline extent then it will be copied into the log >> > directly and the log will be written out, no going through the data write path >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't >> > honoring that then that may be a bug. Thanks, >> > >> > Josef >> >> I tried it on 3.12-rc2 release, and it seems there is a bug then. >> Please find attached logs to confirm. >> Also, probably on the older release. >> > > Oooh ok I understand, you have your printk's in the wrong place ;). > do_writepages doesn't necessarily mean you are writing something. If you want > to see if stuff got written to the disk I'd put a printk at run_delalloc_range > and have it spit out the range it is writing out since thats what we think is > actually dirty. Thanks, > > Josef No, but I also placed dump_stack() in the beginning of __extent_writepage. run_delalloc_range is being called only from __extent_writepage, if it were to be called, the dump_stack() at the top of __extent_writepage would have printed as well, no? Thanks -- Aastha Mehta MPI-SWS, Germany E-mail: aasthakm@mpi-sws.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-30 21:07 ` Aastha Mehta @ 2013-09-30 21:17 ` Josef Bacik 2013-10-01 17:34 ` Josef Bacik 1 sibling, 0 replies; 19+ messages in thread From: Josef Bacik @ 2013-09-30 21:17 UTC (permalink / raw) To: Aastha Mehta; +Cc: Josef Bacik, linux-btrfs On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: > On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: > > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: > >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: > >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: > >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: > >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: > >> >> >> Thank you very much for the reply. That clarifies a lot of things. > >> >> >> > >> >> >> I was trying a small test case that opens a file, writes a block of > >> >> >> data, calls fsync and then closes the file. If I understand correctly, > >> >> >> fsync would return only after all in-memory buffers have been > >> >> >> committed to disk. I have added few print statements in the > >> >> >> __extent_writepage function, and I notice that the function gets > >> >> >> called a bit later after fsync returns. It seems that I am not > >> >> >> guaranteed to see the data going to disk by the time fsync returns. > >> >> >> > >> >> >> Am I doing something wrong, or am I looking at the wrong place for > >> >> >> disk write? This happens both with tree logging enabled as well as > >> >> >> with notreelog. > >> >> >> > >> >> > > >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was > >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 > >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that > >> >> > __extent_writepage was how both data and metadata was written out at the time (I > >> >> > don't think I changed it until 3.2 or something later) so what you are likely > >> >> > seeing is the normal transaction commit after the fsync. In the case of > >> >> > notreelog we are likely starting another transaction and you are seeing that > >> >> > commit (at the time the transaction kthread would start a transaction even if > >> >> > none had been started yet.) Thanks, > >> >> > > >> >> > Josef > >> >> > >> >> Is there any special handling for very small file write, less than 4K? As > >> >> I understand there is an optimization to inline the first extent in a file if > >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did > >> >> set the max_inline mount option to 0, but even then it seems there is > >> >> some difference in fsync behaviour for writing first extent of less than 4K > >> >> size and writing 4K or more. > >> >> > >> > > >> > Yeah if the file is an inline extent then it will be copied into the log > >> > directly and the log will be written out, no going through the data write path > >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't > >> > honoring that then that may be a bug. Thanks, > >> > > >> > Josef > >> > >> I tried it on 3.12-rc2 release, and it seems there is a bug then. > >> Please find attached logs to confirm. > >> Also, probably on the older release. > >> > > > > Oooh ok I understand, you have your printk's in the wrong place ;). > > do_writepages doesn't necessarily mean you are writing something. If you want > > to see if stuff got written to the disk I'd put a printk at run_delalloc_range > > and have it spit out the range it is writing out since thats what we think is > > actually dirty. Thanks, > > > > Josef > > No, but I also placed dump_stack() in the beginning of > __extent_writepage. run_delalloc_range is being called only from > __extent_writepage, if it were to be called, the dump_stack() at the > top of __extent_writepage would have printed as well, no? > Yeah, so I don't know whats going on and I'm in the middle of something, I'll look at it tomorrow and see if I can't figure out what is going on. I'm sure it's working, we have a xfstest to test this sort of thing and it's passing so we're definitely getting the data to disk properly, I'm probably just missing some peice around here somewhere. Thanks, Josef ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-09-30 21:07 ` Aastha Mehta 2013-09-30 21:17 ` Josef Bacik @ 2013-10-01 17:34 ` Josef Bacik 2013-10-01 19:40 ` Aastha Mehta 1 sibling, 1 reply; 19+ messages in thread From: Josef Bacik @ 2013-10-01 17:34 UTC (permalink / raw) To: Aastha Mehta; +Cc: Josef Bacik, linux-btrfs On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: > On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: > > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: > >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: > >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: > >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: > >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: > >> >> >> Thank you very much for the reply. That clarifies a lot of things. > >> >> >> > >> >> >> I was trying a small test case that opens a file, writes a block of > >> >> >> data, calls fsync and then closes the file. If I understand correctly, > >> >> >> fsync would return only after all in-memory buffers have been > >> >> >> committed to disk. I have added few print statements in the > >> >> >> __extent_writepage function, and I notice that the function gets > >> >> >> called a bit later after fsync returns. It seems that I am not > >> >> >> guaranteed to see the data going to disk by the time fsync returns. > >> >> >> > >> >> >> Am I doing something wrong, or am I looking at the wrong place for > >> >> >> disk write? This happens both with tree logging enabled as well as > >> >> >> with notreelog. > >> >> >> > >> >> > > >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was > >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 > >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that > >> >> > __extent_writepage was how both data and metadata was written out at the time (I > >> >> > don't think I changed it until 3.2 or something later) so what you are likely > >> >> > seeing is the normal transaction commit after the fsync. In the case of > >> >> > notreelog we are likely starting another transaction and you are seeing that > >> >> > commit (at the time the transaction kthread would start a transaction even if > >> >> > none had been started yet.) Thanks, > >> >> > > >> >> > Josef > >> >> > >> >> Is there any special handling for very small file write, less than 4K? As > >> >> I understand there is an optimization to inline the first extent in a file if > >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did > >> >> set the max_inline mount option to 0, but even then it seems there is > >> >> some difference in fsync behaviour for writing first extent of less than 4K > >> >> size and writing 4K or more. > >> >> > >> > > >> > Yeah if the file is an inline extent then it will be copied into the log > >> > directly and the log will be written out, no going through the data write path > >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't > >> > honoring that then that may be a bug. Thanks, > >> > > >> > Josef > >> > >> I tried it on 3.12-rc2 release, and it seems there is a bug then. > >> Please find attached logs to confirm. > >> Also, probably on the older release. > >> > > > > Oooh ok I understand, you have your printk's in the wrong place ;). > > do_writepages doesn't necessarily mean you are writing something. If you want > > to see if stuff got written to the disk I'd put a printk at run_delalloc_range > > and have it spit out the range it is writing out since thats what we think is > > actually dirty. Thanks, > > > > Josef > > No, but I also placed dump_stack() in the beginning of > __extent_writepage. run_delalloc_range is being called only from > __extent_writepage, if it were to be called, the dump_stack() at the > top of __extent_writepage would have printed as well, no? > Ok I've done the same thing and I'm not seeing what you are seeing. Are you using any mount options other than notreelog and max_inline=0? Could you adjust your printk to print out the root objectid for the inode as well? It could be possible that this is the writeout for the space cache or inode cache. Thanks, Josef ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-10-01 17:34 ` Josef Bacik @ 2013-10-01 19:40 ` Aastha Mehta 2013-10-01 19:42 ` Aastha Mehta 0 siblings, 1 reply; 19+ messages in thread From: Aastha Mehta @ 2013-10-01 19:40 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4456 bytes --] On 1 October 2013 19:34, Josef Bacik <jbacik@fusionio.com> wrote: > On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: >> On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: >> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: >> >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: >> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: >> >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: >> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: >> >> >> >> Thank you very much for the reply. That clarifies a lot of things. >> >> >> >> >> >> >> >> I was trying a small test case that opens a file, writes a block of >> >> >> >> data, calls fsync and then closes the file. If I understand correctly, >> >> >> >> fsync would return only after all in-memory buffers have been >> >> >> >> committed to disk. I have added few print statements in the >> >> >> >> __extent_writepage function, and I notice that the function gets >> >> >> >> called a bit later after fsync returns. It seems that I am not >> >> >> >> guaranteed to see the data going to disk by the time fsync returns. >> >> >> >> >> >> >> >> Am I doing something wrong, or am I looking at the wrong place for >> >> >> >> disk write? This happens both with tree logging enabled as well as >> >> >> >> with notreelog. >> >> >> >> >> >> >> > >> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was >> >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 >> >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that >> >> >> > __extent_writepage was how both data and metadata was written out at the time (I >> >> >> > don't think I changed it until 3.2 or something later) so what you are likely >> >> >> > seeing is the normal transaction commit after the fsync. In the case of >> >> >> > notreelog we are likely starting another transaction and you are seeing that >> >> >> > commit (at the time the transaction kthread would start a transaction even if >> >> >> > none had been started yet.) Thanks, >> >> >> > >> >> >> > Josef >> >> >> >> >> >> Is there any special handling for very small file write, less than 4K? As >> >> >> I understand there is an optimization to inline the first extent in a file if >> >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did >> >> >> set the max_inline mount option to 0, but even then it seems there is >> >> >> some difference in fsync behaviour for writing first extent of less than 4K >> >> >> size and writing 4K or more. >> >> >> >> >> > >> >> > Yeah if the file is an inline extent then it will be copied into the log >> >> > directly and the log will be written out, no going through the data write path >> >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't >> >> > honoring that then that may be a bug. Thanks, >> >> > >> >> > Josef >> >> >> >> I tried it on 3.12-rc2 release, and it seems there is a bug then. >> >> Please find attached logs to confirm. >> >> Also, probably on the older release. >> >> >> > >> > Oooh ok I understand, you have your printk's in the wrong place ;). >> > do_writepages doesn't necessarily mean you are writing something. If you want >> > to see if stuff got written to the disk I'd put a printk at run_delalloc_range >> > and have it spit out the range it is writing out since thats what we think is >> > actually dirty. Thanks, >> > >> > Josef >> >> No, but I also placed dump_stack() in the beginning of >> __extent_writepage. run_delalloc_range is being called only from >> __extent_writepage, if it were to be called, the dump_stack() at the >> top of __extent_writepage would have printed as well, no? >> > > Ok I've done the same thing and I'm not seeing what you are seeing. Are you > using any mount options other than notreelog and max_inline=0? Could you adjust > your printk to print out the root objectid for the inode as well? It could be > possible that this is the writeout for the space cache or inode cache. Thanks, > > Josef I actually printed the stack only when the root objectid is 5. I have attached another log for writing the first 500 bytes in a file. I also print the root objectid for the inode in run_delalloc and __extent_writepage. Thanks -- Aastha Mehta MPI-SWS, Germany E-mail: aasthakm@mpi-sws.org [-- Attachment #2: fsync-3.12-rc2-500bytes-011013.txt --] [-- Type: text/plain, Size: 6945 bytes --] write success, size: 500 fsync:: Success file size: 500 fclose:: Success [ 829.847483] btrfs: device fsid 4d2767ea-85cc-443c-88bd-2d8c693f7c7c devid 1 transid 2201 /dev/sdd [ 829.848199] btrfs: max_inline at 0 [ 829.848201] btrfs: disabling tree log [ 829.848202] btrfs: disk space caching is enabled [ 835.266483] [do_writepages] inum: 258, nr_to_write: 9223372036854775807, writepages fn: ffffffff8129ce40 [ 835.266492] CPU: 0 PID: 10021 Comm: bb Not tainted 3.12.0-rc2-latest+ #7 [ 835.266494] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 835.266496] ffff88003cdc5e60 ffff88003cdc5e28 ffffffff817a8acf ffff88003fc0eb48 [ 835.266511] ffff88001f478b28 ffff88003cdc5e48 ffffffff81105c1f ffff88001f4789e0 [ 835.266515] ffff880037557400 ffff88003cdc5e88 ffffffff810fb689 ffff88003fc12c00 [ 835.266518] Call Trace: [ 835.266528] [<ffffffff817a8acf>] dump_stack+0x46/0x58 [ 835.266534] [<ffffffff81105c1f>] do_writepages+0x4f/0x90 [ 835.266538] [<ffffffff810fb689>] __filemap_fdatawrite_range+0x49/0x50 [ 835.266542] [<ffffffff810fc41e>] filemap_fdatawrite_range+0xe/0x10 [ 835.266546] [<ffffffff812ab01c>] btrfs_sync_file+0xac/0x380 [ 835.266553] [<ffffffff817aee32>] ? __schedule+0x692/0x730 [ 835.266559] [<ffffffff811752d8>] do_fsync+0x58/0x90 [ 835.266563] [<ffffffff81148a6d>] ? SyS_write+0x4d/0xa0 [ 835.266568] [<ffffffff8117565b>] SyS_fsync+0xb/0x10 [ 835.266572] [<ffffffff817b8422>] system_call_fastpath+0x16/0x1b [ 835.266579] [btrfs_sync_file] inum: (258, 5), called write_range, start: 0, end: 9223372036854775807, ret: 0 [ 835.266581] [btrfs_sync_file] inum: (258, 5), calling btrfs_wait_ordered_range, start: 0, end: 9223372036854775807 [ 835.266584] [do_writepages] inum: 258, nr_to_write: 9223372036854775807, writepages fn: ffffffff8129ce40 [ 835.266587] CPU: 0 PID: 10021 Comm: bb Not tainted 3.12.0-rc2-latest+ #7 [ 835.266589] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 835.266590] ffff88003cdc5e20 ffff88003cdc5de8 ffffffff817a8acf 0000000000000007 [ 835.266593] ffff88001f478b28 ffff88003cdc5e08 ffffffff81105c1f 0000000000000000 [ 835.266596] ffff88001f4789e0 ffff88003cdc5e48 ffffffff810fb689 ffff880037557400 [ 835.266599] Call Trace: [ 835.266603] [<ffffffff817a8acf>] dump_stack+0x46/0x58 [ 835.266607] [<ffffffff81105c1f>] do_writepages+0x4f/0x90 [ 835.266610] [<ffffffff810fb689>] __filemap_fdatawrite_range+0x49/0x50 [ 835.266613] [<ffffffff810fc41e>] filemap_fdatawrite_range+0xe/0x10 [ 835.266618] [<ffffffff812b2ad8>] btrfs_wait_ordered_range+0x58/0x100 [ 835.266621] [<ffffffff812ab118>] btrfs_sync_file+0x1a8/0x380 [ 835.266625] [<ffffffff817aee32>] ? __schedule+0x692/0x730 [ 835.266629] [<ffffffff811752d8>] do_fsync+0x58/0x90 [ 835.266632] [<ffffffff81148a6d>] ? SyS_write+0x4d/0xa0 [ 835.266636] [<ffffffff8117565b>] SyS_fsync+0xb/0x10 [ 835.266640] [<ffffffff817b8422>] system_call_fastpath+0x16/0x1b [ 835.266647] [btrfs_sync_file] inum: (258, 5), return from log_dentry, start: 0, end: 9223372036854775807, ret: 1 [ 865.312082] [do_writepages] inum: 258, nr_to_write: 3072, writepages fn: ffffffff8129ce40 [ 865.312090] CPU: 0 PID: 711 Comm: kworker/u2:5 Not tainted 3.12.0-rc2-latest+ #7 [ 865.312093] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 865.312100] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-2) [ 865.312104] ffff88003c793b60 ffff88003c793a88 ffffffff817a8acf ffff88003fc0eb48 [ 865.312107] ffff88001f478b28 ffff88003c793aa8 ffffffff81105c1f ffff88001f4789e0 [ 865.312110] ffff88003c793b60 ffff88003c793b08 ffffffff8116ea30 ffff880039caf398 [ 865.312113] Call Trace: [ 865.312121] [<ffffffff817a8acf>] dump_stack+0x46/0x58 [ 865.312126] [<ffffffff81105c1f>] do_writepages+0x4f/0x90 [ 865.312131] [<ffffffff8116ea30>] __writeback_single_inode+0xa0/0x260 [ 865.312135] [<ffffffff8116f741>] writeback_sb_inodes+0x2b1/0x430 [ 865.312139] [<ffffffff8116f93e>] __writeback_inodes_wb+0x7e/0xc0 [ 865.312143] [<ffffffff8117000b>] wb_writeback+0x18b/0x2e0 [ 865.312147] [<ffffffff811701e3>] wb_check_old_data_flush+0x83/0xb0 [ 865.312151] [<ffffffff81171aac>] bdi_writeback_workfn+0x1dc/0x440 [ 865.312157] [<ffffffff8105f19c>] process_one_work+0x2bc/0x430 [ 865.312161] [<ffffffff810600c5>] worker_thread+0x215/0x320 [ 865.312165] [<ffffffff8105feb0>] ? manage_workers.isra.18+0x2b0/0x2b0 [ 865.312169] [<ffffffff810666f6>] kthread+0xb6/0xc0 [ 865.312172] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 865.312178] [<ffffffff817b837c>] ret_from_fork+0x7c/0xb0 [ 865.312181] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 865.312188] [__extent_writepage] inum: (258, 5) [ 865.312191] CPU: 0 PID: 711 Comm: kworker/u2:5 Not tainted 3.12.0-rc2-latest+ #7 [ 865.312192] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 865.312196] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-2) [ 865.312198] ffffea0000b93700 ffff88003c793808 ffffffff817a8acf 0000000000000007 [ 865.312201] ffff88001f478850 ffff88003c793918 ffffffff812b7cf2 ffff88003c793878 [ 865.312204] ffffffff817a36ae ffff88003c793888 ffffffff00000020 ffff88003c793888 [ 865.312207] Call Trace: [ 865.312211] [<ffffffff817a8acf>] dump_stack+0x46/0x58 [ 865.312216] [<ffffffff812b7cf2>] __extent_writepage+0x82/0x770 [ 865.312219] [<ffffffff817a36ae>] ? printk+0x5c/0x5e [ 865.312222] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 865.312226] [<ffffffff810fac71>] ? find_get_pages_tag+0x121/0x160 [ 865.312231] [<ffffffff812b85b2>] extent_write_cache_pages.isra.24.constprop.31+0x1d2/0x380 [ 865.312235] [<ffffffff81005877>] ? show_trace_log_lvl+0x57/0x70 [ 865.312239] [<ffffffff812b8a39>] extent_writepages+0x49/0x60 [ 865.312244] [<ffffffff8129f160>] ? btrfs_submit_direct+0x6c0/0x6c0 [ 865.312247] [<ffffffff8129ce63>] btrfs_writepages+0x23/0x30 [ 865.312251] [<ffffffff81105c3e>] do_writepages+0x6e/0x90 [ 865.312255] [<ffffffff8116ea30>] __writeback_single_inode+0xa0/0x260 [ 865.312259] [<ffffffff8116f741>] writeback_sb_inodes+0x2b1/0x430 [ 865.312263] [<ffffffff8116f93e>] __writeback_inodes_wb+0x7e/0xc0 [ 865.312267] [<ffffffff8117000b>] wb_writeback+0x18b/0x2e0 [ 865.312271] [<ffffffff811701e3>] wb_check_old_data_flush+0x83/0xb0 [ 865.312275] [<ffffffff81171aac>] bdi_writeback_workfn+0x1dc/0x440 [ 865.312279] [<ffffffff8105f19c>] process_one_work+0x2bc/0x430 [ 865.312283] [<ffffffff810600c5>] worker_thread+0x215/0x320 [ 865.312287] [<ffffffff8105feb0>] ? manage_workers.isra.18+0x2b0/0x2b0 [ 865.312290] [<ffffffff810666f6>] kthread+0xb6/0xc0 [ 865.312293] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 865.312297] [<ffffffff817b837c>] ret_from_fork+0x7c/0xb0 [ 865.312301] [<ffffffff81066640>] ? flush_kthread_worker+0xa0/0xa0 [ 865.312309] [run_delalloc_range] inum: (258, 5), alloc start: 0, end: 4095 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-10-01 19:40 ` Aastha Mehta @ 2013-10-01 19:42 ` Aastha Mehta 2013-10-01 20:13 ` Aastha Mehta 0 siblings, 1 reply; 19+ messages in thread From: Aastha Mehta @ 2013-10-01 19:42 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On 1 October 2013 21:40, Aastha Mehta <aasthakm@gmail.com> wrote: > On 1 October 2013 19:34, Josef Bacik <jbacik@fusionio.com> wrote: >> On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: >>> On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: >>> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: >>> >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: >>> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: >>> >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: >>> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: >>> >> >> >> Thank you very much for the reply. That clarifies a lot of things. >>> >> >> >> >>> >> >> >> I was trying a small test case that opens a file, writes a block of >>> >> >> >> data, calls fsync and then closes the file. If I understand correctly, >>> >> >> >> fsync would return only after all in-memory buffers have been >>> >> >> >> committed to disk. I have added few print statements in the >>> >> >> >> __extent_writepage function, and I notice that the function gets >>> >> >> >> called a bit later after fsync returns. It seems that I am not >>> >> >> >> guaranteed to see the data going to disk by the time fsync returns. >>> >> >> >> >>> >> >> >> Am I doing something wrong, or am I looking at the wrong place for >>> >> >> >> disk write? This happens both with tree logging enabled as well as >>> >> >> >> with notreelog. >>> >> >> >> >>> >> >> > >>> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was >>> >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 >>> >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that >>> >> >> > __extent_writepage was how both data and metadata was written out at the time (I >>> >> >> > don't think I changed it until 3.2 or something later) so what you are likely >>> >> >> > seeing is the normal transaction commit after the fsync. In the case of >>> >> >> > notreelog we are likely starting another transaction and you are seeing that >>> >> >> > commit (at the time the transaction kthread would start a transaction even if >>> >> >> > none had been started yet.) Thanks, >>> >> >> > >>> >> >> > Josef >>> >> >> >>> >> >> Is there any special handling for very small file write, less than 4K? As >>> >> >> I understand there is an optimization to inline the first extent in a file if >>> >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did >>> >> >> set the max_inline mount option to 0, but even then it seems there is >>> >> >> some difference in fsync behaviour for writing first extent of less than 4K >>> >> >> size and writing 4K or more. >>> >> >> >>> >> > >>> >> > Yeah if the file is an inline extent then it will be copied into the log >>> >> > directly and the log will be written out, no going through the data write path >>> >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't >>> >> > honoring that then that may be a bug. Thanks, >>> >> > >>> >> > Josef >>> >> >>> >> I tried it on 3.12-rc2 release, and it seems there is a bug then. >>> >> Please find attached logs to confirm. >>> >> Also, probably on the older release. >>> >> >>> > >>> > Oooh ok I understand, you have your printk's in the wrong place ;). >>> > do_writepages doesn't necessarily mean you are writing something. If you want >>> > to see if stuff got written to the disk I'd put a printk at run_delalloc_range >>> > and have it spit out the range it is writing out since thats what we think is >>> > actually dirty. Thanks, >>> > >>> > Josef >>> >>> No, but I also placed dump_stack() in the beginning of >>> __extent_writepage. run_delalloc_range is being called only from >>> __extent_writepage, if it were to be called, the dump_stack() at the >>> top of __extent_writepage would have printed as well, no? >>> >> >> Ok I've done the same thing and I'm not seeing what you are seeing. Are you >> using any mount options other than notreelog and max_inline=0? Could you adjust >> your printk to print out the root objectid for the inode as well? It could be >> possible that this is the writeout for the space cache or inode cache. Thanks, >> >> Josef > > I actually printed the stack only when the root objectid is 5. I have > attached another log for writing the first 500 bytes in a file. I also > print the root objectid for the inode in run_delalloc and > __extent_writepage. > > Thanks > Just to clarify, in the latest logs, I allowed printing of debug printk's and stack dump for all root objectid's. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-10-01 19:42 ` Aastha Mehta @ 2013-10-01 20:13 ` Aastha Mehta 2013-10-02 11:52 ` Josef Bacik 0 siblings, 1 reply; 19+ messages in thread From: Aastha Mehta @ 2013-10-01 20:13 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On 1 October 2013 21:42, Aastha Mehta <aasthakm@gmail.com> wrote: > On 1 October 2013 21:40, Aastha Mehta <aasthakm@gmail.com> wrote: >> On 1 October 2013 19:34, Josef Bacik <jbacik@fusionio.com> wrote: >>> On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: >>>> On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: >>>> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: >>>> >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: >>>> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: >>>> >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: >>>> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: >>>> >> >> >> Thank you very much for the reply. That clarifies a lot of things. >>>> >> >> >> >>>> >> >> >> I was trying a small test case that opens a file, writes a block of >>>> >> >> >> data, calls fsync and then closes the file. If I understand correctly, >>>> >> >> >> fsync would return only after all in-memory buffers have been >>>> >> >> >> committed to disk. I have added few print statements in the >>>> >> >> >> __extent_writepage function, and I notice that the function gets >>>> >> >> >> called a bit later after fsync returns. It seems that I am not >>>> >> >> >> guaranteed to see the data going to disk by the time fsync returns. >>>> >> >> >> >>>> >> >> >> Am I doing something wrong, or am I looking at the wrong place for >>>> >> >> >> disk write? This happens both with tree logging enabled as well as >>>> >> >> >> with notreelog. >>>> >> >> >> >>>> >> >> > >>>> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was >>>> >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 >>>> >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that >>>> >> >> > __extent_writepage was how both data and metadata was written out at the time (I >>>> >> >> > don't think I changed it until 3.2 or something later) so what you are likely >>>> >> >> > seeing is the normal transaction commit after the fsync. In the case of >>>> >> >> > notreelog we are likely starting another transaction and you are seeing that >>>> >> >> > commit (at the time the transaction kthread would start a transaction even if >>>> >> >> > none had been started yet.) Thanks, >>>> >> >> > >>>> >> >> > Josef >>>> >> >> >>>> >> >> Is there any special handling for very small file write, less than 4K? As >>>> >> >> I understand there is an optimization to inline the first extent in a file if >>>> >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did >>>> >> >> set the max_inline mount option to 0, but even then it seems there is >>>> >> >> some difference in fsync behaviour for writing first extent of less than 4K >>>> >> >> size and writing 4K or more. >>>> >> >> >>>> >> > >>>> >> > Yeah if the file is an inline extent then it will be copied into the log >>>> >> > directly and the log will be written out, no going through the data write path >>>> >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't >>>> >> > honoring that then that may be a bug. Thanks, >>>> >> > >>>> >> > Josef >>>> >> >>>> >> I tried it on 3.12-rc2 release, and it seems there is a bug then. >>>> >> Please find attached logs to confirm. >>>> >> Also, probably on the older release. >>>> >> >>>> > >>>> > Oooh ok I understand, you have your printk's in the wrong place ;). >>>> > do_writepages doesn't necessarily mean you are writing something. If you want >>>> > to see if stuff got written to the disk I'd put a printk at run_delalloc_range >>>> > and have it spit out the range it is writing out since thats what we think is >>>> > actually dirty. Thanks, >>>> > >>>> > Josef >>>> >>>> No, but I also placed dump_stack() in the beginning of >>>> __extent_writepage. run_delalloc_range is being called only from >>>> __extent_writepage, if it were to be called, the dump_stack() at the >>>> top of __extent_writepage would have printed as well, no? >>>> >>> >>> Ok I've done the same thing and I'm not seeing what you are seeing. Are you >>> using any mount options other than notreelog and max_inline=0? Could you adjust >>> your printk to print out the root objectid for the inode as well? It could be >>> possible that this is the writeout for the space cache or inode cache. Thanks, >>> >>> Josef >> >> I actually printed the stack only when the root objectid is 5. I have >> attached another log for writing the first 500 bytes in a file. I also >> print the root objectid for the inode in run_delalloc and >> __extent_writepage. >> >> Thanks >> > > Just to clarify, in the latest logs, I allowed printing of debug > printk's and stack dump for all root objectid's. Actually, it is the same behaviour when I write anything less than 4K long, no matter what offset, except if I straddle the page boundary. To summarise: 1. write 4K -> write in the fsync path 2. write less than 4K, within a single page -> bdi_writeback by flush worker 3. small write that straddles a page boundary or write 4K+delta -> the first page gets written in the fsync path, the remaining length that straddles the page boundary is written in the bdi_writeback path Please let me know, if I am trying out incorrect cases. Sorry for too many mails. Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-10-01 20:13 ` Aastha Mehta @ 2013-10-02 11:52 ` Josef Bacik 2013-10-02 20:12 ` Aastha Mehta 0 siblings, 1 reply; 19+ messages in thread From: Josef Bacik @ 2013-10-02 11:52 UTC (permalink / raw) To: Aastha Mehta; +Cc: Josef Bacik, linux-btrfs On Tue, Oct 01, 2013 at 10:13:25PM +0200, Aastha Mehta wrote: > On 1 October 2013 21:42, Aastha Mehta <aasthakm@gmail.com> wrote: > > On 1 October 2013 21:40, Aastha Mehta <aasthakm@gmail.com> wrote: > >> On 1 October 2013 19:34, Josef Bacik <jbacik@fusionio.com> wrote: > >>> On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: > >>>> On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: > >>>> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: > >>>> >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: > >>>> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: > >>>> >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: > >>>> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: > >>>> >> >> >> Thank you very much for the reply. That clarifies a lot of things. > >>>> >> >> >> > >>>> >> >> >> I was trying a small test case that opens a file, writes a block of > >>>> >> >> >> data, calls fsync and then closes the file. If I understand correctly, > >>>> >> >> >> fsync would return only after all in-memory buffers have been > >>>> >> >> >> committed to disk. I have added few print statements in the > >>>> >> >> >> __extent_writepage function, and I notice that the function gets > >>>> >> >> >> called a bit later after fsync returns. It seems that I am not > >>>> >> >> >> guaranteed to see the data going to disk by the time fsync returns. > >>>> >> >> >> > >>>> >> >> >> Am I doing something wrong, or am I looking at the wrong place for > >>>> >> >> >> disk write? This happens both with tree logging enabled as well as > >>>> >> >> >> with notreelog. > >>>> >> >> >> > >>>> >> >> > > >>>> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was > >>>> >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 > >>>> >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that > >>>> >> >> > __extent_writepage was how both data and metadata was written out at the time (I > >>>> >> >> > don't think I changed it until 3.2 or something later) so what you are likely > >>>> >> >> > seeing is the normal transaction commit after the fsync. In the case of > >>>> >> >> > notreelog we are likely starting another transaction and you are seeing that > >>>> >> >> > commit (at the time the transaction kthread would start a transaction even if > >>>> >> >> > none had been started yet.) Thanks, > >>>> >> >> > > >>>> >> >> > Josef > >>>> >> >> > >>>> >> >> Is there any special handling for very small file write, less than 4K? As > >>>> >> >> I understand there is an optimization to inline the first extent in a file if > >>>> >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did > >>>> >> >> set the max_inline mount option to 0, but even then it seems there is > >>>> >> >> some difference in fsync behaviour for writing first extent of less than 4K > >>>> >> >> size and writing 4K or more. > >>>> >> >> > >>>> >> > > >>>> >> > Yeah if the file is an inline extent then it will be copied into the log > >>>> >> > directly and the log will be written out, no going through the data write path > >>>> >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't > >>>> >> > honoring that then that may be a bug. Thanks, > >>>> >> > > >>>> >> > Josef > >>>> >> > >>>> >> I tried it on 3.12-rc2 release, and it seems there is a bug then. > >>>> >> Please find attached logs to confirm. > >>>> >> Also, probably on the older release. > >>>> >> > >>>> > > >>>> > Oooh ok I understand, you have your printk's in the wrong place ;). > >>>> > do_writepages doesn't necessarily mean you are writing something. If you want > >>>> > to see if stuff got written to the disk I'd put a printk at run_delalloc_range > >>>> > and have it spit out the range it is writing out since thats what we think is > >>>> > actually dirty. Thanks, > >>>> > > >>>> > Josef > >>>> > >>>> No, but I also placed dump_stack() in the beginning of > >>>> __extent_writepage. run_delalloc_range is being called only from > >>>> __extent_writepage, if it were to be called, the dump_stack() at the > >>>> top of __extent_writepage would have printed as well, no? > >>>> > >>> > >>> Ok I've done the same thing and I'm not seeing what you are seeing. Are you > >>> using any mount options other than notreelog and max_inline=0? Could you adjust > >>> your printk to print out the root objectid for the inode as well? It could be > >>> possible that this is the writeout for the space cache or inode cache. Thanks, > >>> > >>> Josef > >> > >> I actually printed the stack only when the root objectid is 5. I have > >> attached another log for writing the first 500 bytes in a file. I also > >> print the root objectid for the inode in run_delalloc and > >> __extent_writepage. > >> > >> Thanks > >> > > > > Just to clarify, in the latest logs, I allowed printing of debug > > printk's and stack dump for all root objectid's. > > Actually, it is the same behaviour when I write anything less than 4K > long, no matter what offset, except if I straddle the page boundary. > To summarise: > 1. write 4K -> write in the fsync path > 2. write less than 4K, within a single page -> bdi_writeback by flush worker > 3. small write that straddles a page boundary or write 4K+delta -> the > first page gets written in the fsync path, the remaining length that > straddles the page boundary is written in the bdi_writeback path > > Please let me know, if I am trying out incorrect cases. > > Sorry for too many mails. > This has been bugging me so much I was dreaming about it and now here I am writing an email at 4:45 in the morning ;). So I couldn't reproduce earlier with any of these scenarios and then I realized something, I'm doing something like this xfs_io -f -c "pwrite 0 54" -c "fsync" /mnt/btrfs-test/foo and it is working perfectly. But I bet what you are doing is something like this file = fopen("/mnt/btrfs-test/foo"); fwrite(buf, 54, 1, file); fsync(fileno(file)); fclose(file); right? Please say yes :). If this is the case then it is likely that these small writes are getting buffered in the userspace buffering that comes with fwrite, and so when you fsync it is only flushing the data that is actually in the kernel, not what is buffered in userspace. Then when you fclose it flushes what is in the userspace buffers out to the kernel and then later on the background writer comes in and writes out the dirty data. To fix this you want to do fflush() and then fsync(). Hopefully that is what you are doing and I can go back to sleep, thanks, Josef ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-10-02 11:52 ` Josef Bacik @ 2013-10-02 20:12 ` Aastha Mehta 2013-10-02 23:28 ` Josef Bacik 0 siblings, 1 reply; 19+ messages in thread From: Aastha Mehta @ 2013-10-02 20:12 UTC (permalink / raw) To: Josef Bacik; +Cc: linux-btrfs On 2 October 2013 13:52, Josef Bacik <jbacik@fusionio.com> wrote: > On Tue, Oct 01, 2013 at 10:13:25PM +0200, Aastha Mehta wrote: >> On 1 October 2013 21:42, Aastha Mehta <aasthakm@gmail.com> wrote: >> > On 1 October 2013 21:40, Aastha Mehta <aasthakm@gmail.com> wrote: >> >> On 1 October 2013 19:34, Josef Bacik <jbacik@fusionio.com> wrote: >> >>> On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: >> >>>> On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: >> >>>> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: >> >>>> >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: >> >>>> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: >> >>>> >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: >> >>>> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: >> >>>> >> >> >> Thank you very much for the reply. That clarifies a lot of things. >> >>>> >> >> >> >> >>>> >> >> >> I was trying a small test case that opens a file, writes a block of >> >>>> >> >> >> data, calls fsync and then closes the file. If I understand correctly, >> >>>> >> >> >> fsync would return only after all in-memory buffers have been >> >>>> >> >> >> committed to disk. I have added few print statements in the >> >>>> >> >> >> __extent_writepage function, and I notice that the function gets >> >>>> >> >> >> called a bit later after fsync returns. It seems that I am not >> >>>> >> >> >> guaranteed to see the data going to disk by the time fsync returns. >> >>>> >> >> >> >> >>>> >> >> >> Am I doing something wrong, or am I looking at the wrong place for >> >>>> >> >> >> disk write? This happens both with tree logging enabled as well as >> >>>> >> >> >> with notreelog. >> >>>> >> >> >> >> >>>> >> >> > >> >>>> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was >> >>>> >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 >> >>>> >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that >> >>>> >> >> > __extent_writepage was how both data and metadata was written out at the time (I >> >>>> >> >> > don't think I changed it until 3.2 or something later) so what you are likely >> >>>> >> >> > seeing is the normal transaction commit after the fsync. In the case of >> >>>> >> >> > notreelog we are likely starting another transaction and you are seeing that >> >>>> >> >> > commit (at the time the transaction kthread would start a transaction even if >> >>>> >> >> > none had been started yet.) Thanks, >> >>>> >> >> > >> >>>> >> >> > Josef >> >>>> >> >> >> >>>> >> >> Is there any special handling for very small file write, less than 4K? As >> >>>> >> >> I understand there is an optimization to inline the first extent in a file if >> >>>> >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did >> >>>> >> >> set the max_inline mount option to 0, but even then it seems there is >> >>>> >> >> some difference in fsync behaviour for writing first extent of less than 4K >> >>>> >> >> size and writing 4K or more. >> >>>> >> >> >> >>>> >> > >> >>>> >> > Yeah if the file is an inline extent then it will be copied into the log >> >>>> >> > directly and the log will be written out, no going through the data write path >> >>>> >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't >> >>>> >> > honoring that then that may be a bug. Thanks, >> >>>> >> > >> >>>> >> > Josef >> >>>> >> >> >>>> >> I tried it on 3.12-rc2 release, and it seems there is a bug then. >> >>>> >> Please find attached logs to confirm. >> >>>> >> Also, probably on the older release. >> >>>> >> >> >>>> > >> >>>> > Oooh ok I understand, you have your printk's in the wrong place ;). >> >>>> > do_writepages doesn't necessarily mean you are writing something. If you want >> >>>> > to see if stuff got written to the disk I'd put a printk at run_delalloc_range >> >>>> > and have it spit out the range it is writing out since thats what we think is >> >>>> > actually dirty. Thanks, >> >>>> > >> >>>> > Josef >> >>>> >> >>>> No, but I also placed dump_stack() in the beginning of >> >>>> __extent_writepage. run_delalloc_range is being called only from >> >>>> __extent_writepage, if it were to be called, the dump_stack() at the >> >>>> top of __extent_writepage would have printed as well, no? >> >>>> >> >>> >> >>> Ok I've done the same thing and I'm not seeing what you are seeing. Are you >> >>> using any mount options other than notreelog and max_inline=0? Could you adjust >> >>> your printk to print out the root objectid for the inode as well? It could be >> >>> possible that this is the writeout for the space cache or inode cache. Thanks, >> >>> >> >>> Josef >> >> >> >> I actually printed the stack only when the root objectid is 5. I have >> >> attached another log for writing the first 500 bytes in a file. I also >> >> print the root objectid for the inode in run_delalloc and >> >> __extent_writepage. >> >> >> >> Thanks >> >> >> > >> > Just to clarify, in the latest logs, I allowed printing of debug >> > printk's and stack dump for all root objectid's. >> >> Actually, it is the same behaviour when I write anything less than 4K >> long, no matter what offset, except if I straddle the page boundary. >> To summarise: >> 1. write 4K -> write in the fsync path >> 2. write less than 4K, within a single page -> bdi_writeback by flush worker >> 3. small write that straddles a page boundary or write 4K+delta -> the >> first page gets written in the fsync path, the remaining length that >> straddles the page boundary is written in the bdi_writeback path >> >> Please let me know, if I am trying out incorrect cases. >> >> Sorry for too many mails. >> > > This has been bugging me so much I was dreaming about it and now here I am > writing an email at 4:45 in the morning ;). So I couldn't reproduce earlier > with any of these scenarios and then I realized something, I'm doing something > like this > > xfs_io -f -c "pwrite 0 54" -c "fsync" /mnt/btrfs-test/foo > > and it is working perfectly. But I bet what you are doing is something like > this > > file = fopen("/mnt/btrfs-test/foo"); > fwrite(buf, 54, 1, file); > fsync(fileno(file)); > fclose(file); > > right? Please say yes :). If this is the case then it is likely that these > small writes are getting buffered in the userspace buffering that comes with > fwrite, and so when you fsync it is only flushing the data that is actually in > the kernel, not what is buffered in userspace. Then when you fclose it flushes > what is in the userspace buffers out to the kernel and then later on the > background writer comes in and writes out the dirty data. To fix this you want > to do fflush() and then fsync(). Hopefully that is what you are doing and I can > go back to sleep, thanks, > > Josef Indeed!! :) I did mention I am using f* version of the POSIX API. Sorry for the confusion. Calling fflush before fsync seems to write everything perfectly. It works even without notreelog option, as it should have. I was under the misconception that fflush and fsync do the same thing. Thanks a lot for your quick help. Regards, -- Aastha Mehta MPI-SWS, Germany E-mail: aasthakm@mpi-sws.org ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Questions regarding logging upon fsync in btrfs 2013-10-02 20:12 ` Aastha Mehta @ 2013-10-02 23:28 ` Josef Bacik 0 siblings, 0 replies; 19+ messages in thread From: Josef Bacik @ 2013-10-02 23:28 UTC (permalink / raw) To: Aastha Mehta; +Cc: Josef Bacik, linux-btrfs On Wed, Oct 02, 2013 at 10:12:20PM +0200, Aastha Mehta wrote: > On 2 October 2013 13:52, Josef Bacik <jbacik@fusionio.com> wrote: > > On Tue, Oct 01, 2013 at 10:13:25PM +0200, Aastha Mehta wrote: > >> On 1 October 2013 21:42, Aastha Mehta <aasthakm@gmail.com> wrote: > >> > On 1 October 2013 21:40, Aastha Mehta <aasthakm@gmail.com> wrote: > >> >> On 1 October 2013 19:34, Josef Bacik <jbacik@fusionio.com> wrote: > >> >>> On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: > >> >>>> On 30 September 2013 22:47, Josef Bacik <jbacik@fusionio.com> wrote: > >> >>>> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: > >> >>>> >> On 30 September 2013 22:11, Josef Bacik <jbacik@fusionio.com> wrote: > >> >>>> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: > >> >>>> >> >> On 29 September 2013 15:12, Josef Bacik <jbacik@fusionio.com> wrote: > >> >>>> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: > >> >>>> >> >> >> Thank you very much for the reply. That clarifies a lot of things. > >> >>>> >> >> >> > >> >>>> >> >> >> I was trying a small test case that opens a file, writes a block of > >> >>>> >> >> >> data, calls fsync and then closes the file. If I understand correctly, > >> >>>> >> >> >> fsync would return only after all in-memory buffers have been > >> >>>> >> >> >> committed to disk. I have added few print statements in the > >> >>>> >> >> >> __extent_writepage function, and I notice that the function gets > >> >>>> >> >> >> called a bit later after fsync returns. It seems that I am not > >> >>>> >> >> >> guaranteed to see the data going to disk by the time fsync returns. > >> >>>> >> >> >> > >> >>>> >> >> >> Am I doing something wrong, or am I looking at the wrong place for > >> >>>> >> >> >> disk write? This happens both with tree logging enabled as well as > >> >>>> >> >> >> with notreelog. > >> >>>> >> >> >> > >> >>>> >> >> > > >> >>>> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't think it was > >> >>>> >> >> > _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 > >> >>>> >> >> > or just build btrfs-next from git. But if I were to make a guess I'd say that > >> >>>> >> >> > __extent_writepage was how both data and metadata was written out at the time (I > >> >>>> >> >> > don't think I changed it until 3.2 or something later) so what you are likely > >> >>>> >> >> > seeing is the normal transaction commit after the fsync. In the case of > >> >>>> >> >> > notreelog we are likely starting another transaction and you are seeing that > >> >>>> >> >> > commit (at the time the transaction kthread would start a transaction even if > >> >>>> >> >> > none had been started yet.) Thanks, > >> >>>> >> >> > > >> >>>> >> >> > Josef > >> >>>> >> >> > >> >>>> >> >> Is there any special handling for very small file write, less than 4K? As > >> >>>> >> >> I understand there is an optimization to inline the first extent in a file if > >> >>>> >> >> it is smaller than 4K, does it affect the writeback on fsync as well? I did > >> >>>> >> >> set the max_inline mount option to 0, but even then it seems there is > >> >>>> >> >> some difference in fsync behaviour for writing first extent of less than 4K > >> >>>> >> >> size and writing 4K or more. > >> >>>> >> >> > >> >>>> >> > > >> >>>> >> > Yeah if the file is an inline extent then it will be copied into the log > >> >>>> >> > directly and the log will be written out, no going through the data write path > >> >>>> >> > at all. Max inline == 0 should make it so we don't inline, so if it isn't > >> >>>> >> > honoring that then that may be a bug. Thanks, > >> >>>> >> > > >> >>>> >> > Josef > >> >>>> >> > >> >>>> >> I tried it on 3.12-rc2 release, and it seems there is a bug then. > >> >>>> >> Please find attached logs to confirm. > >> >>>> >> Also, probably on the older release. > >> >>>> >> > >> >>>> > > >> >>>> > Oooh ok I understand, you have your printk's in the wrong place ;). > >> >>>> > do_writepages doesn't necessarily mean you are writing something. If you want > >> >>>> > to see if stuff got written to the disk I'd put a printk at run_delalloc_range > >> >>>> > and have it spit out the range it is writing out since thats what we think is > >> >>>> > actually dirty. Thanks, > >> >>>> > > >> >>>> > Josef > >> >>>> > >> >>>> No, but I also placed dump_stack() in the beginning of > >> >>>> __extent_writepage. run_delalloc_range is being called only from > >> >>>> __extent_writepage, if it were to be called, the dump_stack() at the > >> >>>> top of __extent_writepage would have printed as well, no? > >> >>>> > >> >>> > >> >>> Ok I've done the same thing and I'm not seeing what you are seeing. Are you > >> >>> using any mount options other than notreelog and max_inline=0? Could you adjust > >> >>> your printk to print out the root objectid for the inode as well? It could be > >> >>> possible that this is the writeout for the space cache or inode cache. Thanks, > >> >>> > >> >>> Josef > >> >> > >> >> I actually printed the stack only when the root objectid is 5. I have > >> >> attached another log for writing the first 500 bytes in a file. I also > >> >> print the root objectid for the inode in run_delalloc and > >> >> __extent_writepage. > >> >> > >> >> Thanks > >> >> > >> > > >> > Just to clarify, in the latest logs, I allowed printing of debug > >> > printk's and stack dump for all root objectid's. > >> > >> Actually, it is the same behaviour when I write anything less than 4K > >> long, no matter what offset, except if I straddle the page boundary. > >> To summarise: > >> 1. write 4K -> write in the fsync path > >> 2. write less than 4K, within a single page -> bdi_writeback by flush worker > >> 3. small write that straddles a page boundary or write 4K+delta -> the > >> first page gets written in the fsync path, the remaining length that > >> straddles the page boundary is written in the bdi_writeback path > >> > >> Please let me know, if I am trying out incorrect cases. > >> > >> Sorry for too many mails. > >> > > > > This has been bugging me so much I was dreaming about it and now here I am > > writing an email at 4:45 in the morning ;). So I couldn't reproduce earlier > > with any of these scenarios and then I realized something, I'm doing something > > like this > > > > xfs_io -f -c "pwrite 0 54" -c "fsync" /mnt/btrfs-test/foo > > > > and it is working perfectly. But I bet what you are doing is something like > > this > > > > file = fopen("/mnt/btrfs-test/foo"); > > fwrite(buf, 54, 1, file); > > fsync(fileno(file)); > > fclose(file); > > > > right? Please say yes :). If this is the case then it is likely that these > > small writes are getting buffered in the userspace buffering that comes with > > fwrite, and so when you fsync it is only flushing the data that is actually in > > the kernel, not what is buffered in userspace. Then when you fclose it flushes > > what is in the userspace buffers out to the kernel and then later on the > > background writer comes in and writes out the dirty data. To fix this you want > > to do fflush() and then fsync(). Hopefully that is what you are doing and I can > > go back to sleep, thanks, > > > > Josef > > Indeed!! :) > > I did mention I am using f* version of the POSIX API. Sorry for the > confusion. Calling fflush before fsync seems to write everything > perfectly. It works even without notreelog option, as it should have. > I was under the misconception that fflush and fsync do the same thing. > > Thanks a lot for your quick help. > Yeah sorry all I read was "there is a bug in fsync" and just assumed we broke something ;). Glad I could help, Josef ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-10-02 23:29 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-28 23:35 Questions regarding logging upon fsync in btrfs Aastha Mehta 2013-09-28 23:46 ` Aastha Mehta 2013-09-29 0:21 ` Hugo Mills 2013-09-29 0:42 ` Josef Bacik 2013-09-29 9:22 ` Aastha Mehta 2013-09-29 13:12 ` Josef Bacik 2013-09-30 19:32 ` Aastha Mehta 2013-09-30 20:11 ` Josef Bacik 2013-09-30 20:30 ` Aastha Mehta 2013-09-30 20:47 ` Josef Bacik 2013-09-30 21:07 ` Aastha Mehta 2013-09-30 21:17 ` Josef Bacik 2013-10-01 17:34 ` Josef Bacik 2013-10-01 19:40 ` Aastha Mehta 2013-10-01 19:42 ` Aastha Mehta 2013-10-01 20:13 ` Aastha Mehta 2013-10-02 11:52 ` Josef Bacik 2013-10-02 20:12 ` Aastha Mehta 2013-10-02 23:28 ` Josef Bacik
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).