* question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN @ 2015-07-12 17:15 Alex Lyakas 2015-07-13 9:27 ` Filipe David Manana 0 siblings, 1 reply; 5+ messages in thread From: Alex Lyakas @ 2015-07-12 17:15 UTC (permalink / raw) To: linux-btrfs, jbacik Greetings, Looking at the code of should_cow_block(), I see: if (btrfs_header_generation(buf) == trans->transid && !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) && ... So if the extent buffer has been written to disk, and now is changed again in the same transaction, we insist on COW'ing it. Can anybody explain why COW is needed in this case? The transaction has not committed yet, so what is the danger of rewriting to the same location on disk? My understanding was that a tree block needs to be COW'ed at most once in the same transaction. But I see that this is not the case. I am asking because I am doing some profiling of btrfs metadata work under heavy loads, and I see that sometimes btrfs COW's almost twice more tree blocks than the total metadata size. Thanks, Alex. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN 2015-07-12 17:15 question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN Alex Lyakas @ 2015-07-13 9:27 ` Filipe David Manana 2015-07-13 16:55 ` Alex Lyakas 0 siblings, 1 reply; 5+ messages in thread From: Filipe David Manana @ 2015-07-13 9:27 UTC (permalink / raw) To: Alex Lyakas; +Cc: linux-btrfs@vger.kernel.org, Josef Bacik On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas <alex@zadarastorage.com> wrote: > Greetings, > Looking at the code of should_cow_block(), I see: > > if (btrfs_header_generation(buf) == trans->transid && > !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) && > ... > So if the extent buffer has been written to disk, and now is changed again > in the same transaction, we insist on COW'ing it. Can anybody explain why > COW is needed in this case? The transaction has not committed yet, so what > is the danger of rewriting to the same location on disk? My understanding > was that a tree block needs to be COW'ed at most once in the same > transaction. But I see that this is not the case. That logic is there, as far as I can see, for at least 2 obvious reasons: 1) fsync/log trees. All extent buffers (tree blocks) of a log tree have the same transaction id/generation, and you can have multiple fsyncs (log transaction commits) per transaction so you need to ensure consistency. If we skipped the COWing in the example below, you would get an inconsistent log tree at log replay time when the fs is mounted: transaction N start fsync inode A start creates tree block X flush X to disk write a new superblock fsync inode A end fsync inode B start skip COW of X because its generation == current transaction id and modify it in place flush X to disk ========== crash =========== write a new superblock fsync inode B end transaction N commit 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is written to disk but instead when we trigger writeback for it. So while the writeback is ongoing we want to make sure the block's content isn't concurrently modified (we don't keep the eb write locked to allow concurrent reads during the writeback). All tree blocks that don't belong to a log tree are normally written only when at the end of a transaction commit. But often, due to memory pressure for e.g., the VM can call the writepages() callback of the btree inode to force dirty tree blocks to be written to disk before the transaction commit. > > I am asking because I am doing some profiling of btrfs metadata work under > heavy loads, and I see that sometimes btrfs COW's almost twice more tree > blocks than the total metadata size. > > Thanks, > Alex. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN 2015-07-13 9:27 ` Filipe David Manana @ 2015-07-13 16:55 ` Alex Lyakas 2015-07-13 17:02 ` Chris Mason 0 siblings, 1 reply; 5+ messages in thread From: Alex Lyakas @ 2015-07-13 16:55 UTC (permalink / raw) To: Filipe Manana; +Cc: linux-btrfs@vger.kernel.org, Josef Bacik Filipe, Thanks for the explanation. Those reasons were not so obvious for me. Would it make sense not to COW the block in case-1, if we are mounted with "notreelog"? Or, perhaps, to check that the block does not belong to a log tree? The second case is more difficult. One problem is that BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block due to memory pressure (this is what I see happening), we complete the writeback, release the extent buffer, and pages are evicted from the page cache of btree_inode. After some time we read the block again (because we want to modify it in the same transaction), but its header is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at this point it should be safe to avoid COW, we will re-COW. Would it make sense to have some runtime-only mechanism to lock-out the write-back for an eb? I.e., if we know that eb is not under writeback, and writeback is locked out from starting, we can redirty the block without COW. Then we allow the writeback to start when it wants to. In one of my test runs, btrfs had 6.4GB of metadata (before raid-induced overhead), but during a particular transaction total of 10GB of metadata (again, before raid-induced overhead) was written to disk. (Thisis total of all ebs having header->generation==curr_transid, not only during commit of the transaction). This particular run was with "notreelog". Machine had 8GB of RAM. Linux allows the btree_inode to grow its page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages). But even though the used amount of metadata is less than that, this re-COW'ing of already-COW'ed blocks seems to cause page-cache trashing... Thanks, Alex. On Mon, Jul 13, 2015 at 11:27 AM, Filipe David Manana <fdmanana@gmail.com> wrote: > On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas <alex@zadarastorage.com> wrote: >> Greetings, >> Looking at the code of should_cow_block(), I see: >> >> if (btrfs_header_generation(buf) == trans->transid && >> !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) && >> ... >> So if the extent buffer has been written to disk, and now is changed again >> in the same transaction, we insist on COW'ing it. Can anybody explain why >> COW is needed in this case? The transaction has not committed yet, so what >> is the danger of rewriting to the same location on disk? My understanding >> was that a tree block needs to be COW'ed at most once in the same >> transaction. But I see that this is not the case. > > That logic is there, as far as I can see, for at least 2 obvious reasons: > > 1) fsync/log trees. All extent buffers (tree blocks) of a log tree > have the same transaction id/generation, and you can have multiple > fsyncs (log transaction commits) per transaction so you need to ensure > consistency. If we skipped the COWing in the example below, you would > get an inconsistent log tree at log replay time when the fs is > mounted: > > transaction N start > > fsync inode A start > creates tree block X > flush X to disk > write a new superblock > fsync inode A end > > fsync inode B start > skip COW of X because its generation == current transaction id and > modify it in place > flush X to disk > > ========== crash =========== > > write a new superblock > fsync inode B end > > transaction N commit > > 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is > written to disk but instead when we trigger writeback for it. So while > the writeback is ongoing we want to make sure the block's content > isn't concurrently modified (we don't keep the eb write locked to > allow concurrent reads during the writeback). > > All tree blocks that don't belong to a log tree are normally written > only when at the end of a transaction commit. But often, due to memory > pressure for e.g., the VM can call the writepages() callback of the > btree inode to force dirty tree blocks to be written to disk before > the transaction commit. > >> >> I am asking because I am doing some profiling of btrfs metadata work under >> heavy loads, and I see that sometimes btrfs COW's almost twice more tree >> blocks than the total metadata size. >> >> Thanks, >> Alex. >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Filipe David Manana, > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men." ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN 2015-07-13 16:55 ` Alex Lyakas @ 2015-07-13 17:02 ` Chris Mason 2015-07-21 14:11 ` Alex Lyakas 0 siblings, 1 reply; 5+ messages in thread From: Chris Mason @ 2015-07-13 17:02 UTC (permalink / raw) To: Alex Lyakas; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Josef Bacik On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote: > Filipe, > Thanks for the explanation. Those reasons were not so obvious for me. > > Would it make sense not to COW the block in case-1, if we are mounted > with "notreelog"? Or, perhaps, to check that the block does not belong > to a log tree? > Hi Alex, The crc rules are the most important, we have to make sure the block isn't changed while it is in flight. Also, think about something like this: transaction write block A, puts pointer to it in the btree, generation Y <hard disk properly completes the IO> transaction rewrites block A, same generation Y <hard disk drops the IO on the floor and never does it> Later on, we try to read block A again. We find it has the correct crc and the correct generation number, but the contents are actually wrong. > The second case is more difficult. One problem is that > BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block > due to memory pressure (this is what I see happening), we complete the > writeback, release the extent buffer, and pages are evicted from the > page cache of btree_inode. After some time we read the block again > (because we want to modify it in the same transaction), but its header > is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at > this point it should be safe to avoid COW, we will re-COW. > > Would it make sense to have some runtime-only mechanism to lock-out > the write-back for an eb? I.e., if we know that eb is not under > writeback, and writeback is locked out from starting, we can redirty > the block without COW. Then we allow the writeback to start when it > wants to. > > In one of my test runs, btrfs had 6.4GB of metadata (before > raid-induced overhead), but during a particular transaction total of > 10GB of metadata (again, before raid-induced overhead) was written to > disk. (Thisis total of all ebs having > header->generation==curr_transid, not only during commit of the > transaction). This particular run was with "notreelog". > > Machine had 8GB of RAM. Linux allows the btree_inode to grow its > page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages). > But even though the used amount of metadata is less than that, this > re-COW'ing of already-COW'ed blocks seems to cause page-cache > trashing... Interesting. We've addressed this in the past with changes to the writepage(s) callback for the btree, basically skipping memory pressure related writeback if there isn't that much dirty. There is a lot of room to improve those decisions, like preferring to write leaves over nodes, especially full leaves that are not likely to change again. -chris ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN 2015-07-13 17:02 ` Chris Mason @ 2015-07-21 14:11 ` Alex Lyakas 0 siblings, 0 replies; 5+ messages in thread From: Alex Lyakas @ 2015-07-21 14:11 UTC (permalink / raw) To: Chris Mason, Alex Lyakas, Filipe Manana, linux-btrfs@vger.kernel.org Cc: Shyam Kaushik Hi Filipe, Chris, I did more profiling work, and I want to writeup some of the things I noticed. Hopefully it has some value for the community. # btrfs_transaction.dirty_pages: This "extent_io_tree" includes every eb created by btrfs_init_new_buffer (this is the only function that adds dirty extents there). But sometimes such eb is marked back as clean (clean_tree_block), but is not removed from btrfs_transaction.dirty_pages. So I think btrfs_write_and_wait_transaction triggers a writeout for such ebs as well, although they are not marked as dirty, so probably this won't do anything. I don't know if this may cause any issue. # btrfs_fs_info.dirty_metadata_bytes This accounts dirty metadata per filesystem, not per particular transaction. Because we may have transaction X in btrfs_write_and_wait_transaction, and transaction X+1 creating dirty ebs. This is probably intended. # btrfs_init_new_buffer I was trying to account exact amount of dirty ebs per transaction. Obviously, any such dirty eb would be always created by this function. And I was trying to compare the amount of ebs created by this function vs the amount of ebs written out by write_one_eb. I am aware that a particular eb can be freed in the same transaction via btrfs_free_tree_block() either pinning it, or directly dropping it back to the free-space cache. I am also aware that delayed references may cancel one another, and run_one_delayed_ref() may pin the eb without doing any extent tree update. I am also aware about VM writeback happening via btree_writepages, thus requiring to re-COW some of the ebs. Still the total amount of metadata written by a particular transaction is less than the total amount of ebs created by btrfs_init_new_buffer, taking into account all the cases I mentioned above. I guess that clean_tree_block() is the culprit, but I am not sure what happens to such "cleaned" eb. It just remains in btree_inode page cache? Or it is freed by some path I an unaware of? # should_cow_block() + BTRFS_HEADER_FLAG_WRITTEN causing a re-COW of a tree block Chris, you mention that btrfs should be skipping memory pressure writeback if there is not much dirty metadata. I guess you mean BTRFS_DIRTY_METADATA_THRESH checked in btree_writepages, and also skipping "for_kupdate" writeback. Grepping through kernel sources, I see that btree_writepages should never see for_reclaim bit set, because for that the btree_inode needs to provide the "writepage" callback, but it only provides "writepages". (for_reclaim is set by vmscan.c::pageout and migrate.c::writeout, which both require "writepage" if I am seeing correctly). So except for explicit writeback requests, btree_writepages should see only the "for_background" writeback, which is mostly controlled by dirty_background_ratio/dirty_background_bytes as I understand (not by memory pressure). Anyways, I disabled the WB_SYNC_NONE fully, and indeed was not seeing any re-COWs. But on the other hand, I had few OOM panics, when dirty metadata grew to several GBs. # global_block_rsv and update_global_block_rsv I don't fully understand what calc_global_metadata_size() is trying to do, But I think once metadata size is above couple of GBs, then update_global_block_rsv will always refill global_block_rsv back to 512Mb, as it does: block_rsv->size = min_t(u64, num_bytes, 512 * 1024 * 1024); and then bumps up "reserved" to "size". As a result, should_end_transaction() will tell us that we need to start a commit when we grab half of that, i.e., 256MB or more, from global_block_rsv. This seems very reasonable, as we really don't want to have more dirty metadata than that in a single transaction. However, we might call update_global_block_rsv() on different occasions, and every such call will bump it back to 512MB. So for example, we create a new data block group, and then btrfs_make_block_group will call update_global_block_rsv, allowing us to dirty more metadata before we commit. This seems a bit strange, why we are bumping it back and suddenly allowing more updates to come in? # run_delayed_refs by the commit thread I see that the current spirit is that when somebody detaches from a transaction by btrfs_end_transaction, we don't want to delay it for too long. So even if must_run_delayed_refs==1, we will run at most 32 delayed refs: cur = max_t(unsigned long, cur, 32) As a result, when we trigger a commit we might have tens of thousands of delayed refs, to be run by the commit thread. The commit thread calls btrfs_run_delayed_refs several times, before it reaches the critical section (TRANS_STATE_COMMIT_DOING), where we don't allow anybody to join it. Profiling these run_delayed_refs calls shows that they are slow due to: - we need to read the extent tree blocks from disk[1] in order to update the extent tree (read_extent_buffer_pages), and there is only one thread reading them sequentially (the commit thread). - before we get to the critical section of the commit, we can have other threads also running delayed refs, so the commit thread needs to compete on tree-block locks with them (and they hold the locks because they also read tree blocks from disk as it seems) So my question is shouldn't we be much more aggressive in __btrfs_end_transaction, running delayed refs several times and checking trans->delayed_ref_updates after each run, and return only if this number is zero or small enough. This way when we trigger a commit, it will not have a lot of delayed refs to run, it will get very quickly to the critical section, pass it hopefully very quickly (get to TRANS_STATE_UNBLOCKED), and then we can open a new transaction while the previous is doing btrfs_write_and_wait_transaction. That's what I wanted to ask. Thanks! Alex. [1] In my case, btrfs metadata is ~10GBs and the machine has 8GB of RAM. Due to this we need to read a lot of ebs from disk, as they are not in the page cache. Also need to keep in mind that every COW of eb requires a new slot in the page cache, because we index by "bytenr" that we receive from the free-space cache, which is a "logical" coordinate by which EXTENT_ITEMs are sorted in the extent tree. On Mon, Jul 13, 2015 at 7:02 PM, Chris Mason <clm@fb.com> wrote: > On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote: >> Filipe, >> Thanks for the explanation. Those reasons were not so obvious for me. >> >> Would it make sense not to COW the block in case-1, if we are mounted >> with "notreelog"? Or, perhaps, to check that the block does not belong >> to a log tree? >> > > Hi Alex, > > The crc rules are the most important, we have to make sure the block > isn't changed while it is in flight. Also, think about something like > this: > > transaction write block A, puts pointer to it in the btree, generation Y > > <hard disk properly completes the IO> > > transaction rewrites block A, same generation Y > > <hard disk drops the IO on the floor and never does it> > > Later on, we try to read block A again. We find it has the correct crc > and the correct generation number, but the contents are actually wrong. > >> The second case is more difficult. One problem is that >> BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block >> due to memory pressure (this is what I see happening), we complete the >> writeback, release the extent buffer, and pages are evicted from the >> page cache of btree_inode. After some time we read the block again >> (because we want to modify it in the same transaction), but its header >> is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at >> this point it should be safe to avoid COW, we will re-COW. >> >> Would it make sense to have some runtime-only mechanism to lock-out >> the write-back for an eb? I.e., if we know that eb is not under >> writeback, and writeback is locked out from starting, we can redirty >> the block without COW. Then we allow the writeback to start when it >> wants to. >> >> In one of my test runs, btrfs had 6.4GB of metadata (before >> raid-induced overhead), but during a particular transaction total of >> 10GB of metadata (again, before raid-induced overhead) was written to >> disk. (Thisis total of all ebs having >> header->generation==curr_transid, not only during commit of the >> transaction). This particular run was with "notreelog". >> >> Machine had 8GB of RAM. Linux allows the btree_inode to grow its >> page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages). >> But even though the used amount of metadata is less than that, this >> re-COW'ing of already-COW'ed blocks seems to cause page-cache >> trashing... > > Interesting. We've addressed this in the past with changes to the > writepage(s) callback for the btree, basically skipping memory pressure > related writeback if there isn't that much dirty. There is a lot of > room to improve those decisions, like preferring to write leaves over > nodes, especially full leaves that are not likely to change again. > > -chris ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2015-07-21 14:11 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-12 17:15 question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN Alex Lyakas 2015-07-13 9:27 ` Filipe David Manana 2015-07-13 16:55 ` Alex Lyakas 2015-07-13 17:02 ` Chris Mason 2015-07-21 14:11 ` Alex Lyakas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).