question about should_cow_block() and BTRFS_HEADER_FLAG

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
@ 2015-07-12 17:15 Alex Lyakas
  2015-07-13  9:27 ` Filipe David Manana
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Lyakas @ 2015-07-12 17:15 UTC (permalink / raw)
  To: linux-btrfs, jbacik

Greetings,
Looking at the code of should_cow_block(), I see:

if (btrfs_header_generation(buf) == trans->transid &&
    !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
...
So if the extent buffer has been written to disk, and now is changed again 
in the same transaction, we insist on COW'ing it. Can anybody explain why 
COW is needed in this case? The transaction has not committed yet, so what 
is the danger of rewriting to the same location on disk? My understanding 
was that a tree block needs to be COW'ed at most once in the same 
transaction. But I see that this is not the case.

I am asking because I am doing some profiling of btrfs metadata work under 
heavy loads, and I see that sometimes btrfs COW's almost twice more tree 
blocks than the total metadata size.

Thanks,
Alex.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
  2015-07-12 17:15 question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN Alex Lyakas
@ 2015-07-13  9:27 ` Filipe David Manana
  2015-07-13 16:55   ` Alex Lyakas
  0 siblings, 1 reply; 5+ messages in thread
From: Filipe David Manana @ 2015-07-13  9:27 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs@vger.kernel.org, Josef Bacik

On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas <alex@zadarastorage.com> wrote:
> Greetings,
> Looking at the code of should_cow_block(), I see:
>
> if (btrfs_header_generation(buf) == trans->transid &&
>    !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
> ...
> So if the extent buffer has been written to disk, and now is changed again
> in the same transaction, we insist on COW'ing it. Can anybody explain why
> COW is needed in this case? The transaction has not committed yet, so what
> is the danger of rewriting to the same location on disk? My understanding
> was that a tree block needs to be COW'ed at most once in the same
> transaction. But I see that this is not the case.

That logic is there, as far as I can see, for at least 2 obvious reasons:

1) fsync/log trees. All extent buffers (tree blocks) of a log tree
have the same transaction id/generation, and you can have multiple
fsyncs (log transaction commits) per transaction so you need to ensure
consistency. If we skipped the COWing in the example below, you would
get an inconsistent log tree at log replay time when the fs is
mounted:

transaction N start

   fsync inode A start
   creates tree block X
   flush X to disk
   write a new superblock
   fsync inode A end

   fsync inode B start
   skip COW of X because its generation == current transaction id and
modify it in place
   flush X to disk

========== crash ===========

   write a new superblock
   fsync inode B end

transaction N commit

2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
written to disk but instead when we trigger writeback for it. So while
the writeback is ongoing we want to make sure the block's content
isn't concurrently modified (we don't keep the eb write locked to
allow concurrent reads during the writeback).

All tree blocks that don't belong to a log tree are normally written
only when at the end of a transaction commit. But often, due to memory
pressure for e.g., the VM can call the writepages() callback of the
btree inode to force dirty tree blocks to be written to disk before
the transaction commit.

>
> I am asking because I am doing some profiling of btrfs metadata work under
> heavy loads, and I see that sometimes btrfs COW's almost twice more tree
> blocks than the total metadata size.
>
> Thanks,
> Alex.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
  2015-07-13  9:27 ` Filipe David Manana
@ 2015-07-13 16:55   ` Alex Lyakas
  2015-07-13 17:02     ` Chris Mason
  0 siblings, 1 reply; 5+ messages in thread
From: Alex Lyakas @ 2015-07-13 16:55 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs@vger.kernel.org, Josef Bacik

Filipe,
Thanks for the explanation. Those reasons were not so obvious for me.

Would it make sense not to COW the block in case-1, if we are mounted
with "notreelog"? Or, perhaps, to check that the block does not belong
to a log tree?

The second case is more difficult. One problem is that
BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
due to memory pressure (this is what I see happening), we complete the
writeback, release the extent buffer, and pages are evicted from the
page cache of btree_inode. After some time we read the block again
(because we want to modify it in the same transaction), but its header
is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
this point it should be safe to avoid COW, we will re-COW.

Would it make sense to have some runtime-only mechanism to lock-out
the write-back for an eb? I.e., if we know that eb is not under
writeback, and writeback is locked out from starting, we can redirty
the block without COW. Then we allow the writeback to start when it
wants to.

In one of my test runs, btrfs had 6.4GB of metadata (before
raid-induced overhead), but during a particular transaction total of
10GB of metadata (again, before raid-induced overhead) was written to
disk. (Thisis  total of all ebs having
header->generation==curr_transid, not only during commit of the
transaction). This particular run was with "notreelog".

Machine had 8GB of RAM. Linux allows the btree_inode to grow its
page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
But even though the used amount of metadata is less than that, this
re-COW'ing of already-COW'ed blocks seems to cause page-cache
trashing...

Thanks,
Alex.

On Mon, Jul 13, 2015 at 11:27 AM, Filipe David Manana
<fdmanana@gmail.com> wrote:
> On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas <alex@zadarastorage.com> wrote:
>> Greetings,
>> Looking at the code of should_cow_block(), I see:
>>
>> if (btrfs_header_generation(buf) == trans->transid &&
>>    !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) &&
>> ...
>> So if the extent buffer has been written to disk, and now is changed again
>> in the same transaction, we insist on COW'ing it. Can anybody explain why
>> COW is needed in this case? The transaction has not committed yet, so what
>> is the danger of rewriting to the same location on disk? My understanding
>> was that a tree block needs to be COW'ed at most once in the same
>> transaction. But I see that this is not the case.
>
> That logic is there, as far as I can see, for at least 2 obvious reasons:
>
> 1) fsync/log trees. All extent buffers (tree blocks) of a log tree
> have the same transaction id/generation, and you can have multiple
> fsyncs (log transaction commits) per transaction so you need to ensure
> consistency. If we skipped the COWing in the example below, you would
> get an inconsistent log tree at log replay time when the fs is
> mounted:
>
> transaction N start
>
>    fsync inode A start
>    creates tree block X
>    flush X to disk
>    write a new superblock
>    fsync inode A end
>
>    fsync inode B start
>    skip COW of X because its generation == current transaction id and
> modify it in place
>    flush X to disk
>
> ========== crash ===========
>
>    write a new superblock
>    fsync inode B end
>
> transaction N commit
>
> 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
> written to disk but instead when we trigger writeback for it. So while
> the writeback is ongoing we want to make sure the block's content
> isn't concurrently modified (we don't keep the eb write locked to
> allow concurrent reads during the writeback).
>
> All tree blocks that don't belong to a log tree are normally written
> only when at the end of a transaction commit. But often, due to memory
> pressure for e.g., the VM can call the writepages() callback of the
> btree inode to force dirty tree blocks to be written to disk before
> the transaction commit.
>
>>
>> I am asking because I am doing some profiling of btrfs metadata work under
>> heavy loads, and I see that sometimes btrfs COW's almost twice more tree
>> blocks than the total metadata size.
>>
>> Thanks,
>> Alex.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Filipe David Manana,
>
> "Reasonable men adapt themselves to the world.
>  Unreasonable men adapt the world to themselves.
>  That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
  2015-07-13 16:55   ` Alex Lyakas
@ 2015-07-13 17:02     ` Chris Mason
  2015-07-21 14:11       ` Alex Lyakas
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Mason @ 2015-07-13 17:02 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Filipe Manana, linux-btrfs@vger.kernel.org, Josef Bacik

On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote:
> Filipe,
> Thanks for the explanation. Those reasons were not so obvious for me.
> 
> Would it make sense not to COW the block in case-1, if we are mounted
> with "notreelog"? Or, perhaps, to check that the block does not belong
> to a log tree?
> 

Hi Alex,

The crc rules are the most important, we have to make sure the block
isn't changed while it is in flight.  Also, think about something like
this:

transaction write block A, puts pointer to it in the btree, generation Y

<hard disk properly completes the IO>

transaction rewrites block A, same generation Y

<hard disk drops the IO on the floor and never does it>

Later on, we try to read block A again.  We find it has the correct crc
and the correct generation number, but the contents are actually wrong.

> The second case is more difficult. One problem is that
> BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
> due to memory pressure (this is what I see happening), we complete the
> writeback, release the extent buffer, and pages are evicted from the
> page cache of btree_inode. After some time we read the block again
> (because we want to modify it in the same transaction), but its header
> is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
> this point it should be safe to avoid COW, we will re-COW.
> 
> Would it make sense to have some runtime-only mechanism to lock-out
> the write-back for an eb? I.e., if we know that eb is not under
> writeback, and writeback is locked out from starting, we can redirty
> the block without COW. Then we allow the writeback to start when it
> wants to.
> 
> In one of my test runs, btrfs had 6.4GB of metadata (before
> raid-induced overhead), but during a particular transaction total of
> 10GB of metadata (again, before raid-induced overhead) was written to
> disk. (Thisis  total of all ebs having
> header->generation==curr_transid, not only during commit of the
> transaction). This particular run was with "notreelog".
> 
> Machine had 8GB of RAM. Linux allows the btree_inode to grow its
> page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
> But even though the used amount of metadata is less than that, this
> re-COW'ing of already-COW'ed blocks seems to cause page-cache
> trashing...

Interesting.  We've addressed this in the past with changes to the
writepage(s) callback for the btree, basically skipping memory pressure
related writeback if there isn't that much dirty.  There is a lot of
room to improve those decisions, like preferring to write leaves over
nodes, especially full leaves that are not likely to change again.

-chris

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
  2015-07-13 17:02     ` Chris Mason
@ 2015-07-21 14:11       ` Alex Lyakas
  0 siblings, 0 replies; 5+ messages in thread
From: Alex Lyakas @ 2015-07-21 14:11 UTC (permalink / raw)
  To: Chris Mason, Alex Lyakas, Filipe Manana,
	linux-btrfs@vger.kernel.org
  Cc: Shyam Kaushik

Hi Filipe, Chris,

I did more profiling work, and I want to writeup some of the things I
noticed. Hopefully it has some value for the community.

# btrfs_transaction.dirty_pages:
This "extent_io_tree" includes every eb created by
btrfs_init_new_buffer (this is the only function that adds dirty
extents there). But sometimes such eb is marked back as clean
(clean_tree_block), but is not removed from
btrfs_transaction.dirty_pages. So I think
btrfs_write_and_wait_transaction triggers a writeout for such ebs as
well, although they are not marked as dirty, so probably this won't do
anything. I don't know if this may cause any issue.

# btrfs_fs_info.dirty_metadata_bytes
This accounts dirty metadata per filesystem, not per particular
transaction. Because we may have transaction X in
btrfs_write_and_wait_transaction, and transaction X+1 creating dirty
ebs. This is probably intended.

# btrfs_init_new_buffer
I was trying to account exact amount of dirty ebs per transaction.
Obviously, any such dirty eb would be always created by this function.
And I was trying to compare the amount of ebs created by this function
vs the amount of ebs written out by write_one_eb.
I am aware that a particular eb can be freed in the same transaction
via btrfs_free_tree_block() either pinning it, or directly dropping it
back to the free-space cache. I am also aware that delayed references
may cancel one another, and run_one_delayed_ref() may pin the eb
without doing any extent tree update. I am also aware about VM
writeback happening via btree_writepages, thus requiring to re-COW
some of the ebs.
Still the total amount of metadata written by a particular transaction
is less than the total amount of ebs created by btrfs_init_new_buffer,
taking into account all the cases I mentioned above. I guess that
clean_tree_block() is the culprit, but I am not sure what happens to
such "cleaned" eb. It just remains in btree_inode page cache? Or it is
freed by some path I an unaware of?

# should_cow_block() + BTRFS_HEADER_FLAG_WRITTEN causing a re-COW of a
tree block
Chris, you mention that btrfs should be skipping memory pressure
writeback if there is not much dirty metadata. I guess you mean
BTRFS_DIRTY_METADATA_THRESH checked in btree_writepages, and also
skipping "for_kupdate" writeback. Grepping through kernel sources, I
see that btree_writepages should never see for_reclaim bit set,
because for that the btree_inode needs to provide the "writepage"
callback, but it only provides "writepages". (for_reclaim is set by
vmscan.c::pageout and migrate.c::writeout, which both require
"writepage" if I am seeing correctly).
So except for explicit writeback requests, btree_writepages should see
only the "for_background" writeback, which is mostly controlled by
dirty_background_ratio/dirty_background_bytes as I understand (not by
memory pressure).
Anyways, I disabled the WB_SYNC_NONE fully, and indeed was not seeing
any re-COWs. But on the other hand, I had few OOM panics, when dirty
metadata grew to several GBs.

# global_block_rsv and update_global_block_rsv
I don't fully understand what calc_global_metadata_size() is trying to
do, But I think once metadata size is above couple of GBs, then
update_global_block_rsv will always refill global_block_rsv back to
512Mb, as it does:
block_rsv->size = min_t(u64, num_bytes, 512 * 1024 * 1024);
and then bumps up "reserved" to "size".
As a result, should_end_transaction() will tell us that we need to
start a commit when we grab half of that, i.e., 256MB or more, from
global_block_rsv. This seems very reasonable, as we really don't want
to have more dirty metadata than that in a single transaction.
However, we might call update_global_block_rsv() on different
occasions, and every such call will bump it back to 512MB. So for
example, we create a new data block group, and then
btrfs_make_block_group will call update_global_block_rsv, allowing us
to dirty more metadata before we commit.
This seems a bit strange, why we are bumping it back and suddenly
allowing more updates to come in?

# run_delayed_refs by the commit thread
I see that the current spirit is that when somebody detaches from a
transaction by btrfs_end_transaction, we don't want to delay it for
too long.
So even if must_run_delayed_refs==1, we will run at most 32 delayed refs:
cur = max_t(unsigned long, cur, 32)
As a result, when we trigger a commit we might have tens of thousands
of delayed refs, to be run by the commit thread. The commit thread
calls
btrfs_run_delayed_refs several times, before it reaches the critical
section (TRANS_STATE_COMMIT_DOING), where we don't allow anybody to
join it.

Profiling these run_delayed_refs calls shows that they are slow due to:
- we need to read the extent tree blocks from disk[1] in order to
update the extent tree (read_extent_buffer_pages), and there is only
one thread reading them sequentially (the commit thread).
- before we get to the critical section of the commit, we can have
other threads also running delayed refs, so the commit thread needs to
compete on tree-block
locks with them (and they hold the locks because they also read tree
blocks from disk as it seems)

So my question is shouldn't we be much more aggressive in
__btrfs_end_transaction, running delayed refs several times and
checking trans->delayed_ref_updates after each run, and return only if
this number is zero or small enough.
This way when we trigger a commit, it will not have a lot of delayed
refs to run, it will get very quickly to the critical section, pass it
hopefully very quickly (get to TRANS_STATE_UNBLOCKED), and then we can
open a new transaction while the previous is doing
btrfs_write_and_wait_transaction.
That's what I wanted to ask.

Thanks!
Alex.

[1] In my case, btrfs metadata is ~10GBs and the machine has 8GB of
RAM. Due to this we need to read a lot of ebs from disk, as they are
not in the page cache. Also need to keep in mind that every COW of eb
requires a new slot in the page cache, because we index by "bytenr"
that we receive from the free-space cache, which is a "logical"
coordinate by which EXTENT_ITEMs are sorted in the extent tree.

On Mon, Jul 13, 2015 at 7:02 PM, Chris Mason <clm@fb.com> wrote:
> On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote:
>> Filipe,
>> Thanks for the explanation. Those reasons were not so obvious for me.
>>
>> Would it make sense not to COW the block in case-1, if we are mounted
>> with "notreelog"? Or, perhaps, to check that the block does not belong
>> to a log tree?
>>
>
> Hi Alex,
>
> The crc rules are the most important, we have to make sure the block
> isn't changed while it is in flight.  Also, think about something like
> this:
>
> transaction write block A, puts pointer to it in the btree, generation Y
>
> <hard disk properly completes the IO>
>
> transaction rewrites block A, same generation Y
>
> <hard disk drops the IO on the floor and never does it>
>
> Later on, we try to read block A again.  We find it has the correct crc
> and the correct generation number, but the contents are actually wrong.
>
>> The second case is more difficult. One problem is that
>> BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
>> due to memory pressure (this is what I see happening), we complete the
>> writeback, release the extent buffer, and pages are evicted from the
>> page cache of btree_inode. After some time we read the block again
>> (because we want to modify it in the same transaction), but its header
>> is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
>> this point it should be safe to avoid COW, we will re-COW.
>>
>> Would it make sense to have some runtime-only mechanism to lock-out
>> the write-back for an eb? I.e., if we know that eb is not under
>> writeback, and writeback is locked out from starting, we can redirty
>> the block without COW. Then we allow the writeback to start when it
>> wants to.
>>
>> In one of my test runs, btrfs had 6.4GB of metadata (before
>> raid-induced overhead), but during a particular transaction total of
>> 10GB of metadata (again, before raid-induced overhead) was written to
>> disk. (Thisis  total of all ebs having
>> header->generation==curr_transid, not only during commit of the
>> transaction). This particular run was with "notreelog".
>>
>> Machine had 8GB of RAM. Linux allows the btree_inode to grow its
>> page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
>> But even though the used amount of metadata is less than that, this
>> re-COW'ing of already-COW'ed blocks seems to cause page-cache
>> trashing...
>
> Interesting.  We've addressed this in the past with changes to the
> writepage(s) callback for the btree, basically skipping memory pressure
> related writeback if there isn't that much dirty.  There is a lot of
> room to improve those decisions, like preferring to write leaves over
> nodes, especially full leaves that are not likely to change again.
>
> -chris

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-07-21 14:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-12 17:15 question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN Alex Lyakas
2015-07-13  9:27 ` Filipe David Manana
2015-07-13 16:55   ` Alex Lyakas
2015-07-13 17:02     ` Chris Mason
2015-07-21 14:11       ` Alex Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).