All of lore.kernel.org
 help / color / mirror / Atom feed
* PLEASE TEST: Everybody who is seeing weird and long hangs
@ 2011-08-01 15:21 Josef Bacik
  2011-08-01 15:45 ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2011-08-01 15:21 UTC (permalink / raw)
  To: linux-btrfs

Hello,

We've seen a lot of reports of people having these constant long pauses
when doing things like sync or such.  The stack traces usually all look
the same, one is btrfs-transaction stuck in btrfs_wait_marked_extents
and one is btrfs-submit-# stuck in get_request_wait.  I had originally
thought this was due to the new plugging stuff, but I think it just
makes the problem happen more quickly as we've seen that 2.6.38 which we
thought was ok will still have the problem happen if given enough time.

I _think_ this is because of the way we write out metadata in the
transaction commit phase.  We're doing write_on_page for every dirty
page in the btree during the commit.  This sucks because basically we
end up with one bio per page, which makes us blow out our nr_requests
constantly, which is why btrfs-submit-# is always stuck in
get_request_wait.  What we need to do instead is use filemap_fdatawrite
which will do a WB_SYNC_ALL but will do it via writepages, so hopefully
we will get less bios and this problem will go away.  Please try this
very hastily put together patch if you are experiencing this problem and
let me know if it fixes it for you.  Thanks,

Josef

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index eb55863..86217a4 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -577,15 +577,17 @@ int btrfs_end_transaction_dmeta(struct
btrfs_trans_handle *trans,
 int btrfs_write_marked_extents(struct btrfs_root *root,
 			       struct extent_io_tree *dirty_pages, int mark)
 {
-	int ret;
-	int err = 0;
-	int werr = 0;
-	struct page *page;
+//	int ret;
+//	int err = 0;
+//	int werr = 0;
+//	struct page *page;
 	struct inode *btree_inode = root->fs_info->btree_inode;
-	u64 start = 0;
-	u64 end;
-	unsigned long index;
+//	u64 start = 0;
+//	u64 end;
+//	unsigned long index;

+	return filemap_fdatawrite(btree_inode->i_mapping);
+	/*
 	while (1) {
 		ret = find_first_extent_bit(dirty_pages, start, &start, &end,
 					    mark);
@@ -624,7 +626,8 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
 	}
 	if (err)
 		werr = err;
-	return werr;
+	*/
+//	return werr;
 }

 /*

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: PLEASE TEST: Everybody who is seeing weird and long hangs
  2011-08-01 15:21 PLEASE TEST: Everybody who is seeing weird and long hangs Josef Bacik
@ 2011-08-01 15:45 ` Chris Mason
  2011-08-01 16:03   ` Josef Bacik
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2011-08-01 15:45 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -0400:
> Hello,
> 
> We've seen a lot of reports of people having these constant long pauses
> when doing things like sync or such.  The stack traces usually all look
> the same, one is btrfs-transaction stuck in btrfs_wait_marked_extents
> and one is btrfs-submit-# stuck in get_request_wait.  I had originally
> thought this was due to the new plugging stuff, but I think it just
> makes the problem happen more quickly as we've seen that 2.6.38 which we
> thought was ok will still have the problem happen if given enough time.
> 
> I _think_ this is because of the way we write out metadata in the
> transaction commit phase.  We're doing write_on_page for every dirty
> page in the btree during the commit.  This sucks because basically we
> end up with one bio per page, which makes us blow out our nr_requests
> constantly, which is why btrfs-submit-# is always stuck in
> get_request_wait.  What we need to do instead is use filemap_fdatawrite
> which will do a WB_SYNC_ALL but will do it via writepages, so hopefully
> we will get less bios and this problem will go away.  Please try this
> very hastily put together patch if you are experiencing this problem and
> let me know if it fixes it for you.  Thanks,

I'm definitely curious to hear if this helps, but I think it might cause
a different set of problems.  It writes everything that is dirty on the
btree, which includes a lot of things we've cow'd in the current
transaction and marked dirty.  They will have to go through COW again
if someone wants to modify them again.

The btrfs writepage code does this:

        ret = __extent_writepage(page, wbc, &epd);

        extent_write_cache_pages(tree, mapping, &wbc_writepages,
                                 __extent_writepage, &epd, flush_write_bio);
        flush_epd_write_bio(&epd);

So during the commit phase we'll be grabbing adjacent pages already.  My
bet is that our problem will go away if we remove this extra IO
completely.

-chris

> 
> Josef
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index eb55863..86217a4 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -577,15 +577,17 @@ int btrfs_end_transaction_dmeta(struct
> btrfs_trans_handle *trans,
>  int btrfs_write_marked_extents(struct btrfs_root *root,
>                     struct extent_io_tree *dirty_pages, int mark)
>  {
> -    int ret;
> -    int err = 0;
> -    int werr = 0;
> -    struct page *page;
> +//    int ret;
> +//    int err = 0;
> +//    int werr = 0;
> +//    struct page *page;
>      struct inode *btree_inode = root->fs_info->btree_inode;
> -    u64 start = 0;
> -    u64 end;
> -    unsigned long index;
> +//    u64 start = 0;
> +//    u64 end;
> +//    unsigned long index;
> 
> +    return filemap_fdatawrite(btree_inode->i_mapping);
> +    /*
>      while (1) {
>          ret = find_first_extent_bit(dirty_pages, start, &start, &end,
>                          mark);
> @@ -624,7 +626,8 @@ int btrfs_write_marked_extents(struct btrfs_root *root,
>      }
>      if (err)
>          werr = err;
> -    return werr;
> +    */
> +//    return werr;
>  }
> 
>  /*

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PLEASE TEST: Everybody who is seeing weird and long hangs
  2011-08-01 15:45 ` Chris Mason
@ 2011-08-01 16:03   ` Josef Bacik
  2011-08-01 17:54     ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2011-08-01 16:03 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On 08/01/2011 11:45 AM, Chris Mason wrote:
> Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -0400:
>> Hello,
>>
>> We've seen a lot of reports of people having these constant long pauses
>> when doing things like sync or such.  The stack traces usually all look
>> the same, one is btrfs-transaction stuck in btrfs_wait_marked_extents
>> and one is btrfs-submit-# stuck in get_request_wait.  I had originally
>> thought this was due to the new plugging stuff, but I think it just
>> makes the problem happen more quickly as we've seen that 2.6.38 which we
>> thought was ok will still have the problem happen if given enough time.
>>
>> I _think_ this is because of the way we write out metadata in the
>> transaction commit phase.  We're doing write_on_page for every dirty
>> page in the btree during the commit.  This sucks because basically we
>> end up with one bio per page, which makes us blow out our nr_requests
>> constantly, which is why btrfs-submit-# is always stuck in
>> get_request_wait.  What we need to do instead is use filemap_fdatawrite
>> which will do a WB_SYNC_ALL but will do it via writepages, so hopefully
>> we will get less bios and this problem will go away.  Please try this
>> very hastily put together patch if you are experiencing this problem and
>> let me know if it fixes it for you.  Thanks,
> 
> I'm definitely curious to hear if this helps, but I think it might cause
> a different set of problems.  It writes everything that is dirty on the
> btree, which includes a lot of things we've cow'd in the current
> transaction and marked dirty.  They will have to go through COW again
> if someone wants to modify them again.
> 

But this is happening in the commit after we've done all of our work, we
shouldn't be dirtying anything else at this point right?

> The btrfs writepage code does this:
> 
>         ret = __extent_writepage(page, wbc, &epd);
> 
>         extent_write_cache_pages(tree, mapping, &wbc_writepages,
>                                  __extent_writepage, &epd, flush_write_bio);
>         flush_epd_write_bio(&epd);
> 

Yeah but nr_to_write is 1, so after the __extent_writepage it will be 0
and extent_write_cache_pages will just return since there's nothing to
write, so we'll still end up with 1 page at a time being written out.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PLEASE TEST: Everybody who is seeing weird and long hangs
  2011-08-01 16:03   ` Josef Bacik
@ 2011-08-01 17:54     ` Chris Mason
  2011-08-01 18:01       ` Josef Bacik
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2011-08-01 17:54 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

Excerpts from Josef Bacik's message of 2011-08-01 12:03:34 -0400:
> On 08/01/2011 11:45 AM, Chris Mason wrote:
> > Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -0400:
> >> Hello,
> >>
> >> We've seen a lot of reports of people having these constant long pauses
> >> when doing things like sync or such.  The stack traces usually all look
> >> the same, one is btrfs-transaction stuck in btrfs_wait_marked_extents
> >> and one is btrfs-submit-# stuck in get_request_wait.  I had originally
> >> thought this was due to the new plugging stuff, but I think it just
> >> makes the problem happen more quickly as we've seen that 2.6.38 which we
> >> thought was ok will still have the problem happen if given enough time.
> >>
> >> I _think_ this is because of the way we write out metadata in the
> >> transaction commit phase.  We're doing write_on_page for every dirty
> >> page in the btree during the commit.  This sucks because basically we
> >> end up with one bio per page, which makes us blow out our nr_requests
> >> constantly, which is why btrfs-submit-# is always stuck in
> >> get_request_wait.  What we need to do instead is use filemap_fdatawrite
> >> which will do a WB_SYNC_ALL but will do it via writepages, so hopefully
> >> we will get less bios and this problem will go away.  Please try this
> >> very hastily put together patch if you are experiencing this problem and
> >> let me know if it fixes it for you.  Thanks,
> > 
> > I'm definitely curious to hear if this helps, but I think it might cause
> > a different set of problems.  It writes everything that is dirty on the
> > btree, which includes a lot of things we've cow'd in the current
> > transaction and marked dirty.  They will have to go through COW again
> > if someone wants to modify them again.
> > 
> 
> But this is happening in the commit after we've done all of our work, we
> shouldn't be dirtying anything else at this point right?

The commit code is setup to unblock people before we start the IO:

       trans->transaction->blocked = 0;
        spin_lock(&root->fs_info->trans_lock);
        root->fs_info->running_transaction = NULL;
        root->fs_info->trans_no_join = 0;
        spin_unlock(&root->fs_info->trans_lock);
        mutex_unlock(&root->fs_info->reloc_mutex);
        
        wake_up(&root->fs_info->transaction_wait);

        ret = btrfs_write_and_wait_transaction(trans, root);

So, we should have concurrent FS mods for a new transaction while we are
writing out this old transaction.

> 
> > The btrfs writepage code does this:
> > 
> >         ret = __extent_writepage(page, wbc, &epd);
> > 
> >         extent_write_cache_pages(tree, mapping, &wbc_writepages,
> >                                  __extent_writepage, &epd, flush_write_bio);
> >         flush_epd_write_bio(&epd);
> > 
> 
> Yeah but nr_to_write is 1, so after the __extent_writepage it will be 0
> and extent_write_cache_pages will just return since there's nothing to
> write, so we'll still end up with 1 page at a time being written out.
> Thanks,

We bump nr_to_write to 64:

	struct writeback_control wbc_writepages = {
		.sync_mode	= wbc->sync_mode,
		.older_than_this = NULL,
		.nr_to_write	= 64,
		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
		.range_end	= (loff_t)-1,
	};

	ret = __extent_writepage(page, wbc, &epd);

	extent_write_cache_pages(tree, mapping, &wbc_writepages,
				 __extent_writepage, &epd, flush_write_bio);
	flush_epd_write_bio(&epd);

-chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PLEASE TEST: Everybody who is seeing weird and long hangs
  2011-08-01 17:54     ` Chris Mason
@ 2011-08-01 18:01       ` Josef Bacik
  2011-08-01 18:21         ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2011-08-01 18:01 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On 08/01/2011 01:54 PM, Chris Mason wrote:
> Excerpts from Josef Bacik's message of 2011-08-01 12:03:34 -0400:
>> On 08/01/2011 11:45 AM, Chris Mason wrote:
>>> Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -0400:
>>>> Hello,
>>>>
>>>> We've seen a lot of reports of people having these constant long pauses
>>>> when doing things like sync or such.  The stack traces usually all look
>>>> the same, one is btrfs-transaction stuck in btrfs_wait_marked_extents
>>>> and one is btrfs-submit-# stuck in get_request_wait.  I had originally
>>>> thought this was due to the new plugging stuff, but I think it just
>>>> makes the problem happen more quickly as we've seen that 2.6.38 which we
>>>> thought was ok will still have the problem happen if given enough time.
>>>>
>>>> I _think_ this is because of the way we write out metadata in the
>>>> transaction commit phase.  We're doing write_on_page for every dirty
>>>> page in the btree during the commit.  This sucks because basically we
>>>> end up with one bio per page, which makes us blow out our nr_requests
>>>> constantly, which is why btrfs-submit-# is always stuck in
>>>> get_request_wait.  What we need to do instead is use filemap_fdatawrite
>>>> which will do a WB_SYNC_ALL but will do it via writepages, so hopefully
>>>> we will get less bios and this problem will go away.  Please try this
>>>> very hastily put together patch if you are experiencing this problem and
>>>> let me know if it fixes it for you.  Thanks,
>>>
>>> I'm definitely curious to hear if this helps, but I think it might cause
>>> a different set of problems.  It writes everything that is dirty on the
>>> btree, which includes a lot of things we've cow'd in the current
>>> transaction and marked dirty.  They will have to go through COW again
>>> if someone wants to modify them again.
>>>
>>
>> But this is happening in the commit after we've done all of our work, we
>> shouldn't be dirtying anything else at this point right?
> 
> The commit code is setup to unblock people before we start the IO:
> 
>        trans->transaction->blocked = 0;
>         spin_lock(&root->fs_info->trans_lock);
>         root->fs_info->running_transaction = NULL;
>         root->fs_info->trans_no_join = 0;
>         spin_unlock(&root->fs_info->trans_lock);
>         mutex_unlock(&root->fs_info->reloc_mutex);
>         
>         wake_up(&root->fs_info->transaction_wait);
> 
>         ret = btrfs_write_and_wait_transaction(trans, root);
> 
> So, we should have concurrent FS mods for a new transaction while we are
> writing out this old transaction.
> 

Ah right, but then this brings up another question, we shouldn't cow
them again since we would have set the new transid.  And isn't this kind
of bad, since somebody could come in and dirty a piece of metadata
before we have a chance to write it out for this transaction, so we end
up writing out the new data instead of what we are trying to commit?

And also the writepages() thing would get around this problem since we
are SYNC_ALL which now tags all dirty pages as TOWRITE and then writes
those pages instead of writing all dirty pages.  So anything being
dirtied once we started writepages would be fine.

So this really could explain why this is sucking for people, we are just
walking through and writing everything that's dirty, and then doing the
same thing in wait_marked_extents() again, so we could be writing out
things that aren't in the transaction that we committed, which would
mean we're writing way more than we need to.

>>
>>> The btrfs writepage code does this:
>>>
>>>         ret = __extent_writepage(page, wbc, &epd);
>>>
>>>         extent_write_cache_pages(tree, mapping, &wbc_writepages,
>>>                                  __extent_writepage, &epd, flush_write_bio);
>>>         flush_epd_write_bio(&epd);
>>>
>>
>> Yeah but nr_to_write is 1, so after the __extent_writepage it will be 0
>> and extent_write_cache_pages will just return since there's nothing to
>> write, so we'll still end up with 1 page at a time being written out.
>> Thanks,
> 
> We bump nr_to_write to 64:
> 
> 	struct writeback_control wbc_writepages = {
> 		.sync_mode	= wbc->sync_mode,
> 		.older_than_this = NULL,
> 		.nr_to_write	= 64,
> 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
> 		.range_end	= (loff_t)-1,
> 	};
> 
> 	ret = __extent_writepage(page, wbc, &epd);
> 
> 	extent_write_cache_pages(tree, mapping, &wbc_writepages,
> 				 __extent_writepage, &epd, flush_write_bio);
> 	flush_epd_write_bio(&epd);
> 

Oops I missed that, thanks,

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PLEASE TEST: Everybody who is seeing weird and long hangs
  2011-08-01 18:01       ` Josef Bacik
@ 2011-08-01 18:21         ` Chris Mason
  2011-08-01 23:28           ` cwillu
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Mason @ 2011-08-01 18:21 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

Excerpts from Josef Bacik's message of 2011-08-01 14:01:35 -0400:
> On 08/01/2011 01:54 PM, Chris Mason wrote:
> > Excerpts from Josef Bacik's message of 2011-08-01 12:03:34 -0400:
> >> On 08/01/2011 11:45 AM, Chris Mason wrote:
> >>> Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -0400:
> >>>> Hello,
> >>>>
> >>>> We've seen a lot of reports of people having these constant long pauses
> >>>> when doing things like sync or such.  The stack traces usually all look
> >>>> the same, one is btrfs-transaction stuck in btrfs_wait_marked_extents
> >>>> and one is btrfs-submit-# stuck in get_request_wait.  I had originally
> >>>> thought this was due to the new plugging stuff, but I think it just
> >>>> makes the problem happen more quickly as we've seen that 2.6.38 which we
> >>>> thought was ok will still have the problem happen if given enough time.
> >>>>
> >>>> I _think_ this is because of the way we write out metadata in the
> >>>> transaction commit phase.  We're doing write_on_page for every dirty
> >>>> page in the btree during the commit.  This sucks because basically we
> >>>> end up with one bio per page, which makes us blow out our nr_requests
> >>>> constantly, which is why btrfs-submit-# is always stuck in
> >>>> get_request_wait.  What we need to do instead is use filemap_fdatawrite
> >>>> which will do a WB_SYNC_ALL but will do it via writepages, so hopefully
> >>>> we will get less bios and this problem will go away.  Please try this
> >>>> very hastily put together patch if you are experiencing this problem and
> >>>> let me know if it fixes it for you.  Thanks,
> >>>
> >>> I'm definitely curious to hear if this helps, but I think it might cause
> >>> a different set of problems.  It writes everything that is dirty on the
> >>> btree, which includes a lot of things we've cow'd in the current
> >>> transaction and marked dirty.  They will have to go through COW again
> >>> if someone wants to modify them again.
> >>>
> >>
> >> But this is happening in the commit after we've done all of our work, we
> >> shouldn't be dirtying anything else at this point right?
> > 
> > The commit code is setup to unblock people before we start the IO:
> > 
> >        trans->transaction->blocked = 0;
> >         spin_lock(&root->fs_info->trans_lock);
> >         root->fs_info->running_transaction = NULL;
> >         root->fs_info->trans_no_join = 0;
> >         spin_unlock(&root->fs_info->trans_lock);
> >         mutex_unlock(&root->fs_info->reloc_mutex);
> >         
> >         wake_up(&root->fs_info->transaction_wait);
> > 
> >         ret = btrfs_write_and_wait_transaction(trans, root);
> > 
> > So, we should have concurrent FS mods for a new transaction while we are
> > writing out this old transaction.
> > 
> 
> Ah right, but then this brings up another question, we shouldn't cow
> them again since we would have set the new transid.  And isn't this kind
> of bad, since somebody could come in and dirty a piece of metadata
> before we have a chance to write it out for this transaction, so we end
> up writing out the new data instead of what we are trying to commit?

I think we're mixing together different ideas here.  If we're doing a
commit on transaction N, we allow N+1 to start while we're doing the
btrfs_write_and_wait_transaction().  N+1 might allocate and dirty a new
block, which btrfs_write_and_wait_transaction might start IO on.

Strictly speaking this isn't a problem.  It doesn't break any rules of
COW because we're allowed to write metadata at any time.  But, once we
do write it, we must COW it again if we want to change it.  So, anything
that btrfs_write_and_wait_transaction() catches from transaction N+1 is
likely to make more work for us because future mods will have to
allocate a new block.  Basically it's wasted IO.

But, it's also free IO, assuming it was contiguous.  The problem is that
write_cache_pages isn't actually making sure it was contiguous, so we
end up doing many more writes than we could have.

-chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PLEASE TEST: Everybody who is seeing weird and long hangs
  2011-08-01 18:21         ` Chris Mason
@ 2011-08-01 23:28           ` cwillu
  2011-08-02  0:09             ` Chris Mason
  0 siblings, 1 reply; 8+ messages in thread
From: cwillu @ 2011-08-01 23:28 UTC (permalink / raw)
  To: Chris Mason; +Cc: Josef Bacik, linux-btrfs

On Mon, Aug 1, 2011 at 12:21 PM, Chris Mason <chris.mason@oracle.com> w=
rote:
> Excerpts from Josef Bacik's message of 2011-08-01 14:01:35 -0400:
>> On 08/01/2011 01:54 PM, Chris Mason wrote:
>> > Excerpts from Josef Bacik's message of 2011-08-01 12:03:34 -0400:
>> >> On 08/01/2011 11:45 AM, Chris Mason wrote:
>> >>> Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -0400=
:
>> >>>> Hello,
>> >>>>
>> >>>> We've seen a lot of reports of people having these constant lon=
g pauses
>> >>>> when doing things like sync or such. =C2=A0The stack traces usu=
ally all look
>> >>>> the same, one is btrfs-transaction stuck in btrfs_wait_marked_e=
xtents
>> >>>> and one is btrfs-submit-# stuck in get_request_wait. =C2=A0I ha=
d originally
>> >>>> thought this was due to the new plugging stuff, but I think it =
just
>> >>>> makes the problem happen more quickly as we've seen that 2.6.38=
 which we
>> >>>> thought was ok will still have the problem happen if given enou=
gh time.
>> >>>>
>> >>>> I _think_ this is because of the way we write out metadata in t=
he
>> >>>> transaction commit phase. =C2=A0We're doing write_on_page for e=
very dirty
>> >>>> page in the btree during the commit. =C2=A0This sucks because b=
asically we
>> >>>> end up with one bio per page, which makes us blow out our nr_re=
quests
>> >>>> constantly, which is why btrfs-submit-# is always stuck in
>> >>>> get_request_wait. =C2=A0What we need to do instead is use filem=
ap_fdatawrite
>> >>>> which will do a WB_SYNC_ALL but will do it via writepages, so h=
opefully
>> >>>> we will get less bios and this problem will go away. =C2=A0Plea=
se try this
>> >>>> very hastily put together patch if you are experiencing this pr=
oblem and
>> >>>> let me know if it fixes it for you. =C2=A0Thanks,
>> >>>
>> >>> I'm definitely curious to hear if this helps, but I think it mig=
ht cause
>> >>> a different set of problems. =C2=A0It writes everything that is =
dirty on the
>> >>> btree, which includes a lot of things we've cow'd in the current
>> >>> transaction and marked dirty. =C2=A0They will have to go through=
 COW again
>> >>> if someone wants to modify them again.
>> >>>
>> >>
>> >> But this is happening in the commit after we've done all of our w=
ork, we
>> >> shouldn't be dirtying anything else at this point right?
>> >
>> > The commit code is setup to unblock people before we start the IO:
>> >
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0trans->transaction->blocked =3D 0;
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_lock(&root->fs_info->trans_lock);
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 root->fs_info->running_transaction =3D=
 NULL;
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 root->fs_info->trans_no_join =3D 0;
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(&root->fs_info->trans_lock=
);
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_unlock(&root->fs_info->reloc_mut=
ex);
>> >
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 wake_up(&root->fs_info->transaction_wa=
it);
>> >
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 ret =3D btrfs_write_and_wait_transacti=
on(trans, root);
>> >
>> > So, we should have concurrent FS mods for a new transaction while =
we are
>> > writing out this old transaction.
>> >
>>
>> Ah right, but then this brings up another question, we shouldn't cow
>> them again since we would have set the new transid. =C2=A0And isn't =
this kind
>> of bad, since somebody could come in and dirty a piece of metadata
>> before we have a chance to write it out for this transaction, so we =
end
>> up writing out the new data instead of what we are trying to commit?
>
> I think we're mixing together different ideas here. =C2=A0If we're do=
ing a
> commit on transaction N, we allow N+1 to start while we're doing the
> btrfs_write_and_wait_transaction(). =C2=A0N+1 might allocate and dirt=
y a new
> block, which btrfs_write_and_wait_transaction might start IO on.
>
> Strictly speaking this isn't a problem. =C2=A0It doesn't break any ru=
les of
> COW because we're allowed to write metadata at any time. =C2=A0But, o=
nce we
> do write it, we must COW it again if we want to change it. =C2=A0So, =
anything
> that btrfs_write_and_wait_transaction() catches from transaction N+1 =
is
> likely to make more work for us because future mods will have to
> allocate a new block. =C2=A0Basically it's wasted IO.
>
> But, it's also free IO, assuming it was contiguous. =C2=A0The problem=
 is that
> write_cache_pages isn't actually making sure it was contiguous, so we
> end up doing many more writes than we could have.
>
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =C2=A0http://vger.kernel.org/majordomo-info.ht=
ml
>

=46irst user ("youagree") reported back on irc:

<youagree> guys, just came to report its much worse with josef's patch
<youagree> now i can hardly start anything, it's slowed down most of th=
e time
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: PLEASE TEST: Everybody who is seeing weird and long hangs
  2011-08-01 23:28           ` cwillu
@ 2011-08-02  0:09             ` Chris Mason
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Mason @ 2011-08-02  0:09 UTC (permalink / raw)
  To: cwillu; +Cc: Josef Bacik, linux-btrfs

Excerpts from cwillu's message of 2011-08-01 19:28:35 -0400:
> On Mon, Aug 1, 2011 at 12:21 PM, Chris Mason <chris.mason@oracle.com>=
 wrote:
> > Excerpts from Josef Bacik's message of 2011-08-01 14:01:35 -0400:
> >> On 08/01/2011 01:54 PM, Chris Mason wrote:
> >> > Excerpts from Josef Bacik's message of 2011-08-01 12:03:34 -0400=
:
> >> >> On 08/01/2011 11:45 AM, Chris Mason wrote:
> >> >>> Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -04=
00:
> >> >>>> Hello,
> >> >>>>
> >> >>>> We've seen a lot of reports of people having these constant l=
ong pauses
> >> >>>> when doing things like sync or such. =C2=A0The stack traces u=
sually all look
> >> >>>> the same, one is btrfs-transaction stuck in btrfs_wait_marked=
_extents
> >> >>>> and one is btrfs-submit-# stuck in get_request_wait. =C2=A0I =
had originally
> >> >>>> thought this was due to the new plugging stuff, but I think i=
t just
> >> >>>> makes the problem happen more quickly as we've seen that 2.6.=
38 which we
> >> >>>> thought was ok will still have the problem happen if given en=
ough time.
> >> >>>>
> >> >>>> I _think_ this is because of the way we write out metadata in=
 the
> >> >>>> transaction commit phase. =C2=A0We're doing write_on_page for=
 every dirty
> >> >>>> page in the btree during the commit. =C2=A0This sucks because=
 basically we
> >> >>>> end up with one bio per page, which makes us blow out our nr_=
requests
> >> >>>> constantly, which is why btrfs-submit-# is always stuck in
> >> >>>> get_request_wait. =C2=A0What we need to do instead is use fil=
emap_fdatawrite
> >> >>>> which will do a WB_SYNC_ALL but will do it via writepages, so=
 hopefully
> >> >>>> we will get less bios and this problem will go away. =C2=A0Pl=
ease try this
> >> >>>> very hastily put together patch if you are experiencing this =
problem and
> >> >>>> let me know if it fixes it for you. =C2=A0Thanks,
> >> >>>
> >> >>> I'm definitely curious to hear if this helps, but I think it m=
ight cause
> >> >>> a different set of problems. =C2=A0It writes everything that i=
s dirty on the
> >> >>> btree, which includes a lot of things we've cow'd in the curre=
nt
> >> >>> transaction and marked dirty. =C2=A0They will have to go throu=
gh COW again
> >> >>> if someone wants to modify them again.
> >> >>>
> >> >>
> >> >> But this is happening in the commit after we've done all of our=
 work, we
> >> >> shouldn't be dirtying anything else at this point right?
> >> >
> >> > The commit code is setup to unblock people before we start the I=
O:
> >> >
> >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0trans->transaction->blocked =3D 0;
> >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_lock(&root->fs_info->trans_lock=
);
> >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 root->fs_info->running_transaction =3D=
 NULL;
> >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 root->fs_info->trans_no_join =3D 0;
> >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin_unlock(&root->fs_info->trans_lo=
ck);
> >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 mutex_unlock(&root->fs_info->reloc_m=
utex);
> >> >
> >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 wake_up(&root->fs_info->transaction_=
wait);
> >> >
> >> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 ret =3D btrfs_write_and_wait_transac=
tion(trans, root);
> >> >
> >> > So, we should have concurrent FS mods for a new transaction whil=
e we are
> >> > writing out this old transaction.
> >> >
> >>
> >> Ah right, but then this brings up another question, we shouldn't c=
ow
> >> them again since we would have set the new transid. =C2=A0And isn'=
t this kind
> >> of bad, since somebody could come in and dirty a piece of metadata
> >> before we have a chance to write it out for this transaction, so w=
e end
> >> up writing out the new data instead of what we are trying to commi=
t?
> >
> > I think we're mixing together different ideas here. =C2=A0If we're =
doing a
> > commit on transaction N, we allow N+1 to start while we're doing th=
e
> > btrfs_write_and_wait_transaction(). =C2=A0N+1 might allocate and di=
rty a new
> > block, which btrfs_write_and_wait_transaction might start IO on.
> >
> > Strictly speaking this isn't a problem. =C2=A0It doesn't break any =
rules of
> > COW because we're allowed to write metadata at any time. =C2=A0But,=
 once we
> > do write it, we must COW it again if we want to change it. =C2=A0So=
, anything
> > that btrfs_write_and_wait_transaction() catches from transaction N+=
1 is
> > likely to make more work for us because future mods will have to
> > allocate a new block. =C2=A0Basically it's wasted IO.
> >
> > But, it's also free IO, assuming it was contiguous. =C2=A0The probl=
em is that
> > write_cache_pages isn't actually making sure it was contiguous, so =
we
> > end up doing many more writes than we could have.
>=20
> First user ("youagree") reported back on irc:
>=20
> <youagree> guys, just came to report its much worse with josef's patc=
h
> <youagree> now i can hardly start anything, it's slowed down most of =
the time

Josef's filemap_fdatawrite patch?  He sent a second one to the list tha=
t
gets rid of the extra IO done by the current code.  That's the one we
hope will fix things.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-08-02  0:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-01 15:21 PLEASE TEST: Everybody who is seeing weird and long hangs Josef Bacik
2011-08-01 15:45 ` Chris Mason
2011-08-01 16:03   ` Josef Bacik
2011-08-01 17:54     ` Chris Mason
2011-08-01 18:01       ` Josef Bacik
2011-08-01 18:21         ` Chris Mason
2011-08-01 23:28           ` cwillu
2011-08-02  0:09             ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.