linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Lyakas <alex.btrfs@zadarastorage.com>
To: miaox@cn.fujitsu.com
Cc: Linux Btrfs <linux-btrfs@vger.kernel.org>,
	Josef Bacik <jbacik@fusionio.com>
Subject: Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit
Date: Mon, 25 Mar 2013 11:11:51 +0200	[thread overview]
Message-ID: <CAOcd+r1i+neaAb+3o6CSVboKK_P0WPrSoiPLGC8t0QeqFM-uAQ@mail.gmail.com> (raw)
In-Reply-To: <514FAD96.7040002@cn.fujitsu.com>

Hi Miao,

On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>> Hi Miao,
>> I am seeing another issue. Your fix prevents from TRANS_START to get
>> in the way of a committing transaction. But it does not prevent from
>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>> following loop:
>>
>> do {
>>     // attempt to do some useful stuff and/or sleep
>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>                (should_grow && cur_trans->num_joined != joined));
>>
>> What I see is basically that new writers join the transaction, while
>> btrfs_commit_transaction() does this loop. I see
>> cur_trans->num_writers decreasing, but then it increases, then
>> decreases etc. This can go for several seconds during heavy IO load.
>> There is nothing to prevent new TRANS_JOIN writers coming and joining
>> a transaction over and over, thus delaying transaction commit. The IO
>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
>>
>> Do you observe such behavior? Do you believe it's problematic?
>
> I know this behavior, there is no problem with it, the latter code
> will prevent from TRANS_JOIN.
>
> 1672         spin_lock(&root->fs_info->trans_lock);
> 1673         root->fs_info->trans_no_join = 1;
> 1674         spin_unlock(&root->fs_info->trans_lock);
> 1675         wait_event(cur_trans->writer_wait,
> 1676                    atomic_read(&cur_trans->num_writers) == 1);
>
Yes, this code prevents anybody from joining, but before
btrfs_commit_transaction() gets to this code, it may spend sometimes
10 seconds (in my tests) in the do-while loop, while new writers come
and go. Basically, it is not deterministic when the do-while loop will
exit, it depends on the IO pattern.

> And if we block the TRANS_JOIN at the place you point out, the deadlock
> will happen because we need deal with the ordered operations which will
> use TRANS_JOIN here.
>
> (I am dealing with the problem you said above by adding a new type of
> TRANS_* now)

Thanks.
Alex.


>
> Thanks
> Miao
>
>> Thanks,
>> Alex.
>>
>>
>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>>> Hi Miao,
>>>> can you please explain your solution a bit more.
>>>>
>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <miaox@cn.fujitsu.com> wrote:
>>>>> Now btrfs_commit_transaction() does this
>>>>>
>>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>>
>>>>> which async flushes all inodes on the ordered operations list, it introduced
>>>>> a deadlock that transaction-start task, transaction-commit task and the flush
>>>>> workers waited for each other.
>>>>> (See the following URL to get the detail
>>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>>>
>>>>> As we know, if ->in_commit is set, it means someone is committing the
>>>>> current transaction, we should not try to join it if we are not JOIN
>>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
>>>>> the above problem. In this way, there is another benefit: there is no new
>>>>> transaction handle to block the transaction which is on the way of commit,
>>>>> once we set ->in_commit.
>>>>>
>>>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>>>> ---
>>>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>> index bc2f2d1..71b7e2e 100644
>>>>> --- a/fs/btrfs/transaction.c
>>>>> +++ b/fs/btrfs/transaction.c
>>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root)
>>>>>         root->commit_root = btrfs_root_node(root);
>>>>>  }
>>>>>
>>>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>>>> +                                      int type)
>>>>> +{
>>>>> +       return !(trans->in_commit &&
>>>>> +                type != TRANS_JOIN &&
>>>>> +                type != TRANS_JOIN_NOLOCK);
>>>>> +}
>>>>> +
>>>>>  /*
>>>>>   * either allocate a new transaction or hop into the existing one
>>>>>   */
>>>>> @@ -86,6 +94,10 @@ loop:
>>>>>                         spin_unlock(&fs_info->trans_lock);
>>>>>                         return cur_trans->aborted;
>>>>>                 }
>>>>> +               if (!can_join_transaction(cur_trans, type)) {
>>>>> +                       spin_unlock(&fs_info->trans_lock);
>>>>> +                       return -EBUSY;
>>>>> +               }
>>>>>                 atomic_inc(&cur_trans->use_count);
>>>>>                 atomic_inc(&cur_trans->num_writers);
>>>>>                 cur_trans->num_joined++;
>>>>> @@ -360,8 +372,11 @@ again:
>>>>>
>>>>>         do {
>>>>>                 ret = join_transaction(root, type);
>>>>> -               if (ret == -EBUSY)
>>>>> +               if (ret == -EBUSY) {
>>>>>                         wait_current_trans(root);
>>>>> +                       if (unlikely(type == TRANS_ATTACH))
>>>>> +                               ret = -ENOENT;
>>>>> +               }
>>>>
>>>> So I understand that instead of incrementing num_writes and joining
>>>> the current transaction, you do not join and wait for the current
>>>> transaction to unblock.
>>>
>>> More specifically,TRANS_START、TRANS_USERSPACE and TRANS_ATTACH can not
>>> join and just wait for the current transaction to unblock if ->in_commit
>>> is set.
>>>
>>>> Which task in Josef's example
>>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>>>> task 1, task 2 or task 3 is the one that will not join the
>>>> transaction, but instead wait?
>>>
>>> Task1 will not join the transaction, in this way, async inode flush
>>> won't run, and then task3 won't do anything.
>>>
>>> Before applying the patch:
>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>> (Task1)                                 (Task2)                         (Task3)         -- the name in Josef's example
>>> btrfs_start_transaction()
>>>  |->may_wait_transaction()
>>>  |  (return 0)
>>>  |                                      btrfs_commit_transaction()
>>>  |                                       |->set ->in_commit and
>>>  |                                       |  blocked to 1
>>>  |                                       |->wait writers to be 1
>>>  |                                       |  (writers is 1)
>>>  |->join_transaction()                   |
>>>  |  (writers is 2)                       |
>>>  |->btrfs_commit_transaction()           |
>>>      |                                   |->set trans_no_join to 1
>>>      |                                   |  (close join transaction)
>>>      |->btrfs_run_ordered_operations     |
>>>         (Those ordered operations        |
>>>          are added when releasing        |
>>>          file)                           |
>>>          |->async inode flush()          |
>>>          |->wait_flush_comlete()         |
>>>                                          |                              work_loop()
>>>                                          |                               |->run_work()
>>>                                          |                                   |->btrfs_join_transaction()
>>>                                          |                                       |->wait_current_trans()
>>>                                          |->wait writers to be 1
>>>
>>> This three tasks waited for each other.
>>>
>>> After applying this patch:
>>> Start/Attach_Trans_Task                 Commit_Task                     Flush_Worker
>>> (Task1)                                 (Task2)                         (Task3)
>>> btrfs_start_transaction()
>>>  |->may_wait_transaction()
>>>  |  (return 0)
>>>  |                                      btrfs_commit_transaction()
>>>  |                                       |->set ->in_commit and
>>>  |                                       |  blocked to 1
>>>  |                                       |->wait writers to be 1
>>>  |                                       |  (writers is 1)
>>>  |->join_transaction() fail              |
>>>  |  (return -EBUSY, writers is still 1)  |
>>>  |->wait_current_trans()                 |
>>>                                          |->set trans_no_join to 1
>>>                                          |  (close join transaction)
>>>                                          |->wait writers to be 1
>>>                                          |->continue committing
>>>                                                                         (Task3 does nothing)
>>>> Also, I think I don't fully understand Josef's example. What is
>>>> preventing from async flushing to complete?
>>>> Is task 3 waiting because trans_no_join is set?
>>>> Is task 3 the one that actually does the delalloc flush?
>>>
>>> See above.
>>>
>>> Thanks
>>> Miao
>>>
>>>>
>>>> Thanks,
>>>> Alex.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>         } while (ret == -EBUSY);
>>>>>
>>>>>         if (ret < 0) {
>>>>> --
>>>>> 1.6.5.2
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>

  reply	other threads:[~2013-03-25  9:11 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-20  9:16 [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit Miao Xie
2013-02-24 19:49 ` Alex Lyakas
2013-02-25 10:20   ` Miao Xie
2013-03-02 21:15     ` Alex Lyakas
2013-03-24 11:13     ` Alex Lyakas
2013-03-25  1:51       ` Miao Xie
2013-03-25  9:11         ` Alex Lyakas [this message]
2013-04-10 18:45           ` Alex Lyakas
2013-04-11  2:19             ` Miao Xie
     [not found]             ` <518B56F1.40909@cn.fujitsu.com>
2013-06-12 20:11               ` Alex Lyakas
2013-06-13  3:08                 ` Miao Xie
2013-06-16 10:38                   ` Alex Lyakas
2013-06-17  1:51                     ` Miao Xie
2013-06-26 17:53                       ` Alex Lyakas
2013-07-04  2:28                         ` Miao Xie

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOcd+r1i+neaAb+3o6CSVboKK_P0WPrSoiPLGC8t0QeqFM-uAQ@mail.gmail.com \
    --to=alex.btrfs@zadarastorage.com \
    --cc=jbacik@fusionio.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=miaox@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).