linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Liu Bo <bo.li.liu@oracle.com>
To: Miao Xie <miaox@cn.fujitsu.com>
Cc: Chris Mason <clm@fb.com>,
	Martin Steigerwald <Martin@lichtvoll.de>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix compressed write corruption on enospc
Date: Tue, 12 Aug 2014 15:51:05 +0800	[thread overview]
Message-ID: <20140812075104.GD9244@localhost.localdomain> (raw)
In-Reply-To: <53E98237.3070108@cn.fujitsu.com>

On Tue, Aug 12, 2014 at 10:55:51AM +0800, Miao Xie wrote:
> On Sun, 10 Aug 2014 22:55:44 +0800, Liu Bo wrote:
> >>>>> This part of the trace is relatively new because Liu Bo's patch made us
> >>>>> redirty the pages, making it more likely that we'd try to write them
> >>>>> during commit.
> >>>>>
> >>>>> But, at the end of the day we have a fundamental deadlock with
> >>>>> committing a transaction while holding a locked page from an ordered file.
> >>>>>
> >>>>> For now, I'm ripping out the strict ordered file and going back to a
> >>>>> best-effort filemap_flush like ext4 is using.
> >>>>
> >>>> I think I've figured the deadlock out, this is obviously a race case, really
> >>>> hard to reproduce :-(
> >>>>
> >>>> So it turns out to be related to workqueues -- now a kthread can process
> >>>> work_struct queued in different workqueues, so we can explain the deadlock as
> >>>> such,
> >>>>
> >>>> (1)
> >>>> "btrfs-delalloc" workqueue gets a compressed extent to process with all its pages
> >>>> locked during this, and it runs into read free space cache inode, and then wait
> >>>> on lock_page().
> >>>>
> >>>> (2)
> >>>> Reading that free space cache inode comes to submit part, and we have a
> >>>> indirect twice endio way for it, with the first endio we come to end_workqueue_bio()
> >>>> and queue a work in "btrfs-endio-meta" workqueue, and it will run the real
> >>>> endio() for us, but ONLY when it's processed.
> >>>>
> >>>> So the problem is a kthread can serve several workqueues, which means
> >>>> works in "btrfs-endio-meta" workqueues and works in "btrfs-flush_delalloc"
> >>>> workqueues can be in the same processing list of a kthread.  When
> >>>> "btrfs-flush_delalloc" waits for the compressed page and "btrfs-endio-meta"
> >>>> comes after it, it hangs.
> >>>
> >>> I don't think it is right. All the btrfs workqueue has RECLAIM flag, which means
> >>> each btrfs workqueue has its own rescue worker. So the problem you said should
> >>> not happen.
> >>
> >> Right, I traded some emails with Tejun about this and spent a few days
> >> trying to prove the workqueues were doing the wrong thing.   It will end
> >> up spawning another worker thread for the new work, and it won't get
> >> queued up behind the existing thread.
> >>
> >> If both work items went to the same workqueue, you'd definitely be right.
> >>
> >> I've got a patch to change the flush-delalloc code so we don't do the
> >> file writes during commit.  It seems like the only choice right now.
> > 
> > Not the only choice any more ;)
> > 
> > It turns out to be related to async_cow's ordered list, say we have two
> > async_cow works on the wq->ordered_list, and the first work(named A) finishes its
> > ->ordered_func() and ->ordered_free(), and the second work(B) starts B's
> > ->ordered_func() which gets to read free space cache inode, where it queues a
> > work on @endio_meta_workers, but this work happens to be the same address with
> > A's work.
> > 
> > So now the situation is,
> > 
> > (1) in kthread's looping worker_thread(), work A is actually running its job,
> > (2) however, work A has freed its memory but kthread still want to use this
> >     address of memory, which means worker->current_work is still A's address.
> > (3) B's readahead for free space cache inode happens to queue a work whose
> >     address of memory is just the previous address of A's work, which means
> >     another worker's ->current_work is also A's address.
> > (4) as in btrfs we all use function normal_work_helper(), so
> >     worker->current_func is fixed here.
> > (5) worker_thread()
> >            ->process_one_work()
> >                     ->find_worker_executing_work()    
> >                     (find a collision, another work returns)
> > 
> > Then we saw the hang.
> 
> Here is my understand of what you said:
> The same worker dealt with work A and work B, and the 3rd work which was introduced by
> work B and has the same virtual memory address as work A was also inserted into the work
> list of that worker. But work B was wait for the 3rd work at that time, so deadlock happened.
> Am I right?
> 
> If I'm right, I think what you said is impossible. Before we dealt with work B, we should
> already invoke spin_unlock_irq(&pool->lock), which implies a memory barrier that all changes
> happens before unlock should complete before unlock, that is the address in current_work should
> be the address of work B, when we inserted the 3rd work which was introduced by work B, we
> should not find the address of work A in current_work of work B's worker.
> 
> I can not reproduce the problem on my machine, so I don't verify whether what I said is right
> or not. Please correct me if I am wrong.

Hmm, my fault...the problem is complex so I didn't describe it 100% precisely
in the above reply.

I've sent a patch about this which has the truth you may want :-)

-liubo

      reply	other threads:[~2014-08-12  7:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-24 14:48 [PATCH] Btrfs: fix compressed write corruption on enospc Liu Bo
2014-07-24 14:55 ` Chris Mason
2014-07-25  1:00   ` Liu Bo
2014-07-25  9:58     ` Martin Steigerwald
2014-07-25  1:53 ` Wang Shilong
2014-07-25  2:08   ` Liu Bo
2014-07-25  2:11     ` Wang Shilong
2014-07-25  9:54 ` Martin Steigerwald
2014-08-04 12:50   ` Martin Steigerwald
2014-08-04 12:52     ` Martin Steigerwald
2014-08-06 10:21     ` Martin Steigerwald
2014-08-06 10:29       ` Hugo Mills
2014-08-06 12:28         ` Martin Steigerwald
2014-08-06 13:35       ` Chris Mason
2014-08-06 14:43         ` Martin Steigerwald
2014-08-06 15:18           ` Chris Mason
2014-08-07  0:52             ` Chris Mason
2014-08-07  7:50               ` Liu Bo
2014-08-07  8:20                 ` Miao Xie
2014-08-07 14:02                   ` Chris Mason
2014-08-10 14:55                     ` Liu Bo
2014-08-11 20:35                       ` Chris Mason
2014-08-12  2:55                       ` Miao Xie
2014-08-12  7:51                         ` Liu Bo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140812075104.GD9244@localhost.localdomain \
    --to=bo.li.liu@oracle.com \
    --cc=Martin@lichtvoll.de \
    --cc=clm@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=miaox@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).