From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([222.73.24.84]:4668 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1756424Ab3LWCeU (ORCPT ); Sun, 22 Dec 2013 21:34:20 -0500 Message-ID: <52B7A158.8000503@cn.fujitsu.com> Date: Mon, 23 Dec 2013 10:35:04 +0800 From: Qu Wenruo MIME-Version: 1.0 To: Josef Bacik , linux-btrfs@vger.kernel.org Subject: Re: [PATCH v4 00/18] Replace btrfs_workers with kernel workqueue based btrfs_workqueue References: <1387272719-11889-1-git-send-email-quwenruo@cn.fujitsu.com> <52B3105A.5000102@fb.com> <52B3B4BC.70204@cn.fujitsu.com> <52B44688.5070603@fb.com> In-Reply-To: <52B44688.5070603@fb.com> Content-Type: text/plain; charset=UTF-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On fri, 20 Dec 2013 05:30:48 -0800, Josef Bacik wrote: > > On 12/19/2013 07:08 PM, Qu Wenruo wrote: >> I'm sorry but I failed to reproduce the problem. >> Btrfs/012 in xfstests has been run for serveral hours but nothing >> happened. >> >> Would you please give me some more details about the environment or >> the panic backtrace? >> > > Ok so it wasn't that test, it was just ./check -g auto. It would > sometimes die at btrfs/012, other times it would make it as far as > generic/083 before it keeled over. I bisected it down to > > btrfs: Replace fs_info->workers with btrfs_workqueue > > which of course is just the first patch that you start using the new > code which isn't helpful. My dmesg from one of my runs is here > > http://ur1.ca/g867b > > All of the initial panics looked exactly the same. I'm just going to > kick the series out for now while you track down the problem. Thanks, > > Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Thanks for your dmesg a lot. It's quite sad that I still failed to reproduce the same bug on all my test environment during the weekend. But according to the dmesg, which is expired now though, I made the following assumption: There is a possibility that out of order execution made the "work->wq = wq" sentence executed behind the "queue_work()" call, and the "normal_work_helper" is executed in another CPU instantly, which caused the problem. ------ static inline void __btrfs_queue_work(struct __btrfs_workqueue *wq, struct btrfs_work *work) { unsigned long flags; work->wq = wq; /* There is no memory barrier ensure the work->wq = wq is executed before queue_work */ thresh_queue_hook(wq); if (work->ordered_func) { spin_lock_irqsave(&wq->list_lock, flags); list_add_tail(&work->ordered_list, &wq->ordered_list); spin_unlock_irqrestore(&wq->list_lock, flags); } queue_work(wq->normal_wq, &work->normal_work); } ------ The memory order problem involves compiler or even CPU hardware (AMD CPUS seems to have a weaker memory order than Intel), so to simulate the bug, I used the following codes tried to reproduce the bug, and the result is pretty much the same as your dmesg. ------ static inline void __btrfs_queue_work(struct __btrfs_workqueue *wq, struct btrfs_work *work) { unsigned long flags; int in_order = get_random_int() % 1000 if (in_order) work->wq = wq; thresh_queue_hook(wq); if (work->ordered_func) { spin_lock_irqsave(&wq->list_lock, flags); list_add_tail(&work->ordered_list, &wq->ordered_list); spin_unlock_irqrestore(&wq->list_lock, flags); } queue_work(wq->normal_wq, &work->normal_work); /* Try make the codes out of order since the Intel CPU seems obey the memory order quite well */ if (!in_order) work->wq = wq; } ------ The fix is as easy as just adding a smp_wmb() behind the "work->wq = wq" sentence, but since I failed to reproduce the bug, I can't confirm whether the fix is good or not. I know it's impolite but would you please first try the smp_wmb() first to confirm the fix? If the fix is good I'll send the formal V5 patchset. Thanks Qu