From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753366AbcICNve (ORCPT ); Sat, 3 Sep 2016 09:51:34 -0400 Received: from mga14.intel.com ([192.55.52.115]:32995 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752609AbcICNvc (ORCPT ); Sat, 3 Sep 2016 09:51:32 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.30,276,1470726000"; d="scan'208";a="756818937" From: Felipe Balbi To: Peter Zijlstra Cc: Alan Stern , "Paul E. McKenney" , Ingo Molnar , USB list , Kernel development list Subject: Re: Memory barrier needed with wake_up_process()? In-Reply-To: <20160903121915.GC2794@worktop> References: <20160902191857.GL10153@twins.programming.kicks-ass.net> <20160902221658.GO10153@twins.programming.kicks-ass.net> <8737lh79mm.fsf@linux.intel.com> <20160903121915.GC2794@worktop> User-Agent: Notmuch/0.22.1+63~g994277e (https://notmuchmail.org) Emacs/25.1.1 (x86_64-pc-linux-gnu) Date: Sat, 03 Sep 2016 16:51:07 +0300 Message-ID: <8760qdks6s.fsf@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Peter Zijlstra writes: > On Sat, Sep 03, 2016 at 09:58:09AM +0300, Felipe Balbi wrote: > >> > What arch are you seeing this on? >> >> x86. Skylake to be exact. > > So it _cannot_ be the thing Alan mentioned. By the simple fact that > spin_lock() is a full barrier on x86 (every LOCK prefixed instruction > is). I still have this working even after 15 hours of runtime on a test case that was failing consistently within few minutes. At a minimum smp_mb() has some side effect which is hiding the actual problem. >> The following change survived through the night: >> >> diff --git a/drivers/usb/gadget/function/f_mass_storage.c b/drivers/usb/gadget/function/f_mass_storage.c >> index 8f3659b65f53..d31581dd5ce5 100644 >> --- a/drivers/usb/gadget/function/f_mass_storage.c >> +++ b/drivers/usb/gadget/function/f_mass_storage.c >> @@ -395,7 +395,7 @@ static int fsg_set_halt(struct fsg_dev *fsg, struct usb_ep *ep) >> /* Caller must hold fsg->lock */ >> static void wakeup_thread(struct fsg_common *common) >> { >> - smp_wmb(); /* ensure the write of bh->state is complete */ >> + smp_mb(); /* ensure the write of bh->state is complete */ >> /* Tell the main thread that something has happened */ >> common->thread_wakeup_needed = 1; >> if (common->thread_task) >> @@ -626,7 +626,7 @@ static int sleep_thread(struct fsg_common *common, bool can_freeze) >> } >> __set_current_state(TASK_RUNNING); >> common->thread_wakeup_needed = 0; >> - smp_rmb(); /* ensure the latest bh->state is visible */ >> + smp_mb(); /* ensure the latest bh->state is visible */ >> return rc; >> } > > Sorry, but that is horrible code. A barrier cannot ensure writes are > 'complete', at best they can ensure order between writes (or reads > etc..). not arguing ;-) > Also, looking at that thing, that common->thread_wakeup_needed variable > is 100% redundant. All sleep_thread() invocations are inside a loop of > sorts and basically wait for other conditions to become true. > > For example: > > while (bh->state != BUF_STATE_EMPTY) { > rc = sleep_thread(common, false); > if (rc) > return rc; > } right > All you care about there is bh->state, _not_ > common->thread_wakeup_needed. > > That said, I cannot spot an obvious fail, okay, but a fail does exist. Any hints on what extra information I could capture to help figuring this one out? > but the code can certainly use help. Sure, that can be done for v4.9 (if I have time) or v4.10 merge window. Meanwhile, we're trying to find a minimal fix for the -rc which can also be backported to stable, right? -- balbi