From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <51AE1B81.20900@candelatech.com> Date: Tue, 04 Jun 2013 09:53:21 -0700 From: Ben Greear MIME-Version: 1.0 To: Joe Lawrence CC: Rusty Russell , Linux Kernel Mailing List , stable@vger.kernel.org Subject: Re: Please add to stable: module: don't unlink the module until we've removed all exposure. References: <51A8E884.1080009@candelatech.com> <87ehclumhr.fsf@rustcorp.com.au> <51ACBD6A.1030304@candelatech.com> <51ACC60B.8090504@candelatech.com> <87d2s2to4z.fsf@rustcorp.com.au> <20130604100744.7cdf8777@jlaw-desktop.mno.stratus.com> In-Reply-To: <20130604100744.7cdf8777@jlaw-desktop.mno.stratus.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: On 06/04/2013 07:07 AM, Joe Lawrence wrote: > On Tue, 04 Jun 2013 15:26:28 +0930 > Rusty Russell wrote: > >> Do you have a backtrace of the 3.9.4 crash? You can add "CFLAGS_module.o >> = -O0" to get a clearer backtrace if you want... > > Hi Rusty, > > See my 3.9 stack traces below, which may or may not be what Ben had > been seeing. If you like, I can try a similar loop as the one you were > testing in the other email. My stack traces are similar. I had better luck reproducing the problem once I enabled lots of debugging (slub memory poisoning, lockdep, object debugging, etc). I'm using Fedora 17 on 2-core core-i7 (4 CPU threads total) for most of this testing. We reproduced on dual-core Atom system as well (32-bit Fedora 14 and Fedora 17). Relatively standard hardware as far as I know. I'll run the insmod/rmmod stress test on my patched systems and see if I can reproduce with the patch in the title applied. Rusty: I'm also seeing lockups related to migration on stock 3.9.4+ (with and without the 'don't unlink the module...' patch. Much harder to reproduce. But, that code appears to be mostly called during module load/unload, so it's possible it is related. The first traces are from a system with local patches, applied, but a later post by me has traces from clean upstream kernel. Further debugging showed that this could be a race, because it seems that all migration/ threads think they are done with their state machine, but the atomic thread counter sits at 1, so no progress is ever made. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg443471.html Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com