From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751879AbcEJRU2 (ORCPT ); Tue, 10 May 2016 13:20:28 -0400 Received: from forward-corp1m.cmail.yandex.net ([5.255.216.100]:40545 "EHLO forward-corp1m.cmail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751819AbcEJRUY (ORCPT ); Tue, 10 May 2016 13:20:24 -0400 Authentication-Results: smtpcorp1m.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Subject: Re: workqueue: race in mod_delayed_work_on? To: Tejun Heo References: <57319A12.2020403@yandex-team.ru> <57320C18.3000909@yandex-team.ru> <20160510163625.GM7110@mtj.duckdns.org> Cc: "linux-kernel@vger.kernel.org" , Sasha Levin From: Konstantin Khlebnikov Message-ID: <57321852.80908@yandex-team.ru> Date: Tue, 10 May 2016 20:20:18 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: <20160510163625.GM7110@mtj.duckdns.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10.05.2016 19:36, Tejun Heo wrote: > Hello, > > On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote: >> On 10.05.2016 11:21, Konstantin Khlebnikov wrote: >>> I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c >> >> Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer > > The timer usages are gated behind PENDING bit, so whether add_timer() > is equal to mod_timer() shouldn't matter. Hmm... this looks little bit more complicated than one bit. > >> but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76 >> (workqueue: make sure delayed work run in local cpu) into 3.18.25 >> >> I don't see reason why that commit could break delayed work, >> most likely it highlighted some other problem. > > What are you running? Can you reproduce the issue on upstream kernel? > This is slight patched 3.18.y. Looks like this started when we upgraded kernel to 3.18.25 and somebody have loaded module ib_addr (ip in infiniband or something) which actually unused because these machines have no infiniband at all. But this code is poked from ethernet arp sometimes. So, it crashes somewhere from time to time. I'll try to stresstest this piece. -- Konstantin