From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754302AbcEOQJN (ORCPT ); Sun, 15 May 2016 12:09:13 -0400 Received: from forward-corp1o.mail.yandex.net ([37.140.190.172]:59659 "EHLO forward-corp1o.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752457AbcEOQJM (ORCPT ); Sun, 15 May 2016 12:09:12 -0400 Authentication-Results: smtpcorp1m.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Subject: Re: workqueue: race in mod_delayed_work_on? To: Tejun Heo References: <57319A12.2020403@yandex-team.ru> <57320C18.3000909@yandex-team.ru> <20160510163625.GM7110@mtj.duckdns.org> <57321852.80908@yandex-team.ru> <57347FE9.4070709@yandex-team.ru> <5735DB7C.2040201@yandex-team.ru> Cc: "linux-kernel@vger.kernel.org" , Sasha Levin From: Konstantin Khlebnikov Message-ID: <57389F01.309@yandex-team.ru> Date: Sun, 15 May 2016 19:08:33 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: <5735DB7C.2040201@yandex-team.ru> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 13.05.2016 16:49, Konstantin Khlebnikov wrote: > On 12.05.2016 16:06, Konstantin Khlebnikov wrote: >> On 10.05.2016 20:20, Konstantin Khlebnikov wrote: >>> On 10.05.2016 19:36, Tejun Heo wrote: >>>> Hello, >>>> >>>> On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote: >>>>> On 10.05.2016 11:21, Konstantin Khlebnikov wrote: >>>>>> I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c >>>>> >>>>> Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer >>>> >>>> The timer usages are gated behind PENDING bit, so whether add_timer() >>>> is equal to mod_timer() shouldn't matter. >>> >>> Hmm... this looks little bit more complicated than one bit. >> >> Yep, problem was here - both timer and work can be active at the same time. > > Nope, this is impossible. This will be a bug itself. > >> >> So try_to_grab_pending can return success for two coRecentncurrent callers: >> first get del_timer, second removes work from workqueue. After that >> both call add timer and one of them either catch BUG_ON or corrupt timer list. >> >> I see two possible fixes: always remove timer and work in try_to_grab_pending >> but this must be carefully synchronized. This will make it slower for sure. >> Or always use mod_timer in __queue_delayed_work() - both callers will modify timer, >> but here is no mod_timer_on(). >> >>> >>>> >>>>> but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76 >>>>> (workqueue: make sure delayed work run in local cpu) into 3.18.25 >>>>> >>>>> I don't see reason why that commit could break delayed work, >>>>> most likely it highlighted some other problem. Indeed, my problem was caused by that backported commit because branch 3.18.y has no your fix for add_timer_on() -- 22b886dd1018093920c4250dee2a9a3cb7cff7b8 ("timers: Use proper base migration in add_timer_on()"). Somehow I've missed that. >>>> >>>> What are you running? Can you reproduce the issue on upstream kernel? >>>> >>> >>> This is slight patched 3.18.y. Looks like this started when we upgraded kernel to 3.18.25 and >>> somebody have loaded module ib_addr (ip in infiniband or something) which actually unused >>> because these machines have no infiniband at all. But this code is poked from ethernet arp >>> sometimes. So, it crashes somewhere from time to time. I'll try to stresstest this piece. >>> >> > -- Konstantin