From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jay Vosburgh Subject: Re: [RFC] bonding: fix workqueue re-arming races Date: Wed, 01 Sep 2010 17:54:00 -0700 Message-ID: <3617.1283388840@death> References: <20100831170752.GA9743@midget.suse.cz> <20136.1283288063@death> <20100901131626.GA12447@midget.suse.cz> <24764.1283361274@death> <20100901183113.GA25227@midget.suse.cz> <12656.1283371238@death> <20100901205656.GA14982@smudla-wifi.bakulak.kosire.czf> Cc: bonding-devel@lists.sourceforge.net, markine@google.com, jarkao2@gmail.com, chavey@google.com, netdev@vger.kernel.org To: Jiri Bohac Return-path: Received: from e3.ny.us.ibm.com ([32.97.182.143]:43618 "EHLO e3.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752228Ab0IBAyE (ORCPT ); Wed, 1 Sep 2010 20:54:04 -0400 Received: from d01relay03.pok.ibm.com (d01relay03.pok.ibm.com [9.56.227.235]) by e3.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id o820cUsI009201 for ; Wed, 1 Sep 2010 20:38:30 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay03.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o820s35k345948 for ; Wed, 1 Sep 2010 20:54:03 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id o820s2EV012581 for ; Wed, 1 Sep 2010 20:54:03 -0400 In-reply-to: <20100901205656.GA14982@smudla-wifi.bakulak.kosire.czf> Sender: netdev-owner@vger.kernel.org List-ID: Jiri Bohac wrote: >On Wed, Sep 01, 2010 at 01:00:38PM -0700, Jay Vosburgh wrote: >> Jiri Bohac wrote: >> >I don't think this patch opens new races. The current race >> >scenario is: >> > >> >1) schedule_delayed_work(foo) >> >2) foo's timer expires and foo is queued on bond->wq >> > (possibly, foo starts to run and either gets preempted or >> > sleeps on rtnl) >> >3) bond_close() sets kill_timers=1 and calls >> > cancel_delayed_work() which accomplishes nothing >> >4) bond_open() sets kill_timers=0 >> >5) bond_open() calls schedule_delayed_work(bar) >> >6) foo may run the "commit" work that should not be run >> >7) foo re-arms >> >8) if (foo == bar) -> BUG /* bond->mode did not change */ In looking at bond_open a bit, I wonder if a contributing factor to the crash (if that's what happens) is that INIT_DELAYED_WORK is being done in bond_open on a work item that's already queued or running (left over from the kill_timers sentinel being missed). Calling queue_delayed_work when the work item is already queued is ok, I believe, so I don't think that part is an issue (or at least not a panic-worthy one). I suspect that the INIT_DELAYED_WORK calls will have to move to bond_create if any of the work items end up be able to run beyond the end of close (which seems likely; I'm running out of ideas). >> >With this patch, it is: >> > >> >1) schedule_delayed_work(foo) >> >2) foo's timer expires and foo is queued on bond->wq >> >3) foo may have queued foo_commit on bond->wq_rtnl >> >4) bond_close() cancels foo >> >5) bond_open() >> >6) foo_commit may run and it should not be run >> > >> >The patch avoids the problem of 7) and 8) >> >> But the "with patch" #6 is a bigger window; it doesn't require >> step 5; the foo_commit, et al, can always run after bond_close (because >> nothing ever cancels the foo_commit except for unregistration). That's >> the part that makes me nervous. > >We can always call cancel_work(foo_commit) in bond_close. This >should make the race window the same size it is now. >I didn't do that because I was thinking we'd avoid the race >somehow completely. Perhaps we can do cancel_work() now and solve >it cleanly later. > >> The current race, as I understand it, requires a "close then >> open" sequence with little delay between the two. > >Yeah, not sure how small the delay has to be. With releasing >bond->lock, acquiring rtnl and re-acquiring bond->lock in most of >the work items it may be pretty long. Putting an extra check for >kill_timers after bond->lock is re-acquired will make the window >much smaller ... just in case this is the way we want to "fix" >race conditions ;-) > >> >I think the race in 6) remains the same. It is now easier to fix. >> >This could even be done with a flag (similar to kill_timers), >> >which would be set each time the "commit" work is queued on wq_rtnl and >> >cleared by bond_close(). This should avoid the races completely, >> >I think. The trick is that, unlike kill_timers, bond_open() would >> >not touch this flag. >> >> I'm chewing on whether or not it's feasible to introduce some >> kind of generation count into bond_open/close, so that, e.g., at >> bond_close, the generation is incremented. Each time any of the work >> items is queued, the current generation is stashed somewhere private to >> that work item (in struct bonding, probably). Then, when it runs, it >> compares the current generation to the stored one. If they don't match, >> then the work item does nothing. > >I thought about the generation count as well before I did this >patch. I don't think you can put the counter in struct bonding -- >because that would be overwritten with the new value if the work >item is re-scheduled by bond_open. > >I think you would have to create a new dynamic structure on each >work schedule and pass it to the work item in the "data" pointer. >The structure would contain the counter and the bond pointer. It >would be freed by thework item. I did not like this too much. > >> >[BTW, this is https://bugzilla.novell.com/show_bug.cgi?id=602969 >> >,Novell BZ account needeed] >> >> My BZ account is unworthy to access that bug; can you provide >> any information as to how they're hitting the problem? Presumably >> they're doing something that's doing a fast down/up cycle on the bond, >> but anything else? > >They are doing "rcnetwork restart", which will do the >close->open. Perhaps all the contention on the rtnl (lots of work >with other network interfaces) makes the race window longer. I >couldn't reproduce this. What actually happens when the failure occurs? Is there a stack dump? >> I'm wondering if there's any utility in the "generation count" >> idea I mention above. It's still a sentinel, but if that can be worked >> out to reliably stop the work items after close, then maybe it's the >> least bad option. > >Not without the dynamic allocation, I think. >How about the "kill_timers" on top of this patch (see my >previous e-mail) -- a flag that would be set when queuing the >"commit" work and cleared by bond_close()? > >While this can not stop the re-arming race it is trying to stop >now, it should be able to stop the "commit" work items (where it >does not matter if you try to queue them on the workqueue for a >second time, since it is not a delayed work). > >-- >Jiri Bohac >SUSE Labs, SUSE CZ -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com