From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Metcalf Subject: Re: [PATCH v15 04/13] task_isolation: add initial support Date: Mon, 12 Sep 2016 15:25:04 -0400 Message-ID: <84b47c50-6f1b-dcdb-c90e-471d1b4adbac@mellanox.com> References: <1471382376-5443-1-git-send-email-cmetcalf@mellanox.com> <1471382376-5443-5-git-send-email-cmetcalf@mellanox.com> <20160829163352.GV10153@twins.programming.kicks-ass.net> <20160830075854.GZ10153@twins.programming.kicks-ass.net> <3f84f736-ed7f-adff-d5f0-4f7db664208f@mellanox.com> <440e20d1-441a-3228-6b37-6e71e9fce47c@mellanox.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org To: Andy Lutomirski Cc: "linux-doc@vger.kernel.org" , Thomas Gleixner , Christoph Lameter , Michal Hocko , Gilad Ben Yossef , Andrew Morton , Linux API , Viresh Kumar , Ingo Molnar , Steven Rostedt , Tejun Heo , Will Deacon , Rik van Riel , Frederic Weisbecker , "Paul E. McKenney" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Catalin Marinas , Peter Zijlstra List-Id: linux-api@vger.kernel.org On 9/12/2016 1:41 PM, Andy Lutomirski wrote: > On Sep 9, 2016 1:40 PM, "Chris Metcalf" wrote: >> On 9/2/2016 1:28 PM, Andy Lutomirski wrote: >>> On Sep 2, 2016 7:04 AM, "Chris Metcalf" wrote: >>>> On 8/30/2016 3:50 PM, Andy Lutomirski wrote: >>>>> On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf wrote: >>>>>> So to pop up a level, what is your actual concern about the existing >>>>>> "do it in a loop" model? The macrology currently in use means there >>>>>> is zero cost if you don't configure TASK_ISOLATION, and the software >>>>>> maintenance cost seems low since the idioms used for task isolation >>>>>> in the loop are generally familiar to people reading that code. >>>>> My concern is that it's not obvious to readers of the code that the >>>>> loop ever terminates. It really ought to, but it's doing something >>>>> very odd. Normally we can loop because we get scheduled out, but >>>>> actually blocking in the return-to-userspace path, especially blocking >>>>> on a condition that doesn't have a wakeup associated with it, is odd. >>>> >>>> True, although, comments :-) >>>> >>>> Regardless, though, this doesn't seem at all weird to me in the >>>> context of the vmstat and lru stuff, though. It's exactly parallel to >>>> the fact that we loop around on checking need_resched and signal, and >>>> in some cases you could imagine multiple loops around when we schedule >>>> out and get a signal, so loop around again, and then another >>>> reschedule event happens during signal processing so we go around >>>> again, etc. Eventually it settles down. It's the same with the >>>> vmstat/lru stuff. >>> Only kind of. >>> >>> When we say, effectively, while (need_resched()) schedule();, we're >>> not waiting for an event or condition per se. We're runnable (in the >>> sense that userspace wants to run and we're not blocked on anything) >>> the entire time -- we're simply yielding to some other thread that is >>> also runnable. So if that loop runs forever, it either means that >>> we're at low priority and we genuinely shouldn't be running or that >>> there's a scheduler bug. >>> >>> If, on the other hand, we say while (not quiesced) schedule(); (or >>> equivalent), we're saying that we're *not* really ready to run and >>> that we're waiting for some condition to change. The condition in >>> question is fairly complicated and won't wake us when we are ready. I >>> can also imagine the scheduler getting rather confused, since, as far >>> as the scheduler knows, we are runnable and we are supposed to be >>> running. >> >> So, how about a code structure like this? >> >> In the main return-to-userspace loop where we check TIF flags, >> we keep the notion of our TIF_TASK_ISOLATION flag that causes >> us to invoke a task_isolation_prepare() routine. This routine >> does the following things: >> >> 1. As you suggested, set a new TIF bit (or equivalent) that says the >> system should no longer create deferred work on this core, and then >> flush any necessary already-deferred work (currently, the LRU cache >> and the vmstat stuff). We never have to go flush the deferred work >> again during this task's return to userspace. Note that this bit can >> only be set on a core marked for task isolation, so it can't be used >> for denial of service type attacks on normal cores that are trying to >> multitask normal Linux processes. > I think it can't be a TIF flag unless you can do the whole mess with > preemption off because, if you get preempted, other tasks on the CPU > won't see the flag. You could do it with a percpu flag, I think. Yes, a percpu flag - you're right. I think it will make sense for this to be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc. >> 2. Check if the dyntick is stopped, and if not, wait on a completion >> that will be set when it does stop. This means we may schedule out at >> this point, but when we return, the deferred work stuff is still safe >> since your bit is still set, and in principle the dyn tick is >> stopped. >> >> Then, after we disable interrupts and re-read the thread-info flags, >> we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still >> set that would keep us in the loop. This will always end up happening >> on each return to userspace, since the only thing that actually clears >> the bit is a prctl() call. When that happens we know we are about to >> return to userspace, so we call task_isolation_ready(), which now has >> two things to do: > Would it perhaps be more straightforward to do the stuff before the > loop and not check TIF_TASK_ISOLATION in the loop? We can certainly play around with just not looping in this case, but in particular I can imagine an isolated task entering the kernel and then doing something that requires scheduling a kernel task. We'd clearly like that other task to run before the isolated task returns to userspace. But then, that other task might do something to re-enable the dyntick. That's why we'd like to recheck that dyntick is off in the loop after each potential call to schedule(). >> 1. We check that the dyntick is in fact stopped, since it's possible >> that a race condition led to it being somehow restarted by an interrupt. >> If it is not stopped, we go around the loop again so we can go back in >> to the completion discussed above and wait some more. This may merit >> a WARN_ON or other notice since it seems like people aren't convinced >> there are things that could restart it, but honestly the dyntick stuff >> is complex enough that I think a belt-and-suspenders kind of test here >> at the last minute is just defensive programming. > Seems reasonable. But maybe this could go after the loop and, if the > dyntick is back, it could be treated like any other kernel bug that > interrupts an isolated task? That would preserve more of the existing > code structure. Well, we can certainly try it that way. If I move it out and my testing doesn't trigger the bug, that's at least an initial sign that it might be OK. But I worry/suspect that it will trip up at some point in some use case and we'll have to fix it at that point. > If that works, it could go in user_enter(). Presumably with trace_user_enter() and vtime_user_enter() in __context_tracking_enter()? -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org