From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754764AbaEEHcI (ORCPT ); Mon, 5 May 2014 03:32:08 -0400 Received: from cn.fujitsu.com ([59.151.112.132]:55675 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1754635AbaEEHcH convert rfc822-to-8bit (ORCPT ); Mon, 5 May 2014 03:32:07 -0400 X-IronPort-AV: E=Sophos;i="4.97,985,1389715200"; d="scan'208";a="30072229" Message-ID: <53673074.1040406@cn.fujitsu.com> Date: Mon, 5 May 2014 15:32:20 +0900 From: Dongsheng Yang User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130612 Thunderbird/17.0.6 MIME-Version: 1.0 To: Peter Zijlstra , , CC: , , , Subject: Re: [PATCH 4/8] sched/core: Skip wakeup when task is already running. References: <51238bf1648b1f4c66d3547a49cf949d1679d068.1397562542.git.yangds.fnst@cn.fujitsu.com> <20140415135326.GV11096@twins.programming.kicks-ass.net> <534E59FC.2090001@cn.fujitsu.com> <535658DB.2090801@cn.fujitsu.com> <20140422181820.GT26782@laptop.programming.kicks-ass.net> In-Reply-To: <20140422181820.GT26782@laptop.programming.kicks-ass.net> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8BIT X-Originating-IP: [10.167.226.49] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi all, Thanx for your reply, and sorry for the late. On 04/23/2014 03:18 AM, Peter Zijlstra wrote: > On Tue, Apr 22, 2014 at 10:10:52AM -0700, bsegall@google.com wrote: >> This is all expected behavior, and the somewhat less than useful trace >> events are expected. A task setting p->state to TASK_RUNNING without >> locks is fine if and only p == current. The standard deschedule loop is >> basically: >> >> while (1) { >> set_current_state(TASK_(UN)INTERRUPTIBLE); >> if (should_still_sleep) >> schedule(); >> } >> set_current_state(TASK_RUNNING); >> >> Which can produce this in a race. >> >> The only problem this causes is a wasted check_preempt_curr call in the >> racing case, and a somewhat inaccurate sched:sched_wakeup trace event. >> Note that even if you did recheck in ttwu_do_wakeup you could still race >> and get an "inaccurate" trace event. Yes, even I recheck it in ttwu_do_wakeup(), I could still race. Now I can not catch up a good idea to avoid this race. But I think it is not too expensive. About the event of sched:sched_wakeup we can get the bug information as I mentioned at the first mail of this thread: [root@yds-PC linux]# perf sched latency|tail bash:25186 | 0.410 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s bash:25174 | 0.407 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s bash:25268 | 0.367 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s bash:25279 | 0.358 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s bash:25238 | 0.286 ms | 1 | avg: 0.000 ms | max: 0.000 ms | max at: 0.000000 s ----------------------------------------------------------------------------------------- TOTAL: | 1344.539 ms | 9604 | --------------------------------------------------- *INFO: 0.557% state machine bugs (51 out of 9161)* It is caused by the issue we discussed in this thread, that ttwu_do_wakeup() could be called to wakeup a task which is on run queue. There are two solutions in my mind: * Add a new trace event, such as sched:sched_enqueue. Then we can trace the event of it and get the timestamp when a task enqueue and start waiting cpu. In this way, we can calculate the latency time with (sched_in_time - sched_enqueue_time) in `perf sched latency`. * Move the current trace point from ttwu_do_wakeup() to ttwu_activate(). Currently the sched:sched_wakeup can tell user very little. When we get a sched:sched_wakeup: a) We can not say a task is inserted into run queue, it is also used for task which is on_rq and only change the task->state to TASK_RUNNING. b) We can not say the task->state is changed from {UN}INTERRUPTABLE to RUNNING, sometimes task->state is already changed to RUNNING by other cpu. I prefer the second one, anyway, current sched_wakeup tells user none. >> Heck, even if the ttwu is >> _necessary_ because p is currently trying to take rq->lock to >> deschedule, you won't get a matching sched_switch event, because the >> ttwu is running before schedule is. >> >> You could sorta fix this I guess by tracking every write to p->state >> with trace events, but that would be a somewhat different change, and >> might be considered too expensive for all I know (and the trace events >> could /still/ be resolved in a different order across cpus compared to >> p->state's memory). > Ah, you're saying that a second task could try a spurious wakeup between > set_current_state() and schedule(). Yes, that'll trigger this indeed. > > > . >