From mboxrd@z Thu Jan  1 00:00:00 1970
From: Juergen Gross <juergen.gross@ts.fujitsu.com>
Subject: Re: [PATCH v2] Fix scheduler crash after s3 resume
Date: Fri, 25 Jan 2013 11:23:40 +0100
Message-ID: <51025D2C.3040005@ts.fujitsu.com>
References: <5100070F.7010808@citrix.com>
	<5100D229.4030906@ts.fujitsu.com>	<510144A3.9060302@citrix.com>	<5101630D02000078000B93AD@nat28.tlf.novell.com>	<51016065.3080902@citrix.com>	<510175E802000078000B94A1@nat28.tlf.novell.com>	<51024B56.20706@citrix.com>	<5102603302000078000B985C@nat28.tlf.novell.com>	<5102541D.1070408@citrix.com>
	<5102694B02000078000B98B7@nat28.tlf.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <5102694B02000078000B98B7@nat28.tlf.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, Tomasz Wroblewski <tomasz.wroblewski@citrix.com>, "Keir (Xen.org)" <keir@xen.org>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

Am 25.01.2013 11:15, schrieb Jan Beulich:
>>>> On 25.01.13 at 10:45, Tomasz Wroblewski<tomasz.wroblewski@citrix.com>  wrote:
>
>>> I think I had already raised the question of the placement of
>>> this rcu_barrier() here, and the lack of a counterpart in the
>>> suspend portion of the path. Keir? Or should
>>> rcu_barrier_action() avoid calling process_pending_softirqs()
>>> while still resuming, and instead call __do_softirq() with all but
>>> RCU_SOFTIRQ masked (perhaps through a suitable wrapper,
>>> or alternatively by open-coding its effect)?
>>>
>> Though I recall these vcpu_wake crashes happen also from other entry
>> points in enter_state but rcu_barrier, so I dont think removing that
>> helps much. Just was unable to get a proper log of them today due to
>> most of them being cut in half. Will try bit more.
>
> In which case making __do_softirq() itself honor being in the
> suspend/resume path might still be an option.
>
>> My belief is that as long as vcpu_migrate is not called in
>> cpu_disable_scheduler, the vcpu->processor shall continue to point to
>> offline cpu. Which will crash if the vcpu_wake is called for that vcpu.
>> If vcpu_migrate is called, then vcpu_wake will still be called with some
>> frequency but since vcpu->processor shall point to online cpu, and it
>> won't crash. So likely avoiding the wakes here completely is not the
>> goal, just the offline ones.
>
> But you neglect the fact that waking vCPU-s at this point is
> unnecessary anyway (they have nowhere to run on).

What about adding a global scheduler_disable() in freeze_domains() and a 
scheduler_enable() in thaw_domains() which will switch scheduler locking to
a global lock (or disable it at all?). This should solve all problems without
any complex changes of current behavior.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html