POD: soft lockups in dom0 kernel

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* POD: soft lockups in dom0 kernel
@ 2013-12-05 13:55 Dietmar Hahn
  2013-12-06 10:00 ` Jan Beulich
  2014-01-16 11:10 ` Jan Beulich
  0 siblings, 2 replies; 12+ messages in thread
From: Dietmar Hahn @ 2013-12-05 13:55 UTC (permalink / raw)
  To: xen-devel

Hi,

when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get
softlockups from time to time.

kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]

I tracked this down to the call of xc_domain_set_pod_target() and further
p2m_pod_set_mem_target().

Unfortunately I can this check only with xen-4.2.2 as I don't have a machine
with enough memory for current hypervisors. But it seems the code is nearly
the same.

My suggestion would be to do the 'pod set target' in the function
xc_domain_set_pod_target() in chunks of maybe 1GB to give the dom0 scheduler
a chance to run.
As this is not performance critical it should not be a problem.

I can reproduce this with SLES11-SP3 with Linux 3.0.101 and xen-4.2.2.

# cat dummy
name = "DummyOS"
memory = 10000
maxmem = 12000
builder='hvm'

# echo 1 > /proc/sys/kernel/watchdog_thresh
# xm create -c dummy

This leads to a kernel message:
kernel: [ 5019.958089] BUG: soft lockup - CPU#4 stuck for 3s! [xend:20854]

Any comments are welcome.
Thanks.

Dietmar.

-- 
Company details: http://ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2013-12-05 13:55 POD: soft lockups in dom0 kernel Dietmar Hahn
@ 2013-12-06 10:00 ` Jan Beulich
  2013-12-06 11:07   ` David Vrabel
  2014-01-16 11:10 ` Jan Beulich
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2013-12-06 10:00 UTC (permalink / raw)
  To: Dietmar Hahn; +Cc: xen-devel, Boris Ostrovsky, David Vrabel

>>> On 05.12.13 at 14:55, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> wrote:
> when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get
> softlockups from time to time.
> 
> kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
> 
> I tracked this down to the call of xc_domain_set_pod_target() and further
> p2m_pod_set_mem_target().
> 
> Unfortunately I can this check only with xen-4.2.2 as I don't have a machine
> with enough memory for current hypervisors. But it seems the code is nearly
> the same.
> 
> My suggestion would be to do the 'pod set target' in the function
> xc_domain_set_pod_target() in chunks of maybe 1GB to give the dom0 scheduler
> a chance to run.
> As this is not performance critical it should not be a problem.

This is a broader problem: There are more long running hypercalls
than just the one setting the POD target. While a kernel built with
CONFIG_PREEMPT ought to have no issue with this (as the
hypervisor internal preemption will always exit back to the guest,
thus allowing interrupts to be processed) as long as such
hypercalls aren't invoked with preemption disabled, non-
preemptable kernels (the suggested default for servers) have -
afaict - no way to deal with this.

However, as long as interrupts and softirqs can get serviced by
the kernel (which they can as long as they weren't disabled upon
invocation of the hypercall), that may also be a mostly cosmetic
problem (in that the soft lockup is being reported) as long as no
real time like guarantees are required (which if they were would
be sort of contradictory to the kernel being non-preemptable),
i.e. other tasks may get starved for some time, but OS health
shouldn't be impacted.

Hence I wonder whether it wouldn't make sense to simply
suppress the soft lockup detection at least across privcmd
invoked hypercalls - Cc-ing upstream Linux maintainers to see if
they have an opinion or thoughts towards a proper solution.

Jan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2013-12-06 10:00 ` Jan Beulich
@ 2013-12-06 11:07   ` David Vrabel
  2013-12-06 11:30     ` Jan Beulich
  0 siblings, 1 reply; 12+ messages in thread
From: David Vrabel @ 2013-12-06 11:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Boris Ostrovsky, Dietmar Hahn

On 06/12/13 10:00, Jan Beulich wrote:
>>>> On 05.12.13 at 14:55, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> wrote:
>> when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get
>> softlockups from time to time.
>>
>> kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
>>
>> I tracked this down to the call of xc_domain_set_pod_target() and further
>> p2m_pod_set_mem_target().
>>
>> Unfortunately I can this check only with xen-4.2.2 as I don't have a machine
>> with enough memory for current hypervisors. But it seems the code is nearly
>> the same.
>>
>> My suggestion would be to do the 'pod set target' in the function
>> xc_domain_set_pod_target() in chunks of maybe 1GB to give the dom0 scheduler
>> a chance to run.
>> As this is not performance critical it should not be a problem.
> 
> This is a broader problem: There are more long running hypercalls
> than just the one setting the POD target. While a kernel built with
> CONFIG_PREEMPT ought to have no issue with this (as the
> hypervisor internal preemption will always exit back to the guest,
> thus allowing interrupts to be processed) as long as such
> hypercalls aren't invoked with preemption disabled, non-
> preemptable kernels (the suggested default for servers) have -
> afaict - no way to deal with this.
> 
> However, as long as interrupts and softirqs can get serviced by
> the kernel (which they can as long as they weren't disabled upon
> invocation of the hypercall), that may also be a mostly cosmetic
> problem (in that the soft lockup is being reported) as long as no
> real time like guarantees are required (which if they were would
> be sort of contradictory to the kernel being non-preemptable),
> i.e. other tasks may get starved for some time, but OS health
> shouldn't be impacted.
> 
> Hence I wonder whether it wouldn't make sense to simply
> suppress the soft lockup detection at least across privcmd
> invoked hypercalls - Cc-ing upstream Linux maintainers to see if
> they have an opinion or thoughts towards a proper solution.

We do not want to disable the soft lockup detection here as it has found
a bug.  We can't have tasks that are unschedulable for minutes as it
would only take a handful of such tasks to hose the system.

We should put an explicit preemption point in.  This will fix it for the
CONFIG_PREEMPT_VOLUNTARY case which I think is the most common
configuration.  Or perhaps this should even be a cond_reched() call to
fix it for fully non-preemptible as well.

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2013-12-06 11:07   ` David Vrabel
@ 2013-12-06 11:30     ` Jan Beulich
  2013-12-06 12:00       ` David Vrabel
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Beulich @ 2013-12-06 11:30 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Boris Ostrovsky, Dietmar Hahn

>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote:
> We do not want to disable the soft lockup detection here as it has found
> a bug.  We can't have tasks that are unschedulable for minutes as it
> would only take a handful of such tasks to hose the system.

My understanding is that the soft lockup detection is what its name
says - a mechanism to find cases where the kernel software locked
up. Yet that's not the case with long running hypercalls.

> We should put an explicit preemption point in.  This will fix it for the
> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common
> configuration.  Or perhaps this should even be a cond_reched() call to
> fix it for fully non-preemptible as well.

How do you imagine to do this? When the hypervisor preempts a
hypercall, all the kernel gets to see is that it drops back into the
hypercall page, such that the next thing to happen would be
re-execution of the hypercall. You can't call anything at that point,
all that can get run here are interrupts (i.e. event upcalls). Or do
you suggest to call cond_resched() from within
__xen_evtchn_do_upcall()?

And even if you do - how certain is it that what gets its continuation
deferred won't interfere with other things the kernel wants to do
(since if you'd be doing it that way, you'd cover all hypercalls at
once, not just those coming through privcmd, and hence you could
end up with partially completed multicalls or other forms of batching,
plus you'd need to deal with possibly active lazy modes).

Jan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2013-12-06 11:30     ` Jan Beulich
@ 2013-12-06 12:00       ` David Vrabel
  2013-12-06 13:52         ` Dietmar Hahn
  2013-12-06 14:50         ` Boris Ostrovsky
  0 siblings, 2 replies; 12+ messages in thread
From: David Vrabel @ 2013-12-06 12:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Boris Ostrovsky, Dietmar Hahn

On 06/12/13 11:30, Jan Beulich wrote:
>>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote:
>> We do not want to disable the soft lockup detection here as it has found
>> a bug.  We can't have tasks that are unschedulable for minutes as it
>> would only take a handful of such tasks to hose the system.
> 
> My understanding is that the soft lockup detection is what its name
> says - a mechanism to find cases where the kernel software locked
> up. Yet that's not the case with long running hypercalls.

Well ok, it's not a lockup in the kernel but it's still a task that
cannot be descheduled for minutes of wallclock time.  This is still a
bug that needs to be fixed.

>> We should put an explicit preemption point in.  This will fix it for the
>> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common
>> configuration.  Or perhaps this should even be a cond_reched() call to
>> fix it for fully non-preemptible as well.
> 
> How do you imagine to do this? When the hypervisor preempts a
> hypercall, all the kernel gets to see is that it drops back into the
> hypercall page, such that the next thing to happen would be
> re-execution of the hypercall. You can't call anything at that point,
> all that can get run here are interrupts (i.e. event upcalls). Or do
> you suggest to call cond_resched() from within
> __xen_evtchn_do_upcall()?

I've not looked at how.

> And even if you do - how certain is it that what gets its continuation
> deferred won't interfere with other things the kernel wants to do
> (since if you'd be doing it that way, you'd cover all hypercalls at
> once, not just those coming through privcmd, and hence you could
> end up with partially completed multicalls or other forms of batching,
> plus you'd need to deal with possibly active lazy modes).

I would only do this for hypercalls issued by the privcmd driver.

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2013-12-06 12:00       ` David Vrabel
@ 2013-12-06 13:52         ` Dietmar Hahn
  2013-12-06 14:58           ` David Vrabel
  2013-12-06 14:50         ` Boris Ostrovsky
  1 sibling, 1 reply; 12+ messages in thread
From: Dietmar Hahn @ 2013-12-06 13:52 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Boris Ostrovsky, Jan Beulich

Am Freitag 06 Dezember 2013, 12:00:02 schrieb David Vrabel:
> On 06/12/13 11:30, Jan Beulich wrote:
> >>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote:
> >> We do not want to disable the soft lockup detection here as it has found
> >> a bug.  We can't have tasks that are unschedulable for minutes as it
> >> would only take a handful of such tasks to hose the system.
> > 
> > My understanding is that the soft lockup detection is what its name
> > says - a mechanism to find cases where the kernel software locked
> > up. Yet that's not the case with long running hypercalls.
> 
> Well ok, it's not a lockup in the kernel but it's still a task that
> cannot be descheduled for minutes of wallclock time.  This is still a
> bug that needs to be fixed.
> 
> >> We should put an explicit preemption point in.  This will fix it for the
> >> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common
> >> configuration.  Or perhaps this should even be a cond_reched() call to
> >> fix it for fully non-preemptible as well.
> > 
> > How do you imagine to do this? When the hypervisor preempts a
> > hypercall, all the kernel gets to see is that it drops back into the
> > hypercall page, such that the next thing to happen would be
> > re-execution of the hypercall. You can't call anything at that point,
> > all that can get run here are interrupts (i.e. event upcalls). Or do
> > you suggest to call cond_resched() from within
> > __xen_evtchn_do_upcall()?
> 
> I've not looked at how.
> 
> > And even if you do - how certain is it that what gets its continuation
> > deferred won't interfere with other things the kernel wants to do
> > (since if you'd be doing it that way, you'd cover all hypercalls at
> > once, not just those coming through privcmd, and hence you could
> > end up with partially completed multicalls or other forms of batching,
> > plus you'd need to deal with possibly active lazy modes).
> 
> I would only do this for hypercalls issued by the privcmd driver.

But I also got soft lockups when unmapping a bigger chunk of guest memory
(our BS2000 OS) in the dom0 kernel via vunmap(). This calls in the end
HYPERVISOR_update_va_mapping() and may take a very long time.
>From a kernel module I found no solution to split the virtual address area to
be able to call schedule(). Because all needed kernel functions are not
exported to be usable in modules. The only possible solution was to turn of
the soft lockup detection.

Dietmar.

> 
> David
> 

-- 
Company details: http://ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2013-12-06 12:00       ` David Vrabel
  2013-12-06 13:52         ` Dietmar Hahn
@ 2013-12-06 14:50         ` Boris Ostrovsky
  1 sibling, 0 replies; 12+ messages in thread
From: Boris Ostrovsky @ 2013-12-06 14:50 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Dietmar Hahn, Jan Beulich

On 12/06/2013 07:00 AM, David Vrabel wrote:
> On 06/12/13 11:30, Jan Beulich wrote:
>>>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote:
>>> We do not want to disable the soft lockup detection here as it has found
>>> a bug.  We can't have tasks that are unschedulable for minutes as it
>>> would only take a handful of such tasks to hose the system.
>> My understanding is that the soft lockup detection is what its name
>> says - a mechanism to find cases where the kernel software locked
>> up. Yet that's not the case with long running hypercalls.
> Well ok, it's not a lockup in the kernel but it's still a task that
> cannot be descheduled for minutes of wallclock time.  This is still a
> bug that needs to be fixed.
>
>>> We should put an explicit preemption point in.  This will fix it for the
>>> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common
>>> configuration.  Or perhaps this should even be a cond_reched() call to
>>> fix it for fully non-preemptible as well.
>> How do you imagine to do this? When the hypervisor preempts a
>> hypercall, all the kernel gets to see is that it drops back into the
>> hypercall page, such that the next thing to happen would be
>> re-execution of the hypercall. You can't call anything at that point,
>> all that can get run here are interrupts (i.e. event upcalls). Or do
>> you suggest to call cond_resched() from within
>> __xen_evtchn_do_upcall()?
> I've not looked at how.

KVM has a hook (kvm_check_and_clear_guest_paused()) into watchdog code 
to prevent it from having false positives (for a different reason 
though). If we claim that soft lockup mechanism is only to detect Linux 
kernel problems and not long-running hypervisor code then perhaps we can 
make this hook a bit more generic.

We would still need to think about what may happen if we are stuck in 
the hypervisor for abnormally long time. Maybe this Xen hook can still 
return false when such cases are detected.

-boris



>
>> And even if you do - how certain is it that what gets its continuation
>> deferred won't interfere with other things the kernel wants to do
>> (since if you'd be doing it that way, you'd cover all hypercalls at
>> once, not just those coming through privcmd, and hence you could
>> end up with partially completed multicalls or other forms of batching,
>> plus you'd need to deal with possibly active lazy modes).
> I would only do this for hypercalls issued by the privcmd driver.
>
> David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2013-12-06 13:52         ` Dietmar Hahn
@ 2013-12-06 14:58           ` David Vrabel
  0 siblings, 0 replies; 12+ messages in thread
From: David Vrabel @ 2013-12-06 14:58 UTC (permalink / raw)
  To: Dietmar Hahn; +Cc: xen-devel, Boris Ostrovsky, Jan Beulich

On 06/12/13 13:52, Dietmar Hahn wrote:
> Am Freitag 06 Dezember 2013, 12:00:02 schrieb David Vrabel:
>> On 06/12/13 11:30, Jan Beulich wrote:
>>>>>> On 06.12.13 at 12:07, David Vrabel <david.vrabel@citrix.com> wrote:
>>>> We do not want to disable the soft lockup detection here as it has found
>>>> a bug.  We can't have tasks that are unschedulable for minutes as it
>>>> would only take a handful of such tasks to hose the system.
>>>
>>> My understanding is that the soft lockup detection is what its name
>>> says - a mechanism to find cases where the kernel software locked
>>> up. Yet that's not the case with long running hypercalls.
>>
>> Well ok, it's not a lockup in the kernel but it's still a task that
>> cannot be descheduled for minutes of wallclock time.  This is still a
>> bug that needs to be fixed.
>>
>>>> We should put an explicit preemption point in.  This will fix it for the
>>>> CONFIG_PREEMPT_VOLUNTARY case which I think is the most common
>>>> configuration.  Or perhaps this should even be a cond_reched() call to
>>>> fix it for fully non-preemptible as well.
>>>
>>> How do you imagine to do this? When the hypervisor preempts a
>>> hypercall, all the kernel gets to see is that it drops back into the
>>> hypercall page, such that the next thing to happen would be
>>> re-execution of the hypercall. You can't call anything at that point,
>>> all that can get run here are interrupts (i.e. event upcalls). Or do
>>> you suggest to call cond_resched() from within
>>> __xen_evtchn_do_upcall()?
>>
>> I've not looked at how.
>>
>>> And even if you do - how certain is it that what gets its continuation
>>> deferred won't interfere with other things the kernel wants to do
>>> (since if you'd be doing it that way, you'd cover all hypercalls at
>>> once, not just those coming through privcmd, and hence you could
>>> end up with partially completed multicalls or other forms of batching,
>>> plus you'd need to deal with possibly active lazy modes).
>>
>> I would only do this for hypercalls issued by the privcmd driver.
> 
> But I also got soft lockups when unmapping a bigger chunk of guest memory
> (our BS2000 OS) in the dom0 kernel via vunmap(). This calls in the end
> HYPERVISOR_update_va_mapping() and may take a very long time.
> From a kernel module I found no solution to split the virtual address area to
> be able to call schedule(). Because all needed kernel functions are not
> exported to be usable in modules. The only possible solution was to turn of
> the soft lockup detection.

vunmap() does a hypercall per-page since it calls ptep_get_and_clear()
so there are no long running hypercalls here.

zap_pmd_range() (which is used for munmap()) already has appropriate
cond_resched() calls after every zap_pte_range() so I think there needs
to be a cond_resched() call added into vunmap_pmd_range() as well.

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2013-12-05 13:55 POD: soft lockups in dom0 kernel Dietmar Hahn
  2013-12-06 10:00 ` Jan Beulich
@ 2014-01-16 11:10 ` Jan Beulich
  2014-01-20 14:39   ` Andrew Cooper
  2014-01-29 14:12   ` Dietmar Hahn
  1 sibling, 2 replies; 12+ messages in thread
From: Jan Beulich @ 2014-01-16 11:10 UTC (permalink / raw)
  To: Dietmar Hahn; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1132 bytes --]

>>> On 05.12.13 at 14:55, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> wrote:
> when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get
> softlockups from time to time.
> 
> kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
> 
> I tracked this down to the call of xc_domain_set_pod_target() and further
> p2m_pod_set_mem_target().
> 
> Unfortunately I can this check only with xen-4.2.2 as I don't have a machine
> with enough memory for current hypervisors. But it seems the code is nearly
> the same.

While I still didn't see a formal report of this against SLE11 yet,
attached a draft patch against the SP3 code base adding manual
preemption to the hypercall path of privcmd. This is only lightly
tested, and therefore has a little bit of debugging code still left in
there. Mind giving this an try (perhaps together with the patch
David had sent for the other issue - there may still be a need for
further preemption points in the IOCTL_PRIVCMD_MMAP*
handling, but without knowing for sure whether that matters to
you I didn't want to add this right away)?

Jan


[-- Attachment #2: xen-privcmd-hcall-preemption.patch --]
[-- Type: text/plain, Size: 6210 bytes --]

--- sle11sp3.orig/arch/x86/include/mach-xen/asm/hypervisor.h	2012-10-19 16:11:54.000000000 +0200
+++ sle11sp3/arch/x86/include/mach-xen/asm/hypervisor.h	2014-01-15 13:02:20.000000000 +0100
@@ -235,6 +235,9 @@ static inline int gnttab_post_map_adjust
 #ifdef CONFIG_XEN
 #define is_running_on_xen() 1
 extern char hypercall_page[PAGE_SIZE];
+#define in_hypercall(regs) (!user_mode_vm(regs) && \
+	(regs)->ip >= (unsigned long)hypercall_page && \
+	(regs)->ip < (unsigned long)hypercall_page + PAGE_SIZE)
 #else
 extern char *hypercall_stubs;
 #define is_running_on_xen() (!!hypercall_stubs)
--- sle11sp3.orig/arch/x86/kernel/entry_32-xen.S	2012-10-19 16:10:09.000000000 +0200
+++ sle11sp3/arch/x86/kernel/entry_32-xen.S	2014-01-16 10:49:07.000000000 +0100
@@ -980,6 +980,20 @@ ENTRY(hypervisor_callback)
 	call evtchn_do_upcall
 	add  $4,%esp
 	CFI_ADJUST_CFA_OFFSET -4
+#ifndef CONFIG_PREEMPT
+	test %al,%al
+	jz   ret_from_intr
+	GET_THREAD_INFO(%edx)
+	cmpl $0,TI_preempt_count(%edx)
+	jnz  ret_from_intr
+	testl $_TIF_NEED_RESCHED,TI_flags(%edx)
+	jz   ret_from_intr
+	testl $X86_EFLAGS_IF,PT_EFLAGS(%esp)
+	jz   ret_from_intr
+	movb $0,PER_CPU_VAR(privcmd_hcall)
+	call preempt_schedule_irq
+	movb $1,PER_CPU_VAR(privcmd_hcall)
+#endif
 	jmp  ret_from_intr
 	CFI_ENDPROC
 
--- sle11sp3.orig/arch/x86/kernel/entry_64-xen.S	2011-10-06 13:06:38.000000000 +0200
+++ sle11sp3/arch/x86/kernel/entry_64-xen.S	2014-01-16 10:52:27.000000000 +0100
@@ -982,6 +982,20 @@ ENTRY(do_hypervisor_callback)   # do_hyp
 	popq %rsp
 	CFI_DEF_CFA_REGISTER rsp
 	decl PER_CPU_VAR(irq_count)
+#ifndef CONFIG_PREEMPT
+	test %al,%al
+	jz   error_exit
+	GET_THREAD_INFO(%rdx)
+	cmpl $0,TI_preempt_count(%rdx)
+	jnz  error_exit
+	bt   $TIF_NEED_RESCHED,TI_flags(%rdx)
+	jnc  error_exit
+	bt   $9,EFLAGS-ARGOFFSET(%rsp)
+	jnc  error_exit
+	movb $0,PER_CPU_VAR(privcmd_hcall)
+	call preempt_schedule_irq
+	movb $1,PER_CPU_VAR(privcmd_hcall)
+#endif
 	jmp  error_exit
 	CFI_ENDPROC
 END(do_hypervisor_callback)
--- sle11sp3.orig/drivers/xen/core/evtchn.c	2013-02-05 17:47:43.000000000 +0100
+++ sle11sp3/drivers/xen/core/evtchn.c	2014-01-15 13:42:02.000000000 +0100
@@ -379,7 +379,14 @@ static DEFINE_PER_CPU(unsigned int, curr
 #endif
 
 /* NB. Interrupts are disabled on entry. */
-asmlinkage void __irq_entry evtchn_do_upcall(struct pt_regs *regs)
+asmlinkage
+#ifdef CONFIG_PREEMPT
+void
+#define return(x) return
+#else
+bool
+#endif
+__irq_entry evtchn_do_upcall(struct pt_regs *regs)
 {
 	unsigned long       l1, l2;
 	unsigned long       masked_l1, masked_l2;
@@ -393,7 +400,7 @@ asmlinkage void __irq_entry evtchn_do_up
 		__this_cpu_or(upcall_state, UPC_NESTED_LATCH);
 		/* Avoid a callback storm when we reenable delivery. */
 		vcpu_info_write(evtchn_upcall_pending, 0);
-		return;
+		return(false);
 	}
 
 	old_regs = set_irq_regs(regs);
@@ -511,6 +518,9 @@ asmlinkage void __irq_entry evtchn_do_up
 	irq_exit();
 	xen_spin_irq_exit();
 	set_irq_regs(old_regs);
+
+	return(__this_cpu_read(privcmd_hcall) && in_hypercall(regs));
+#undef return
 }
 
 static int find_unbound_irq(unsigned int node, struct irq_cfg **pcfg,
--- sle11sp3.orig/drivers/xen/privcmd/privcmd.c	2012-12-12 12:05:51.000000000 +0100
+++ sle11sp3/drivers/xen/privcmd/privcmd.c	2014-01-16 10:01:23.000000000 +0100
@@ -23,6 +23,18 @@
 #include <xen/interface/xen.h>
 #include <xen/xen_proc.h>
 #include <xen/features.h>
+#include <xen/evtchn.h>
+
+#ifndef CONFIG_PREEMPT
+DEFINE_PER_CPU(bool, privcmd_hcall);
+#endif
+
+static inline void _privcmd_hcall(bool state)
+{
+#ifndef CONFIG_PREEMPT
+	this_cpu_write(privcmd_hcall, state);
+#endif
+}
 
 static struct proc_dir_entry *privcmd_intf;
 static struct proc_dir_entry *capabilities_intf;
@@ -97,6 +109,7 @@ static long privcmd_ioctl(struct file *f
 		ret = -ENOSYS;
 		if (hypercall.op >= (PAGE_SIZE >> 5))
 			break;
+		_privcmd_hcall(true);
 		ret = _hypercall(long, (unsigned int)hypercall.op,
 				 (unsigned long)hypercall.arg[0],
 				 (unsigned long)hypercall.arg[1],
@@ -104,8 +117,10 @@ static long privcmd_ioctl(struct file *f
 				 (unsigned long)hypercall.arg[3],
 				 (unsigned long)hypercall.arg[4]);
 #else
+		_privcmd_hcall(true);
 		ret = privcmd_hypercall(&hypercall);
 #endif
+		_privcmd_hcall(false);
 	}
 	break;
 
--- sle11sp3.orig/include/xen/evtchn.h	2011-12-09 15:38:45.000000000 +0100
+++ sle11sp3/include/xen/evtchn.h	2014-01-15 14:32:14.000000000 +0100
@@ -143,7 +143,13 @@ void irq_resume(void);
 #endif
 
 /* Entry point for notifications into Linux subsystems. */
-asmlinkage void evtchn_do_upcall(struct pt_regs *regs);
+asmlinkage
+#ifdef CONFIG_PREEMPT
+void
+#else
+bool
+#endif
+evtchn_do_upcall(struct pt_regs *regs);
 
 /* Mark a PIRQ as unavailable for dynamic allocation. */
 void evtchn_register_pirq(int irq);
@@ -221,6 +227,8 @@ void notify_remote_via_ipi(unsigned int 
 void clear_ipi_evtchn(void);
 #endif
 
+DECLARE_PER_CPU(bool, privcmd_hcall);
+
 #if defined(CONFIG_XEN_SPINLOCK_ACQUIRE_NESTING) \
     && CONFIG_XEN_SPINLOCK_ACQUIRE_NESTING
 void xen_spin_irq_enter(void);
--- sle11sp3.orig/kernel/sched.c	2014-01-10 14:11:39.000000000 +0100
+++ sle11sp3/kernel/sched.c	2014-01-16 11:05:05.000000000 +0100
@@ -4690,6 +4690,9 @@ asmlinkage void __sched notrace preempt_
 }
 EXPORT_SYMBOL(preempt_schedule);
 
+#endif
+#if defined(CONFIG_PREEMPT) || defined(CONFIG_XEN)
+
 /*
  * this is the entry point to schedule() from kernel preemption
  * off of irq context.
@@ -4699,6 +4702,14 @@ EXPORT_SYMBOL(preempt_schedule);
 asmlinkage void __sched preempt_schedule_irq(void)
 {
 	struct thread_info *ti = current_thread_info();
+#ifdef CONFIG_XEN//temp
+static DEFINE_PER_CPU(unsigned long, cnt);
+static DEFINE_PER_CPU(unsigned long, thr);
+if(__this_cpu_inc_return(cnt) > __this_cpu_read(thr)) {
+ __this_cpu_or(thr, __this_cpu_read(cnt));
+ printk("psi[%02u] %08x:%d #%lx\n", raw_smp_processor_id(), ti->preempt_count, need_resched(), __this_cpu_read(cnt));
+}
+#endif
 
 	/* Catch callers which need to be fixed */
 	BUG_ON(ti->preempt_count || !irqs_disabled());

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2014-01-16 11:10 ` Jan Beulich
@ 2014-01-20 14:39   ` Andrew Cooper
  2014-01-20 15:16     ` Jan Beulich
  2014-01-29 14:12   ` Dietmar Hahn
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Cooper @ 2014-01-20 14:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, David Vrabel, Dietmar Hahn

On 16/01/14 11:10, Jan Beulich wrote:
>>>> On 05.12.13 at 14:55, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> wrote:
>> when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get
>> softlockups from time to time.
>>
>> kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
>>
>> I tracked this down to the call of xc_domain_set_pod_target() and further
>> p2m_pod_set_mem_target().
>>
>> Unfortunately I can this check only with xen-4.2.2 as I don't have a machine
>> with enough memory for current hypervisors. But it seems the code is nearly
>> the same.
> While I still didn't see a formal report of this against SLE11 yet,
> attached a draft patch against the SP3 code base adding manual
> preemption to the hypercall path of privcmd. This is only lightly
> tested, and therefore has a little bit of debugging code still left in
> there. Mind giving this an try (perhaps together with the patch
> David had sent for the other issue - there may still be a need for
> further preemption points in the IOCTL_PRIVCMD_MMAP*
> handling, but without knowing for sure whether that matters to
> you I didn't want to add this right away)?
>
> Jan
>

With my 4.4-rc2 testing, these softlockups are becoming more of a
problem, especially with construction/migration of 128GB guests.

I have been looking at doing a similar patch against mainline.

Having talked it through with David, it seems more sensible to have a
second hypercall page, at which point in_hypercall() becomes
in_preemptable_hypercall().

Any task (which could even be kernel tasks) could use the preemptable
page, rather than the main hypercall page, and the asm code doesn't need
to care whether the task was in privcmd.

This would avoid having to maintain extra state to identify whether the
hypercall was preemptable, and would avoid modification to
evtchn_do_upcall().

I shall see about hacking up a patch to this effect.

~Andrew

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2014-01-20 14:39   ` Andrew Cooper
@ 2014-01-20 15:16     ` Jan Beulich
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Beulich @ 2014-01-20 15:16 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, David Vrabel, Dietmar Hahn

>>> On 20.01.14 at 15:39, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 16/01/14 11:10, Jan Beulich wrote:
>>>>> On 05.12.13 at 14:55, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> wrote:
>>> when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get
>>> softlockups from time to time.
>>>
>>> kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
>>>
>>> I tracked this down to the call of xc_domain_set_pod_target() and further
>>> p2m_pod_set_mem_target().
>>>
>>> Unfortunately I can this check only with xen-4.2.2 as I don't have a machine
>>> with enough memory for current hypervisors. But it seems the code is nearly
>>> the same.
>> While I still didn't see a formal report of this against SLE11 yet,
>> attached a draft patch against the SP3 code base adding manual
>> preemption to the hypercall path of privcmd. This is only lightly
>> tested, and therefore has a little bit of debugging code still left in
>> there. Mind giving this an try (perhaps together with the patch
>> David had sent for the other issue - there may still be a need for
>> further preemption points in the IOCTL_PRIVCMD_MMAP*
>> handling, but without knowing for sure whether that matters to
>> you I didn't want to add this right away)?
>>
>> Jan
>>
> 
> With my 4.4-rc2 testing, these softlockups are becoming more of a
> problem, especially with construction/migration of 128GB guests.
> 
> I have been looking at doing a similar patch against mainline.
> 
> Having talked it through with David, it seems more sensible to have a
> second hypercall page, at which point in_hypercall() becomes
> in_preemptable_hypercall().
> 
> Any task (which could even be kernel tasks) could use the preemptable
> page, rather than the main hypercall page, and the asm code doesn't need
> to care whether the task was in privcmd.

Of course this can be generalized, but I don't think a second
hypercall page is the answer here: You'd then also need a
second set of hypercall wrappers (_hypercall0() etc as well as
HYPERVISOR_*()), and generic library routines would need to
have a way to know which one to call.

Therefore I think having a per-CPU state flag (which gets cleared/
restored during interrupt handling, or - like my patch does -
honored only when outside or atomic context) is still the more
reasonable approach.

> This would avoid having to maintain extra state to identify whether the
> hypercall was preemptable, and would avoid modification to
> evtchn_do_upcall().

I'd be curious to see how you avoid modifying evtchn_do_upcall()
(other than by adding what I added there at the assembly call
site) - I especially don't see where your in_preemptable_hypercall()
would get invoked.

Jan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: POD: soft lockups in dom0 kernel
  2014-01-16 11:10 ` Jan Beulich
  2014-01-20 14:39   ` Andrew Cooper
@ 2014-01-29 14:12   ` Dietmar Hahn
  1 sibling, 0 replies; 12+ messages in thread
From: Dietmar Hahn @ 2014-01-29 14:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

Hi,

sorry for the delay.

Am Donnerstag 16 Januar 2014, 11:10:38 schrieb Jan Beulich:
> >>> On 05.12.13 at 14:55, Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> wrote:
> > when creating a bigger (> 50 GB) HVM guest with maxmem > memory we get
> > softlockups from time to time.
> > 
> > kernel: [  802.084335] BUG: soft lockup - CPU#1 stuck for 22s! [xend:31351]
> > 
> > I tracked this down to the call of xc_domain_set_pod_target() and further
> > p2m_pod_set_mem_target().
> > 
> > Unfortunately I can this check only with xen-4.2.2 as I don't have a machine
> > with enough memory for current hypervisors. But it seems the code is nearly
> > the same.
> 
> While I still didn't see a formal report of this against SLE11 yet,
> attached a draft patch against the SP3 code base adding manual
> preemption to the hypercall path of privcmd. This is only lightly
> tested, and therefore has a little bit of debugging code still left in
> there. Mind giving this an try (perhaps together with the patch
> David had sent for the other issue - there may still be a need for
> further preemption points in the IOCTL_PRIVCMD_MMAP*
> handling, but without knowing for sure whether that matters to
> you I didn't want to add this right away)?
> 
> Jan

Today I did some tests with the patch. As the debug part didn't compile I
changed the per cpu variables to local variables.

OK it works! I tried several times to start a domU with
memory=100GB and maxmem=230GB and never got a soft lockup.
Following messages in /var/log/message on the first start:
Jan 29 14:14:45 gut1 kernel: [  178.976373] psi[03] 00000000:1 #2
Jan 29 14:14:46 gut1 kernel: [  179.008774] psi[03] 00000000:1 #4
Jan 29 14:14:46 gut1 kernel: [  179.073048] psi[03] 00000000:1 #8
Jan 29 14:14:46 gut1 kernel: [  179.219272] psi[03] 00000000:1 #10
Jan 29 14:14:47 gut1 kernel: [  180.220803] psi[03] 00000000:1 #20
Jan 29 14:14:48 gut1 kernel: [  181.844153] psi[03] 00000000:1 #40
Jan 29 14:14:51 gut1 kernel: [  184.769331] psi[03] 00000000:1 #80
Jan 29 14:14:56 gut1 kernel: [  189.169159] psi[03] 00000000:1 #100
Jan 29 14:14:57 gut1 kernel: [  190.178545] psi[03] 00000000:1 #200
Jan 29 14:15:03 gut1 kernel: [  196.256353] psi[00] 00000000:1 #1
Jan 29 14:15:03 gut1 kernel: [  196.260928] psi[00] 00000000:1 #2
Jan 29 14:15:03 gut1 kernel: [  196.497156] psi[00] 00000000:1 #4
Jan 29 14:15:03 gut1 kernel: [  196.552303] psi[00] 00000000:1 #8
Jan 29 14:15:04 gut1 kernel: [  197.035527] psi[00] 00000000:1 #10
Jan 29 14:15:04 gut1 kernel: [  197.060626] psi[01] 00000000:1 #1
Jan 29 14:15:04 gut1 kernel: [  197.064101] psi[01] 00000000:1 #2
Jan 29 14:15:04 gut1 kernel: [  197.096719] psi[01] 00000000:1 #4
Jan 29 14:15:04 gut1 kernel: [  197.148756] psi[01] 00000000:1 #8
Jan 29 14:15:04 gut1 kernel: [  197.517184] psi[01] 00000000:1 #10
Jan 29 14:15:05 gut1 kernel: [  198.153211] psi[01] 00000000:1 #20
Jan 29 14:15:06 gut1 kernel: [  199.162541] psi[02] 00000000:1 #1
Jan 29 14:15:06 gut1 kernel: [  199.164895] psi[02] 00000000:1 #2
Jan 29 14:15:06 gut1 kernel: [  199.169576] psi[02] 00000000:1 #4
Jan 29 14:15:06 gut1 kernel: [  199.178073] psi[02] 00000000:1 #8
Jan 29 14:15:06 gut1 kernel: [  199.195693] psi[02] 00000000:1 #10
Jan 29 14:15:06 gut1 kernel: [  199.335857] psi[02] 00000000:1 #20
Jan 29 14:15:06 gut1 kernel: [  199.805027] psi[02] 00000000:1 #40
Jan 29 14:15:07 gut1 kernel: [  200.753118] psi[00] 00000000:1 #20
Jan 29 14:15:08 gut1 kernel: [  201.524368] psi[01] 00000000:1 #40
Jan 29 14:15:09 gut1 kernel: [  202.692159] psi[01] 00000000:1 #80
Jan 29 14:15:11 gut1 kernel: [  204.968433] psi[01] 00000000:1 #100
Jan 29 14:15:16 gut1 kernel: [  209.712892] psi[01] 00000000:1 #200
Jan 29 14:15:32 gut1 kernel: [  225.940798] psi[01] 00000000:1 #400
Jan 29 14:15:38 gut1 kernel: [  231.360556] psi[00] 00000000:1 #40

Second:
Jan 29 14:49:19 gut1 kernel: [ 2250.788926] psi[02] 00000000:1 #80
Jan 29 14:49:26 gut1 kernel: [ 2257.360767] psi[02] 00000000:1 #100
Jan 29 14:49:37 gut1 kernel: [ 2268.912916] psi[02] 00000000:1 #200
Jan 29 14:50:09 gut1 kernel: [ 2300.804211] psi[01] 00000000:1 #800

Thanks.

Dietmar.

-- 
Company details: http://ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-01-29 14:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-05 13:55 POD: soft lockups in dom0 kernel Dietmar Hahn
2013-12-06 10:00 ` Jan Beulich
2013-12-06 11:07   ` David Vrabel
2013-12-06 11:30     ` Jan Beulich
2013-12-06 12:00       ` David Vrabel
2013-12-06 13:52         ` Dietmar Hahn
2013-12-06 14:58           ` David Vrabel
2013-12-06 14:50         ` Boris Ostrovsky
2014-01-16 11:10 ` Jan Beulich
2014-01-20 14:39   ` Andrew Cooper
2014-01-20 15:16     ` Jan Beulich
2014-01-29 14:12   ` Dietmar Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).