[Xenomai-core] x86: Endless minor faults

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Xenomai-core] x86: Endless minor faults
@ 2009-06-29 17:35 Jan Kiszka
  2009-06-29 18:09 ` Philippe Gerum
                   ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-06-29 17:35 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix; +Cc: xenomai-core

Hi all,

seen such loops before? This particular trace is from a 2.6.29.3 kernel
with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
2.6.29.5/2.3-03:

:|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
:|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
:|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
:    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
:|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
:|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
:    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
:    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
:    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
:    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
:    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
:    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
:|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
:|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
:    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
:    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
:    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
:    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
:    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
:    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
:    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
:    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
:    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
:    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
:    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
:    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
:    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
:|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
:|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
:|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)

and again and again...

We are looping over a minor fault here (according to /proc/PID/stat),
the context is a Xenomai task in secondary mode. As the task no longer
processes signals in this state, the whole system is more or less
broken. Tomorrow I will try to find out the faulting address with an
instrumented kernel, but maybe you already have some ideas.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-29 17:35 [Xenomai-core] x86: Endless minor faults Jan Kiszka
@ 2009-06-29 18:09 ` Philippe Gerum
  2009-06-29 18:20   ` Jan Kiszka
  2009-06-30  8:32 ` Jan Kiszka
  2009-07-02 17:14 ` Jan Kiszka
  2 siblings, 1 reply; 45+ messages in thread
From: Philippe Gerum @ 2009-06-29 18:09 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

On Mon, 2009-06-29 at 19:35 +0200, Jan Kiszka wrote:
> Hi all,
> 
> seen such loops before? This particular trace is from a 2.6.29.3 kernel
> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
> 2.6.29.5/2.3-03:

Do you mean 2.6.29.5/2.4-03?

-- 
Philippe.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-29 18:09 ` Philippe Gerum
@ 2009-06-29 18:20   ` Jan Kiszka
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-06-29 18:20 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

Philippe Gerum wrote:
> On Mon, 2009-06-29 at 19:35 +0200, Jan Kiszka wrote:
>> Hi all,
>>
>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>> 2.6.29.5/2.3-03:
> 
> Do you mean 2.6.29.5/2.4-03?

No, in fact I meant 2.4-01 (the one of rc2).

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-29 17:35 [Xenomai-core] x86: Endless minor faults Jan Kiszka
  2009-06-29 18:09 ` Philippe Gerum
@ 2009-06-30  8:32 ` Jan Kiszka
  2009-06-30  8:36   ` Gilles Chanteperdrix
  2009-06-30  8:42   ` Gilles Chanteperdrix
  2009-07-02 17:14 ` Jan Kiszka
  2 siblings, 2 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-06-30  8:32 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix; +Cc: xenomai-core

Jan Kiszka wrote:
> Hi all,
> 
> seen such loops before? This particular trace is from a 2.6.29.3 kernel
> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
> 2.6.29.5/2.3-03:
> 
> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
> 
> and again and again...
> 
> We are looping over a minor fault here (according to /proc/PID/stat),
> the context is a Xenomai task in secondary mode. As the task no longer
> processes signals in this state, the whole system is more or less
> broken. Tomorrow I will try to find out the faulting address with an
> instrumented kernel, but maybe you already have some ideas.

The fault is apparently triggered by __xn_put_user(XNRELAX,
thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
invalid region ATM. The questions are now: Who corrupted this, user
space on init (not that likely) or kernel space later on (unpleasant
thought)? Moreover: Why can't we recover from a fault on u_mode?

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  8:32 ` Jan Kiszka
@ 2009-06-30  8:36   ` Gilles Chanteperdrix
  2009-06-30  8:44     ` Jan Kiszka
  2009-06-30  8:42   ` Gilles Chanteperdrix
  1 sibling, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-06-30  8:36 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Hi all,
>>
>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>> 2.6.29.5/2.3-03:
>>
>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>
>> and again and again...
>>
>> We are looping over a minor fault here (according to /proc/PID/stat),
>> the context is a Xenomai task in secondary mode. As the task no longer
>> processes signals in this state, the whole system is more or less
>> broken. Tomorrow I will try to find out the faulting address with an
>> instrumented kernel, but maybe you already have some ideas.
> 
> The fault is apparently triggered by __xn_put_user(XNRELAX,
> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
> invalid region ATM. The questions are now: Who corrupted this, user
> space on init (not that likely) or kernel space later on (unpleasant
> thought)? Moreover: Why can't we recover from a fault on u_mode?

Can not this be a cleanup issue? The dying thread would have freed its
u_mode before the final relax which really kills it.

-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  8:32 ` Jan Kiszka
  2009-06-30  8:36   ` Gilles Chanteperdrix
@ 2009-06-30  8:42   ` Gilles Chanteperdrix
  2009-06-30  8:56     ` Jan Kiszka
  2009-06-30  9:20     ` Philippe Gerum
  1 sibling, 2 replies; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-06-30  8:42 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Hi all,
>>
>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>> 2.6.29.5/2.3-03:
>>
>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>
>> and again and again...
>>
>> We are looping over a minor fault here (according to /proc/PID/stat),
>> the context is a Xenomai task in secondary mode. As the task no longer
>> processes signals in this state, the whole system is more or less
>> broken. Tomorrow I will try to find out the faulting address with an
>> instrumented kernel, but maybe you already have some ideas.
> 
> The fault is apparently triggered by __xn_put_user(XNRELAX,
> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
> invalid region ATM. The questions are now: Who corrupted this, user
> space on init (not that likely) or kernel space later on (unpleasant
> thought)? Moreover: Why can't we recover from a fault on u_mode?

I already investigated such an issue, and my conclusion was that there
are some places in the code where we can not cope with a fault.
xnshadow_relax being such a place, because, if relax faults, then what
will the fault handler do? Call relax again. Fortunately, mlockall and
the nocow stuff fixes this.

Another way to implement the u_mode thing would be to use the shared
heaps we use for fast mutexes.

-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  8:36   ` Gilles Chanteperdrix
@ 2009-06-30  8:44     ` Jan Kiszka
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-06-30  8:44 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Hi all,
>>>
>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>>> 2.6.29.5/2.3-03:
>>>
>>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
>>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
>>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
>>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
>>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
>>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
>>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
>>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
>>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
>>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
>>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
>>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
>>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
>>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
>>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
>>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
>>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>
>>> and again and again...
>>>
>>> We are looping over a minor fault here (according to /proc/PID/stat),
>>> the context is a Xenomai task in secondary mode. As the task no longer
>>> processes signals in this state, the whole system is more or less
>>> broken. Tomorrow I will try to find out the faulting address with an
>>> instrumented kernel, but maybe you already have some ideas.
>> The fault is apparently triggered by __xn_put_user(XNRELAX,
>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
>> invalid region ATM. The questions are now: Who corrupted this, user
>> space on init (not that likely) or kernel space later on (unpleasant
>> thought)? Moreover: Why can't we recover from a fault on u_mode?
> 
> Can not this be a cleanup issue? The dying thread would have freed its
> u_mode before the final relax which really kills it.
> 

u_mode is 0x018fbf60, and that's typically an unused range for any
process on x86_64. It is definitely unused for the process in question
according to /proc/PID/maps.

That said, you scenario may be problematic too, but I don't think we see
it here.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  8:42   ` Gilles Chanteperdrix
@ 2009-06-30  8:56     ` Jan Kiszka
  2009-06-30  9:20     ` Philippe Gerum
  1 sibling, 0 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-06-30  8:56 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Hi all,
>>>
>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>>> 2.6.29.5/2.3-03:
>>>
>>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
>>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
>>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
>>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
>>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
>>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
>>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
>>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
>>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
>>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
>>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
>>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
>>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
>>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
>>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
>>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
>>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>
>>> and again and again...
>>>
>>> We are looping over a minor fault here (according to /proc/PID/stat),
>>> the context is a Xenomai task in secondary mode. As the task no longer
>>> processes signals in this state, the whole system is more or less
>>> broken. Tomorrow I will try to find out the faulting address with an
>>> instrumented kernel, but maybe you already have some ideas.
>> The fault is apparently triggered by __xn_put_user(XNRELAX,
>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
>> invalid region ATM. The questions are now: Who corrupted this, user
>> space on init (not that likely) or kernel space later on (unpleasant
>> thought)? Moreover: Why can't we recover from a fault on u_mode?
> 
> I already investigated such an issue, and my conclusion was that there
> are some places in the code where we can not cope with a fault.
> xnshadow_relax being such a place, because, if relax faults, then what
> will the fault handler do? Call relax again. Fortunately, mlockall and
> the nocow stuff fixes this.

That was the assumption I built u_mode setting on. Granted, I neglected
the case where we fail due to corrupt addresses.

> 
> Another way to implement the u_mode thing would be to use the shared
> heaps we use for fast mutexes.

This is probably the way to go for a robust u_mode interface. A must-fix
for 2.5-final. Nevertheless, I will do another run to find out where
this address comes from. As often, we may see a chain of bugs...

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  8:42   ` Gilles Chanteperdrix
  2009-06-30  8:56     ` Jan Kiszka
@ 2009-06-30  9:20     ` Philippe Gerum
  2009-06-30  9:21       ` Gilles Chanteperdrix
  1 sibling, 1 reply; 45+ messages in thread
From: Philippe Gerum @ 2009-06-30  9:20 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core

On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
> > Jan Kiszka wrote:
> >> Hi all,
> >>
> >> seen such loops before? This particular trace is from a 2.6.29.3 kernel
> >> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
> >> 2.6.29.5/2.3-03:
> >>
> >> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
> >> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
> >> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
> >> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
> >> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
> >> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
> >> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
> >> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
> >> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
> >> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
> >> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
> >> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
> >> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
> >> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
> >> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
> >> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
> >> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
> >> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
> >> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
> >> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
> >> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
> >> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
> >> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
> >> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
> >> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
> >> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
> >> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
> >> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
> >> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
> >> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
> >>
> >> and again and again...
> >>
> >> We are looping over a minor fault here (according to /proc/PID/stat),
> >> the context is a Xenomai task in secondary mode. As the task no longer
> >> processes signals in this state, the whole system is more or less
> >> broken. Tomorrow I will try to find out the faulting address with an
> >> instrumented kernel, but maybe you already have some ideas.
> > 
> > The fault is apparently triggered by __xn_put_user(XNRELAX,
> > thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
> > invalid region ATM. The questions are now: Who corrupted this, user
> > space on init (not that likely) or kernel space later on (unpleasant
> > thought)? Moreover: Why can't we recover from a fault on u_mode?
> 
> I already investigated such an issue, and my conclusion was that there
> are some places in the code where we can not cope with a fault.
> xnshadow_relax being such a place, because, if relax faults, then what
> will the fault handler do? Call relax again. Fortunately, mlockall and
> the nocow stuff fixes this.


xnshadow_relax() faulting before the current thread bears the XNRELAX
bit would mean that a creepy issue involving ondemand PTEs in _kernel_
space must have caused this. Having the init_mm mappings known from all
processes seems more relevant to this issue than anything nocow and/or
mlockall could ever do to fix it.

> 
> Another way to implement the u_mode thing would be to use the shared
> heaps we use for fast mutexes.
> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  9:20     ` Philippe Gerum
@ 2009-06-30  9:21       ` Gilles Chanteperdrix
  2009-06-30  9:25         ` Philippe Gerum
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-06-30  9:21 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>> Hi all,
>>>>
>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>>>> 2.6.29.5/2.3-03:
>>>>
>>>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
>>>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
>>>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
>>>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
>>>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
>>>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
>>>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
>>>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
>>>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
>>>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
>>>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
>>>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
>>>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
>>>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
>>>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
>>>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
>>>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>>
>>>> and again and again...
>>>>
>>>> We are looping over a minor fault here (according to /proc/PID/stat),
>>>> the context is a Xenomai task in secondary mode. As the task no longer
>>>> processes signals in this state, the whole system is more or less
>>>> broken. Tomorrow I will try to find out the faulting address with an
>>>> instrumented kernel, but maybe you already have some ideas.
>>> The fault is apparently triggered by __xn_put_user(XNRELAX,
>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
>>> invalid region ATM. The questions are now: Who corrupted this, user
>>> space on init (not that likely) or kernel space later on (unpleasant
>>> thought)? Moreover: Why can't we recover from a fault on u_mode?
>> I already investigated such an issue, and my conclusion was that there
>> are some places in the code where we can not cope with a fault.
>> xnshadow_relax being such a place, because, if relax faults, then what
>> will the fault handler do? Call relax again. Fortunately, mlockall and
>> the nocow stuff fixes this.
> 
> 
> xnshadow_relax() faulting before the current thread bears the XNRELAX
> bit would mean that a creepy issue involving ondemand PTEs in _kernel_
> space must have caused this. Having the init_mm mappings known from all
> processes seems more relevant to this issue than anything nocow and/or
> mlockall could ever do to fix it.

u_mode is a user-space address.

-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  9:21       ` Gilles Chanteperdrix
@ 2009-06-30  9:25         ` Philippe Gerum
  2009-06-30  9:26           ` Gilles Chanteperdrix
  0 siblings, 1 reply; 45+ messages in thread
From: Philippe Gerum @ 2009-06-30  9:25 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core

On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
> > On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote:
> >> Jan Kiszka wrote:
> >>> Jan Kiszka wrote:
> >>>> Hi all,
> >>>>
> >>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
> >>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
> >>>> 2.6.29.5/2.3-03:
> >>>>
> >>>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
> >>>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
> >>>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
> >>>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
> >>>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
> >>>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
> >>>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
> >>>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
> >>>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
> >>>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
> >>>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
> >>>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
> >>>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
> >>>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
> >>>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
> >>>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
> >>>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
> >>>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
> >>>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
> >>>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
> >>>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
> >>>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
> >>>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
> >>>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
> >>>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
> >>>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
> >>>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
> >>>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
> >>>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
> >>>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
> >>>>
> >>>> and again and again...
> >>>>
> >>>> We are looping over a minor fault here (according to /proc/PID/stat),
> >>>> the context is a Xenomai task in secondary mode. As the task no longer
> >>>> processes signals in this state, the whole system is more or less
> >>>> broken. Tomorrow I will try to find out the faulting address with an
> >>>> instrumented kernel, but maybe you already have some ideas.
> >>> The fault is apparently triggered by __xn_put_user(XNRELAX,
> >>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
> >>> invalid region ATM. The questions are now: Who corrupted this, user
> >>> space on init (not that likely) or kernel space later on (unpleasant
> >>> thought)? Moreover: Why can't we recover from a fault on u_mode?
> >> I already investigated such an issue, and my conclusion was that there
> >> are some places in the code where we can not cope with a fault.
> >> xnshadow_relax being such a place, because, if relax faults, then what
> >> will the fault handler do? Call relax again. Fortunately, mlockall and
> >> the nocow stuff fixes this.
> > 
> > 
> > xnshadow_relax() faulting before the current thread bears the XNRELAX
> > bit would mean that a creepy issue involving ondemand PTEs in _kernel_
> > space must have caused this. Having the init_mm mappings known from all
> > processes seems more relevant to this issue than anything nocow and/or
> > mlockall could ever do to fix it.
> 
> u_mode is a user-space address.
> 

Why do you think xnshadow_relax() would be called for an already relaxed
thread?

-- 
Philippe.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  9:25         ` Philippe Gerum
@ 2009-06-30  9:26           ` Gilles Chanteperdrix
  2009-06-30  9:27             ` Philippe Gerum
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-06-30  9:26 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
> On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote:
>> Philippe Gerum wrote:
>>> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote:
>>>> Jan Kiszka wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>>>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>>>>>> 2.6.29.5/2.3-03:
>>>>>>
>>>>>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>>>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
>>>>>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
>>>>>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
>>>>>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
>>>>>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
>>>>>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
>>>>>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
>>>>>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>>>>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
>>>>>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>>>>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>>>>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>>>>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>>>>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
>>>>>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>>>>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
>>>>>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
>>>>>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
>>>>>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
>>>>>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>>>>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
>>>>>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
>>>>>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>>>>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
>>>>>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>>>>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>>>>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>>>>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>>>>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>>>>
>>>>>> and again and again...
>>>>>>
>>>>>> We are looping over a minor fault here (according to /proc/PID/stat),
>>>>>> the context is a Xenomai task in secondary mode. As the task no longer
>>>>>> processes signals in this state, the whole system is more or less
>>>>>> broken. Tomorrow I will try to find out the faulting address with an
>>>>>> instrumented kernel, but maybe you already have some ideas.
>>>>> The fault is apparently triggered by __xn_put_user(XNRELAX,
>>>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
>>>>> invalid region ATM. The questions are now: Who corrupted this, user
>>>>> space on init (not that likely) or kernel space later on (unpleasant
>>>>> thought)? Moreover: Why can't we recover from a fault on u_mode?
>>>> I already investigated such an issue, and my conclusion was that there
>>>> are some places in the code where we can not cope with a fault.
>>>> xnshadow_relax being such a place, because, if relax faults, then what
>>>> will the fault handler do? Call relax again. Fortunately, mlockall and
>>>> the nocow stuff fixes this.
>>>
>>> xnshadow_relax() faulting before the current thread bears the XNRELAX
>>> bit would mean that a creepy issue involving ondemand PTEs in _kernel_
>>> space must have caused this. Having the init_mm mappings known from all
>>> processes seems more relevant to this issue than anything nocow and/or
>>> mlockall could ever do to fix it.
>> u_mode is a user-space address.
>>
> 
> Why do you think xnshadow_relax() would be called for an already relaxed
> thread?

Because the fault happens before it has finished relaxing ?

-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  9:26           ` Gilles Chanteperdrix
@ 2009-06-30  9:27             ` Philippe Gerum
  2009-06-30  9:34               ` Gilles Chanteperdrix
  0 siblings, 1 reply; 45+ messages in thread
From: Philippe Gerum @ 2009-06-30  9:27 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core

On Tue, 2009-06-30 at 11:26 +0200, Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
> > On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote:
> >> Philippe Gerum wrote:
> >>> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote:
> >>>> Jan Kiszka wrote:
> >>>>> Jan Kiszka wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
> >>>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
> >>>>>> 2.6.29.5/2.3-03:
> >>>>>>
> >>>>>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
> >>>>>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
> >>>>>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
> >>>>>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
> >>>>>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
> >>>>>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
> >>>>>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
> >>>>>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
> >>>>>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
> >>>>>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
> >>>>>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
> >>>>>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
> >>>>>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
> >>>>>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
> >>>>>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
> >>>>>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
> >>>>>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
> >>>>>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
> >>>>>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
> >>>>>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
> >>>>>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
> >>>>>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
> >>>>>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
> >>>>>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
> >>>>>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
> >>>>>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
> >>>>>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
> >>>>>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
> >>>>>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
> >>>>>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
> >>>>>>
> >>>>>> and again and again...
> >>>>>>
> >>>>>> We are looping over a minor fault here (according to /proc/PID/stat),
> >>>>>> the context is a Xenomai task in secondary mode. As the task no longer
> >>>>>> processes signals in this state, the whole system is more or less
> >>>>>> broken. Tomorrow I will try to find out the faulting address with an
> >>>>>> instrumented kernel, but maybe you already have some ideas.
> >>>>> The fault is apparently triggered by __xn_put_user(XNRELAX,
> >>>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
> >>>>> invalid region ATM. The questions are now: Who corrupted this, user
> >>>>> space on init (not that likely) or kernel space later on (unpleasant
> >>>>> thought)? Moreover: Why can't we recover from a fault on u_mode?
> >>>> I already investigated such an issue, and my conclusion was that there
> >>>> are some places in the code where we can not cope with a fault.
> >>>> xnshadow_relax being such a place, because, if relax faults, then what
> >>>> will the fault handler do? Call relax again. Fortunately, mlockall and
> >>>> the nocow stuff fixes this.
> >>>
> >>> xnshadow_relax() faulting before the current thread bears the XNRELAX
> >>> bit would mean that a creepy issue involving ondemand PTEs in _kernel_
> >>> space must have caused this. Having the init_mm mappings known from all
> >>> processes seems more relevant to this issue than anything nocow and/or
> >>> mlockall could ever do to fix it.
> >> u_mode is a user-space address.
> >>
> > 
> > Why do you think xnshadow_relax() would be called for an already relaxed
> > thread?
> 
> Because the fault happens before it has finished relaxing ?
> 

Well, no. Have a second look at the code.

-- 
Philippe.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  9:27             ` Philippe Gerum
@ 2009-06-30  9:34               ` Gilles Chanteperdrix
  2009-06-30 16:11                 ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-06-30  9:34 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
> On Tue, 2009-06-30 at 11:26 +0200, Gilles Chanteperdrix wrote:
>> Philippe Gerum wrote:
>>> On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote:
>>>> Philippe Gerum wrote:
>>>>> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>>>>>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>>>>>>>> 2.6.29.5/2.3-03:
>>>>>>>>
>>>>>>>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>>>>>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
>>>>>>>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
>>>>>>>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
>>>>>>>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
>>>>>>>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
>>>>>>>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
>>>>>>>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
>>>>>>>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>>>>>>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
>>>>>>>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>>>>>>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>>>>>>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>>>>>>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>>>>>>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
>>>>>>>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>>>>>>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
>>>>>>>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
>>>>>>>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
>>>>>>>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
>>>>>>>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>>>>>>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
>>>>>>>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
>>>>>>>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>>>>>>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
>>>>>>>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>>>>>>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>>>>>>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>>>>>>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>>>>>>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>>>>>>
>>>>>>>> and again and again...
>>>>>>>>
>>>>>>>> We are looping over a minor fault here (according to /proc/PID/stat),
>>>>>>>> the context is a Xenomai task in secondary mode. As the task no longer
>>>>>>>> processes signals in this state, the whole system is more or less
>>>>>>>> broken. Tomorrow I will try to find out the faulting address with an
>>>>>>>> instrumented kernel, but maybe you already have some ideas.
>>>>>>> The fault is apparently triggered by __xn_put_user(XNRELAX,
>>>>>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
>>>>>>> invalid region ATM. The questions are now: Who corrupted this, user
>>>>>>> space on init (not that likely) or kernel space later on (unpleasant
>>>>>>> thought)? Moreover: Why can't we recover from a fault on u_mode?
>>>>>> I already investigated such an issue, and my conclusion was that there
>>>>>> are some places in the code where we can not cope with a fault.
>>>>>> xnshadow_relax being such a place, because, if relax faults, then what
>>>>>> will the fault handler do? Call relax again. Fortunately, mlockall and
>>>>>> the nocow stuff fixes this.
>>>>> xnshadow_relax() faulting before the current thread bears the XNRELAX
>>>>> bit would mean that a creepy issue involving ondemand PTEs in _kernel_
>>>>> space must have caused this. Having the init_mm mappings known from all
>>>>> processes seems more relevant to this issue than anything nocow and/or
>>>>> mlockall could ever do to fix it.
>>>> u_mode is a user-space address.
>>>>
>>> Why do you think xnshadow_relax() would be called for an already relaxed
>>> thread?
>> Because the fault happens before it has finished relaxing ?
>>
> 
> Well, no. Have a second look at the code.

Ok, you are right then, in my case the faults were probably due to vmalloc.

-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30  9:34               ` Gilles Chanteperdrix
@ 2009-06-30 16:11                 ` Jan Kiszka
  2009-07-01 11:56                   ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-06-30 16:11 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
>> On Tue, 2009-06-30 at 11:26 +0200, Gilles Chanteperdrix wrote:
>>> Philippe Gerum wrote:
>>>> On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote:
>>>>> Philippe Gerum wrote:
>>>>>> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Jan Kiszka wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel
>>>>>>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
>>>>>>>>> 2.6.29.5/2.3-03:
>>>>>>>>>
>>>>>>>>> :|   +func                -653    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>>>>>>> :|   +func                -653    0.096  ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
>>>>>>>>> :|   #end     0x80000000  -653    0.069  do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
>>>>>>>>> :    #func                -653    0.078  __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
>>>>>>>>> :|   #begin   0x80000000  -653    0.068  __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
>>>>>>>>> :|   +end     0x80000000  -653    0.069  __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
>>>>>>>>> :    +func                -653    0.060  down_read_trylock+0x4 (do_page_fault+0x424)
>>>>>>>>> :    +func                -653    0.068  _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
>>>>>>>>> :    +func                -653    0.108  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>>>>>>>> :    #func                -652    0.066  _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
>>>>>>>>> :    #func                -652    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>>>>>>>> :    #func                -652    0.074  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>>>>>>>> :|   #begin   0x80000000  -652    0.066  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>>>>>>>> :|   +end     0x80000000  -652    0.069  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>>>>>>>> :    +func                -652    0.096  find_vma+0x4 (do_page_fault+0x465)
>>>>>>>>> :    +func                -652    0.150  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>>>>>>>> :    +func                -652    0.098  handle_mm_fault+0x11 (do_page_fault+0x537)
>>>>>>>>> :    +func                -652    0.090  _spin_lock+0x4 (handle_mm_fault+0x680)
>>>>>>>>> :    +func                -652    0.063  ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
>>>>>>>>> :    +func                -652    0.282  flush_tlb_page+0xd (handle_mm_fault+0x6e7)
>>>>>>>>> :    +func                -651    0.162  ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
>>>>>>>>> :    +func                -651    0.062  up_read+0x4 (do_page_fault+0x5a9)
>>>>>>>>> :    +func                -651    0.072  _spin_lock_irqsave+0x9 (__up_read+0x1c)
>>>>>>>>> :    +func                -651    0.117  ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
>>>>>>>>> :    #func                -651    0.074  _spin_unlock_irqrestore+0x4 (__up_read+0x92)
>>>>>>>>> :    #func                -651    0.069  __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
>>>>>>>>> :    #func                -651    0.060  __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
>>>>>>>>> :|   #begin   0x80000000  -651    0.056  __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
>>>>>>>>> :|   +end     0x80000000  -651    0.420  __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
>>>>>>>>> :|   +func                -650    0.084  __ipipe_handle_exception+0x11 (page_fault+0x26)
>>>>>>>>>
>>>>>>>>> and again and again...
>>>>>>>>>
>>>>>>>>> We are looping over a minor fault here (according to /proc/PID/stat),
>>>>>>>>> the context is a Xenomai task in secondary mode. As the task no longer
>>>>>>>>> processes signals in this state, the whole system is more or less
>>>>>>>>> broken. Tomorrow I will try to find out the faulting address with an
>>>>>>>>> instrumented kernel, but maybe you already have some ideas.
>>>>>>>> The fault is apparently triggered by __xn_put_user(XNRELAX,
>>>>>>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an
>>>>>>>> invalid region ATM. The questions are now: Who corrupted this, user
>>>>>>>> space on init (not that likely) or kernel space later on (unpleasant
>>>>>>>> thought)? Moreover: Why can't we recover from a fault on u_mode?
>>>>>>> I already investigated such an issue, and my conclusion was that there
>>>>>>> are some places in the code where we can not cope with a fault.
>>>>>>> xnshadow_relax being such a place, because, if relax faults, then what
>>>>>>> will the fault handler do? Call relax again. Fortunately, mlockall and
>>>>>>> the nocow stuff fixes this.
>>>>>> xnshadow_relax() faulting before the current thread bears the XNRELAX
>>>>>> bit would mean that a creepy issue involving ondemand PTEs in _kernel_
>>>>>> space must have caused this. Having the init_mm mappings known from all
>>>>>> processes seems more relevant to this issue than anything nocow and/or
>>>>>> mlockall could ever do to fix it.
>>>>> u_mode is a user-space address.
>>>>>
>>>> Why do you think xnshadow_relax() would be called for an already relaxed
>>>> thread?
>>> Because the fault happens before it has finished relaxing ?
>>>
>> Well, no. Have a second look at the code.
> 
> Ok, you are right then, in my case the faults were probably due to vmalloc.

In our case, it must be some special fault path, too. I tried injecting
a non-NULL but invalid address intentionally. But the __xn_put_user just
swallowed it without complaints.

It's still unclear what goes on precisely, we are still digging, but the
test system that can produce this is highly contended.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-30 16:11                 ` Jan Kiszka
@ 2009-07-01 11:56                   ` Jan Kiszka
  2009-07-01 12:05                     ` Jan Kiszka
  2009-07-01 12:24                     ` Gilles Chanteperdrix
  0 siblings, 2 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 11:56 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Jan Kiszka wrote:
> It's still unclear what goes on precisely, we are still digging, but the
> test system that can produce this is highly contended.

Short update: Further instrumentation revealed that cr3 differs from
active_mm->pgd while we are looping over that fault, ie. the kernel
tries to fixup the wrong mm. And that means we have some open race
window between updating cr3 and active_mm somewhere (isn't switch_mm run
in a preemptible manner now?).

As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
checking if it makes a difference. Digging deeper into the code in the
meanwhile...

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 11:56                   ` Jan Kiszka
@ 2009-07-01 12:05                     ` Jan Kiszka
  2009-07-01 12:24                     ` Gilles Chanteperdrix
  1 sibling, 0 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 12:05 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> It's still unclear what goes on precisely, we are still digging, but the
>> test system that can produce this is highly contended.
> 
> Short update: Further instrumentation revealed that cr3 differs from
> active_mm->pgd while we are looping over that fault, ie. the kernel
> tries to fixup the wrong mm. And that means we have some open race
> window between updating cr3 and active_mm somewhere (isn't switch_mm run
> in a preemptible manner now?).
> 
> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
> checking if it makes a difference. Digging deeper into the code in the
> meanwhile...

CONFIG_IPIPE_DELAYED_ATOMICSW is nonsense. And we don't do switch_mm
without irq protection on x86, at least not from the nucleus, right?
Maybe a race due some Linux user running it unprotectedly?

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 11:56                   ` Jan Kiszka
  2009-07-01 12:05                     ` Jan Kiszka
@ 2009-07-01 12:24                     ` Gilles Chanteperdrix
  2009-07-01 12:39                       ` Jan Kiszka
  1 sibling, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-01 12:24 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> It's still unclear what goes on precisely, we are still digging, but the
>> test system that can produce this is highly contended.
> 
> Short update: Further instrumentation revealed that cr3 differs from
> active_mm->pgd while we are looping over that fault, ie. the kernel
> tries to fixup the wrong mm. And that means we have some open race
> window between updating cr3 and active_mm somewhere (isn't switch_mm run
> in a preemptible manner now?).

Maybe the rsp is wrong and leads you to the wrong active_mm ?

> 
> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
> checking if it makes a difference. Digging deeper into the code in the
> meanwhile...

As you have found out in the mean time, we do not use unlocked context
switches on x86.

-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 12:24                     ` Gilles Chanteperdrix
@ 2009-07-01 12:39                       ` Jan Kiszka
  2009-07-01 12:41                         ` Gilles Chanteperdrix
  2009-07-01 12:41                         ` Jan Kiszka
  0 siblings, 2 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 12:39 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> It's still unclear what goes on precisely, we are still digging, but the
>>> test system that can produce this is highly contended.
>> Short update: Further instrumentation revealed that cr3 differs from
>> active_mm->pgd while we are looping over that fault, ie. the kernel
>> tries to fixup the wrong mm. And that means we have some open race
>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>> in a preemptible manner now?).
> 
> Maybe the rsp is wrong and leads you to the wrong active_mm ?
> 
>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>> checking if it makes a difference. Digging deeper into the code in the
>> meanwhile...
> 
> As you have found out in the mean time, we do not use unlocked context
> switches on x86.
> 

Yes.

The last question I asked myself (but couldn't answer yet due to other
activity) was: Where are the local_irq_disable/enable_hw around
switch_mm for its Linux callers?

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 12:39                       ` Jan Kiszka
@ 2009-07-01 12:41                         ` Gilles Chanteperdrix
  2009-07-01 12:41                         ` Jan Kiszka
  1 sibling, 0 replies; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-01 12:41 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>> test system that can produce this is highly contended.
>>> Short update: Further instrumentation revealed that cr3 differs from
>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>> tries to fixup the wrong mm. And that means we have some open race
>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>> in a preemptible manner now?).
>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>
>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>> checking if it makes a difference. Digging deeper into the code in the
>>> meanwhile...
>> As you have found out in the mean time, we do not use unlocked context
>> switches on x86.
>>
> 
> Yes.
> 
> The last question I asked myself (but couldn't answer yet due to other
> activity) was: Where are the local_irq_disable/enable_hw around
> switch_mm for its Linux callers?

prepare_arch_switch and task_hijacked

-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 12:39                       ` Jan Kiszka
  2009-07-01 12:41                         ` Gilles Chanteperdrix
@ 2009-07-01 12:41                         ` Jan Kiszka
  2009-07-01 15:51                           ` Gilles Chanteperdrix
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 12:41 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>> test system that can produce this is highly contended.
>>> Short update: Further instrumentation revealed that cr3 differs from
>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>> tries to fixup the wrong mm. And that means we have some open race
>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>> in a preemptible manner now?).
>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>
>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>> checking if it makes a difference. Digging deeper into the code in the
>>> meanwhile...
>> As you have found out in the mean time, we do not use unlocked context
>> switches on x86.
>>
> 
> Yes.
> 
> The last question I asked myself (but couldn't answer yet due to other
> activity) was: Where are the local_irq_disable/enable_hw around
> switch_mm for its Linux callers?

Ha, that's the point: only activate_mm is protected, but we have more
spots in 2.6.29 and maybe other kernels, too!

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 12:41                         ` Jan Kiszka
@ 2009-07-01 15:51                           ` Gilles Chanteperdrix
  2009-07-01 16:01                             ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-01 15:51 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>> test system that can produce this is highly contended.
>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>> tries to fixup the wrong mm. And that means we have some open race
>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>> in a preemptible manner now?).
>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>
>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>> checking if it makes a difference. Digging deeper into the code in the
>>>> meanwhile...
>>> As you have found out in the mean time, we do not use unlocked context
>>> switches on x86.
>>>
>> Yes.
>>
>> The last question I asked myself (but couldn't answer yet due to other
>> activity) was: Where are the local_irq_disable/enable_hw around
>> switch_mm for its Linux callers?
> 
> Ha, that's the point: only activate_mm is protected, but we have more
> spots in 2.6.29 and maybe other kernels, too!

Ok, I do not see where switch_mm is called with IRQs off. What I found,
however, is that leave_mm sets the cr3 and just clears
active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
cr3 and active_mm. I do not know what could happen if Xenomai could
interrupt leave_mm between the cpu_clear and the write_cr3. From what I
understand, switch_mm called by Xenomai upon return to root would re-set
the bit, and re-set cr3, which would be set to the kernel cr3 right
after that, but this would result in the active_mm.cpu_vm_mask bit being
set instead of cleared as expected. So, maybe an irqs off section is
missing in leave_mm.

-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 15:51                           ` Gilles Chanteperdrix
@ 2009-07-01 16:01                             ` Jan Kiszka
  2009-07-01 16:04                               ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 16:01 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Gilles Chanteperdrix wrote:
>>>> Jan Kiszka wrote:
>>>>> Jan Kiszka wrote:
>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>> test system that can produce this is highly contended.
>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>> in a preemptible manner now?).
>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>
>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>> meanwhile...
>>>> As you have found out in the mean time, we do not use unlocked context
>>>> switches on x86.
>>>>
>>> Yes.
>>>
>>> The last question I asked myself (but couldn't answer yet due to other
>>> activity) was: Where are the local_irq_disable/enable_hw around
>>> switch_mm for its Linux callers?
>> Ha, that's the point: only activate_mm is protected, but we have more
>> spots in 2.6.29 and maybe other kernels, too!
> 
> Ok, I do not see where switch_mm is called with IRQs off. What I found,

We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
Both need protection (I pushed IRQ disabling into switch_mm), but that
is not enough according to current tests. It seems to reduce to
probability of corruption, though.

> however, is that leave_mm sets the cr3 and just clears
> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
> cr3 and active_mm. I do not know what could happen if Xenomai could
> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
> understand, switch_mm called by Xenomai upon return to root would re-set
> the bit, and re-set cr3, which would be set to the kernel cr3 right
> after that, but this would result in the active_mm.cpu_vm_mask bit being
> set instead of cleared as expected. So, maybe an irqs off section is
> missing in leave_mm.

leave_mm is already protected by its caller smp_invalidate_interrupt -
but now I'm parsing context_switch /wrt to lazy tlb.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 16:01                             ` Jan Kiszka
@ 2009-07-01 16:04                               ` Jan Kiszka
  2009-07-01 17:56                                 ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 16:04 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>>> test system that can produce this is highly contended.
>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>>> in a preemptible manner now?).
>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>
>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>>> meanwhile...
>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>> switches on x86.
>>>>>
>>>> Yes.
>>>>
>>>> The last question I asked myself (but couldn't answer yet due to other
>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>> switch_mm for its Linux callers?
>>> Ha, that's the point: only activate_mm is protected, but we have more
>>> spots in 2.6.29 and maybe other kernels, too!
>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
> 
> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
> Both need protection (I pushed IRQ disabling into switch_mm), but that
> is not enough according to current tests. It seems to reduce to
> probability of corruption, though.
> 
>> however, is that leave_mm sets the cr3 and just clears
>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>> cr3 and active_mm. I do not know what could happen if Xenomai could
>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>> understand, switch_mm called by Xenomai upon return to root would re-set
>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>> set instead of cleared as expected. So, maybe an irqs off section is
>> missing in leave_mm.
> 
> leave_mm is already protected by its caller smp_invalidate_interrupt -
> but now I'm parsing context_switch /wrt to lazy tlb.
> 

Hmm... lazy tlb: This means a new task is switched in and has active_mm
!= mm. But do_page_fault reads task->mm... Just thoughts, no clear
picture yet.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 16:04                               ` Jan Kiszka
@ 2009-07-01 17:56                                 ` Jan Kiszka
  2009-07-01 18:15                                   ` Philippe Gerum
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 17:56 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Jan Kiszka wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> Gilles Chanteperdrix wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>>>> test system that can produce this is highly contended.
>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>>>> in a preemptible manner now?).
>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>
>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>>>> meanwhile...
>>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>>> switches on x86.
>>>>>>
>>>>> Yes.
>>>>>
>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>> switch_mm for its Linux callers?
>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>> spots in 2.6.29 and maybe other kernels, too!
>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>> is not enough according to current tests. It seems to reduce to
>> probability of corruption, though.
>>
>>> however, is that leave_mm sets the cr3 and just clears
>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>> set instead of cleared as expected. So, maybe an irqs off section is
>>> missing in leave_mm.
>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>> but now I'm parsing context_switch /wrt to lazy tlb.
>>
> 
> Hmm... lazy tlb: This means a new task is switched in and has active_mm
> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
> picture yet.
> 

Looking closer at the call sites of switch_mm, I think our the problem
is mostly related to use_mm from fs/aio.c (customer is using aio
heavily). But other callers need protection, too. We are going to test
this patch tomorrow:

diff --git a/fs/aio.c b/fs/aio.c
index 76da125..d90fca3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
 {
 	struct mm_struct *active_mm;
 	struct task_struct *tsk = current;
+	unsigned long flags;
 
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
 	atomic_inc(&mm->mm_count);
+	local_irq_save_hw_cond(flags);
 	tsk->mm = mm;
 	tsk->active_mm = mm;
 	switch_mm(active_mm, mm, tsk);
+	local_irq_restore_hw_cond(flags);
 	task_unlock(tsk);
 
 	mmdrop(active_mm);
diff --git a/kernel/sched.c b/kernel/sched.c
index aa8f86c..8c545a4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
 		next->active_mm = oldmm;
 		atomic_inc(&oldmm->mm_count);
 		enter_lazy_tlb(oldmm, next);
-	} else
+	} else {
+		unsigned long flags;
+		local_irq_save_hw_cond(flags);
 		switch_mm(oldmm, mm, next);
+		local_irq_restore_hw_cond(flags);
+	}
 
 	if (unlikely(!prev->mm)) {
 		prev->active_mm = NULL;
@@ -6406,8 +6410,12 @@ void idle_task_exit(void)
 
 	BUG_ON(cpu_online(smp_processor_id()));
 
-	if (mm != &init_mm)
+	if (mm != &init_mm) {
+		unsigned long flags;
+		local_irq_save_hw_cond(flags);
 		switch_mm(mm, &init_mm, current);
+		local_irq_restore_hw_cond(flags);
+	}
 	mmdrop(mm);
 }
 

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 17:56                                 ` Jan Kiszka
@ 2009-07-01 18:15                                   ` Philippe Gerum
  2009-07-01 18:27                                     ` Philippe Gerum
  2009-07-01 18:56                                     ` Jan Kiszka
  0 siblings, 2 replies; 45+ messages in thread
From: Philippe Gerum @ 2009-07-01 18:15 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
> Jan Kiszka wrote:
> > Jan Kiszka wrote:
> >> Gilles Chanteperdrix wrote:
> >>> Jan Kiszka wrote:
> >>>> Jan Kiszka wrote:
> >>>>> Gilles Chanteperdrix wrote:
> >>>>>> Jan Kiszka wrote:
> >>>>>>> Jan Kiszka wrote:
> >>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
> >>>>>>>> test system that can produce this is highly contended.
> >>>>>>> Short update: Further instrumentation revealed that cr3 differs from
> >>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
> >>>>>>> tries to fixup the wrong mm. And that means we have some open race
> >>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
> >>>>>>> in a preemptible manner now?).
> >>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
> >>>>>>
> >>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
> >>>>>>> checking if it makes a difference. Digging deeper into the code in the
> >>>>>>> meanwhile...
> >>>>>> As you have found out in the mean time, we do not use unlocked context
> >>>>>> switches on x86.
> >>>>>>
> >>>>> Yes.
> >>>>>
> >>>>> The last question I asked myself (but couldn't answer yet due to other
> >>>>> activity) was: Where are the local_irq_disable/enable_hw around
> >>>>> switch_mm for its Linux callers?
> >>>> Ha, that's the point: only activate_mm is protected, but we have more
> >>>> spots in 2.6.29 and maybe other kernels, too!
> >>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
> >> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
> >> Both need protection (I pushed IRQ disabling into switch_mm), but that
> >> is not enough according to current tests. It seems to reduce to
> >> probability of corruption, though.
> >>
> >>> however, is that leave_mm sets the cr3 and just clears
> >>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
> >>> cr3 and active_mm. I do not know what could happen if Xenomai could
> >>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
> >>> understand, switch_mm called by Xenomai upon return to root would re-set
> >>> the bit, and re-set cr3, which would be set to the kernel cr3 right
> >>> after that, but this would result in the active_mm.cpu_vm_mask bit being
> >>> set instead of cleared as expected. So, maybe an irqs off section is
> >>> missing in leave_mm.
> >> leave_mm is already protected by its caller smp_invalidate_interrupt -
> >> but now I'm parsing context_switch /wrt to lazy tlb.
> >>
> > 
> > Hmm... lazy tlb: This means a new task is switched in and has active_mm
> > != mm. But do_page_fault reads task->mm... Just thoughts, no clear
> > picture yet.
> > 
> 
> Looking closer at the call sites of switch_mm, I think our the problem
> is mostly related to use_mm from fs/aio.c (customer is using aio
> heavily). But other callers need protection, too. We are going to test
> this patch tomorrow:
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 76da125..d90fca3 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>  {
>  	struct mm_struct *active_mm;
>  	struct task_struct *tsk = current;
> +	unsigned long flags;
>  
>  	task_lock(tsk);
>  	active_mm = tsk->active_mm;
>  	atomic_inc(&mm->mm_count);
> +	local_irq_save_hw_cond(flags);
>  	tsk->mm = mm;
>  	tsk->active_mm = mm;
>  	switch_mm(active_mm, mm, tsk);
> +	local_irq_restore_hw_cond(flags);
>  	task_unlock(tsk);
>  
>  	mmdrop(active_mm);
> diff --git a/kernel/sched.c b/kernel/sched.c
> index aa8f86c..8c545a4 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  		next->active_mm = oldmm;
>  		atomic_inc(&oldmm->mm_count);
>  		enter_lazy_tlb(oldmm, next);
> -	} else
> +	} else {
> +		unsigned long flags;
> +		local_irq_save_hw_cond(flags);
>  		switch_mm(oldmm, mm, next);
> +		local_irq_restore_hw_cond(flags);
> +	}
>  
>  	if (unlikely(!prev->mm)) {
>  		prev->active_mm = NULL;
> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>  
>  	BUG_ON(cpu_online(smp_processor_id()));
>  
> -	if (mm != &init_mm)
> +	if (mm != &init_mm) {
> +		unsigned long flags;
> +		local_irq_save_hw_cond(flags);
>  		switch_mm(mm, &init_mm, current);
> +		local_irq_restore_hw_cond(flags);
> +	}
>  	mmdrop(mm);
>  }

Please fix the callee instead of ironing the call sites. This would
avoid further issues as upstream emits additional switch_mm calls over
time, and make ironing activate_mm useless.

>  
> 
> Jan
> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 18:15                                   ` Philippe Gerum
@ 2009-07-01 18:27                                     ` Philippe Gerum
  2009-07-01 18:58                                       ` Jan Kiszka
  2009-07-02  2:05                                       ` Gilles Chanteperdrix
  2009-07-01 18:56                                     ` Jan Kiszka
  1 sibling, 2 replies; 45+ messages in thread
From: Philippe Gerum @ 2009-07-01 18:27 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
> > Jan Kiszka wrote:
> > > Jan Kiszka wrote:
> > >> Gilles Chanteperdrix wrote:
> > >>> Jan Kiszka wrote:
> > >>>> Jan Kiszka wrote:
> > >>>>> Gilles Chanteperdrix wrote:
> > >>>>>> Jan Kiszka wrote:
> > >>>>>>> Jan Kiszka wrote:
> > >>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
> > >>>>>>>> test system that can produce this is highly contended.
> > >>>>>>> Short update: Further instrumentation revealed that cr3 differs from
> > >>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
> > >>>>>>> tries to fixup the wrong mm. And that means we have some open race
> > >>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
> > >>>>>>> in a preemptible manner now?).
> > >>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
> > >>>>>>
> > >>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
> > >>>>>>> checking if it makes a difference. Digging deeper into the code in the
> > >>>>>>> meanwhile...
> > >>>>>> As you have found out in the mean time, we do not use unlocked context
> > >>>>>> switches on x86.
> > >>>>>>
> > >>>>> Yes.
> > >>>>>
> > >>>>> The last question I asked myself (but couldn't answer yet due to other
> > >>>>> activity) was: Where are the local_irq_disable/enable_hw around
> > >>>>> switch_mm for its Linux callers?
> > >>>> Ha, that's the point: only activate_mm is protected, but we have more
> > >>>> spots in 2.6.29 and maybe other kernels, too!
> > >>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
> > >> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
> > >> Both need protection (I pushed IRQ disabling into switch_mm), but that
> > >> is not enough according to current tests. It seems to reduce to
> > >> probability of corruption, though.
> > >>
> > >>> however, is that leave_mm sets the cr3 and just clears
> > >>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
> > >>> cr3 and active_mm. I do not know what could happen if Xenomai could
> > >>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
> > >>> understand, switch_mm called by Xenomai upon return to root would re-set
> > >>> the bit, and re-set cr3, which would be set to the kernel cr3 right
> > >>> after that, but this would result in the active_mm.cpu_vm_mask bit being
> > >>> set instead of cleared as expected. So, maybe an irqs off section is
> > >>> missing in leave_mm.
> > >> leave_mm is already protected by its caller smp_invalidate_interrupt -
> > >> but now I'm parsing context_switch /wrt to lazy tlb.
> > >>
> > > 
> > > Hmm... lazy tlb: This means a new task is switched in and has active_mm
> > > != mm. But do_page_fault reads task->mm... Just thoughts, no clear
> > > picture yet.
> > > 
> > 
> > Looking closer at the call sites of switch_mm, I think our the problem
> > is mostly related to use_mm from fs/aio.c (customer is using aio
> > heavily). But other callers need protection, too. We are going to test
> > this patch tomorrow:
> > 
> > diff --git a/fs/aio.c b/fs/aio.c
> > index 76da125..d90fca3 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
> >  {
> >  	struct mm_struct *active_mm;
> >  	struct task_struct *tsk = current;
> > +	unsigned long flags;
> >  
> >  	task_lock(tsk);
> >  	active_mm = tsk->active_mm;
> >  	atomic_inc(&mm->mm_count);
> > +	local_irq_save_hw_cond(flags);
> >  	tsk->mm = mm;
> >  	tsk->active_mm = mm;
> >  	switch_mm(active_mm, mm, tsk);
> > +	local_irq_restore_hw_cond(flags);
> >  	task_unlock(tsk);
> >  
> >  	mmdrop(active_mm);
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index aa8f86c..8c545a4 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >  		next->active_mm = oldmm;
> >  		atomic_inc(&oldmm->mm_count);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> > +		unsigned long flags;
> > +		local_irq_save_hw_cond(flags);
> >  		switch_mm(oldmm, mm, next);
> > +		local_irq_restore_hw_cond(flags);
> > +	}
> >  
> >  	if (unlikely(!prev->mm)) {
> >  		prev->active_mm = NULL;
> > @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
> >  
> >  	BUG_ON(cpu_online(smp_processor_id()));
> >  
> > -	if (mm != &init_mm)
> > +	if (mm != &init_mm) {
> > +		unsigned long flags;
> > +		local_irq_save_hw_cond(flags);
> >  		switch_mm(mm, &init_mm, current);
> > +		local_irq_restore_hw_cond(flags);
> > +	}
> >  	mmdrop(mm);
> >  }
> 
> Please fix the callee instead of ironing the call sites. This would
> avoid further issues as upstream emits additional switch_mm calls over
> time, and make ironing activate_mm useless.
> 

Btw, this how the powerpc port works in locked switch mode, and this
particular bug does not apply in unlocked switch mode anyway; this is
why we don't have that problem on this arch. The ARM port always works
in unlocked switch mode IIRC for latency reasons, so this should be a
non-issue here as well. Gilles, would you confirm this?

> >  
> > 
> > Jan
> > 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 18:15                                   ` Philippe Gerum
  2009-07-01 18:27                                     ` Philippe Gerum
@ 2009-07-01 18:56                                     ` Jan Kiszka
  2009-07-02  7:11                                       ` Philippe Gerum
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 18:56 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 4968 bytes --]

Philippe Gerum wrote:
> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>> Jan Kiszka wrote:
>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>>>>>> in a preemptible manner now?).
>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>
>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>>>>>> meanwhile...
>>>>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>>>>> switches on x86.
>>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>> switch_mm for its Linux callers?
>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>> is not enough according to current tests. It seems to reduce to
>>>> probability of corruption, though.
>>>>
>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>> missing in leave_mm.
>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>
>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>> picture yet.
>>>
>> Looking closer at the call sites of switch_mm, I think our the problem
>> is mostly related to use_mm from fs/aio.c (customer is using aio
>> heavily). But other callers need protection, too. We are going to test
>> this patch tomorrow:
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 76da125..d90fca3 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>  {
>>  	struct mm_struct *active_mm;
>>  	struct task_struct *tsk = current;
>> +	unsigned long flags;
>>  
>>  	task_lock(tsk);
>>  	active_mm = tsk->active_mm;
>>  	atomic_inc(&mm->mm_count);
>> +	local_irq_save_hw_cond(flags);
>>  	tsk->mm = mm;
>>  	tsk->active_mm = mm;
>>  	switch_mm(active_mm, mm, tsk);
>> +	local_irq_restore_hw_cond(flags);
>>  	task_unlock(tsk);
>>  
>>  	mmdrop(active_mm);
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index aa8f86c..8c545a4 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>  		next->active_mm = oldmm;
>>  		atomic_inc(&oldmm->mm_count);
>>  		enter_lazy_tlb(oldmm, next);
>> -	} else
>> +	} else {
>> +		unsigned long flags;
>> +		local_irq_save_hw_cond(flags);
>>  		switch_mm(oldmm, mm, next);
>> +		local_irq_restore_hw_cond(flags);
>> +	}
>>  
>>  	if (unlikely(!prev->mm)) {
>>  		prev->active_mm = NULL;
>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>  
>>  	BUG_ON(cpu_online(smp_processor_id()));
>>  
>> -	if (mm != &init_mm)
>> +	if (mm != &init_mm) {
>> +		unsigned long flags;
>> +		local_irq_save_hw_cond(flags);
>>  		switch_mm(mm, &init_mm, current);
>> +		local_irq_restore_hw_cond(flags);
>> +	}
>>  	mmdrop(mm);
>>  }
> 
> Please fix the callee instead of ironing the call sites. This would
> avoid further issues as upstream emits additional switch_mm calls over
> time, and make ironing activate_mm useless.

This was my first idea, too, but it didn't work, see use_mm(). I don't
think we will get around reviewing switch_mm users on kernel updates.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 18:27                                     ` Philippe Gerum
@ 2009-07-01 18:58                                       ` Jan Kiszka
  2009-07-01 19:14                                         ` Jan Kiszka
  2009-07-02  2:05                                       ` Gilles Chanteperdrix
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 18:58 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 5541 bytes --]

Philippe Gerum wrote:
> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> Gilles Chanteperdrix wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>>>>>>> in a preemptible manner now?).
>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>>
>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>>>>>>> meanwhile...
>>>>>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>>>>>> switches on x86.
>>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>>> switch_mm for its Linux callers?
>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>>> is not enough according to current tests. It seems to reduce to
>>>>> probability of corruption, though.
>>>>>
>>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>>> missing in leave_mm.
>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>>
>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>>> picture yet.
>>>>
>>> Looking closer at the call sites of switch_mm, I think our the problem
>>> is mostly related to use_mm from fs/aio.c (customer is using aio
>>> heavily). But other callers need protection, too. We are going to test
>>> this patch tomorrow:
>>>
>>> diff --git a/fs/aio.c b/fs/aio.c
>>> index 76da125..d90fca3 100644
>>> --- a/fs/aio.c
>>> +++ b/fs/aio.c
>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>  {
>>>  	struct mm_struct *active_mm;
>>>  	struct task_struct *tsk = current;
>>> +	unsigned long flags;
>>>  
>>>  	task_lock(tsk);
>>>  	active_mm = tsk->active_mm;
>>>  	atomic_inc(&mm->mm_count);
>>> +	local_irq_save_hw_cond(flags);
>>>  	tsk->mm = mm;
>>>  	tsk->active_mm = mm;
>>>  	switch_mm(active_mm, mm, tsk);
>>> +	local_irq_restore_hw_cond(flags);
>>>  	task_unlock(tsk);
>>>  
>>>  	mmdrop(active_mm);
>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>> index aa8f86c..8c545a4 100644
>>> --- a/kernel/sched.c
>>> +++ b/kernel/sched.c
>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>>  		next->active_mm = oldmm;
>>>  		atomic_inc(&oldmm->mm_count);
>>>  		enter_lazy_tlb(oldmm, next);
>>> -	} else
>>> +	} else {
>>> +		unsigned long flags;
>>> +		local_irq_save_hw_cond(flags);
>>>  		switch_mm(oldmm, mm, next);
>>> +		local_irq_restore_hw_cond(flags);
>>> +	}
>>>  
>>>  	if (unlikely(!prev->mm)) {
>>>  		prev->active_mm = NULL;
>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>>  
>>>  	BUG_ON(cpu_online(smp_processor_id()));
>>>  
>>> -	if (mm != &init_mm)
>>> +	if (mm != &init_mm) {
>>> +		unsigned long flags;
>>> +		local_irq_save_hw_cond(flags);
>>>  		switch_mm(mm, &init_mm, current);
>>> +		local_irq_restore_hw_cond(flags);
>>> +	}
>>>  	mmdrop(mm);
>>>  }
>> Please fix the callee instead of ironing the call sites. This would
>> avoid further issues as upstream emits additional switch_mm calls over
>> time, and make ironing activate_mm useless.
>>
> 
> Btw, this how the powerpc port works in locked switch mode, and this
> particular bug does not apply in unlocked switch mode anyway; this is
> why we don't have that problem on this arch. The ARM port always works
> in unlocked switch mode IIRC for latency reasons, so this should be a
> non-issue here as well. Gilles, would you confirm this?

The aio issue (if it turns out to be one, /me still needs to understand
the concrete patch) should bite any arch, but we may solve it
differently where switch_mm is actually reentrant.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 18:58                                       ` Jan Kiszka
@ 2009-07-01 19:14                                         ` Jan Kiszka
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-07-01 19:14 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 6193 bytes --]

Jan Kiszka wrote:
> Philippe Gerum wrote:
>> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
>>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Gilles Chanteperdrix wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Jan Kiszka wrote:
>>>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>>>>>>>> in a preemptible manner now?).
>>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>>>
>>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>>>>>>>> meanwhile...
>>>>>>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>>>>>>> switches on x86.
>>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>>>> switch_mm for its Linux callers?
>>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>>>> is not enough according to current tests. It seems to reduce to
>>>>>> probability of corruption, though.
>>>>>>
>>>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>>>> missing in leave_mm.
>>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>>>
>>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>>>> picture yet.
>>>>>
>>>> Looking closer at the call sites of switch_mm, I think our the problem
>>>> is mostly related to use_mm from fs/aio.c (customer is using aio
>>>> heavily). But other callers need protection, too. We are going to test
>>>> this patch tomorrow:
>>>>
>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>> index 76da125..d90fca3 100644
>>>> --- a/fs/aio.c
>>>> +++ b/fs/aio.c
>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>  {
>>>>  	struct mm_struct *active_mm;
>>>>  	struct task_struct *tsk = current;
>>>> +	unsigned long flags;
>>>>  
>>>>  	task_lock(tsk);
>>>>  	active_mm = tsk->active_mm;
>>>>  	atomic_inc(&mm->mm_count);
>>>> +	local_irq_save_hw_cond(flags);
>>>>  	tsk->mm = mm;
>>>>  	tsk->active_mm = mm;
>>>>  	switch_mm(active_mm, mm, tsk);
>>>> +	local_irq_restore_hw_cond(flags);
>>>>  	task_unlock(tsk);
>>>>  
>>>>  	mmdrop(active_mm);
>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>> index aa8f86c..8c545a4 100644
>>>> --- a/kernel/sched.c
>>>> +++ b/kernel/sched.c
>>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>>>  		next->active_mm = oldmm;
>>>>  		atomic_inc(&oldmm->mm_count);
>>>>  		enter_lazy_tlb(oldmm, next);
>>>> -	} else
>>>> +	} else {
>>>> +		unsigned long flags;
>>>> +		local_irq_save_hw_cond(flags);
>>>>  		switch_mm(oldmm, mm, next);
>>>> +		local_irq_restore_hw_cond(flags);
>>>> +	}
>>>>  
>>>>  	if (unlikely(!prev->mm)) {
>>>>  		prev->active_mm = NULL;
>>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>>>  
>>>>  	BUG_ON(cpu_online(smp_processor_id()));
>>>>  
>>>> -	if (mm != &init_mm)
>>>> +	if (mm != &init_mm) {
>>>> +		unsigned long flags;
>>>> +		local_irq_save_hw_cond(flags);
>>>>  		switch_mm(mm, &init_mm, current);
>>>> +		local_irq_restore_hw_cond(flags);
>>>> +	}
>>>>  	mmdrop(mm);
>>>>  }
>>> Please fix the callee instead of ironing the call sites. This would
>>> avoid further issues as upstream emits additional switch_mm calls over
>>> time, and make ironing activate_mm useless.
>>>
>> Btw, this how the powerpc port works in locked switch mode, and this
>> particular bug does not apply in unlocked switch mode anyway; this is
>> why we don't have that problem on this arch. The ARM port always works
>> in unlocked switch mode IIRC for latency reasons, so this should be a
>> non-issue here as well. Gilles, would you confirm this?
> 
> The aio issue (if it turns out to be one, /me still needs to understand
> the concrete patch) should bite any arch, but we may solve it
> differently where switch_mm is actually reentrant.

Took a beer to get my mind free, here is the potential race:

-> aio thread runs with mm == init_mm
-> aio thread calls into use_mm
-> preemption by Xenomai right after tsk->active_mm = mm
-> Xenomai switches from the aio thread to an RT thread with identical
   mm (definitely possible in our scenario)
-> switch_mm will not update cr3 as prev == next
-> next fault of the RT thread (e.g. in __xn_put_user...) will attempt
   to fix the right mm, but the CPU will continue to fail over init_mm

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 18:27                                     ` Philippe Gerum
  2009-07-01 18:58                                       ` Jan Kiszka
@ 2009-07-02  2:05                                       ` Gilles Chanteperdrix
  2009-07-02  6:24                                         ` Jan Kiszka
  1 sibling, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-02  2:05 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core

Philippe Gerum wrote:
> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>>> Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> Gilles Chanteperdrix wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>>>>>>> in a preemptible manner now?).
>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>>
>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>>>>>>> meanwhile...
>>>>>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>>>>>> switches on x86.
>>>>>>>>>
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>>> switch_mm for its Linux callers?
>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>>> is not enough according to current tests. It seems to reduce to
>>>>> probability of corruption, though.
>>>>>
>>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>>> missing in leave_mm.
>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>>
>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>>> picture yet.
>>>>
>>> Looking closer at the call sites of switch_mm, I think our the problem
>>> is mostly related to use_mm from fs/aio.c (customer is using aio
>>> heavily). But other callers need protection, too. We are going to test
>>> this patch tomorrow:
>>>
>>> diff --git a/fs/aio.c b/fs/aio.c
>>> index 76da125..d90fca3 100644
>>> --- a/fs/aio.c
>>> +++ b/fs/aio.c
>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>  {
>>>  	struct mm_struct *active_mm;
>>>  	struct task_struct *tsk = current;
>>> +	unsigned long flags;
>>>  
>>>  	task_lock(tsk);
>>>  	active_mm = tsk->active_mm;
>>>  	atomic_inc(&mm->mm_count);
>>> +	local_irq_save_hw_cond(flags);
>>>  	tsk->mm = mm;
>>>  	tsk->active_mm = mm;
>>>  	switch_mm(active_mm, mm, tsk);
>>> +	local_irq_restore_hw_cond(flags);
>>>  	task_unlock(tsk);
>>>  
>>>  	mmdrop(active_mm);
>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>> index aa8f86c..8c545a4 100644
>>> --- a/kernel/sched.c
>>> +++ b/kernel/sched.c
>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>>  		next->active_mm = oldmm;
>>>  		atomic_inc(&oldmm->mm_count);
>>>  		enter_lazy_tlb(oldmm, next);
>>> -	} else
>>> +	} else {
>>> +		unsigned long flags;
>>> +		local_irq_save_hw_cond(flags);
>>>  		switch_mm(oldmm, mm, next);
>>> +		local_irq_restore_hw_cond(flags);
>>> +	}
>>>  
>>>  	if (unlikely(!prev->mm)) {
>>>  		prev->active_mm = NULL;
>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>>  
>>>  	BUG_ON(cpu_online(smp_processor_id()));
>>>  
>>> -	if (mm != &init_mm)
>>> +	if (mm != &init_mm) {
>>> +		unsigned long flags;
>>> +		local_irq_save_hw_cond(flags);
>>>  		switch_mm(mm, &init_mm, current);
>>> +		local_irq_restore_hw_cond(flags);
>>> +	}
>>>  	mmdrop(mm);
>>>  }
>> Please fix the callee instead of ironing the call sites. This would
>> avoid further issues as upstream emits additional switch_mm calls over
>> time, and make ironing activate_mm useless.
>>
> 
> Btw, this how the powerpc port works in locked switch mode, and this
> particular bug does not apply in unlocked switch mode anyway; this is
> why we don't have that problem on this arch. The ARM port always works
> in unlocked switch mode IIRC for latency reasons, so this should be a
> non-issue here as well. Gilles, would you confirm this?

Yes, on arm switch_mm always has irqs on. We handle the case where
switch_mm was interrupted in the middle of its operations by restarting
the mm switch after xenomai switches back to Linux.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-02  2:05                                       ` Gilles Chanteperdrix
@ 2009-07-02  6:24                                         ` Jan Kiszka
  2009-07-02  6:59                                           ` Gilles Chanteperdrix
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-02  6:24 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 6309 bytes --]

Gilles Chanteperdrix wrote:
> Philippe Gerum wrote:
>> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
>>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>>>> Jan Kiszka wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Gilles Chanteperdrix wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Jan Kiszka wrote:
>>>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>>>>>>>> in a preemptible manner now?).
>>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>>>
>>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>>>>>>>> meanwhile...
>>>>>>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>>>>>>> switches on x86.
>>>>>>>>>>
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>>>> switch_mm for its Linux callers?
>>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>>>> is not enough according to current tests. It seems to reduce to
>>>>>> probability of corruption, though.
>>>>>>
>>>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>>>> missing in leave_mm.
>>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>>>
>>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>>>> picture yet.
>>>>>
>>>> Looking closer at the call sites of switch_mm, I think our the problem
>>>> is mostly related to use_mm from fs/aio.c (customer is using aio
>>>> heavily). But other callers need protection, too. We are going to test
>>>> this patch tomorrow:
>>>>
>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>> index 76da125..d90fca3 100644
>>>> --- a/fs/aio.c
>>>> +++ b/fs/aio.c
>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>  {
>>>>  	struct mm_struct *active_mm;
>>>>  	struct task_struct *tsk = current;
>>>> +	unsigned long flags;
>>>>  
>>>>  	task_lock(tsk);
>>>>  	active_mm = tsk->active_mm;
>>>>  	atomic_inc(&mm->mm_count);
>>>> +	local_irq_save_hw_cond(flags);
>>>>  	tsk->mm = mm;
>>>>  	tsk->active_mm = mm;
>>>>  	switch_mm(active_mm, mm, tsk);
>>>> +	local_irq_restore_hw_cond(flags);
>>>>  	task_unlock(tsk);
>>>>  
>>>>  	mmdrop(active_mm);
>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>> index aa8f86c..8c545a4 100644
>>>> --- a/kernel/sched.c
>>>> +++ b/kernel/sched.c
>>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>>>  		next->active_mm = oldmm;
>>>>  		atomic_inc(&oldmm->mm_count);
>>>>  		enter_lazy_tlb(oldmm, next);
>>>> -	} else
>>>> +	} else {
>>>> +		unsigned long flags;
>>>> +		local_irq_save_hw_cond(flags);
>>>>  		switch_mm(oldmm, mm, next);
>>>> +		local_irq_restore_hw_cond(flags);
>>>> +	}
>>>>  
>>>>  	if (unlikely(!prev->mm)) {
>>>>  		prev->active_mm = NULL;
>>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>>>  
>>>>  	BUG_ON(cpu_online(smp_processor_id()));
>>>>  
>>>> -	if (mm != &init_mm)
>>>> +	if (mm != &init_mm) {
>>>> +		unsigned long flags;
>>>> +		local_irq_save_hw_cond(flags);
>>>>  		switch_mm(mm, &init_mm, current);
>>>> +		local_irq_restore_hw_cond(flags);
>>>> +	}
>>>>  	mmdrop(mm);
>>>>  }
>>> Please fix the callee instead of ironing the call sites. This would
>>> avoid further issues as upstream emits additional switch_mm calls over
>>> time, and make ironing activate_mm useless.
>>>
>> Btw, this how the powerpc port works in locked switch mode, and this
>> particular bug does not apply in unlocked switch mode anyway; this is
>> why we don't have that problem on this arch. The ARM port always works
>> in unlocked switch mode IIRC for latency reasons, so this should be a
>> non-issue here as well. Gilles, would you confirm this?
> 
> Yes, on arm switch_mm always has irqs on. We handle the case where
> switch_mm was interrupted in the middle of its operations by restarting
> the mm switch after xenomai switches back to Linux.
> 

On archs with non-atomic switch_mm(), use_mm() will require a different
strategy. I'm thinking about something like

use_mm():
	set_some_flag();
	barrier();
	current->mm = new_mm;
	current->active_mm = new_mm;
	switch_mm(old_active_mm, new_mm, current);
	clear_some_flag();

and switch_mm():
	...
	if (likely(prev != next) || some_flag_set()) {
		clear_some_flag();
		...

ie. enforce mm switch if we may have interrupted use_mm at an unpleasant
time. I just don't know yet where to attach that some_flag to. Should be
the current task, but can we always access it from switch_mm?

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-02  6:24                                         ` Jan Kiszka
@ 2009-07-02  6:59                                           ` Gilles Chanteperdrix
  2009-07-02  7:16                                             ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-02  6:59 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Philippe Gerum wrote:
>>> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote:
>>>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>> Jan Kiszka wrote:
>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>> Gilles Chanteperdrix wrote:
>>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>>> Jan Kiszka wrote:
>>>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
>>>>>>>>>>>>> test system that can produce this is highly contended.
>>>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
>>>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
>>>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
>>>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
>>>>>>>>>>>> in a preemptible manner now?).
>>>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
>>>>>>>>>>>
>>>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
>>>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the
>>>>>>>>>>>> meanwhile...
>>>>>>>>>>> As you have found out in the mean time, we do not use unlocked context
>>>>>>>>>>> switches on x86.
>>>>>>>>>>>
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>> The last question I asked myself (but couldn't answer yet due to other
>>>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
>>>>>>>>>> switch_mm for its Linux callers?
>>>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more
>>>>>>>>> spots in 2.6.29 and maybe other kernels, too!
>>>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
>>>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
>>>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
>>>>>>> is not enough according to current tests. It seems to reduce to
>>>>>>> probability of corruption, though.
>>>>>>>
>>>>>>>> however, is that leave_mm sets the cr3 and just clears
>>>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
>>>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
>>>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
>>>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set
>>>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
>>>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
>>>>>>>> set instead of cleared as expected. So, maybe an irqs off section is
>>>>>>>> missing in leave_mm.
>>>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
>>>>>>> but now I'm parsing context_switch /wrt to lazy tlb.
>>>>>>>
>>>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
>>>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
>>>>>> picture yet.
>>>>>>
>>>>> Looking closer at the call sites of switch_mm, I think our the problem
>>>>> is mostly related to use_mm from fs/aio.c (customer is using aio
>>>>> heavily). But other callers need protection, too. We are going to test
>>>>> this patch tomorrow:
>>>>>
>>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>>> index 76da125..d90fca3 100644
>>>>> --- a/fs/aio.c
>>>>> +++ b/fs/aio.c
>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>>  {
>>>>>  	struct mm_struct *active_mm;
>>>>>  	struct task_struct *tsk = current;
>>>>> +	unsigned long flags;
>>>>>  
>>>>>  	task_lock(tsk);
>>>>>  	active_mm = tsk->active_mm;
>>>>>  	atomic_inc(&mm->mm_count);
>>>>> +	local_irq_save_hw_cond(flags);
>>>>>  	tsk->mm = mm;
>>>>>  	tsk->active_mm = mm;
>>>>>  	switch_mm(active_mm, mm, tsk);
>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>  	task_unlock(tsk);
>>>>>  
>>>>>  	mmdrop(active_mm);
>>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>>> index aa8f86c..8c545a4 100644
>>>>> --- a/kernel/sched.c
>>>>> +++ b/kernel/sched.c
>>>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>>>>  		next->active_mm = oldmm;
>>>>>  		atomic_inc(&oldmm->mm_count);
>>>>>  		enter_lazy_tlb(oldmm, next);
>>>>> -	} else
>>>>> +	} else {
>>>>> +		unsigned long flags;
>>>>> +		local_irq_save_hw_cond(flags);
>>>>>  		switch_mm(oldmm, mm, next);
>>>>> +		local_irq_restore_hw_cond(flags);
>>>>> +	}
>>>>>  
>>>>>  	if (unlikely(!prev->mm)) {
>>>>>  		prev->active_mm = NULL;
>>>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
>>>>>  
>>>>>  	BUG_ON(cpu_online(smp_processor_id()));
>>>>>  
>>>>> -	if (mm != &init_mm)
>>>>> +	if (mm != &init_mm) {
>>>>> +		unsigned long flags;
>>>>> +		local_irq_save_hw_cond(flags);
>>>>>  		switch_mm(mm, &init_mm, current);
>>>>> +		local_irq_restore_hw_cond(flags);
>>>>> +	}
>>>>>  	mmdrop(mm);
>>>>>  }
>>>> Please fix the callee instead of ironing the call sites. This would
>>>> avoid further issues as upstream emits additional switch_mm calls over
>>>> time, and make ironing activate_mm useless.
>>>>
>>> Btw, this how the powerpc port works in locked switch mode, and this
>>> particular bug does not apply in unlocked switch mode anyway; this is
>>> why we don't have that problem on this arch. The ARM port always works
>>> in unlocked switch mode IIRC for latency reasons, so this should be a
>>> non-issue here as well. Gilles, would you confirm this?
>> Yes, on arm switch_mm always has irqs on. We handle the case where
>> switch_mm was interrupted in the middle of its operations by restarting
>> the mm switch after xenomai switches back to Linux.
>>
> 
> On archs with non-atomic switch_mm(), use_mm() will require a different
> strategy. I'm thinking about something like
> 
> use_mm():
> 	set_some_flag();
> 	barrier();
> 	current->mm = new_mm;
> 	current->active_mm = new_mm;
> 	switch_mm(old_active_mm, new_mm, current);
> 	clear_some_flag();
> 
> and switch_mm():
> 	...
> 	if (likely(prev != next) || some_flag_set()) {
> 		clear_some_flag();
> 		...
> 
> ie. enforce mm switch if we may have interrupted use_mm at an unpleasant
> time. I just don't know yet where to attach that some_flag to. Should be
> the current task, but can we always access it from switch_mm?

These mechanisms are already in place. All you have to do is:

use_mm()
	ipipe_active_mm = NULL;
	barrier();
	current->mm = new_mm;
	current->active_mm = new_mm;
	switch_mm(old_active_mm, new_mm, current);

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-01 18:56                                     ` Jan Kiszka
@ 2009-07-02  7:11                                       ` Philippe Gerum
  0 siblings, 0 replies; 45+ messages in thread
From: Philippe Gerum @ 2009-07-02  7:11 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

On Wed, 2009-07-01 at 20:56 +0200, Jan Kiszka wrote:
> Philippe Gerum wrote:
> > On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote:
> >> Jan Kiszka wrote:
> >>> Jan Kiszka wrote:
> >>>> Gilles Chanteperdrix wrote:
> >>>>> Jan Kiszka wrote:
> >>>>>> Jan Kiszka wrote:
> >>>>>>> Gilles Chanteperdrix wrote:
> >>>>>>>> Jan Kiszka wrote:
> >>>>>>>>> Jan Kiszka wrote:
> >>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the
> >>>>>>>>>> test system that can produce this is highly contended.
> >>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from
> >>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel
> >>>>>>>>> tries to fixup the wrong mm. And that means we have some open race
> >>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run
> >>>>>>>>> in a preemptible manner now?).
> >>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ?
> >>>>>>>>
> >>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now
> >>>>>>>>> checking if it makes a difference. Digging deeper into the code in the
> >>>>>>>>> meanwhile...
> >>>>>>>> As you have found out in the mean time, we do not use unlocked context
> >>>>>>>> switches on x86.
> >>>>>>>>
> >>>>>>> Yes.
> >>>>>>>
> >>>>>>> The last question I asked myself (but couldn't answer yet due to other
> >>>>>>> activity) was: Where are the local_irq_disable/enable_hw around
> >>>>>>> switch_mm for its Linux callers?
> >>>>>> Ha, that's the point: only activate_mm is protected, but we have more
> >>>>>> spots in 2.6.29 and maybe other kernels, too!
> >>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found,
> >>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c.
> >>>> Both need protection (I pushed IRQ disabling into switch_mm), but that
> >>>> is not enough according to current tests. It seems to reduce to
> >>>> probability of corruption, though.
> >>>>
> >>>>> however, is that leave_mm sets the cr3 and just clears
> >>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between
> >>>>> cr3 and active_mm. I do not know what could happen if Xenomai could
> >>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I
> >>>>> understand, switch_mm called by Xenomai upon return to root would re-set
> >>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right
> >>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being
> >>>>> set instead of cleared as expected. So, maybe an irqs off section is
> >>>>> missing in leave_mm.
> >>>> leave_mm is already protected by its caller smp_invalidate_interrupt -
> >>>> but now I'm parsing context_switch /wrt to lazy tlb.
> >>>>
> >>> Hmm... lazy tlb: This means a new task is switched in and has active_mm
> >>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear
> >>> picture yet.
> >>>
> >> Looking closer at the call sites of switch_mm, I think our the problem
> >> is mostly related to use_mm from fs/aio.c (customer is using aio
> >> heavily). But other callers need protection, too. We are going to test
> >> this patch tomorrow:
> >>
> >> diff --git a/fs/aio.c b/fs/aio.c
> >> index 76da125..d90fca3 100644
> >> --- a/fs/aio.c
> >> +++ b/fs/aio.c
> >> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
> >>  {
> >>  	struct mm_struct *active_mm;
> >>  	struct task_struct *tsk = current;
> >> +	unsigned long flags;
> >>  
> >>  	task_lock(tsk);
> >>  	active_mm = tsk->active_mm;
> >>  	atomic_inc(&mm->mm_count);
> >> +	local_irq_save_hw_cond(flags);
> >>  	tsk->mm = mm;
> >>  	tsk->active_mm = mm;
> >>  	switch_mm(active_mm, mm, tsk);
> >> +	local_irq_restore_hw_cond(flags);
> >>  	task_unlock(tsk);
> >>  
> >>  	mmdrop(active_mm);
> >> diff --git a/kernel/sched.c b/kernel/sched.c
> >> index aa8f86c..8c545a4 100644
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
> >>  		next->active_mm = oldmm;
> >>  		atomic_inc(&oldmm->mm_count);
> >>  		enter_lazy_tlb(oldmm, next);
> >> -	} else
> >> +	} else {
> >> +		unsigned long flags;
> >> +		local_irq_save_hw_cond(flags);
> >>  		switch_mm(oldmm, mm, next);
> >> +		local_irq_restore_hw_cond(flags);
> >> +	}
> >>  
> >>  	if (unlikely(!prev->mm)) {
> >>  		prev->active_mm = NULL;
> >> @@ -6406,8 +6410,12 @@ void idle_task_exit(void)
> >>  
> >>  	BUG_ON(cpu_online(smp_processor_id()));
> >>  
> >> -	if (mm != &init_mm)
> >> +	if (mm != &init_mm) {
> >> +		unsigned long flags;
> >> +		local_irq_save_hw_cond(flags);
> >>  		switch_mm(mm, &init_mm, current);
> >> +		local_irq_restore_hw_cond(flags);
> >> +	}
> >>  	mmdrop(mm);
> >>  }
> > 
> > Please fix the callee instead of ironing the call sites. This would
> > avoid further issues as upstream emits additional switch_mm calls over
> > time, and make ironing activate_mm useless.
> 
> This was my first idea, too, but it didn't work, see use_mm(). I don't
> think we will get around reviewing switch_mm users on kernel updates.
> 

At the very least, you won't have to patch separately the spots that may
work this way. Avoiding uselessly invasive changes is not a substitute
for proper reviewing in any case.

> Jan
> 
-- 
Philippe.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-02  6:59                                           ` Gilles Chanteperdrix
@ 2009-07-02  7:16                                             ` Jan Kiszka
  2009-07-02  7:44                                               ` Gilles Chanteperdrix
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-02  7:16 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> On archs with non-atomic switch_mm(), use_mm() will require a different
>> strategy. I'm thinking about something like
>>
>> use_mm():
>> 	set_some_flag();
>> 	barrier();
>> 	current->mm = new_mm;
>> 	current->active_mm = new_mm;
>> 	switch_mm(old_active_mm, new_mm, current);
>> 	clear_some_flag();
>>
>> and switch_mm():
>> 	...
>> 	if (likely(prev != next) || some_flag_set()) {
>> 		clear_some_flag();
>> 		...
>>
>> ie. enforce mm switch if we may have interrupted use_mm at an unpleasant
>> time. I just don't know yet where to attach that some_flag to. Should be
>> the current task, but can we always access it from switch_mm?
> 
> These mechanisms are already in place. All you have to do is:
> 
> use_mm()
> 	ipipe_active_mm = NULL;
> 	barrier();
> 	current->mm = new_mm;
> 	current->active_mm = new_mm;
> 	switch_mm(old_active_mm, new_mm, current);
> 

As far as I understand, ipipe_active_mm has different semantics on ARM.
Specifically, it has no relation to the initial "next != prev" check.
The test we need is likely orthogonal to this.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-02  7:16                                             ` Jan Kiszka
@ 2009-07-02  7:44                                               ` Gilles Chanteperdrix
  0 siblings, 0 replies; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-02  7:44 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> On archs with non-atomic switch_mm(), use_mm() will require a different
>>> strategy. I'm thinking about something like
>>>
>>> use_mm():
>>> 	set_some_flag();
>>> 	barrier();
>>> 	current->mm = new_mm;
>>> 	current->active_mm = new_mm;
>>> 	switch_mm(old_active_mm, new_mm, current);
>>> 	clear_some_flag();
>>>
>>> and switch_mm():
>>> 	...
>>> 	if (likely(prev != next) || some_flag_set()) {
>>> 		clear_some_flag();
>>> 		...
>>>
>>> ie. enforce mm switch if we may have interrupted use_mm at an
>>> unpleasant
>>> time. I just don't know yet where to attach that some_flag to. Should
>>> be
>>> the current task, but can we always access it from switch_mm?
>>
>> These mechanisms are already in place. All you have to do is:
>>
>> use_mm()
>> 	ipipe_active_mm = NULL;
>> 	barrier();
>> 	current->mm = new_mm;
>> 	current->active_mm = new_mm;
>> 	switch_mm(old_active_mm, new_mm, current);
>>
>
> As far as I understand, ipipe_active_mm has different semantics on ARM.
> Specifically, it has no relation to the initial "next != prev" check.
> The test we need is likely orthogonal to this.

Well except that prev is ipipe_active_mm, so if NULL next != prev is
always true. ipipe_active_mm set to NULL means that the mm state is
currently unknown, what it triggers is that if Xenomai switches at this
moment, it will forcibly reload the mm state, and when switching back to
Linux it will not touch the mm state, but mark the MMSWITCHINT flag in the
current thread state, switch_mm will detect the flag and restart the mm
switch. This mechanism is well tested and we would have had problems
before if it was not working (the mm switch on ARM does a full cache flush
which stands for about 100us, so the chances to be preempted at this
moment are in fact pretty high).

So, I still think that setting ipipe_active_mm to NULL at the beginning of
use_mm is all what is needed, else I do not understand what you are
talking about.

-- 
                    Gilles.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-06-29 17:35 [Xenomai-core] x86: Endless minor faults Jan Kiszka
  2009-06-29 18:09 ` Philippe Gerum
  2009-06-30  8:32 ` Jan Kiszka
@ 2009-07-02 17:14 ` Jan Kiszka
  2009-07-03 14:54   ` Gilles Chanteperdrix
  2 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-02 17:14 UTC (permalink / raw)
  To: Philippe Gerum, Gilles Chanteperdrix; +Cc: xenomai-core

Hi again,

this is now basically the patch which seems to stabilized x86 /wrt mmu
switches again:

There were 3 race windows between setting active_mm of the current task
and actually switching it (that's a noarch issue), there were several
calls into switch_mm without proper hard interrupt protection (on archs
that have no preemptible switch_mm, like x86) and there was a race in
x86's leave_mm (as Gilles already remarked earlier in this thread -
while I was looking at an old tree where smp_invalidate_interrupt took
care of this).

The patch is thought as a basis for further discussions about

 o how to solve all the issues for all archs technically (ideally
   without the need to patch noarch parts in an arch-specific way...)

 o if anyone thinks there could be more spots like these (I've checked
   the code only for x86 so far)

Jan

-------->

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index dddfb84..d261b77 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -29,12 +29,17 @@ void destroy_context(struct mm_struct *mm);
 
 #define activate_mm(prev, next)			\
 do {						\
- 	unsigned long flags;			\
 	paravirt_activate_mm((prev), (next));	\
- 	local_irq_save_hw_cond(flags);		\
-	switch_mm((prev), (next), NULL);	\
- 	local_irq_restore_hw_cond(flags);	\
+	__switch_mm((prev), (next), NULL);	\
 } while (0);
 
+static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+			     struct task_struct *tsk)
+{
+	unsigned long flags;
+	local_irq_save_hw_cond(flags);
+	__switch_mm(prev, next, tsk);
+	local_irq_restore_hw_cond(flags);
+}
 
 #endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/include/asm/mmu_context_32.h b/arch/x86/include/asm/mmu_context_32.h
index 7e98ce1..e51cd09 100644
--- a/arch/x86/include/asm/mmu_context_32.h
+++ b/arch/x86/include/asm/mmu_context_32.h
@@ -9,12 +9,13 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 #endif
 }
 
-static inline void switch_mm(struct mm_struct *prev,
-			     struct mm_struct *next,
-			     struct task_struct *tsk)
+static inline void __switch_mm(struct mm_struct *prev,
+			       struct mm_struct *next,
+			       struct task_struct *tsk)
 {
 	int cpu = smp_processor_id();
 
+	WARN_ON_ONCE(!irqs_disabled_hw());
 	if (likely(prev != next)) {
 		/* stop flush ipis for the previous mm */
 		cpu_clear(cpu, prev->cpu_vm_mask);
diff --git a/arch/x86/include/asm/mmu_context_64.h b/arch/x86/include/asm/mmu_context_64.h
index 677d36e..0118200 100644
--- a/arch/x86/include/asm/mmu_context_64.h
+++ b/arch/x86/include/asm/mmu_context_64.h
@@ -11,10 +11,12 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 #endif
 }
 
-static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
-			     struct task_struct *tsk)
+static inline void __switch_mm(struct mm_struct *prev, struct mm_struct *next,
+			       struct task_struct *tsk)
 {
 	unsigned cpu = smp_processor_id();
+
+	WARN_ON_ONCE(!irqs_disabled_hw());
 	if (likely(prev != next)) {
 		/* stop flush ipis for the previous mm */
 		cpu_clear(cpu, prev->cpu_vm_mask);
diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c
index 08028a7..f68f2a6 100644
--- a/arch/x86/kernel/tlb_32.c
+++ b/arch/x86/kernel/tlb_32.c
@@ -34,9 +34,13 @@ static DEFINE_SPINLOCK(tlbstate_lock);
  */
 void leave_mm(int cpu)
 {
+	unsigned long flags;
+
 	BUG_ON(x86_read_percpu(cpu_tlbstate.state) == TLBSTATE_OK);
+	local_irq_save_hw_cond(flags);
 	cpu_clear(cpu, x86_read_percpu(cpu_tlbstate.active_mm)->cpu_vm_mask);
 	load_cr3(swapper_pg_dir);
+	local_irq_restore_hw_cond(flags);
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
index ce54e12..9829990 100644
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -62,10 +62,14 @@ static DEFINE_PER_CPU(union smp_flush_state, flush_state);
  */
 void leave_mm(int cpu)
 {
+	unsigned long flags;
+
 	if (read_pda(mmu_state) == TLBSTATE_OK)
 		BUG();
+	local_irq_save_hw_cond(flags);
 	cpu_clear(cpu, read_pda(active_mm)->cpu_vm_mask);
 	load_cr3(swapper_pg_dir);
+	local_irq_restore_hw_cond(flags);
 }
 EXPORT_SYMBOL_GPL(leave_mm);
 
diff --git a/fs/aio.c b/fs/aio.c
index 76da125..0286f0f 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
 {
 	struct mm_struct *active_mm;
 	struct task_struct *tsk = current;
+	unsigned long flags;
 
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
 	atomic_inc(&mm->mm_count);
 	tsk->mm = mm;
+	local_irq_save_hw_cond(flags);
 	tsk->active_mm = mm;
-	switch_mm(active_mm, mm, tsk);
+	__switch_mm(active_mm, mm, tsk);
+	local_irq_restore_hw_cond(flags);
 	task_unlock(tsk);
 
 	mmdrop(active_mm);
diff --git a/fs/exec.c b/fs/exec.c
index 3b36c69..06591ac 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct * old_mm, *active_mm;
+	unsigned long flags;
 
 	/* Notify parent that we're no longer interested in the old VM */
 	tsk = current;
@@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
 	tsk->mm = mm;
+	local_irq_save_hw_cond(flags);
 	tsk->active_mm = mm;
 	activate_mm(active_mm, mm);
+	local_irq_restore_hw_cond(flags);
 	task_unlock(tsk);
 	arch_pick_mmap_layout(mm);
 	if (old_mm) {
diff --git a/kernel/fork.c b/kernel/fork.c
index 01a836b..cf3b68a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
 		}
 
 		if (new_mm) {
+			unsigned long flags;
 			mm = current->mm;
 			active_mm = current->active_mm;
 			current->mm = new_mm;
+			local_irq_save_hw_cond(flags);
 			current->active_mm = new_mm;
 			activate_mm(active_mm, new_mm);
+			local_irq_restore_hw_cond(flags);
 			new_mm = mm;
 		}
  

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-02 17:14 ` Jan Kiszka
@ 2009-07-03 14:54   ` Gilles Chanteperdrix
  2009-07-03 15:06     ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-03 14:54 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Hi again,
> 
> this is now basically the patch which seems to stabilized x86 /wrt mmu
> switches again:
> 
> There were 3 race windows between setting active_mm of the current task
> and actually switching it (that's a noarch issue), there were several
> calls into switch_mm without proper hard interrupt protection (on archs
> that have no preemptible switch_mm, like x86) and there was a race in
> x86's leave_mm (as Gilles already remarked earlier in this thread -
> while I was looking at an old tree where smp_invalidate_interrupt took
> care of this).
> 
> The patch is thought as a basis for further discussions about
> 
>  o how to solve all the issues for all archs technically (ideally
>    without the need to patch noarch parts in an arch-specific way...)
> 
>  o if anyone thinks there could be more spots like these (I've checked
>    the code only for x86 so far)
> diff --git a/fs/aio.c b/fs/aio.c
> index 76da125..0286f0f 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>  {
>  	struct mm_struct *active_mm;
>  	struct task_struct *tsk = current;
> +	unsigned long flags;
>  
>  	task_lock(tsk);
>  	active_mm = tsk->active_mm;
>  	atomic_inc(&mm->mm_count);
>  	tsk->mm = mm;
> +	local_irq_save_hw_cond(flags);
>  	tsk->active_mm = mm;
> -	switch_mm(active_mm, mm, tsk);
> +	__switch_mm(active_mm, mm, tsk);
> +	local_irq_restore_hw_cond(flags);
>  	task_unlock(tsk);
>  
>  	mmdrop(active_mm);
> diff --git a/fs/exec.c b/fs/exec.c
> index 3b36c69..06591ac 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
>  {
>  	struct task_struct *tsk;
>  	struct mm_struct * old_mm, *active_mm;
> +	unsigned long flags;
>  
>  	/* Notify parent that we're no longer interested in the old VM */
>  	tsk = current;
> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
>  	task_lock(tsk);
>  	active_mm = tsk->active_mm;
>  	tsk->mm = mm;
> +	local_irq_save_hw_cond(flags);
>  	tsk->active_mm = mm;
>  	activate_mm(active_mm, mm);
> +	local_irq_restore_hw_cond(flags);
>  	task_unlock(tsk);
>  	arch_pick_mmap_layout(mm);
>  	if (old_mm) {
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 01a836b..cf3b68a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>  		}
>  
>  		if (new_mm) {
> +			unsigned long flags;
>  			mm = current->mm;
>  			active_mm = current->active_mm;
>  			current->mm = new_mm;
> +			local_irq_save_hw_cond(flags);
>  			current->active_mm = new_mm;
>  			activate_mm(active_mm, new_mm);
> +			local_irq_restore_hw_cond(flags);
>  			new_mm = mm;
>  		}

Ok. These are patches of the noarch code which are dependent on x86
implementation.

I think we need something like

mm_change_enter(flags);
mm_change_leave(flags);

which would resolve to ipipe_active_mm = NULL on architectures with
unlocked context switches, and to local_irq_save_hw/local_irq_restore on
x86?

>   
> 


-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-03 14:54   ` Gilles Chanteperdrix
@ 2009-07-03 15:06     ` Jan Kiszka
  2009-07-04 16:39       ` Gilles Chanteperdrix
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-03 15:06 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Hi again,
>>
>> this is now basically the patch which seems to stabilized x86 /wrt mmu
>> switches again:
>>
>> There were 3 race windows between setting active_mm of the current task
>> and actually switching it (that's a noarch issue), there were several
>> calls into switch_mm without proper hard interrupt protection (on archs
>> that have no preemptible switch_mm, like x86) and there was a race in
>> x86's leave_mm (as Gilles already remarked earlier in this thread -
>> while I was looking at an old tree where smp_invalidate_interrupt took
>> care of this).
>>
>> The patch is thought as a basis for further discussions about
>>
>>  o how to solve all the issues for all archs technically (ideally
>>    without the need to patch noarch parts in an arch-specific way...)
>>
>>  o if anyone thinks there could be more spots like these (I've checked
>>    the code only for x86 so far)
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 76da125..0286f0f 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>  {
>>  	struct mm_struct *active_mm;
>>  	struct task_struct *tsk = current;
>> +	unsigned long flags;
>>  
>>  	task_lock(tsk);
>>  	active_mm = tsk->active_mm;
>>  	atomic_inc(&mm->mm_count);
>>  	tsk->mm = mm;
>> +	local_irq_save_hw_cond(flags);
>>  	tsk->active_mm = mm;
>> -	switch_mm(active_mm, mm, tsk);
>> +	__switch_mm(active_mm, mm, tsk);
>> +	local_irq_restore_hw_cond(flags);
>>  	task_unlock(tsk);
>>  
>>  	mmdrop(active_mm);
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 3b36c69..06591ac 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
>>  {
>>  	struct task_struct *tsk;
>>  	struct mm_struct * old_mm, *active_mm;
>> +	unsigned long flags;
>>  
>>  	/* Notify parent that we're no longer interested in the old VM */
>>  	tsk = current;
>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
>>  	task_lock(tsk);
>>  	active_mm = tsk->active_mm;
>>  	tsk->mm = mm;
>> +	local_irq_save_hw_cond(flags);
>>  	tsk->active_mm = mm;
>>  	activate_mm(active_mm, mm);
>> +	local_irq_restore_hw_cond(flags);
>>  	task_unlock(tsk);
>>  	arch_pick_mmap_layout(mm);
>>  	if (old_mm) {
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 01a836b..cf3b68a 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>>  		}
>>  
>>  		if (new_mm) {
>> +			unsigned long flags;
>>  			mm = current->mm;
>>  			active_mm = current->active_mm;
>>  			current->mm = new_mm;
>> +			local_irq_save_hw_cond(flags);
>>  			current->active_mm = new_mm;
>>  			activate_mm(active_mm, new_mm);
>> +			local_irq_restore_hw_cond(flags);
>>  			new_mm = mm;
>>  		}
> 
> Ok. These are patches of the noarch code which are dependent on x86
> implementation.
> 
> I think we need something like
> 
> mm_change_enter(flags);
> mm_change_leave(flags);
> 
> which would resolve to ipipe_active_mm = NULL on architectures with
> unlocked context switches, and to local_irq_save_hw/local_irq_restore on
> x86?

+ we have to change switch_mm in the lockless case, and maybe also its
caller (if I look at the arm code), to use ipipe_active_mm in order to
decide if to switch or not - see my explanation of the possible race in
aio.c.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-03 15:06     ` Jan Kiszka
@ 2009-07-04 16:39       ` Gilles Chanteperdrix
  2009-07-05 12:01         ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-04 16:39 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Hi again,
>>>
>>> this is now basically the patch which seems to stabilized x86 /wrt mmu
>>> switches again:
>>>
>>> There were 3 race windows between setting active_mm of the current task
>>> and actually switching it (that's a noarch issue), there were several
>>> calls into switch_mm without proper hard interrupt protection (on archs
>>> that have no preemptible switch_mm, like x86) and there was a race in
>>> x86's leave_mm (as Gilles already remarked earlier in this thread -
>>> while I was looking at an old tree where smp_invalidate_interrupt took
>>> care of this).
>>>
>>> The patch is thought as a basis for further discussions about
>>>
>>>  o how to solve all the issues for all archs technically (ideally
>>>    without the need to patch noarch parts in an arch-specific way...)
>>>
>>>  o if anyone thinks there could be more spots like these (I've checked
>>>    the code only for x86 so far)
>>> diff --git a/fs/aio.c b/fs/aio.c
>>> index 76da125..0286f0f 100644
>>> --- a/fs/aio.c
>>> +++ b/fs/aio.c
>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>  {
>>>  	struct mm_struct *active_mm;
>>>  	struct task_struct *tsk = current;
>>> +	unsigned long flags;
>>>  
>>>  	task_lock(tsk);
>>>  	active_mm = tsk->active_mm;
>>>  	atomic_inc(&mm->mm_count);
>>>  	tsk->mm = mm;
>>> +	local_irq_save_hw_cond(flags);
>>>  	tsk->active_mm = mm;
>>> -	switch_mm(active_mm, mm, tsk);
>>> +	__switch_mm(active_mm, mm, tsk);
>>> +	local_irq_restore_hw_cond(flags);
>>>  	task_unlock(tsk);
>>>  
>>>  	mmdrop(active_mm);
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index 3b36c69..06591ac 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>  {
>>>  	struct task_struct *tsk;
>>>  	struct mm_struct * old_mm, *active_mm;
>>> +	unsigned long flags;
>>>  
>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>  	tsk = current;
>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>  	task_lock(tsk);
>>>  	active_mm = tsk->active_mm;
>>>  	tsk->mm = mm;
>>> +	local_irq_save_hw_cond(flags);
>>>  	tsk->active_mm = mm;
>>>  	activate_mm(active_mm, mm);
>>> +	local_irq_restore_hw_cond(flags);
>>>  	task_unlock(tsk);
>>>  	arch_pick_mmap_layout(mm);
>>>  	if (old_mm) {
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 01a836b..cf3b68a 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>>>  		}
>>>  
>>>  		if (new_mm) {
>>> +			unsigned long flags;
>>>  			mm = current->mm;
>>>  			active_mm = current->active_mm;
>>>  			current->mm = new_mm;
>>> +			local_irq_save_hw_cond(flags);
>>>  			current->active_mm = new_mm;
>>>  			activate_mm(active_mm, new_mm);
>>> +			local_irq_restore_hw_cond(flags);
>>>  			new_mm = mm;
>>>  		}
>> Ok. These are patches of the noarch code which are dependent on x86
>> implementation.
>>
>> I think we need something like
>>
>> mm_change_enter(flags);
>> mm_change_leave(flags);
>>
>> which would resolve to ipipe_active_mm = NULL on architectures with
>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on
>> x86?
> 
> + we have to change switch_mm in the lockless case, and maybe also its
> caller (if I look at the arm code), to use ipipe_active_mm in order to
> decide if to switch or not - see my explanation of the possible race in
> aio.c.

This is already the way it is currently working. the "prev" passed to
switch_mm is always ipipe_active_mm.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-04 16:39       ` Gilles Chanteperdrix
@ 2009-07-05 12:01         ` Jan Kiszka
  2009-07-05 14:56           ` Gilles Chanteperdrix
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-05 12:01 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 4249 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Hi again,
>>>>
>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu
>>>> switches again:
>>>>
>>>> There were 3 race windows between setting active_mm of the current task
>>>> and actually switching it (that's a noarch issue), there were several
>>>> calls into switch_mm without proper hard interrupt protection (on archs
>>>> that have no preemptible switch_mm, like x86) and there was a race in
>>>> x86's leave_mm (as Gilles already remarked earlier in this thread -
>>>> while I was looking at an old tree where smp_invalidate_interrupt took
>>>> care of this).
>>>>
>>>> The patch is thought as a basis for further discussions about
>>>>
>>>>  o how to solve all the issues for all archs technically (ideally
>>>>    without the need to patch noarch parts in an arch-specific way...)
>>>>
>>>>  o if anyone thinks there could be more spots like these (I've checked
>>>>    the code only for x86 so far)
>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>> index 76da125..0286f0f 100644
>>>> --- a/fs/aio.c
>>>> +++ b/fs/aio.c
>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>  {
>>>>  	struct mm_struct *active_mm;
>>>>  	struct task_struct *tsk = current;
>>>> +	unsigned long flags;
>>>>  
>>>>  	task_lock(tsk);
>>>>  	active_mm = tsk->active_mm;
>>>>  	atomic_inc(&mm->mm_count);
>>>>  	tsk->mm = mm;
>>>> +	local_irq_save_hw_cond(flags);
>>>>  	tsk->active_mm = mm;
>>>> -	switch_mm(active_mm, mm, tsk);
>>>> +	__switch_mm(active_mm, mm, tsk);
>>>> +	local_irq_restore_hw_cond(flags);
>>>>  	task_unlock(tsk);
>>>>  
>>>>  	mmdrop(active_mm);
>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>> index 3b36c69..06591ac 100644
>>>> --- a/fs/exec.c
>>>> +++ b/fs/exec.c
>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>>  {
>>>>  	struct task_struct *tsk;
>>>>  	struct mm_struct * old_mm, *active_mm;
>>>> +	unsigned long flags;
>>>>  
>>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>>  	tsk = current;
>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>>  	task_lock(tsk);
>>>>  	active_mm = tsk->active_mm;
>>>>  	tsk->mm = mm;
>>>> +	local_irq_save_hw_cond(flags);
>>>>  	tsk->active_mm = mm;
>>>>  	activate_mm(active_mm, mm);
>>>> +	local_irq_restore_hw_cond(flags);
>>>>  	task_unlock(tsk);
>>>>  	arch_pick_mmap_layout(mm);
>>>>  	if (old_mm) {
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index 01a836b..cf3b68a 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>>>>  		}
>>>>  
>>>>  		if (new_mm) {
>>>> +			unsigned long flags;
>>>>  			mm = current->mm;
>>>>  			active_mm = current->active_mm;
>>>>  			current->mm = new_mm;
>>>> +			local_irq_save_hw_cond(flags);
>>>>  			current->active_mm = new_mm;
>>>>  			activate_mm(active_mm, new_mm);
>>>> +			local_irq_restore_hw_cond(flags);
>>>>  			new_mm = mm;
>>>>  		}
>>> Ok. These are patches of the noarch code which are dependent on x86
>>> implementation.
>>>
>>> I think we need something like
>>>
>>> mm_change_enter(flags);
>>> mm_change_leave(flags);
>>>
>>> which would resolve to ipipe_active_mm = NULL on architectures with
>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on
>>> x86?
>> + we have to change switch_mm in the lockless case, and maybe also its
>> caller (if I look at the arm code), to use ipipe_active_mm in order to
>> decide if to switch or not - see my explanation of the possible race in
>> aio.c.
> 
> This is already the way it is currently working. the "prev" passed to
> switch_mm is always ipipe_active_mm.
> 

Ah, I missed the difference in arm's xnarch_leave_root. Now I got it.

OK, but this needs to be thought through in details again, specifically
also the enter-root scenarios. I have no safe feeling about it yet,
partly because switch_mm fiddles with ipipe_active_mm as well, but more
generally because we used to have such critical races in this area for a
"fairly" long time.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-05 12:01         ` Jan Kiszka
@ 2009-07-05 14:56           ` Gilles Chanteperdrix
  2009-07-05 17:12             ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-05 14:56 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Gilles Chanteperdrix wrote:
>>>> Jan Kiszka wrote:
>>>>> Hi again,
>>>>>
>>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu
>>>>> switches again:
>>>>>
>>>>> There were 3 race windows between setting active_mm of the current task
>>>>> and actually switching it (that's a noarch issue), there were several
>>>>> calls into switch_mm without proper hard interrupt protection (on archs
>>>>> that have no preemptible switch_mm, like x86) and there was a race in
>>>>> x86's leave_mm (as Gilles already remarked earlier in this thread -
>>>>> while I was looking at an old tree where smp_invalidate_interrupt took
>>>>> care of this).
>>>>>
>>>>> The patch is thought as a basis for further discussions about
>>>>>
>>>>>  o how to solve all the issues for all archs technically (ideally
>>>>>    without the need to patch noarch parts in an arch-specific way...)
>>>>>
>>>>>  o if anyone thinks there could be more spots like these (I've checked
>>>>>    the code only for x86 so far)
>>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>>> index 76da125..0286f0f 100644
>>>>> --- a/fs/aio.c
>>>>> +++ b/fs/aio.c
>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>>  {
>>>>>  	struct mm_struct *active_mm;
>>>>>  	struct task_struct *tsk = current;
>>>>> +	unsigned long flags;
>>>>>  
>>>>>  	task_lock(tsk);
>>>>>  	active_mm = tsk->active_mm;
>>>>>  	atomic_inc(&mm->mm_count);
>>>>>  	tsk->mm = mm;
>>>>> +	local_irq_save_hw_cond(flags);
>>>>>  	tsk->active_mm = mm;
>>>>> -	switch_mm(active_mm, mm, tsk);
>>>>> +	__switch_mm(active_mm, mm, tsk);
>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>  	task_unlock(tsk);
>>>>>  
>>>>>  	mmdrop(active_mm);
>>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>>> index 3b36c69..06591ac 100644
>>>>> --- a/fs/exec.c
>>>>> +++ b/fs/exec.c
>>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>  {
>>>>>  	struct task_struct *tsk;
>>>>>  	struct mm_struct * old_mm, *active_mm;
>>>>> +	unsigned long flags;
>>>>>  
>>>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>>>  	tsk = current;
>>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>  	task_lock(tsk);
>>>>>  	active_mm = tsk->active_mm;
>>>>>  	tsk->mm = mm;
>>>>> +	local_irq_save_hw_cond(flags);
>>>>>  	tsk->active_mm = mm;
>>>>>  	activate_mm(active_mm, mm);
>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>  	task_unlock(tsk);
>>>>>  	arch_pick_mmap_layout(mm);
>>>>>  	if (old_mm) {
>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>> index 01a836b..cf3b68a 100644
>>>>> --- a/kernel/fork.c
>>>>> +++ b/kernel/fork.c
>>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>>>>>  		}
>>>>>  
>>>>>  		if (new_mm) {
>>>>> +			unsigned long flags;
>>>>>  			mm = current->mm;
>>>>>  			active_mm = current->active_mm;
>>>>>  			current->mm = new_mm;
>>>>> +			local_irq_save_hw_cond(flags);
>>>>>  			current->active_mm = new_mm;
>>>>>  			activate_mm(active_mm, new_mm);
>>>>> +			local_irq_restore_hw_cond(flags);
>>>>>  			new_mm = mm;
>>>>>  		}
>>>> Ok. These are patches of the noarch code which are dependent on x86
>>>> implementation.
>>>>
>>>> I think we need something like
>>>>
>>>> mm_change_enter(flags);
>>>> mm_change_leave(flags);
>>>>
>>>> which would resolve to ipipe_active_mm = NULL on architectures with
>>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on
>>>> x86?
>>> + we have to change switch_mm in the lockless case, and maybe also its
>>> caller (if I look at the arm code), to use ipipe_active_mm in order to
>>> decide if to switch or not - see my explanation of the possible race in
>>> aio.c.
>> This is already the way it is currently working. the "prev" passed to
>> switch_mm is always ipipe_active_mm.
>>
> 
> Ah, I missed the difference in arm's xnarch_leave_root. Now I got it.
> 
> OK, but this needs to be thought through in details again, specifically
> also the enter-root scenarios. I have no safe feeling about it yet,
> partly because switch_mm fiddles with ipipe_active_mm as well, but more
> generally because we used to have such critical races in this area for a
> "fairly" long time.

When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit
is set, in which case the switch_mm preempted by Xenomai is restarted.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-05 14:56           ` Gilles Chanteperdrix
@ 2009-07-05 17:12             ` Jan Kiszka
  2009-07-06  7:54               ` Gilles Chanteperdrix
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2009-07-05 17:12 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu
>>>>>> switches again:
>>>>>>
>>>>>> There were 3 race windows between setting active_mm of the current task
>>>>>> and actually switching it (that's a noarch issue), there were several
>>>>>> calls into switch_mm without proper hard interrupt protection (on archs
>>>>>> that have no preemptible switch_mm, like x86) and there was a race in
>>>>>> x86's leave_mm (as Gilles already remarked earlier in this thread -
>>>>>> while I was looking at an old tree where smp_invalidate_interrupt took
>>>>>> care of this).
>>>>>>
>>>>>> The patch is thought as a basis for further discussions about
>>>>>>
>>>>>>  o how to solve all the issues for all archs technically (ideally
>>>>>>    without the need to patch noarch parts in an arch-specific way...)
>>>>>>
>>>>>>  o if anyone thinks there could be more spots like these (I've checked
>>>>>>    the code only for x86 so far)
>>>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>>>> index 76da125..0286f0f 100644
>>>>>> --- a/fs/aio.c
>>>>>> +++ b/fs/aio.c
>>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>>>  {
>>>>>>  	struct mm_struct *active_mm;
>>>>>>  	struct task_struct *tsk = current;
>>>>>> +	unsigned long flags;
>>>>>>  
>>>>>>  	task_lock(tsk);
>>>>>>  	active_mm = tsk->active_mm;
>>>>>>  	atomic_inc(&mm->mm_count);
>>>>>>  	tsk->mm = mm;
>>>>>> +	local_irq_save_hw_cond(flags);
>>>>>>  	tsk->active_mm = mm;
>>>>>> -	switch_mm(active_mm, mm, tsk);
>>>>>> +	__switch_mm(active_mm, mm, tsk);
>>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>>  	task_unlock(tsk);
>>>>>>  
>>>>>>  	mmdrop(active_mm);
>>>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>>>> index 3b36c69..06591ac 100644
>>>>>> --- a/fs/exec.c
>>>>>> +++ b/fs/exec.c
>>>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>>  {
>>>>>>  	struct task_struct *tsk;
>>>>>>  	struct mm_struct * old_mm, *active_mm;
>>>>>> +	unsigned long flags;
>>>>>>  
>>>>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>>>>  	tsk = current;
>>>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>>  	task_lock(tsk);
>>>>>>  	active_mm = tsk->active_mm;
>>>>>>  	tsk->mm = mm;
>>>>>> +	local_irq_save_hw_cond(flags);
>>>>>>  	tsk->active_mm = mm;
>>>>>>  	activate_mm(active_mm, mm);
>>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>>  	task_unlock(tsk);
>>>>>>  	arch_pick_mmap_layout(mm);
>>>>>>  	if (old_mm) {
>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>> index 01a836b..cf3b68a 100644
>>>>>> --- a/kernel/fork.c
>>>>>> +++ b/kernel/fork.c
>>>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>>>>>>  		}
>>>>>>  
>>>>>>  		if (new_mm) {
>>>>>> +			unsigned long flags;
>>>>>>  			mm = current->mm;
>>>>>>  			active_mm = current->active_mm;
>>>>>>  			current->mm = new_mm;
>>>>>> +			local_irq_save_hw_cond(flags);
>>>>>>  			current->active_mm = new_mm;
>>>>>>  			activate_mm(active_mm, new_mm);
>>>>>> +			local_irq_restore_hw_cond(flags);
>>>>>>  			new_mm = mm;
>>>>>>  		}
>>>>> Ok. These are patches of the noarch code which are dependent on x86
>>>>> implementation.
>>>>>
>>>>> I think we need something like
>>>>>
>>>>> mm_change_enter(flags);
>>>>> mm_change_leave(flags);
>>>>>
>>>>> which would resolve to ipipe_active_mm = NULL on architectures with
>>>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on
>>>>> x86?
>>>> + we have to change switch_mm in the lockless case, and maybe also its
>>>> caller (if I look at the arm code), to use ipipe_active_mm in order to
>>>> decide if to switch or not - see my explanation of the possible race in
>>>> aio.c.
>>> This is already the way it is currently working. the "prev" passed to
>>> switch_mm is always ipipe_active_mm.
>>>
>> Ah, I missed the difference in arm's xnarch_leave_root. Now I got it.
>>
>> OK, but this needs to be thought through in details again, specifically
>> also the enter-root scenarios. I have no safe feeling about it yet,
>> partly because switch_mm fiddles with ipipe_active_mm as well, but more
>> generally because we used to have such critical races in this area for a
>> "fairly" long time.
> 
> When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit
> is set, in which case the switch_mm preempted by Xenomai is restarted.
> 

Right. But now we are also dealing with preemptions outside that loop.
And we are dealing with archs that have atomic switch_mm, thus no such
loops.

The goal should be to find a solution against the races outside
switch_mm that is applicable to all archs. But the question is still
unanswered to me if we can simply reuse existing ipipe_active_mm without
disturbing its current use in non-atomic switch_mm.

Moreover, part of the problems on x86 was the non-atomic update of
cpu_vm_mask vs. actually switching the hardware. How do you deal with
cpu_vm_mask on ARM? Why can you call switch_mm without IRQ protection?
My impression is that only the (costly) cpu_switch_mm should be allowed
to be preemptible to have a safe state around it. I'm also asking as we
would have to address this on x86 & co. once switch_mm could be called
with undefined prev_mm.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-05 17:12             ` Jan Kiszka
@ 2009-07-06  7:54               ` Gilles Chanteperdrix
  2009-07-07 18:45                 ` Jan Kiszka
  0 siblings, 1 reply; 45+ messages in thread
From: Gilles Chanteperdrix @ 2009-07-06  7:54 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: xenomai-core

Jan Kiszka wrote:
> Gilles Chanteperdrix wrote:
>> Jan Kiszka wrote:
>>> Gilles Chanteperdrix wrote:
>>>> Jan Kiszka wrote:
>>>>> Gilles Chanteperdrix wrote:
>>>>>> Jan Kiszka wrote:
>>>>>>> Hi again,
>>>>>>>
>>>>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu
>>>>>>> switches again:
>>>>>>>
>>>>>>> There were 3 race windows between setting active_mm of the current task
>>>>>>> and actually switching it (that's a noarch issue), there were several
>>>>>>> calls into switch_mm without proper hard interrupt protection (on archs
>>>>>>> that have no preemptible switch_mm, like x86) and there was a race in
>>>>>>> x86's leave_mm (as Gilles already remarked earlier in this thread -
>>>>>>> while I was looking at an old tree where smp_invalidate_interrupt took
>>>>>>> care of this).
>>>>>>>
>>>>>>> The patch is thought as a basis for further discussions about
>>>>>>>
>>>>>>>  o how to solve all the issues for all archs technically (ideally
>>>>>>>    without the need to patch noarch parts in an arch-specific way...)
>>>>>>>
>>>>>>>  o if anyone thinks there could be more spots like these (I've checked
>>>>>>>    the code only for x86 so far)
>>>>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>>>>> index 76da125..0286f0f 100644
>>>>>>> --- a/fs/aio.c
>>>>>>> +++ b/fs/aio.c
>>>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>>>>  {
>>>>>>>  	struct mm_struct *active_mm;
>>>>>>>  	struct task_struct *tsk = current;
>>>>>>> +	unsigned long flags;
>>>>>>>  
>>>>>>>  	task_lock(tsk);
>>>>>>>  	active_mm = tsk->active_mm;
>>>>>>>  	atomic_inc(&mm->mm_count);
>>>>>>>  	tsk->mm = mm;
>>>>>>> +	local_irq_save_hw_cond(flags);
>>>>>>>  	tsk->active_mm = mm;
>>>>>>> -	switch_mm(active_mm, mm, tsk);
>>>>>>> +	__switch_mm(active_mm, mm, tsk);
>>>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>>>  	task_unlock(tsk);
>>>>>>>  
>>>>>>>  	mmdrop(active_mm);
>>>>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>>>>> index 3b36c69..06591ac 100644
>>>>>>> --- a/fs/exec.c
>>>>>>> +++ b/fs/exec.c
>>>>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>>>  {
>>>>>>>  	struct task_struct *tsk;
>>>>>>>  	struct mm_struct * old_mm, *active_mm;
>>>>>>> +	unsigned long flags;
>>>>>>>  
>>>>>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>>>>>  	tsk = current;
>>>>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>>>  	task_lock(tsk);
>>>>>>>  	active_mm = tsk->active_mm;
>>>>>>>  	tsk->mm = mm;
>>>>>>> +	local_irq_save_hw_cond(flags);
>>>>>>>  	tsk->active_mm = mm;
>>>>>>>  	activate_mm(active_mm, mm);
>>>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>>>  	task_unlock(tsk);
>>>>>>>  	arch_pick_mmap_layout(mm);
>>>>>>>  	if (old_mm) {
>>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>>> index 01a836b..cf3b68a 100644
>>>>>>> --- a/kernel/fork.c
>>>>>>> +++ b/kernel/fork.c
>>>>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>>>>>>>  		}
>>>>>>>  
>>>>>>>  		if (new_mm) {
>>>>>>> +			unsigned long flags;
>>>>>>>  			mm = current->mm;
>>>>>>>  			active_mm = current->active_mm;
>>>>>>>  			current->mm = new_mm;
>>>>>>> +			local_irq_save_hw_cond(flags);
>>>>>>>  			current->active_mm = new_mm;
>>>>>>>  			activate_mm(active_mm, new_mm);
>>>>>>> +			local_irq_restore_hw_cond(flags);
>>>>>>>  			new_mm = mm;
>>>>>>>  		}
>>>>>> Ok. These are patches of the noarch code which are dependent on x86
>>>>>> implementation.
>>>>>>
>>>>>> I think we need something like
>>>>>>
>>>>>> mm_change_enter(flags);
>>>>>> mm_change_leave(flags);
>>>>>>
>>>>>> which would resolve to ipipe_active_mm = NULL on architectures with
>>>>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on
>>>>>> x86?
>>>>> + we have to change switch_mm in the lockless case, and maybe also its
>>>>> caller (if I look at the arm code), to use ipipe_active_mm in order to
>>>>> decide if to switch or not - see my explanation of the possible race in
>>>>> aio.c.
>>>> This is already the way it is currently working. the "prev" passed to
>>>> switch_mm is always ipipe_active_mm.
>>>>
>>> Ah, I missed the difference in arm's xnarch_leave_root. Now I got it.
>>>
>>> OK, but this needs to be thought through in details again, specifically
>>> also the enter-root scenarios. I have no safe feeling about it yet,
>>> partly because switch_mm fiddles with ipipe_active_mm as well, but more
>>> generally because we used to have such critical races in this area for a
>>> "fairly" long time.
>> When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit
>> is set, in which case the switch_mm preempted by Xenomai is restarted.
>>
> 
> Right. But now we are also dealing with preemptions outside that loop.
> And we are dealing with archs that have atomic switch_mm, thus no such
> loops.
> 
> The goal should be to find a solution against the races outside
> switch_mm that is applicable to all archs. But the question is still
> unanswered to me if we can simply reuse existing ipipe_active_mm without
> disturbing its current use in non-atomic switch_mm.
> 
> Moreover, part of the problems on x86 was the non-atomic update of
> cpu_vm_mask vs. actually switching the hardware. How do you deal with
> cpu_vm_mask on ARM? Why can you call switch_mm without IRQ protection?
> My impression is that only the (costly) cpu_switch_mm should be allowed
> to be preemptible to have a safe state around it. I'm also asking as we
> would have to address this on x86 & co. once switch_mm could be called
> with undefined prev_mm.

when ipipe_active_mm is set to NULL, switch_mm does not touch the
current mm cpu_vm_mask, since it does not know what the current mm is.
However, we have a problem when switching back to root: we do not clear
the previous mm cpu_vm_mask since we do not call switch_mm in
xnarch_switch_to, we should fix this. So, I think we can extend the zone
where ipipe_active_mm is set to NULL, to protect this area from the
effects of preemption by Xenomai.

So, IMO, what we need is something like ipipe_mm_switch_protect which
sets ipipe_active_mm to NULL on ARM, and results in local_irq_save_hw on
x86.

> 
> Jan
> 


-- 
                                          Gilles



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Xenomai-core] x86: Endless minor faults
  2009-07-06  7:54               ` Gilles Chanteperdrix
@ 2009-07-07 18:45                 ` Jan Kiszka
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-07-07 18:45 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai-core

[-- Attachment #1: Type: text/plain, Size: 6695 bytes --]

Gilles Chanteperdrix wrote:
> Jan Kiszka wrote:
>> Gilles Chanteperdrix wrote:
>>> Jan Kiszka wrote:
>>>> Gilles Chanteperdrix wrote:
>>>>> Jan Kiszka wrote:
>>>>>> Gilles Chanteperdrix wrote:
>>>>>>> Jan Kiszka wrote:
>>>>>>>> Hi again,
>>>>>>>>
>>>>>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu
>>>>>>>> switches again:
>>>>>>>>
>>>>>>>> There were 3 race windows between setting active_mm of the current task
>>>>>>>> and actually switching it (that's a noarch issue), there were several
>>>>>>>> calls into switch_mm without proper hard interrupt protection (on archs
>>>>>>>> that have no preemptible switch_mm, like x86) and there was a race in
>>>>>>>> x86's leave_mm (as Gilles already remarked earlier in this thread -
>>>>>>>> while I was looking at an old tree where smp_invalidate_interrupt took
>>>>>>>> care of this).
>>>>>>>>
>>>>>>>> The patch is thought as a basis for further discussions about
>>>>>>>>
>>>>>>>>  o how to solve all the issues for all archs technically (ideally
>>>>>>>>    without the need to patch noarch parts in an arch-specific way...)
>>>>>>>>
>>>>>>>>  o if anyone thinks there could be more spots like these (I've checked
>>>>>>>>    the code only for x86 so far)
>>>>>>>> diff --git a/fs/aio.c b/fs/aio.c
>>>>>>>> index 76da125..0286f0f 100644
>>>>>>>> --- a/fs/aio.c
>>>>>>>> +++ b/fs/aio.c
>>>>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm)
>>>>>>>>  {
>>>>>>>>  	struct mm_struct *active_mm;
>>>>>>>>  	struct task_struct *tsk = current;
>>>>>>>> +	unsigned long flags;
>>>>>>>>  
>>>>>>>>  	task_lock(tsk);
>>>>>>>>  	active_mm = tsk->active_mm;
>>>>>>>>  	atomic_inc(&mm->mm_count);
>>>>>>>>  	tsk->mm = mm;
>>>>>>>> +	local_irq_save_hw_cond(flags);
>>>>>>>>  	tsk->active_mm = mm;
>>>>>>>> -	switch_mm(active_mm, mm, tsk);
>>>>>>>> +	__switch_mm(active_mm, mm, tsk);
>>>>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>>>>  	task_unlock(tsk);
>>>>>>>>  
>>>>>>>>  	mmdrop(active_mm);
>>>>>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>>>>>> index 3b36c69..06591ac 100644
>>>>>>>> --- a/fs/exec.c
>>>>>>>> +++ b/fs/exec.c
>>>>>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>>>>  {
>>>>>>>>  	struct task_struct *tsk;
>>>>>>>>  	struct mm_struct * old_mm, *active_mm;
>>>>>>>> +	unsigned long flags;
>>>>>>>>  
>>>>>>>>  	/* Notify parent that we're no longer interested in the old VM */
>>>>>>>>  	tsk = current;
>>>>>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>>>>  	task_lock(tsk);
>>>>>>>>  	active_mm = tsk->active_mm;
>>>>>>>>  	tsk->mm = mm;
>>>>>>>> +	local_irq_save_hw_cond(flags);
>>>>>>>>  	tsk->active_mm = mm;
>>>>>>>>  	activate_mm(active_mm, mm);
>>>>>>>> +	local_irq_restore_hw_cond(flags);
>>>>>>>>  	task_unlock(tsk);
>>>>>>>>  	arch_pick_mmap_layout(mm);
>>>>>>>>  	if (old_mm) {
>>>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>>>> index 01a836b..cf3b68a 100644
>>>>>>>> --- a/kernel/fork.c
>>>>>>>> +++ b/kernel/fork.c
>>>>>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
>>>>>>>>  		}
>>>>>>>>  
>>>>>>>>  		if (new_mm) {
>>>>>>>> +			unsigned long flags;
>>>>>>>>  			mm = current->mm;
>>>>>>>>  			active_mm = current->active_mm;
>>>>>>>>  			current->mm = new_mm;
>>>>>>>> +			local_irq_save_hw_cond(flags);
>>>>>>>>  			current->active_mm = new_mm;
>>>>>>>>  			activate_mm(active_mm, new_mm);
>>>>>>>> +			local_irq_restore_hw_cond(flags);
>>>>>>>>  			new_mm = mm;
>>>>>>>>  		}
>>>>>>> Ok. These are patches of the noarch code which are dependent on x86
>>>>>>> implementation.
>>>>>>>
>>>>>>> I think we need something like
>>>>>>>
>>>>>>> mm_change_enter(flags);
>>>>>>> mm_change_leave(flags);
>>>>>>>
>>>>>>> which would resolve to ipipe_active_mm = NULL on architectures with
>>>>>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on
>>>>>>> x86?
>>>>>> + we have to change switch_mm in the lockless case, and maybe also its
>>>>>> caller (if I look at the arm code), to use ipipe_active_mm in order to
>>>>>> decide if to switch or not - see my explanation of the possible race in
>>>>>> aio.c.
>>>>> This is already the way it is currently working. the "prev" passed to
>>>>> switch_mm is always ipipe_active_mm.
>>>>>
>>>> Ah, I missed the difference in arm's xnarch_leave_root. Now I got it.
>>>>
>>>> OK, but this needs to be thought through in details again, specifically
>>>> also the enter-root scenarios. I have no safe feeling about it yet,
>>>> partly because switch_mm fiddles with ipipe_active_mm as well, but more
>>>> generally because we used to have such critical races in this area for a
>>>> "fairly" long time.
>>> When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit
>>> is set, in which case the switch_mm preempted by Xenomai is restarted.
>>>
>> Right. But now we are also dealing with preemptions outside that loop.
>> And we are dealing with archs that have atomic switch_mm, thus no such
>> loops.
>>
>> The goal should be to find a solution against the races outside
>> switch_mm that is applicable to all archs. But the question is still
>> unanswered to me if we can simply reuse existing ipipe_active_mm without
>> disturbing its current use in non-atomic switch_mm.
>>
>> Moreover, part of the problems on x86 was the non-atomic update of
>> cpu_vm_mask vs. actually switching the hardware. How do you deal with
>> cpu_vm_mask on ARM? Why can you call switch_mm without IRQ protection?
>> My impression is that only the (costly) cpu_switch_mm should be allowed
>> to be preemptible to have a safe state around it. I'm also asking as we
>> would have to address this on x86 & co. once switch_mm could be called
>> with undefined prev_mm.
> 
> when ipipe_active_mm is set to NULL, switch_mm does not touch the
> current mm cpu_vm_mask, since it does not know what the current mm is.
> However, we have a problem when switching back to root: we do not clear
> the previous mm cpu_vm_mask since we do not call switch_mm in
> xnarch_switch_to, we should fix this. So, I think we can extend the zone
> where ipipe_active_mm is set to NULL, to protect this area from the
> effects of preemption by Xenomai.
> 
> So, IMO, what we need is something like ipipe_mm_switch_protect which
> sets ipipe_active_mm to NULL on ARM, and results in local_irq_save_hw on
> x86.

That sounds like a good step forward. I will split up my patch into a
non-arch part that instruments the code and an x86 part that installs
the required protection for that arch.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 257 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2009-07-07 18:45 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-29 17:35 [Xenomai-core] x86: Endless minor faults Jan Kiszka
2009-06-29 18:09 ` Philippe Gerum
2009-06-29 18:20   ` Jan Kiszka
2009-06-30  8:32 ` Jan Kiszka
2009-06-30  8:36   ` Gilles Chanteperdrix
2009-06-30  8:44     ` Jan Kiszka
2009-06-30  8:42   ` Gilles Chanteperdrix
2009-06-30  8:56     ` Jan Kiszka
2009-06-30  9:20     ` Philippe Gerum
2009-06-30  9:21       ` Gilles Chanteperdrix
2009-06-30  9:25         ` Philippe Gerum
2009-06-30  9:26           ` Gilles Chanteperdrix
2009-06-30  9:27             ` Philippe Gerum
2009-06-30  9:34               ` Gilles Chanteperdrix
2009-06-30 16:11                 ` Jan Kiszka
2009-07-01 11:56                   ` Jan Kiszka
2009-07-01 12:05                     ` Jan Kiszka
2009-07-01 12:24                     ` Gilles Chanteperdrix
2009-07-01 12:39                       ` Jan Kiszka
2009-07-01 12:41                         ` Gilles Chanteperdrix
2009-07-01 12:41                         ` Jan Kiszka
2009-07-01 15:51                           ` Gilles Chanteperdrix
2009-07-01 16:01                             ` Jan Kiszka
2009-07-01 16:04                               ` Jan Kiszka
2009-07-01 17:56                                 ` Jan Kiszka
2009-07-01 18:15                                   ` Philippe Gerum
2009-07-01 18:27                                     ` Philippe Gerum
2009-07-01 18:58                                       ` Jan Kiszka
2009-07-01 19:14                                         ` Jan Kiszka
2009-07-02  2:05                                       ` Gilles Chanteperdrix
2009-07-02  6:24                                         ` Jan Kiszka
2009-07-02  6:59                                           ` Gilles Chanteperdrix
2009-07-02  7:16                                             ` Jan Kiszka
2009-07-02  7:44                                               ` Gilles Chanteperdrix
2009-07-01 18:56                                     ` Jan Kiszka
2009-07-02  7:11                                       ` Philippe Gerum
2009-07-02 17:14 ` Jan Kiszka
2009-07-03 14:54   ` Gilles Chanteperdrix
2009-07-03 15:06     ` Jan Kiszka
2009-07-04 16:39       ` Gilles Chanteperdrix
2009-07-05 12:01         ` Jan Kiszka
2009-07-05 14:56           ` Gilles Chanteperdrix
2009-07-05 17:12             ` Jan Kiszka
2009-07-06  7:54               ` Gilles Chanteperdrix
2009-07-07 18:45                 ` Jan Kiszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.