* [Xenomai-core] x86: Endless minor faults
@ 2009-06-29 17:35 Jan Kiszka
2009-06-29 18:09 ` Philippe Gerum
` (2 more replies)
0 siblings, 3 replies; 45+ messages in thread
From: Jan Kiszka @ 2009-06-29 17:35 UTC (permalink / raw)
To: Philippe Gerum, Gilles Chanteperdrix; +Cc: xenomai-core
Hi all,
seen such loops before? This particular trace is from a 2.6.29.3 kernel
with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with
2.6.29.5/2.3-03:
:| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26)
:| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71)
:| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff)
: #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb)
:| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb)
:| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb)
: +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424)
: +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16)
: +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
: #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f)
: #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
: #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
:| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
:| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
: +func -652 0.096 find_vma+0x4 (do_page_fault+0x465)
: +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
: +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537)
: +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680)
: +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1)
: +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7)
: +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1)
: +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9)
: +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c)
: +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d)
: #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92)
: #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21)
: #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c)
:| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c)
:| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c)
:| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26)
and again and again...
We are looping over a minor fault here (according to /proc/PID/stat),
the context is a Xenomai task in secondary mode. As the task no longer
processes signals in this state, the whole system is more or less
broken. Tomorrow I will try to find out the faulting address with an
instrumented kernel, but maybe you already have some ideas.
Jan
--
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: [Xenomai-core] x86: Endless minor faults 2009-06-29 17:35 [Xenomai-core] x86: Endless minor faults Jan Kiszka @ 2009-06-29 18:09 ` Philippe Gerum 2009-06-29 18:20 ` Jan Kiszka 2009-06-30 8:32 ` Jan Kiszka 2009-07-02 17:14 ` Jan Kiszka 2 siblings, 1 reply; 45+ messages in thread From: Philippe Gerum @ 2009-06-29 18:09 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core On Mon, 2009-06-29 at 19:35 +0200, Jan Kiszka wrote: > Hi all, > > seen such loops before? This particular trace is from a 2.6.29.3 kernel > with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with > 2.6.29.5/2.3-03: Do you mean 2.6.29.5/2.4-03? -- Philippe. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-29 18:09 ` Philippe Gerum @ 2009-06-29 18:20 ` Jan Kiszka 0 siblings, 0 replies; 45+ messages in thread From: Jan Kiszka @ 2009-06-29 18:20 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core Philippe Gerum wrote: > On Mon, 2009-06-29 at 19:35 +0200, Jan Kiszka wrote: >> Hi all, >> >> seen such loops before? This particular trace is from a 2.6.29.3 kernel >> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >> 2.6.29.5/2.3-03: > > Do you mean 2.6.29.5/2.4-03? No, in fact I meant 2.4-01 (the one of rc2). Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-29 17:35 [Xenomai-core] x86: Endless minor faults Jan Kiszka 2009-06-29 18:09 ` Philippe Gerum @ 2009-06-30 8:32 ` Jan Kiszka 2009-06-30 8:36 ` Gilles Chanteperdrix 2009-06-30 8:42 ` Gilles Chanteperdrix 2009-07-02 17:14 ` Jan Kiszka 2 siblings, 2 replies; 45+ messages in thread From: Jan Kiszka @ 2009-06-30 8:32 UTC (permalink / raw) To: Philippe Gerum, Gilles Chanteperdrix; +Cc: xenomai-core Jan Kiszka wrote: > Hi all, > > seen such loops before? This particular trace is from a 2.6.29.3 kernel > with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with > 2.6.29.5/2.3-03: > > :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) > :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) > :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) > : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) > :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) > :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) > : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) > : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) > : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) > : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) > : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) > : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) > :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) > :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) > : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) > : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) > : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) > : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) > : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) > : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) > : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) > : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) > : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) > : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) > : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) > : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) > : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) > :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) > :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) > :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) > > and again and again... > > We are looping over a minor fault here (according to /proc/PID/stat), > the context is a Xenomai task in secondary mode. As the task no longer > processes signals in this state, the whole system is more or less > broken. Tomorrow I will try to find out the faulting address with an > instrumented kernel, but maybe you already have some ideas. The fault is apparently triggered by __xn_put_user(XNRELAX, thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an invalid region ATM. The questions are now: Who corrupted this, user space on init (not that likely) or kernel space later on (unpleasant thought)? Moreover: Why can't we recover from a fault on u_mode? Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 8:32 ` Jan Kiszka @ 2009-06-30 8:36 ` Gilles Chanteperdrix 2009-06-30 8:44 ` Jan Kiszka 2009-06-30 8:42 ` Gilles Chanteperdrix 1 sibling, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-06-30 8:36 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Jan Kiszka wrote: >> Hi all, >> >> seen such loops before? This particular trace is from a 2.6.29.3 kernel >> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >> 2.6.29.5/2.3-03: >> >> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) >> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) >> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) >> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) >> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) >> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) >> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) >> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) >> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) >> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) >> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) >> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) >> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) >> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) >> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) >> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) >> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >> >> and again and again... >> >> We are looping over a minor fault here (according to /proc/PID/stat), >> the context is a Xenomai task in secondary mode. As the task no longer >> processes signals in this state, the whole system is more or less >> broken. Tomorrow I will try to find out the faulting address with an >> instrumented kernel, but maybe you already have some ideas. > > The fault is apparently triggered by __xn_put_user(XNRELAX, > thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an > invalid region ATM. The questions are now: Who corrupted this, user > space on init (not that likely) or kernel space later on (unpleasant > thought)? Moreover: Why can't we recover from a fault on u_mode? Can not this be a cleanup issue? The dying thread would have freed its u_mode before the final relax which really kills it. -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 8:36 ` Gilles Chanteperdrix @ 2009-06-30 8:44 ` Jan Kiszka 0 siblings, 0 replies; 45+ messages in thread From: Jan Kiszka @ 2009-06-30 8:44 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Jan Kiszka wrote: >>> Hi all, >>> >>> seen such loops before? This particular trace is from a 2.6.29.3 kernel >>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >>> 2.6.29.5/2.3-03: >>> >>> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) >>> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) >>> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) >>> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) >>> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) >>> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) >>> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) >>> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) >>> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) >>> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) >>> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) >>> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) >>> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) >>> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) >>> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) >>> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) >>> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>> >>> and again and again... >>> >>> We are looping over a minor fault here (according to /proc/PID/stat), >>> the context is a Xenomai task in secondary mode. As the task no longer >>> processes signals in this state, the whole system is more or less >>> broken. Tomorrow I will try to find out the faulting address with an >>> instrumented kernel, but maybe you already have some ideas. >> The fault is apparently triggered by __xn_put_user(XNRELAX, >> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an >> invalid region ATM. The questions are now: Who corrupted this, user >> space on init (not that likely) or kernel space later on (unpleasant >> thought)? Moreover: Why can't we recover from a fault on u_mode? > > Can not this be a cleanup issue? The dying thread would have freed its > u_mode before the final relax which really kills it. > u_mode is 0x018fbf60, and that's typically an unused range for any process on x86_64. It is definitely unused for the process in question according to /proc/PID/maps. That said, you scenario may be problematic too, but I don't think we see it here. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 8:32 ` Jan Kiszka 2009-06-30 8:36 ` Gilles Chanteperdrix @ 2009-06-30 8:42 ` Gilles Chanteperdrix 2009-06-30 8:56 ` Jan Kiszka 2009-06-30 9:20 ` Philippe Gerum 1 sibling, 2 replies; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-06-30 8:42 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Jan Kiszka wrote: >> Hi all, >> >> seen such loops before? This particular trace is from a 2.6.29.3 kernel >> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >> 2.6.29.5/2.3-03: >> >> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) >> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) >> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) >> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) >> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) >> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) >> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) >> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) >> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) >> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) >> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) >> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) >> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) >> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) >> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) >> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) >> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >> >> and again and again... >> >> We are looping over a minor fault here (according to /proc/PID/stat), >> the context is a Xenomai task in secondary mode. As the task no longer >> processes signals in this state, the whole system is more or less >> broken. Tomorrow I will try to find out the faulting address with an >> instrumented kernel, but maybe you already have some ideas. > > The fault is apparently triggered by __xn_put_user(XNRELAX, > thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an > invalid region ATM. The questions are now: Who corrupted this, user > space on init (not that likely) or kernel space later on (unpleasant > thought)? Moreover: Why can't we recover from a fault on u_mode? I already investigated such an issue, and my conclusion was that there are some places in the code where we can not cope with a fault. xnshadow_relax being such a place, because, if relax faults, then what will the fault handler do? Call relax again. Fortunately, mlockall and the nocow stuff fixes this. Another way to implement the u_mode thing would be to use the shared heaps we use for fast mutexes. -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 8:42 ` Gilles Chanteperdrix @ 2009-06-30 8:56 ` Jan Kiszka 2009-06-30 9:20 ` Philippe Gerum 1 sibling, 0 replies; 45+ messages in thread From: Jan Kiszka @ 2009-06-30 8:56 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Jan Kiszka wrote: >>> Hi all, >>> >>> seen such loops before? This particular trace is from a 2.6.29.3 kernel >>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >>> 2.6.29.5/2.3-03: >>> >>> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) >>> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) >>> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) >>> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) >>> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) >>> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) >>> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) >>> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) >>> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) >>> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) >>> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) >>> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) >>> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) >>> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) >>> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) >>> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) >>> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>> >>> and again and again... >>> >>> We are looping over a minor fault here (according to /proc/PID/stat), >>> the context is a Xenomai task in secondary mode. As the task no longer >>> processes signals in this state, the whole system is more or less >>> broken. Tomorrow I will try to find out the faulting address with an >>> instrumented kernel, but maybe you already have some ideas. >> The fault is apparently triggered by __xn_put_user(XNRELAX, >> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an >> invalid region ATM. The questions are now: Who corrupted this, user >> space on init (not that likely) or kernel space later on (unpleasant >> thought)? Moreover: Why can't we recover from a fault on u_mode? > > I already investigated such an issue, and my conclusion was that there > are some places in the code where we can not cope with a fault. > xnshadow_relax being such a place, because, if relax faults, then what > will the fault handler do? Call relax again. Fortunately, mlockall and > the nocow stuff fixes this. That was the assumption I built u_mode setting on. Granted, I neglected the case where we fail due to corrupt addresses. > > Another way to implement the u_mode thing would be to use the shared > heaps we use for fast mutexes. This is probably the way to go for a robust u_mode interface. A must-fix for 2.5-final. Nevertheless, I will do another run to find out where this address comes from. As often, we may see a chain of bugs... Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 8:42 ` Gilles Chanteperdrix 2009-06-30 8:56 ` Jan Kiszka @ 2009-06-30 9:20 ` Philippe Gerum 2009-06-30 9:21 ` Gilles Chanteperdrix 1 sibling, 1 reply; 45+ messages in thread From: Philippe Gerum @ 2009-06-30 9:20 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote: > Jan Kiszka wrote: > > Jan Kiszka wrote: > >> Hi all, > >> > >> seen such loops before? This particular trace is from a 2.6.29.3 kernel > >> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with > >> 2.6.29.5/2.3-03: > >> > >> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) > >> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) > >> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) > >> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) > >> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) > >> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) > >> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) > >> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) > >> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) > >> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) > >> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) > >> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) > >> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) > >> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) > >> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) > >> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) > >> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) > >> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) > >> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) > >> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) > >> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) > >> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) > >> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) > >> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) > >> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) > >> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) > >> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) > >> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) > >> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) > >> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) > >> > >> and again and again... > >> > >> We are looping over a minor fault here (according to /proc/PID/stat), > >> the context is a Xenomai task in secondary mode. As the task no longer > >> processes signals in this state, the whole system is more or less > >> broken. Tomorrow I will try to find out the faulting address with an > >> instrumented kernel, but maybe you already have some ideas. > > > > The fault is apparently triggered by __xn_put_user(XNRELAX, > > thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an > > invalid region ATM. The questions are now: Who corrupted this, user > > space on init (not that likely) or kernel space later on (unpleasant > > thought)? Moreover: Why can't we recover from a fault on u_mode? > > I already investigated such an issue, and my conclusion was that there > are some places in the code where we can not cope with a fault. > xnshadow_relax being such a place, because, if relax faults, then what > will the fault handler do? Call relax again. Fortunately, mlockall and > the nocow stuff fixes this. xnshadow_relax() faulting before the current thread bears the XNRELAX bit would mean that a creepy issue involving ondemand PTEs in _kernel_ space must have caused this. Having the init_mm mappings known from all processes seems more relevant to this issue than anything nocow and/or mlockall could ever do to fix it. > > Another way to implement the u_mode thing would be to use the shared > heaps we use for fast mutexes. > -- Philippe. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 9:20 ` Philippe Gerum @ 2009-06-30 9:21 ` Gilles Chanteperdrix 2009-06-30 9:25 ` Philippe Gerum 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-06-30 9:21 UTC (permalink / raw) To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core Philippe Gerum wrote: > On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote: >> Jan Kiszka wrote: >>> Jan Kiszka wrote: >>>> Hi all, >>>> >>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel >>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >>>> 2.6.29.5/2.3-03: >>>> >>>> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>>> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) >>>> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) >>>> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) >>>> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) >>>> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) >>>> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) >>>> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) >>>> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>>> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) >>>> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>>> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>>> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>>> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>>> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) >>>> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>>> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) >>>> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) >>>> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) >>>> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) >>>> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>>> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) >>>> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) >>>> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>>> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) >>>> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>>> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>>> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>>> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>>> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>>> >>>> and again and again... >>>> >>>> We are looping over a minor fault here (according to /proc/PID/stat), >>>> the context is a Xenomai task in secondary mode. As the task no longer >>>> processes signals in this state, the whole system is more or less >>>> broken. Tomorrow I will try to find out the faulting address with an >>>> instrumented kernel, but maybe you already have some ideas. >>> The fault is apparently triggered by __xn_put_user(XNRELAX, >>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an >>> invalid region ATM. The questions are now: Who corrupted this, user >>> space on init (not that likely) or kernel space later on (unpleasant >>> thought)? Moreover: Why can't we recover from a fault on u_mode? >> I already investigated such an issue, and my conclusion was that there >> are some places in the code where we can not cope with a fault. >> xnshadow_relax being such a place, because, if relax faults, then what >> will the fault handler do? Call relax again. Fortunately, mlockall and >> the nocow stuff fixes this. > > > xnshadow_relax() faulting before the current thread bears the XNRELAX > bit would mean that a creepy issue involving ondemand PTEs in _kernel_ > space must have caused this. Having the init_mm mappings known from all > processes seems more relevant to this issue than anything nocow and/or > mlockall could ever do to fix it. u_mode is a user-space address. -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 9:21 ` Gilles Chanteperdrix @ 2009-06-30 9:25 ` Philippe Gerum 2009-06-30 9:26 ` Gilles Chanteperdrix 0 siblings, 1 reply; 45+ messages in thread From: Philippe Gerum @ 2009-06-30 9:25 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote: > Philippe Gerum wrote: > > On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote: > >> Jan Kiszka wrote: > >>> Jan Kiszka wrote: > >>>> Hi all, > >>>> > >>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel > >>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with > >>>> 2.6.29.5/2.3-03: > >>>> > >>>> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) > >>>> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) > >>>> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) > >>>> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) > >>>> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) > >>>> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) > >>>> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) > >>>> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) > >>>> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) > >>>> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) > >>>> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) > >>>> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) > >>>> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) > >>>> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) > >>>> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) > >>>> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) > >>>> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) > >>>> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) > >>>> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) > >>>> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) > >>>> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) > >>>> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) > >>>> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) > >>>> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) > >>>> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) > >>>> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) > >>>> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) > >>>> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) > >>>> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) > >>>> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) > >>>> > >>>> and again and again... > >>>> > >>>> We are looping over a minor fault here (according to /proc/PID/stat), > >>>> the context is a Xenomai task in secondary mode. As the task no longer > >>>> processes signals in this state, the whole system is more or less > >>>> broken. Tomorrow I will try to find out the faulting address with an > >>>> instrumented kernel, but maybe you already have some ideas. > >>> The fault is apparently triggered by __xn_put_user(XNRELAX, > >>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an > >>> invalid region ATM. The questions are now: Who corrupted this, user > >>> space on init (not that likely) or kernel space later on (unpleasant > >>> thought)? Moreover: Why can't we recover from a fault on u_mode? > >> I already investigated such an issue, and my conclusion was that there > >> are some places in the code where we can not cope with a fault. > >> xnshadow_relax being such a place, because, if relax faults, then what > >> will the fault handler do? Call relax again. Fortunately, mlockall and > >> the nocow stuff fixes this. > > > > > > xnshadow_relax() faulting before the current thread bears the XNRELAX > > bit would mean that a creepy issue involving ondemand PTEs in _kernel_ > > space must have caused this. Having the init_mm mappings known from all > > processes seems more relevant to this issue than anything nocow and/or > > mlockall could ever do to fix it. > > u_mode is a user-space address. > Why do you think xnshadow_relax() would be called for an already relaxed thread? -- Philippe. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 9:25 ` Philippe Gerum @ 2009-06-30 9:26 ` Gilles Chanteperdrix 2009-06-30 9:27 ` Philippe Gerum 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-06-30 9:26 UTC (permalink / raw) To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core Philippe Gerum wrote: > On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote: >> Philippe Gerum wrote: >>> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote: >>>> Jan Kiszka wrote: >>>>> Jan Kiszka wrote: >>>>>> Hi all, >>>>>> >>>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel >>>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >>>>>> 2.6.29.5/2.3-03: >>>>>> >>>>>> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>>>>> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) >>>>>> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) >>>>>> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) >>>>>> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) >>>>>> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) >>>>>> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) >>>>>> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) >>>>>> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>>>>> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) >>>>>> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>>>>> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>>>>> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>>>>> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>>>>> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) >>>>>> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>>>>> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) >>>>>> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) >>>>>> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) >>>>>> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) >>>>>> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>>>>> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) >>>>>> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) >>>>>> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>>>>> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) >>>>>> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>>>>> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>>>>> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>>>>> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>>>>> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>>>>> >>>>>> and again and again... >>>>>> >>>>>> We are looping over a minor fault here (according to /proc/PID/stat), >>>>>> the context is a Xenomai task in secondary mode. As the task no longer >>>>>> processes signals in this state, the whole system is more or less >>>>>> broken. Tomorrow I will try to find out the faulting address with an >>>>>> instrumented kernel, but maybe you already have some ideas. >>>>> The fault is apparently triggered by __xn_put_user(XNRELAX, >>>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an >>>>> invalid region ATM. The questions are now: Who corrupted this, user >>>>> space on init (not that likely) or kernel space later on (unpleasant >>>>> thought)? Moreover: Why can't we recover from a fault on u_mode? >>>> I already investigated such an issue, and my conclusion was that there >>>> are some places in the code where we can not cope with a fault. >>>> xnshadow_relax being such a place, because, if relax faults, then what >>>> will the fault handler do? Call relax again. Fortunately, mlockall and >>>> the nocow stuff fixes this. >>> >>> xnshadow_relax() faulting before the current thread bears the XNRELAX >>> bit would mean that a creepy issue involving ondemand PTEs in _kernel_ >>> space must have caused this. Having the init_mm mappings known from all >>> processes seems more relevant to this issue than anything nocow and/or >>> mlockall could ever do to fix it. >> u_mode is a user-space address. >> > > Why do you think xnshadow_relax() would be called for an already relaxed > thread? Because the fault happens before it has finished relaxing ? -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 9:26 ` Gilles Chanteperdrix @ 2009-06-30 9:27 ` Philippe Gerum 2009-06-30 9:34 ` Gilles Chanteperdrix 0 siblings, 1 reply; 45+ messages in thread From: Philippe Gerum @ 2009-06-30 9:27 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai-core On Tue, 2009-06-30 at 11:26 +0200, Gilles Chanteperdrix wrote: > Philippe Gerum wrote: > > On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote: > >> Philippe Gerum wrote: > >>> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote: > >>>> Jan Kiszka wrote: > >>>>> Jan Kiszka wrote: > >>>>>> Hi all, > >>>>>> > >>>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel > >>>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with > >>>>>> 2.6.29.5/2.3-03: > >>>>>> > >>>>>> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) > >>>>>> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) > >>>>>> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) > >>>>>> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) > >>>>>> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) > >>>>>> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) > >>>>>> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) > >>>>>> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) > >>>>>> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) > >>>>>> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) > >>>>>> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) > >>>>>> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) > >>>>>> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) > >>>>>> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) > >>>>>> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) > >>>>>> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) > >>>>>> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) > >>>>>> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) > >>>>>> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) > >>>>>> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) > >>>>>> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) > >>>>>> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) > >>>>>> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) > >>>>>> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) > >>>>>> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) > >>>>>> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) > >>>>>> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) > >>>>>> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) > >>>>>> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) > >>>>>> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) > >>>>>> > >>>>>> and again and again... > >>>>>> > >>>>>> We are looping over a minor fault here (according to /proc/PID/stat), > >>>>>> the context is a Xenomai task in secondary mode. As the task no longer > >>>>>> processes signals in this state, the whole system is more or less > >>>>>> broken. Tomorrow I will try to find out the faulting address with an > >>>>>> instrumented kernel, but maybe you already have some ideas. > >>>>> The fault is apparently triggered by __xn_put_user(XNRELAX, > >>>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an > >>>>> invalid region ATM. The questions are now: Who corrupted this, user > >>>>> space on init (not that likely) or kernel space later on (unpleasant > >>>>> thought)? Moreover: Why can't we recover from a fault on u_mode? > >>>> I already investigated such an issue, and my conclusion was that there > >>>> are some places in the code where we can not cope with a fault. > >>>> xnshadow_relax being such a place, because, if relax faults, then what > >>>> will the fault handler do? Call relax again. Fortunately, mlockall and > >>>> the nocow stuff fixes this. > >>> > >>> xnshadow_relax() faulting before the current thread bears the XNRELAX > >>> bit would mean that a creepy issue involving ondemand PTEs in _kernel_ > >>> space must have caused this. Having the init_mm mappings known from all > >>> processes seems more relevant to this issue than anything nocow and/or > >>> mlockall could ever do to fix it. > >> u_mode is a user-space address. > >> > > > > Why do you think xnshadow_relax() would be called for an already relaxed > > thread? > > Because the fault happens before it has finished relaxing ? > Well, no. Have a second look at the code. -- Philippe. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 9:27 ` Philippe Gerum @ 2009-06-30 9:34 ` Gilles Chanteperdrix 2009-06-30 16:11 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-06-30 9:34 UTC (permalink / raw) To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core Philippe Gerum wrote: > On Tue, 2009-06-30 at 11:26 +0200, Gilles Chanteperdrix wrote: >> Philippe Gerum wrote: >>> On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote: >>>> Philippe Gerum wrote: >>>>> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel >>>>>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >>>>>>>> 2.6.29.5/2.3-03: >>>>>>>> >>>>>>>> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>>>>>>> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) >>>>>>>> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) >>>>>>>> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) >>>>>>>> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) >>>>>>>> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) >>>>>>>> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) >>>>>>>> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) >>>>>>>> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>>>>>>> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) >>>>>>>> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>>>>>>> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>>>>>>> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>>>>>>> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>>>>>>> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) >>>>>>>> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>>>>>>> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) >>>>>>>> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) >>>>>>>> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) >>>>>>>> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) >>>>>>>> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>>>>>>> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) >>>>>>>> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) >>>>>>>> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>>>>>>> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) >>>>>>>> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>>>>>>> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>>>>>>> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>>>>>>> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>>>>>>> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>>>>>>> >>>>>>>> and again and again... >>>>>>>> >>>>>>>> We are looping over a minor fault here (according to /proc/PID/stat), >>>>>>>> the context is a Xenomai task in secondary mode. As the task no longer >>>>>>>> processes signals in this state, the whole system is more or less >>>>>>>> broken. Tomorrow I will try to find out the faulting address with an >>>>>>>> instrumented kernel, but maybe you already have some ideas. >>>>>>> The fault is apparently triggered by __xn_put_user(XNRELAX, >>>>>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an >>>>>>> invalid region ATM. The questions are now: Who corrupted this, user >>>>>>> space on init (not that likely) or kernel space later on (unpleasant >>>>>>> thought)? Moreover: Why can't we recover from a fault on u_mode? >>>>>> I already investigated such an issue, and my conclusion was that there >>>>>> are some places in the code where we can not cope with a fault. >>>>>> xnshadow_relax being such a place, because, if relax faults, then what >>>>>> will the fault handler do? Call relax again. Fortunately, mlockall and >>>>>> the nocow stuff fixes this. >>>>> xnshadow_relax() faulting before the current thread bears the XNRELAX >>>>> bit would mean that a creepy issue involving ondemand PTEs in _kernel_ >>>>> space must have caused this. Having the init_mm mappings known from all >>>>> processes seems more relevant to this issue than anything nocow and/or >>>>> mlockall could ever do to fix it. >>>> u_mode is a user-space address. >>>> >>> Why do you think xnshadow_relax() would be called for an already relaxed >>> thread? >> Because the fault happens before it has finished relaxing ? >> > > Well, no. Have a second look at the code. Ok, you are right then, in my case the faults were probably due to vmalloc. -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 9:34 ` Gilles Chanteperdrix @ 2009-06-30 16:11 ` Jan Kiszka 2009-07-01 11:56 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-06-30 16:11 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Gilles Chanteperdrix wrote: > Philippe Gerum wrote: >> On Tue, 2009-06-30 at 11:26 +0200, Gilles Chanteperdrix wrote: >>> Philippe Gerum wrote: >>>> On Tue, 2009-06-30 at 11:21 +0200, Gilles Chanteperdrix wrote: >>>>> Philippe Gerum wrote: >>>>>> On Tue, 2009-06-30 at 10:42 +0200, Gilles Chanteperdrix wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Jan Kiszka wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> seen such loops before? This particular trace is from a 2.6.29.3 kernel >>>>>>>>> with ipipe-2.3-01 (SMP/PREEMPT_VOLUNTARY), but the same happens with >>>>>>>>> 2.6.29.5/2.3-03: >>>>>>>>> >>>>>>>>> :| +func -653 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>>>>>>>> :| +func -653 0.096 ipipe_check_context+0xd (__ipipe_handle_exception+0x71) >>>>>>>>> :| #end 0x80000000 -653 0.069 do_page_fault+0x33 (__ipipe_handle_exception+0x1ff) >>>>>>>>> : #func -653 0.078 __ipipe_unstall_root+0x9 (do_page_fault+0x3cb) >>>>>>>>> :| #begin 0x80000000 -653 0.068 __ipipe_unstall_root+0x34 (do_page_fault+0x3cb) >>>>>>>>> :| +end 0x80000000 -653 0.069 __ipipe_unstall_root+0x59 (do_page_fault+0x3cb) >>>>>>>>> : +func -653 0.060 down_read_trylock+0x4 (do_page_fault+0x424) >>>>>>>>> : +func -653 0.068 _spin_lock_irqsave+0x9 (__down_read_trylock+0x16) >>>>>>>>> : +func -653 0.108 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>>>>>>>> : #func -652 0.066 _spin_unlock_irqrestore+0x4 (__down_read_trylock+0x3f) >>>>>>>>> : #func -652 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>>>>>>>> : #func -652 0.074 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>>>>>>>> :| #begin 0x80000000 -652 0.066 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>>>>>>>> :| +end 0x80000000 -652 0.069 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>>>>>>>> : +func -652 0.096 find_vma+0x4 (do_page_fault+0x465) >>>>>>>>> : +func -652 0.150 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>>>>>>>> : +func -652 0.098 handle_mm_fault+0x11 (do_page_fault+0x537) >>>>>>>>> : +func -652 0.090 _spin_lock+0x4 (handle_mm_fault+0x680) >>>>>>>>> : +func -652 0.063 ptep_set_access_flags+0x9 (handle_mm_fault+0x6d1) >>>>>>>>> : +func -652 0.282 flush_tlb_page+0xd (handle_mm_fault+0x6e7) >>>>>>>>> : +func -651 0.162 ltt_run_filter_default+0x4 (_ltt_specialized_trace+0xc1) >>>>>>>>> : +func -651 0.062 up_read+0x4 (do_page_fault+0x5a9) >>>>>>>>> : +func -651 0.072 _spin_lock_irqsave+0x9 (__up_read+0x1c) >>>>>>>>> : +func -651 0.117 ipipe_check_context+0xd (_spin_lock_irqsave+0x1d) >>>>>>>>> : #func -651 0.074 _spin_unlock_irqrestore+0x4 (__up_read+0x92) >>>>>>>>> : #func -651 0.069 __ipipe_restore_root+0x4 (_spin_unlock_irqrestore+0x21) >>>>>>>>> : #func -651 0.060 __ipipe_unstall_root+0x9 (__ipipe_restore_root+0x2c) >>>>>>>>> :| #begin 0x80000000 -651 0.056 __ipipe_unstall_root+0x34 (__ipipe_restore_root+0x2c) >>>>>>>>> :| +end 0x80000000 -651 0.420 __ipipe_unstall_root+0x59 (__ipipe_restore_root+0x2c) >>>>>>>>> :| +func -650 0.084 __ipipe_handle_exception+0x11 (page_fault+0x26) >>>>>>>>> >>>>>>>>> and again and again... >>>>>>>>> >>>>>>>>> We are looping over a minor fault here (according to /proc/PID/stat), >>>>>>>>> the context is a Xenomai task in secondary mode. As the task no longer >>>>>>>>> processes signals in this state, the whole system is more or less >>>>>>>>> broken. Tomorrow I will try to find out the faulting address with an >>>>>>>>> instrumented kernel, but maybe you already have some ideas. >>>>>>>> The fault is apparently triggered by __xn_put_user(XNRELAX, >>>>>>>> thread->u_mode) in xnshadow_relax. thread->u_mode is pointing to an >>>>>>>> invalid region ATM. The questions are now: Who corrupted this, user >>>>>>>> space on init (not that likely) or kernel space later on (unpleasant >>>>>>>> thought)? Moreover: Why can't we recover from a fault on u_mode? >>>>>>> I already investigated such an issue, and my conclusion was that there >>>>>>> are some places in the code where we can not cope with a fault. >>>>>>> xnshadow_relax being such a place, because, if relax faults, then what >>>>>>> will the fault handler do? Call relax again. Fortunately, mlockall and >>>>>>> the nocow stuff fixes this. >>>>>> xnshadow_relax() faulting before the current thread bears the XNRELAX >>>>>> bit would mean that a creepy issue involving ondemand PTEs in _kernel_ >>>>>> space must have caused this. Having the init_mm mappings known from all >>>>>> processes seems more relevant to this issue than anything nocow and/or >>>>>> mlockall could ever do to fix it. >>>>> u_mode is a user-space address. >>>>> >>>> Why do you think xnshadow_relax() would be called for an already relaxed >>>> thread? >>> Because the fault happens before it has finished relaxing ? >>> >> Well, no. Have a second look at the code. > > Ok, you are right then, in my case the faults were probably due to vmalloc. In our case, it must be some special fault path, too. I tried injecting a non-NULL but invalid address intentionally. But the __xn_put_user just swallowed it without complaints. It's still unclear what goes on precisely, we are still digging, but the test system that can produce this is highly contended. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-30 16:11 ` Jan Kiszka @ 2009-07-01 11:56 ` Jan Kiszka 2009-07-01 12:05 ` Jan Kiszka 2009-07-01 12:24 ` Gilles Chanteperdrix 0 siblings, 2 replies; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 11:56 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Jan Kiszka wrote: > It's still unclear what goes on precisely, we are still digging, but the > test system that can produce this is highly contended. Short update: Further instrumentation revealed that cr3 differs from active_mm->pgd while we are looping over that fault, ie. the kernel tries to fixup the wrong mm. And that means we have some open race window between updating cr3 and active_mm somewhere (isn't switch_mm run in a preemptible manner now?). As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now checking if it makes a difference. Digging deeper into the code in the meanwhile... Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 11:56 ` Jan Kiszka @ 2009-07-01 12:05 ` Jan Kiszka 2009-07-01 12:24 ` Gilles Chanteperdrix 1 sibling, 0 replies; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 12:05 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Jan Kiszka wrote: > Jan Kiszka wrote: >> It's still unclear what goes on precisely, we are still digging, but the >> test system that can produce this is highly contended. > > Short update: Further instrumentation revealed that cr3 differs from > active_mm->pgd while we are looping over that fault, ie. the kernel > tries to fixup the wrong mm. And that means we have some open race > window between updating cr3 and active_mm somewhere (isn't switch_mm run > in a preemptible manner now?). > > As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now > checking if it makes a difference. Digging deeper into the code in the > meanwhile... CONFIG_IPIPE_DELAYED_ATOMICSW is nonsense. And we don't do switch_mm without irq protection on x86, at least not from the nucleus, right? Maybe a race due some Linux user running it unprotectedly? Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 11:56 ` Jan Kiszka 2009-07-01 12:05 ` Jan Kiszka @ 2009-07-01 12:24 ` Gilles Chanteperdrix 2009-07-01 12:39 ` Jan Kiszka 1 sibling, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-01 12:24 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Jan Kiszka wrote: >> It's still unclear what goes on precisely, we are still digging, but the >> test system that can produce this is highly contended. > > Short update: Further instrumentation revealed that cr3 differs from > active_mm->pgd while we are looping over that fault, ie. the kernel > tries to fixup the wrong mm. And that means we have some open race > window between updating cr3 and active_mm somewhere (isn't switch_mm run > in a preemptible manner now?). Maybe the rsp is wrong and leads you to the wrong active_mm ? > > As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now > checking if it makes a difference. Digging deeper into the code in the > meanwhile... As you have found out in the mean time, we do not use unlocked context switches on x86. -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 12:24 ` Gilles Chanteperdrix @ 2009-07-01 12:39 ` Jan Kiszka 2009-07-01 12:41 ` Gilles Chanteperdrix 2009-07-01 12:41 ` Jan Kiszka 0 siblings, 2 replies; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 12:39 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Jan Kiszka wrote: >>> It's still unclear what goes on precisely, we are still digging, but the >>> test system that can produce this is highly contended. >> Short update: Further instrumentation revealed that cr3 differs from >> active_mm->pgd while we are looping over that fault, ie. the kernel >> tries to fixup the wrong mm. And that means we have some open race >> window between updating cr3 and active_mm somewhere (isn't switch_mm run >> in a preemptible manner now?). > > Maybe the rsp is wrong and leads you to the wrong active_mm ? > >> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >> checking if it makes a difference. Digging deeper into the code in the >> meanwhile... > > As you have found out in the mean time, we do not use unlocked context > switches on x86. > Yes. The last question I asked myself (but couldn't answer yet due to other activity) was: Where are the local_irq_disable/enable_hw around switch_mm for its Linux callers? Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 12:39 ` Jan Kiszka @ 2009-07-01 12:41 ` Gilles Chanteperdrix 2009-07-01 12:41 ` Jan Kiszka 1 sibling, 0 replies; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-01 12:41 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Jan Kiszka wrote: >>> Jan Kiszka wrote: >>>> It's still unclear what goes on precisely, we are still digging, but the >>>> test system that can produce this is highly contended. >>> Short update: Further instrumentation revealed that cr3 differs from >>> active_mm->pgd while we are looping over that fault, ie. the kernel >>> tries to fixup the wrong mm. And that means we have some open race >>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>> in a preemptible manner now?). >> Maybe the rsp is wrong and leads you to the wrong active_mm ? >> >>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>> checking if it makes a difference. Digging deeper into the code in the >>> meanwhile... >> As you have found out in the mean time, we do not use unlocked context >> switches on x86. >> > > Yes. > > The last question I asked myself (but couldn't answer yet due to other > activity) was: Where are the local_irq_disable/enable_hw around > switch_mm for its Linux callers? prepare_arch_switch and task_hijacked -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 12:39 ` Jan Kiszka 2009-07-01 12:41 ` Gilles Chanteperdrix @ 2009-07-01 12:41 ` Jan Kiszka 2009-07-01 15:51 ` Gilles Chanteperdrix 1 sibling, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 12:41 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Jan Kiszka wrote: >>> Jan Kiszka wrote: >>>> It's still unclear what goes on precisely, we are still digging, but the >>>> test system that can produce this is highly contended. >>> Short update: Further instrumentation revealed that cr3 differs from >>> active_mm->pgd while we are looping over that fault, ie. the kernel >>> tries to fixup the wrong mm. And that means we have some open race >>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>> in a preemptible manner now?). >> Maybe the rsp is wrong and leads you to the wrong active_mm ? >> >>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>> checking if it makes a difference. Digging deeper into the code in the >>> meanwhile... >> As you have found out in the mean time, we do not use unlocked context >> switches on x86. >> > > Yes. > > The last question I asked myself (but couldn't answer yet due to other > activity) was: Where are the local_irq_disable/enable_hw around > switch_mm for its Linux callers? Ha, that's the point: only activate_mm is protected, but we have more spots in 2.6.29 and maybe other kernels, too! Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 12:41 ` Jan Kiszka @ 2009-07-01 15:51 ` Gilles Chanteperdrix 2009-07-01 16:01 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-01 15:51 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >>> Jan Kiszka wrote: >>>> Jan Kiszka wrote: >>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>> test system that can produce this is highly contended. >>>> Short update: Further instrumentation revealed that cr3 differs from >>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>> tries to fixup the wrong mm. And that means we have some open race >>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>> in a preemptible manner now?). >>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>> >>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>> checking if it makes a difference. Digging deeper into the code in the >>>> meanwhile... >>> As you have found out in the mean time, we do not use unlocked context >>> switches on x86. >>> >> Yes. >> >> The last question I asked myself (but couldn't answer yet due to other >> activity) was: Where are the local_irq_disable/enable_hw around >> switch_mm for its Linux callers? > > Ha, that's the point: only activate_mm is protected, but we have more > spots in 2.6.29 and maybe other kernels, too! Ok, I do not see where switch_mm is called with IRQs off. What I found, however, is that leave_mm sets the cr3 and just clears active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between cr3 and active_mm. I do not know what could happen if Xenomai could interrupt leave_mm between the cpu_clear and the write_cr3. From what I understand, switch_mm called by Xenomai upon return to root would re-set the bit, and re-set cr3, which would be set to the kernel cr3 right after that, but this would result in the active_mm.cpu_vm_mask bit being set instead of cleared as expected. So, maybe an irqs off section is missing in leave_mm. -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 15:51 ` Gilles Chanteperdrix @ 2009-07-01 16:01 ` Jan Kiszka 2009-07-01 16:04 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 16:01 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Jan Kiszka wrote: >>> Gilles Chanteperdrix wrote: >>>> Jan Kiszka wrote: >>>>> Jan Kiszka wrote: >>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>> test system that can produce this is highly contended. >>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>> tries to fixup the wrong mm. And that means we have some open race >>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>> in a preemptible manner now?). >>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>> >>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>> checking if it makes a difference. Digging deeper into the code in the >>>>> meanwhile... >>>> As you have found out in the mean time, we do not use unlocked context >>>> switches on x86. >>>> >>> Yes. >>> >>> The last question I asked myself (but couldn't answer yet due to other >>> activity) was: Where are the local_irq_disable/enable_hw around >>> switch_mm for its Linux callers? >> Ha, that's the point: only activate_mm is protected, but we have more >> spots in 2.6.29 and maybe other kernels, too! > > Ok, I do not see where switch_mm is called with IRQs off. What I found, We have two direct callers of switch_mm in sched.c and one in fs/aio.c. Both need protection (I pushed IRQ disabling into switch_mm), but that is not enough according to current tests. It seems to reduce to probability of corruption, though. > however, is that leave_mm sets the cr3 and just clears > active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between > cr3 and active_mm. I do not know what could happen if Xenomai could > interrupt leave_mm between the cpu_clear and the write_cr3. From what I > understand, switch_mm called by Xenomai upon return to root would re-set > the bit, and re-set cr3, which would be set to the kernel cr3 right > after that, but this would result in the active_mm.cpu_vm_mask bit being > set instead of cleared as expected. So, maybe an irqs off section is > missing in leave_mm. leave_mm is already protected by its caller smp_invalidate_interrupt - but now I'm parsing context_switch /wrt to lazy tlb. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 16:01 ` Jan Kiszka @ 2009-07-01 16:04 ` Jan Kiszka 2009-07-01 17:56 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 16:04 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Jan Kiszka wrote: >>> Jan Kiszka wrote: >>>> Gilles Chanteperdrix wrote: >>>>> Jan Kiszka wrote: >>>>>> Jan Kiszka wrote: >>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>> test system that can produce this is highly contended. >>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>> in a preemptible manner now?). >>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>> >>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>> meanwhile... >>>>> As you have found out in the mean time, we do not use unlocked context >>>>> switches on x86. >>>>> >>>> Yes. >>>> >>>> The last question I asked myself (but couldn't answer yet due to other >>>> activity) was: Where are the local_irq_disable/enable_hw around >>>> switch_mm for its Linux callers? >>> Ha, that's the point: only activate_mm is protected, but we have more >>> spots in 2.6.29 and maybe other kernels, too! >> Ok, I do not see where switch_mm is called with IRQs off. What I found, > > We have two direct callers of switch_mm in sched.c and one in fs/aio.c. > Both need protection (I pushed IRQ disabling into switch_mm), but that > is not enough according to current tests. It seems to reduce to > probability of corruption, though. > >> however, is that leave_mm sets the cr3 and just clears >> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >> cr3 and active_mm. I do not know what could happen if Xenomai could >> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >> understand, switch_mm called by Xenomai upon return to root would re-set >> the bit, and re-set cr3, which would be set to the kernel cr3 right >> after that, but this would result in the active_mm.cpu_vm_mask bit being >> set instead of cleared as expected. So, maybe an irqs off section is >> missing in leave_mm. > > leave_mm is already protected by its caller smp_invalidate_interrupt - > but now I'm parsing context_switch /wrt to lazy tlb. > Hmm... lazy tlb: This means a new task is switched in and has active_mm != mm. But do_page_fault reads task->mm... Just thoughts, no clear picture yet. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 16:04 ` Jan Kiszka @ 2009-07-01 17:56 ` Jan Kiszka 2009-07-01 18:15 ` Philippe Gerum 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 17:56 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Jan Kiszka wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >>> Jan Kiszka wrote: >>>> Jan Kiszka wrote: >>>>> Gilles Chanteperdrix wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>>> test system that can produce this is highly contended. >>>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>>> in a preemptible manner now?). >>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>> >>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>>> meanwhile... >>>>>> As you have found out in the mean time, we do not use unlocked context >>>>>> switches on x86. >>>>>> >>>>> Yes. >>>>> >>>>> The last question I asked myself (but couldn't answer yet due to other >>>>> activity) was: Where are the local_irq_disable/enable_hw around >>>>> switch_mm for its Linux callers? >>>> Ha, that's the point: only activate_mm is protected, but we have more >>>> spots in 2.6.29 and maybe other kernels, too! >>> Ok, I do not see where switch_mm is called with IRQs off. What I found, >> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. >> Both need protection (I pushed IRQ disabling into switch_mm), but that >> is not enough according to current tests. It seems to reduce to >> probability of corruption, though. >> >>> however, is that leave_mm sets the cr3 and just clears >>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >>> cr3 and active_mm. I do not know what could happen if Xenomai could >>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >>> understand, switch_mm called by Xenomai upon return to root would re-set >>> the bit, and re-set cr3, which would be set to the kernel cr3 right >>> after that, but this would result in the active_mm.cpu_vm_mask bit being >>> set instead of cleared as expected. So, maybe an irqs off section is >>> missing in leave_mm. >> leave_mm is already protected by its caller smp_invalidate_interrupt - >> but now I'm parsing context_switch /wrt to lazy tlb. >> > > Hmm... lazy tlb: This means a new task is switched in and has active_mm > != mm. But do_page_fault reads task->mm... Just thoughts, no clear > picture yet. > Looking closer at the call sites of switch_mm, I think our the problem is mostly related to use_mm from fs/aio.c (customer is using aio heavily). But other callers need protection, too. We are going to test this patch tomorrow: diff --git a/fs/aio.c b/fs/aio.c index 76da125..d90fca3 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) { struct mm_struct *active_mm; struct task_struct *tsk = current; + unsigned long flags; task_lock(tsk); active_mm = tsk->active_mm; atomic_inc(&mm->mm_count); + local_irq_save_hw_cond(flags); tsk->mm = mm; tsk->active_mm = mm; switch_mm(active_mm, mm, tsk); + local_irq_restore_hw_cond(flags); task_unlock(tsk); mmdrop(active_mm); diff --git a/kernel/sched.c b/kernel/sched.c index aa8f86c..8c545a4 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, next->active_mm = oldmm; atomic_inc(&oldmm->mm_count); enter_lazy_tlb(oldmm, next); - } else + } else { + unsigned long flags; + local_irq_save_hw_cond(flags); switch_mm(oldmm, mm, next); + local_irq_restore_hw_cond(flags); + } if (unlikely(!prev->mm)) { prev->active_mm = NULL; @@ -6406,8 +6410,12 @@ void idle_task_exit(void) BUG_ON(cpu_online(smp_processor_id())); - if (mm != &init_mm) + if (mm != &init_mm) { + unsigned long flags; + local_irq_save_hw_cond(flags); switch_mm(mm, &init_mm, current); + local_irq_restore_hw_cond(flags); + } mmdrop(mm); } Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 17:56 ` Jan Kiszka @ 2009-07-01 18:15 ` Philippe Gerum 2009-07-01 18:27 ` Philippe Gerum 2009-07-01 18:56 ` Jan Kiszka 0 siblings, 2 replies; 45+ messages in thread From: Philippe Gerum @ 2009-07-01 18:15 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: > Jan Kiszka wrote: > > Jan Kiszka wrote: > >> Gilles Chanteperdrix wrote: > >>> Jan Kiszka wrote: > >>>> Jan Kiszka wrote: > >>>>> Gilles Chanteperdrix wrote: > >>>>>> Jan Kiszka wrote: > >>>>>>> Jan Kiszka wrote: > >>>>>>>> It's still unclear what goes on precisely, we are still digging, but the > >>>>>>>> test system that can produce this is highly contended. > >>>>>>> Short update: Further instrumentation revealed that cr3 differs from > >>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel > >>>>>>> tries to fixup the wrong mm. And that means we have some open race > >>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run > >>>>>>> in a preemptible manner now?). > >>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? > >>>>>> > >>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now > >>>>>>> checking if it makes a difference. Digging deeper into the code in the > >>>>>>> meanwhile... > >>>>>> As you have found out in the mean time, we do not use unlocked context > >>>>>> switches on x86. > >>>>>> > >>>>> Yes. > >>>>> > >>>>> The last question I asked myself (but couldn't answer yet due to other > >>>>> activity) was: Where are the local_irq_disable/enable_hw around > >>>>> switch_mm for its Linux callers? > >>>> Ha, that's the point: only activate_mm is protected, but we have more > >>>> spots in 2.6.29 and maybe other kernels, too! > >>> Ok, I do not see where switch_mm is called with IRQs off. What I found, > >> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. > >> Both need protection (I pushed IRQ disabling into switch_mm), but that > >> is not enough according to current tests. It seems to reduce to > >> probability of corruption, though. > >> > >>> however, is that leave_mm sets the cr3 and just clears > >>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between > >>> cr3 and active_mm. I do not know what could happen if Xenomai could > >>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I > >>> understand, switch_mm called by Xenomai upon return to root would re-set > >>> the bit, and re-set cr3, which would be set to the kernel cr3 right > >>> after that, but this would result in the active_mm.cpu_vm_mask bit being > >>> set instead of cleared as expected. So, maybe an irqs off section is > >>> missing in leave_mm. > >> leave_mm is already protected by its caller smp_invalidate_interrupt - > >> but now I'm parsing context_switch /wrt to lazy tlb. > >> > > > > Hmm... lazy tlb: This means a new task is switched in and has active_mm > > != mm. But do_page_fault reads task->mm... Just thoughts, no clear > > picture yet. > > > > Looking closer at the call sites of switch_mm, I think our the problem > is mostly related to use_mm from fs/aio.c (customer is using aio > heavily). But other callers need protection, too. We are going to test > this patch tomorrow: > > diff --git a/fs/aio.c b/fs/aio.c > index 76da125..d90fca3 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) > { > struct mm_struct *active_mm; > struct task_struct *tsk = current; > + unsigned long flags; > > task_lock(tsk); > active_mm = tsk->active_mm; > atomic_inc(&mm->mm_count); > + local_irq_save_hw_cond(flags); > tsk->mm = mm; > tsk->active_mm = mm; > switch_mm(active_mm, mm, tsk); > + local_irq_restore_hw_cond(flags); > task_unlock(tsk); > > mmdrop(active_mm); > diff --git a/kernel/sched.c b/kernel/sched.c > index aa8f86c..8c545a4 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, > next->active_mm = oldmm; > atomic_inc(&oldmm->mm_count); > enter_lazy_tlb(oldmm, next); > - } else > + } else { > + unsigned long flags; > + local_irq_save_hw_cond(flags); > switch_mm(oldmm, mm, next); > + local_irq_restore_hw_cond(flags); > + } > > if (unlikely(!prev->mm)) { > prev->active_mm = NULL; > @@ -6406,8 +6410,12 @@ void idle_task_exit(void) > > BUG_ON(cpu_online(smp_processor_id())); > > - if (mm != &init_mm) > + if (mm != &init_mm) { > + unsigned long flags; > + local_irq_save_hw_cond(flags); > switch_mm(mm, &init_mm, current); > + local_irq_restore_hw_cond(flags); > + } > mmdrop(mm); > } Please fix the callee instead of ironing the call sites. This would avoid further issues as upstream emits additional switch_mm calls over time, and make ironing activate_mm useless. > > > Jan > -- Philippe. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 18:15 ` Philippe Gerum @ 2009-07-01 18:27 ` Philippe Gerum 2009-07-01 18:58 ` Jan Kiszka 2009-07-02 2:05 ` Gilles Chanteperdrix 2009-07-01 18:56 ` Jan Kiszka 1 sibling, 2 replies; 45+ messages in thread From: Philippe Gerum @ 2009-07-01 18:27 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote: > On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: > > Jan Kiszka wrote: > > > Jan Kiszka wrote: > > >> Gilles Chanteperdrix wrote: > > >>> Jan Kiszka wrote: > > >>>> Jan Kiszka wrote: > > >>>>> Gilles Chanteperdrix wrote: > > >>>>>> Jan Kiszka wrote: > > >>>>>>> Jan Kiszka wrote: > > >>>>>>>> It's still unclear what goes on precisely, we are still digging, but the > > >>>>>>>> test system that can produce this is highly contended. > > >>>>>>> Short update: Further instrumentation revealed that cr3 differs from > > >>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel > > >>>>>>> tries to fixup the wrong mm. And that means we have some open race > > >>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run > > >>>>>>> in a preemptible manner now?). > > >>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? > > >>>>>> > > >>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now > > >>>>>>> checking if it makes a difference. Digging deeper into the code in the > > >>>>>>> meanwhile... > > >>>>>> As you have found out in the mean time, we do not use unlocked context > > >>>>>> switches on x86. > > >>>>>> > > >>>>> Yes. > > >>>>> > > >>>>> The last question I asked myself (but couldn't answer yet due to other > > >>>>> activity) was: Where are the local_irq_disable/enable_hw around > > >>>>> switch_mm for its Linux callers? > > >>>> Ha, that's the point: only activate_mm is protected, but we have more > > >>>> spots in 2.6.29 and maybe other kernels, too! > > >>> Ok, I do not see where switch_mm is called with IRQs off. What I found, > > >> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. > > >> Both need protection (I pushed IRQ disabling into switch_mm), but that > > >> is not enough according to current tests. It seems to reduce to > > >> probability of corruption, though. > > >> > > >>> however, is that leave_mm sets the cr3 and just clears > > >>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between > > >>> cr3 and active_mm. I do not know what could happen if Xenomai could > > >>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I > > >>> understand, switch_mm called by Xenomai upon return to root would re-set > > >>> the bit, and re-set cr3, which would be set to the kernel cr3 right > > >>> after that, but this would result in the active_mm.cpu_vm_mask bit being > > >>> set instead of cleared as expected. So, maybe an irqs off section is > > >>> missing in leave_mm. > > >> leave_mm is already protected by its caller smp_invalidate_interrupt - > > >> but now I'm parsing context_switch /wrt to lazy tlb. > > >> > > > > > > Hmm... lazy tlb: This means a new task is switched in and has active_mm > > > != mm. But do_page_fault reads task->mm... Just thoughts, no clear > > > picture yet. > > > > > > > Looking closer at the call sites of switch_mm, I think our the problem > > is mostly related to use_mm from fs/aio.c (customer is using aio > > heavily). But other callers need protection, too. We are going to test > > this patch tomorrow: > > > > diff --git a/fs/aio.c b/fs/aio.c > > index 76da125..d90fca3 100644 > > --- a/fs/aio.c > > +++ b/fs/aio.c > > @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) > > { > > struct mm_struct *active_mm; > > struct task_struct *tsk = current; > > + unsigned long flags; > > > > task_lock(tsk); > > active_mm = tsk->active_mm; > > atomic_inc(&mm->mm_count); > > + local_irq_save_hw_cond(flags); > > tsk->mm = mm; > > tsk->active_mm = mm; > > switch_mm(active_mm, mm, tsk); > > + local_irq_restore_hw_cond(flags); > > task_unlock(tsk); > > > > mmdrop(active_mm); > > diff --git a/kernel/sched.c b/kernel/sched.c > > index aa8f86c..8c545a4 100644 > > --- a/kernel/sched.c > > +++ b/kernel/sched.c > > @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, > > next->active_mm = oldmm; > > atomic_inc(&oldmm->mm_count); > > enter_lazy_tlb(oldmm, next); > > - } else > > + } else { > > + unsigned long flags; > > + local_irq_save_hw_cond(flags); > > switch_mm(oldmm, mm, next); > > + local_irq_restore_hw_cond(flags); > > + } > > > > if (unlikely(!prev->mm)) { > > prev->active_mm = NULL; > > @@ -6406,8 +6410,12 @@ void idle_task_exit(void) > > > > BUG_ON(cpu_online(smp_processor_id())); > > > > - if (mm != &init_mm) > > + if (mm != &init_mm) { > > + unsigned long flags; > > + local_irq_save_hw_cond(flags); > > switch_mm(mm, &init_mm, current); > > + local_irq_restore_hw_cond(flags); > > + } > > mmdrop(mm); > > } > > Please fix the callee instead of ironing the call sites. This would > avoid further issues as upstream emits additional switch_mm calls over > time, and make ironing activate_mm useless. > Btw, this how the powerpc port works in locked switch mode, and this particular bug does not apply in unlocked switch mode anyway; this is why we don't have that problem on this arch. The ARM port always works in unlocked switch mode IIRC for latency reasons, so this should be a non-issue here as well. Gilles, would you confirm this? > > > > > > Jan > > -- Philippe. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 18:27 ` Philippe Gerum @ 2009-07-01 18:58 ` Jan Kiszka 2009-07-01 19:14 ` Jan Kiszka 2009-07-02 2:05 ` Gilles Chanteperdrix 1 sibling, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 18:58 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 5541 bytes --] Philippe Gerum wrote: > On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote: >> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: >>> Jan Kiszka wrote: >>>> Jan Kiszka wrote: >>>>> Gilles Chanteperdrix wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Gilles Chanteperdrix wrote: >>>>>>>>> Jan Kiszka wrote: >>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>>>>>> test system that can produce this is highly contended. >>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>>>>>> in a preemptible manner now?). >>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>>>>> >>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>>>>>> meanwhile... >>>>>>>>> As you have found out in the mean time, we do not use unlocked context >>>>>>>>> switches on x86. >>>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>> The last question I asked myself (but couldn't answer yet due to other >>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around >>>>>>>> switch_mm for its Linux callers? >>>>>>> Ha, that's the point: only activate_mm is protected, but we have more >>>>>>> spots in 2.6.29 and maybe other kernels, too! >>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found, >>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. >>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that >>>>> is not enough according to current tests. It seems to reduce to >>>>> probability of corruption, though. >>>>> >>>>>> however, is that leave_mm sets the cr3 and just clears >>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could >>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >>>>>> understand, switch_mm called by Xenomai upon return to root would re-set >>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right >>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being >>>>>> set instead of cleared as expected. So, maybe an irqs off section is >>>>>> missing in leave_mm. >>>>> leave_mm is already protected by its caller smp_invalidate_interrupt - >>>>> but now I'm parsing context_switch /wrt to lazy tlb. >>>>> >>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm >>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear >>>> picture yet. >>>> >>> Looking closer at the call sites of switch_mm, I think our the problem >>> is mostly related to use_mm from fs/aio.c (customer is using aio >>> heavily). But other callers need protection, too. We are going to test >>> this patch tomorrow: >>> >>> diff --git a/fs/aio.c b/fs/aio.c >>> index 76da125..d90fca3 100644 >>> --- a/fs/aio.c >>> +++ b/fs/aio.c >>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>> { >>> struct mm_struct *active_mm; >>> struct task_struct *tsk = current; >>> + unsigned long flags; >>> >>> task_lock(tsk); >>> active_mm = tsk->active_mm; >>> atomic_inc(&mm->mm_count); >>> + local_irq_save_hw_cond(flags); >>> tsk->mm = mm; >>> tsk->active_mm = mm; >>> switch_mm(active_mm, mm, tsk); >>> + local_irq_restore_hw_cond(flags); >>> task_unlock(tsk); >>> >>> mmdrop(active_mm); >>> diff --git a/kernel/sched.c b/kernel/sched.c >>> index aa8f86c..8c545a4 100644 >>> --- a/kernel/sched.c >>> +++ b/kernel/sched.c >>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, >>> next->active_mm = oldmm; >>> atomic_inc(&oldmm->mm_count); >>> enter_lazy_tlb(oldmm, next); >>> - } else >>> + } else { >>> + unsigned long flags; >>> + local_irq_save_hw_cond(flags); >>> switch_mm(oldmm, mm, next); >>> + local_irq_restore_hw_cond(flags); >>> + } >>> >>> if (unlikely(!prev->mm)) { >>> prev->active_mm = NULL; >>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) >>> >>> BUG_ON(cpu_online(smp_processor_id())); >>> >>> - if (mm != &init_mm) >>> + if (mm != &init_mm) { >>> + unsigned long flags; >>> + local_irq_save_hw_cond(flags); >>> switch_mm(mm, &init_mm, current); >>> + local_irq_restore_hw_cond(flags); >>> + } >>> mmdrop(mm); >>> } >> Please fix the callee instead of ironing the call sites. This would >> avoid further issues as upstream emits additional switch_mm calls over >> time, and make ironing activate_mm useless. >> > > Btw, this how the powerpc port works in locked switch mode, and this > particular bug does not apply in unlocked switch mode anyway; this is > why we don't have that problem on this arch. The ARM port always works > in unlocked switch mode IIRC for latency reasons, so this should be a > non-issue here as well. Gilles, would you confirm this? The aio issue (if it turns out to be one, /me still needs to understand the concrete patch) should bite any arch, but we may solve it differently where switch_mm is actually reentrant. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 18:58 ` Jan Kiszka @ 2009-07-01 19:14 ` Jan Kiszka 0 siblings, 0 replies; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 19:14 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 6193 bytes --] Jan Kiszka wrote: > Philippe Gerum wrote: >> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote: >>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: >>>> Jan Kiszka wrote: >>>>> Jan Kiszka wrote: >>>>>> Gilles Chanteperdrix wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Jan Kiszka wrote: >>>>>>>>> Gilles Chanteperdrix wrote: >>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>>>>>>> test system that can produce this is highly contended. >>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>>>>>>> in a preemptible manner now?). >>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>>>>>> >>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>>>>>>> meanwhile... >>>>>>>>>> As you have found out in the mean time, we do not use unlocked context >>>>>>>>>> switches on x86. >>>>>>>>>> >>>>>>>>> Yes. >>>>>>>>> >>>>>>>>> The last question I asked myself (but couldn't answer yet due to other >>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around >>>>>>>>> switch_mm for its Linux callers? >>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more >>>>>>>> spots in 2.6.29 and maybe other kernels, too! >>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found, >>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. >>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that >>>>>> is not enough according to current tests. It seems to reduce to >>>>>> probability of corruption, though. >>>>>> >>>>>>> however, is that leave_mm sets the cr3 and just clears >>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could >>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set >>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right >>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being >>>>>>> set instead of cleared as expected. So, maybe an irqs off section is >>>>>>> missing in leave_mm. >>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt - >>>>>> but now I'm parsing context_switch /wrt to lazy tlb. >>>>>> >>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm >>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear >>>>> picture yet. >>>>> >>>> Looking closer at the call sites of switch_mm, I think our the problem >>>> is mostly related to use_mm from fs/aio.c (customer is using aio >>>> heavily). But other callers need protection, too. We are going to test >>>> this patch tomorrow: >>>> >>>> diff --git a/fs/aio.c b/fs/aio.c >>>> index 76da125..d90fca3 100644 >>>> --- a/fs/aio.c >>>> +++ b/fs/aio.c >>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>> { >>>> struct mm_struct *active_mm; >>>> struct task_struct *tsk = current; >>>> + unsigned long flags; >>>> >>>> task_lock(tsk); >>>> active_mm = tsk->active_mm; >>>> atomic_inc(&mm->mm_count); >>>> + local_irq_save_hw_cond(flags); >>>> tsk->mm = mm; >>>> tsk->active_mm = mm; >>>> switch_mm(active_mm, mm, tsk); >>>> + local_irq_restore_hw_cond(flags); >>>> task_unlock(tsk); >>>> >>>> mmdrop(active_mm); >>>> diff --git a/kernel/sched.c b/kernel/sched.c >>>> index aa8f86c..8c545a4 100644 >>>> --- a/kernel/sched.c >>>> +++ b/kernel/sched.c >>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, >>>> next->active_mm = oldmm; >>>> atomic_inc(&oldmm->mm_count); >>>> enter_lazy_tlb(oldmm, next); >>>> - } else >>>> + } else { >>>> + unsigned long flags; >>>> + local_irq_save_hw_cond(flags); >>>> switch_mm(oldmm, mm, next); >>>> + local_irq_restore_hw_cond(flags); >>>> + } >>>> >>>> if (unlikely(!prev->mm)) { >>>> prev->active_mm = NULL; >>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) >>>> >>>> BUG_ON(cpu_online(smp_processor_id())); >>>> >>>> - if (mm != &init_mm) >>>> + if (mm != &init_mm) { >>>> + unsigned long flags; >>>> + local_irq_save_hw_cond(flags); >>>> switch_mm(mm, &init_mm, current); >>>> + local_irq_restore_hw_cond(flags); >>>> + } >>>> mmdrop(mm); >>>> } >>> Please fix the callee instead of ironing the call sites. This would >>> avoid further issues as upstream emits additional switch_mm calls over >>> time, and make ironing activate_mm useless. >>> >> Btw, this how the powerpc port works in locked switch mode, and this >> particular bug does not apply in unlocked switch mode anyway; this is >> why we don't have that problem on this arch. The ARM port always works >> in unlocked switch mode IIRC for latency reasons, so this should be a >> non-issue here as well. Gilles, would you confirm this? > > The aio issue (if it turns out to be one, /me still needs to understand > the concrete patch) should bite any arch, but we may solve it > differently where switch_mm is actually reentrant. Took a beer to get my mind free, here is the potential race: -> aio thread runs with mm == init_mm -> aio thread calls into use_mm -> preemption by Xenomai right after tsk->active_mm = mm -> Xenomai switches from the aio thread to an RT thread with identical mm (definitely possible in our scenario) -> switch_mm will not update cr3 as prev == next -> next fault of the RT thread (e.g. in __xn_put_user...) will attempt to fix the right mm, but the CPU will continue to fail over init_mm Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 18:27 ` Philippe Gerum 2009-07-01 18:58 ` Jan Kiszka @ 2009-07-02 2:05 ` Gilles Chanteperdrix 2009-07-02 6:24 ` Jan Kiszka 1 sibling, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-02 2:05 UTC (permalink / raw) To: Philippe Gerum; +Cc: Jan Kiszka, xenomai-core Philippe Gerum wrote: > On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote: >> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: >>> Jan Kiszka wrote: >>>> Jan Kiszka wrote: >>>>> Gilles Chanteperdrix wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Gilles Chanteperdrix wrote: >>>>>>>>> Jan Kiszka wrote: >>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>>>>>> test system that can produce this is highly contended. >>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>>>>>> in a preemptible manner now?). >>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>>>>> >>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>>>>>> meanwhile... >>>>>>>>> As you have found out in the mean time, we do not use unlocked context >>>>>>>>> switches on x86. >>>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>> The last question I asked myself (but couldn't answer yet due to other >>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around >>>>>>>> switch_mm for its Linux callers? >>>>>>> Ha, that's the point: only activate_mm is protected, but we have more >>>>>>> spots in 2.6.29 and maybe other kernels, too! >>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found, >>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. >>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that >>>>> is not enough according to current tests. It seems to reduce to >>>>> probability of corruption, though. >>>>> >>>>>> however, is that leave_mm sets the cr3 and just clears >>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could >>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >>>>>> understand, switch_mm called by Xenomai upon return to root would re-set >>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right >>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being >>>>>> set instead of cleared as expected. So, maybe an irqs off section is >>>>>> missing in leave_mm. >>>>> leave_mm is already protected by its caller smp_invalidate_interrupt - >>>>> but now I'm parsing context_switch /wrt to lazy tlb. >>>>> >>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm >>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear >>>> picture yet. >>>> >>> Looking closer at the call sites of switch_mm, I think our the problem >>> is mostly related to use_mm from fs/aio.c (customer is using aio >>> heavily). But other callers need protection, too. We are going to test >>> this patch tomorrow: >>> >>> diff --git a/fs/aio.c b/fs/aio.c >>> index 76da125..d90fca3 100644 >>> --- a/fs/aio.c >>> +++ b/fs/aio.c >>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>> { >>> struct mm_struct *active_mm; >>> struct task_struct *tsk = current; >>> + unsigned long flags; >>> >>> task_lock(tsk); >>> active_mm = tsk->active_mm; >>> atomic_inc(&mm->mm_count); >>> + local_irq_save_hw_cond(flags); >>> tsk->mm = mm; >>> tsk->active_mm = mm; >>> switch_mm(active_mm, mm, tsk); >>> + local_irq_restore_hw_cond(flags); >>> task_unlock(tsk); >>> >>> mmdrop(active_mm); >>> diff --git a/kernel/sched.c b/kernel/sched.c >>> index aa8f86c..8c545a4 100644 >>> --- a/kernel/sched.c >>> +++ b/kernel/sched.c >>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, >>> next->active_mm = oldmm; >>> atomic_inc(&oldmm->mm_count); >>> enter_lazy_tlb(oldmm, next); >>> - } else >>> + } else { >>> + unsigned long flags; >>> + local_irq_save_hw_cond(flags); >>> switch_mm(oldmm, mm, next); >>> + local_irq_restore_hw_cond(flags); >>> + } >>> >>> if (unlikely(!prev->mm)) { >>> prev->active_mm = NULL; >>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) >>> >>> BUG_ON(cpu_online(smp_processor_id())); >>> >>> - if (mm != &init_mm) >>> + if (mm != &init_mm) { >>> + unsigned long flags; >>> + local_irq_save_hw_cond(flags); >>> switch_mm(mm, &init_mm, current); >>> + local_irq_restore_hw_cond(flags); >>> + } >>> mmdrop(mm); >>> } >> Please fix the callee instead of ironing the call sites. This would >> avoid further issues as upstream emits additional switch_mm calls over >> time, and make ironing activate_mm useless. >> > > Btw, this how the powerpc port works in locked switch mode, and this > particular bug does not apply in unlocked switch mode anyway; this is > why we don't have that problem on this arch. The ARM port always works > in unlocked switch mode IIRC for latency reasons, so this should be a > non-issue here as well. Gilles, would you confirm this? Yes, on arm switch_mm always has irqs on. We handle the case where switch_mm was interrupted in the middle of its operations by restarting the mm switch after xenomai switches back to Linux. -- Gilles. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-02 2:05 ` Gilles Chanteperdrix @ 2009-07-02 6:24 ` Jan Kiszka 2009-07-02 6:59 ` Gilles Chanteperdrix 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-02 6:24 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 6309 bytes --] Gilles Chanteperdrix wrote: > Philippe Gerum wrote: >> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote: >>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: >>>> Jan Kiszka wrote: >>>>> Jan Kiszka wrote: >>>>>> Gilles Chanteperdrix wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Jan Kiszka wrote: >>>>>>>>> Gilles Chanteperdrix wrote: >>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>>>>>>> test system that can produce this is highly contended. >>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>>>>>>> in a preemptible manner now?). >>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>>>>>> >>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>>>>>>> meanwhile... >>>>>>>>>> As you have found out in the mean time, we do not use unlocked context >>>>>>>>>> switches on x86. >>>>>>>>>> >>>>>>>>> Yes. >>>>>>>>> >>>>>>>>> The last question I asked myself (but couldn't answer yet due to other >>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around >>>>>>>>> switch_mm for its Linux callers? >>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more >>>>>>>> spots in 2.6.29 and maybe other kernels, too! >>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found, >>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. >>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that >>>>>> is not enough according to current tests. It seems to reduce to >>>>>> probability of corruption, though. >>>>>> >>>>>>> however, is that leave_mm sets the cr3 and just clears >>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could >>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set >>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right >>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being >>>>>>> set instead of cleared as expected. So, maybe an irqs off section is >>>>>>> missing in leave_mm. >>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt - >>>>>> but now I'm parsing context_switch /wrt to lazy tlb. >>>>>> >>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm >>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear >>>>> picture yet. >>>>> >>>> Looking closer at the call sites of switch_mm, I think our the problem >>>> is mostly related to use_mm from fs/aio.c (customer is using aio >>>> heavily). But other callers need protection, too. We are going to test >>>> this patch tomorrow: >>>> >>>> diff --git a/fs/aio.c b/fs/aio.c >>>> index 76da125..d90fca3 100644 >>>> --- a/fs/aio.c >>>> +++ b/fs/aio.c >>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>> { >>>> struct mm_struct *active_mm; >>>> struct task_struct *tsk = current; >>>> + unsigned long flags; >>>> >>>> task_lock(tsk); >>>> active_mm = tsk->active_mm; >>>> atomic_inc(&mm->mm_count); >>>> + local_irq_save_hw_cond(flags); >>>> tsk->mm = mm; >>>> tsk->active_mm = mm; >>>> switch_mm(active_mm, mm, tsk); >>>> + local_irq_restore_hw_cond(flags); >>>> task_unlock(tsk); >>>> >>>> mmdrop(active_mm); >>>> diff --git a/kernel/sched.c b/kernel/sched.c >>>> index aa8f86c..8c545a4 100644 >>>> --- a/kernel/sched.c >>>> +++ b/kernel/sched.c >>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, >>>> next->active_mm = oldmm; >>>> atomic_inc(&oldmm->mm_count); >>>> enter_lazy_tlb(oldmm, next); >>>> - } else >>>> + } else { >>>> + unsigned long flags; >>>> + local_irq_save_hw_cond(flags); >>>> switch_mm(oldmm, mm, next); >>>> + local_irq_restore_hw_cond(flags); >>>> + } >>>> >>>> if (unlikely(!prev->mm)) { >>>> prev->active_mm = NULL; >>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) >>>> >>>> BUG_ON(cpu_online(smp_processor_id())); >>>> >>>> - if (mm != &init_mm) >>>> + if (mm != &init_mm) { >>>> + unsigned long flags; >>>> + local_irq_save_hw_cond(flags); >>>> switch_mm(mm, &init_mm, current); >>>> + local_irq_restore_hw_cond(flags); >>>> + } >>>> mmdrop(mm); >>>> } >>> Please fix the callee instead of ironing the call sites. This would >>> avoid further issues as upstream emits additional switch_mm calls over >>> time, and make ironing activate_mm useless. >>> >> Btw, this how the powerpc port works in locked switch mode, and this >> particular bug does not apply in unlocked switch mode anyway; this is >> why we don't have that problem on this arch. The ARM port always works >> in unlocked switch mode IIRC for latency reasons, so this should be a >> non-issue here as well. Gilles, would you confirm this? > > Yes, on arm switch_mm always has irqs on. We handle the case where > switch_mm was interrupted in the middle of its operations by restarting > the mm switch after xenomai switches back to Linux. > On archs with non-atomic switch_mm(), use_mm() will require a different strategy. I'm thinking about something like use_mm(): set_some_flag(); barrier(); current->mm = new_mm; current->active_mm = new_mm; switch_mm(old_active_mm, new_mm, current); clear_some_flag(); and switch_mm(): ... if (likely(prev != next) || some_flag_set()) { clear_some_flag(); ... ie. enforce mm switch if we may have interrupted use_mm at an unpleasant time. I just don't know yet where to attach that some_flag to. Should be the current task, but can we always access it from switch_mm? Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-02 6:24 ` Jan Kiszka @ 2009-07-02 6:59 ` Gilles Chanteperdrix 2009-07-02 7:16 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-02 6:59 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Philippe Gerum wrote: >>> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote: >>>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: >>>>> Jan Kiszka wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Gilles Chanteperdrix wrote: >>>>>>>> Jan Kiszka wrote: >>>>>>>>> Jan Kiszka wrote: >>>>>>>>>> Gilles Chanteperdrix wrote: >>>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>>>>>>>> test system that can produce this is highly contended. >>>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>>>>>>>> in a preemptible manner now?). >>>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>>>>>>> >>>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>>>>>>>> meanwhile... >>>>>>>>>>> As you have found out in the mean time, we do not use unlocked context >>>>>>>>>>> switches on x86. >>>>>>>>>>> >>>>>>>>>> Yes. >>>>>>>>>> >>>>>>>>>> The last question I asked myself (but couldn't answer yet due to other >>>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around >>>>>>>>>> switch_mm for its Linux callers? >>>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more >>>>>>>>> spots in 2.6.29 and maybe other kernels, too! >>>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found, >>>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. >>>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that >>>>>>> is not enough according to current tests. It seems to reduce to >>>>>>> probability of corruption, though. >>>>>>> >>>>>>>> however, is that leave_mm sets the cr3 and just clears >>>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >>>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could >>>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >>>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set >>>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right >>>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being >>>>>>>> set instead of cleared as expected. So, maybe an irqs off section is >>>>>>>> missing in leave_mm. >>>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt - >>>>>>> but now I'm parsing context_switch /wrt to lazy tlb. >>>>>>> >>>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm >>>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear >>>>>> picture yet. >>>>>> >>>>> Looking closer at the call sites of switch_mm, I think our the problem >>>>> is mostly related to use_mm from fs/aio.c (customer is using aio >>>>> heavily). But other callers need protection, too. We are going to test >>>>> this patch tomorrow: >>>>> >>>>> diff --git a/fs/aio.c b/fs/aio.c >>>>> index 76da125..d90fca3 100644 >>>>> --- a/fs/aio.c >>>>> +++ b/fs/aio.c >>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>>> { >>>>> struct mm_struct *active_mm; >>>>> struct task_struct *tsk = current; >>>>> + unsigned long flags; >>>>> >>>>> task_lock(tsk); >>>>> active_mm = tsk->active_mm; >>>>> atomic_inc(&mm->mm_count); >>>>> + local_irq_save_hw_cond(flags); >>>>> tsk->mm = mm; >>>>> tsk->active_mm = mm; >>>>> switch_mm(active_mm, mm, tsk); >>>>> + local_irq_restore_hw_cond(flags); >>>>> task_unlock(tsk); >>>>> >>>>> mmdrop(active_mm); >>>>> diff --git a/kernel/sched.c b/kernel/sched.c >>>>> index aa8f86c..8c545a4 100644 >>>>> --- a/kernel/sched.c >>>>> +++ b/kernel/sched.c >>>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, >>>>> next->active_mm = oldmm; >>>>> atomic_inc(&oldmm->mm_count); >>>>> enter_lazy_tlb(oldmm, next); >>>>> - } else >>>>> + } else { >>>>> + unsigned long flags; >>>>> + local_irq_save_hw_cond(flags); >>>>> switch_mm(oldmm, mm, next); >>>>> + local_irq_restore_hw_cond(flags); >>>>> + } >>>>> >>>>> if (unlikely(!prev->mm)) { >>>>> prev->active_mm = NULL; >>>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) >>>>> >>>>> BUG_ON(cpu_online(smp_processor_id())); >>>>> >>>>> - if (mm != &init_mm) >>>>> + if (mm != &init_mm) { >>>>> + unsigned long flags; >>>>> + local_irq_save_hw_cond(flags); >>>>> switch_mm(mm, &init_mm, current); >>>>> + local_irq_restore_hw_cond(flags); >>>>> + } >>>>> mmdrop(mm); >>>>> } >>>> Please fix the callee instead of ironing the call sites. This would >>>> avoid further issues as upstream emits additional switch_mm calls over >>>> time, and make ironing activate_mm useless. >>>> >>> Btw, this how the powerpc port works in locked switch mode, and this >>> particular bug does not apply in unlocked switch mode anyway; this is >>> why we don't have that problem on this arch. The ARM port always works >>> in unlocked switch mode IIRC for latency reasons, so this should be a >>> non-issue here as well. Gilles, would you confirm this? >> Yes, on arm switch_mm always has irqs on. We handle the case where >> switch_mm was interrupted in the middle of its operations by restarting >> the mm switch after xenomai switches back to Linux. >> > > On archs with non-atomic switch_mm(), use_mm() will require a different > strategy. I'm thinking about something like > > use_mm(): > set_some_flag(); > barrier(); > current->mm = new_mm; > current->active_mm = new_mm; > switch_mm(old_active_mm, new_mm, current); > clear_some_flag(); > > and switch_mm(): > ... > if (likely(prev != next) || some_flag_set()) { > clear_some_flag(); > ... > > ie. enforce mm switch if we may have interrupted use_mm at an unpleasant > time. I just don't know yet where to attach that some_flag to. Should be > the current task, but can we always access it from switch_mm? These mechanisms are already in place. All you have to do is: use_mm() ipipe_active_mm = NULL; barrier(); current->mm = new_mm; current->active_mm = new_mm; switch_mm(old_active_mm, new_mm, current); -- Gilles. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-02 6:59 ` Gilles Chanteperdrix @ 2009-07-02 7:16 ` Jan Kiszka 2009-07-02 7:44 ` Gilles Chanteperdrix 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-02 7:16 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 1149 bytes --] Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> On archs with non-atomic switch_mm(), use_mm() will require a different >> strategy. I'm thinking about something like >> >> use_mm(): >> set_some_flag(); >> barrier(); >> current->mm = new_mm; >> current->active_mm = new_mm; >> switch_mm(old_active_mm, new_mm, current); >> clear_some_flag(); >> >> and switch_mm(): >> ... >> if (likely(prev != next) || some_flag_set()) { >> clear_some_flag(); >> ... >> >> ie. enforce mm switch if we may have interrupted use_mm at an unpleasant >> time. I just don't know yet where to attach that some_flag to. Should be >> the current task, but can we always access it from switch_mm? > > These mechanisms are already in place. All you have to do is: > > use_mm() > ipipe_active_mm = NULL; > barrier(); > current->mm = new_mm; > current->active_mm = new_mm; > switch_mm(old_active_mm, new_mm, current); > As far as I understand, ipipe_active_mm has different semantics on ARM. Specifically, it has no relation to the initial "next != prev" check. The test we need is likely orthogonal to this. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-02 7:16 ` Jan Kiszka @ 2009-07-02 7:44 ` Gilles Chanteperdrix 0 siblings, 0 replies; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-02 7:44 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Jan Kiszka wrote: >>> On archs with non-atomic switch_mm(), use_mm() will require a different >>> strategy. I'm thinking about something like >>> >>> use_mm(): >>> set_some_flag(); >>> barrier(); >>> current->mm = new_mm; >>> current->active_mm = new_mm; >>> switch_mm(old_active_mm, new_mm, current); >>> clear_some_flag(); >>> >>> and switch_mm(): >>> ... >>> if (likely(prev != next) || some_flag_set()) { >>> clear_some_flag(); >>> ... >>> >>> ie. enforce mm switch if we may have interrupted use_mm at an >>> unpleasant >>> time. I just don't know yet where to attach that some_flag to. Should >>> be >>> the current task, but can we always access it from switch_mm? >> >> These mechanisms are already in place. All you have to do is: >> >> use_mm() >> ipipe_active_mm = NULL; >> barrier(); >> current->mm = new_mm; >> current->active_mm = new_mm; >> switch_mm(old_active_mm, new_mm, current); >> > > As far as I understand, ipipe_active_mm has different semantics on ARM. > Specifically, it has no relation to the initial "next != prev" check. > The test we need is likely orthogonal to this. Well except that prev is ipipe_active_mm, so if NULL next != prev is always true. ipipe_active_mm set to NULL means that the mm state is currently unknown, what it triggers is that if Xenomai switches at this moment, it will forcibly reload the mm state, and when switching back to Linux it will not touch the mm state, but mark the MMSWITCHINT flag in the current thread state, switch_mm will detect the flag and restart the mm switch. This mechanism is well tested and we would have had problems before if it was not working (the mm switch on ARM does a full cache flush which stands for about 100us, so the chances to be preempted at this moment are in fact pretty high). So, I still think that setting ipipe_active_mm to NULL at the beginning of use_mm is all what is needed, else I do not understand what you are talking about. -- Gilles. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 18:15 ` Philippe Gerum 2009-07-01 18:27 ` Philippe Gerum @ 2009-07-01 18:56 ` Jan Kiszka 2009-07-02 7:11 ` Philippe Gerum 1 sibling, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-01 18:56 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 4968 bytes --] Philippe Gerum wrote: > On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: >> Jan Kiszka wrote: >>> Jan Kiszka wrote: >>>> Gilles Chanteperdrix wrote: >>>>> Jan Kiszka wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Gilles Chanteperdrix wrote: >>>>>>>> Jan Kiszka wrote: >>>>>>>>> Jan Kiszka wrote: >>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>>>>> test system that can produce this is highly contended. >>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>>>>> in a preemptible manner now?). >>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>>>> >>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>>>>> meanwhile... >>>>>>>> As you have found out in the mean time, we do not use unlocked context >>>>>>>> switches on x86. >>>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>> The last question I asked myself (but couldn't answer yet due to other >>>>>>> activity) was: Where are the local_irq_disable/enable_hw around >>>>>>> switch_mm for its Linux callers? >>>>>> Ha, that's the point: only activate_mm is protected, but we have more >>>>>> spots in 2.6.29 and maybe other kernels, too! >>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found, >>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. >>>> Both need protection (I pushed IRQ disabling into switch_mm), but that >>>> is not enough according to current tests. It seems to reduce to >>>> probability of corruption, though. >>>> >>>>> however, is that leave_mm sets the cr3 and just clears >>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >>>>> cr3 and active_mm. I do not know what could happen if Xenomai could >>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >>>>> understand, switch_mm called by Xenomai upon return to root would re-set >>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right >>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being >>>>> set instead of cleared as expected. So, maybe an irqs off section is >>>>> missing in leave_mm. >>>> leave_mm is already protected by its caller smp_invalidate_interrupt - >>>> but now I'm parsing context_switch /wrt to lazy tlb. >>>> >>> Hmm... lazy tlb: This means a new task is switched in and has active_mm >>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear >>> picture yet. >>> >> Looking closer at the call sites of switch_mm, I think our the problem >> is mostly related to use_mm from fs/aio.c (customer is using aio >> heavily). But other callers need protection, too. We are going to test >> this patch tomorrow: >> >> diff --git a/fs/aio.c b/fs/aio.c >> index 76da125..d90fca3 100644 >> --- a/fs/aio.c >> +++ b/fs/aio.c >> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >> { >> struct mm_struct *active_mm; >> struct task_struct *tsk = current; >> + unsigned long flags; >> >> task_lock(tsk); >> active_mm = tsk->active_mm; >> atomic_inc(&mm->mm_count); >> + local_irq_save_hw_cond(flags); >> tsk->mm = mm; >> tsk->active_mm = mm; >> switch_mm(active_mm, mm, tsk); >> + local_irq_restore_hw_cond(flags); >> task_unlock(tsk); >> >> mmdrop(active_mm); >> diff --git a/kernel/sched.c b/kernel/sched.c >> index aa8f86c..8c545a4 100644 >> --- a/kernel/sched.c >> +++ b/kernel/sched.c >> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, >> next->active_mm = oldmm; >> atomic_inc(&oldmm->mm_count); >> enter_lazy_tlb(oldmm, next); >> - } else >> + } else { >> + unsigned long flags; >> + local_irq_save_hw_cond(flags); >> switch_mm(oldmm, mm, next); >> + local_irq_restore_hw_cond(flags); >> + } >> >> if (unlikely(!prev->mm)) { >> prev->active_mm = NULL; >> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) >> >> BUG_ON(cpu_online(smp_processor_id())); >> >> - if (mm != &init_mm) >> + if (mm != &init_mm) { >> + unsigned long flags; >> + local_irq_save_hw_cond(flags); >> switch_mm(mm, &init_mm, current); >> + local_irq_restore_hw_cond(flags); >> + } >> mmdrop(mm); >> } > > Please fix the callee instead of ironing the call sites. This would > avoid further issues as upstream emits additional switch_mm calls over > time, and make ironing activate_mm useless. This was my first idea, too, but it didn't work, see use_mm(). I don't think we will get around reviewing switch_mm users on kernel updates. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-01 18:56 ` Jan Kiszka @ 2009-07-02 7:11 ` Philippe Gerum 0 siblings, 0 replies; 45+ messages in thread From: Philippe Gerum @ 2009-07-02 7:11 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core On Wed, 2009-07-01 at 20:56 +0200, Jan Kiszka wrote: > Philippe Gerum wrote: > > On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: > >> Jan Kiszka wrote: > >>> Jan Kiszka wrote: > >>>> Gilles Chanteperdrix wrote: > >>>>> Jan Kiszka wrote: > >>>>>> Jan Kiszka wrote: > >>>>>>> Gilles Chanteperdrix wrote: > >>>>>>>> Jan Kiszka wrote: > >>>>>>>>> Jan Kiszka wrote: > >>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the > >>>>>>>>>> test system that can produce this is highly contended. > >>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from > >>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel > >>>>>>>>> tries to fixup the wrong mm. And that means we have some open race > >>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run > >>>>>>>>> in a preemptible manner now?). > >>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? > >>>>>>>> > >>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now > >>>>>>>>> checking if it makes a difference. Digging deeper into the code in the > >>>>>>>>> meanwhile... > >>>>>>>> As you have found out in the mean time, we do not use unlocked context > >>>>>>>> switches on x86. > >>>>>>>> > >>>>>>> Yes. > >>>>>>> > >>>>>>> The last question I asked myself (but couldn't answer yet due to other > >>>>>>> activity) was: Where are the local_irq_disable/enable_hw around > >>>>>>> switch_mm for its Linux callers? > >>>>>> Ha, that's the point: only activate_mm is protected, but we have more > >>>>>> spots in 2.6.29 and maybe other kernels, too! > >>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found, > >>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. > >>>> Both need protection (I pushed IRQ disabling into switch_mm), but that > >>>> is not enough according to current tests. It seems to reduce to > >>>> probability of corruption, though. > >>>> > >>>>> however, is that leave_mm sets the cr3 and just clears > >>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between > >>>>> cr3 and active_mm. I do not know what could happen if Xenomai could > >>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I > >>>>> understand, switch_mm called by Xenomai upon return to root would re-set > >>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right > >>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being > >>>>> set instead of cleared as expected. So, maybe an irqs off section is > >>>>> missing in leave_mm. > >>>> leave_mm is already protected by its caller smp_invalidate_interrupt - > >>>> but now I'm parsing context_switch /wrt to lazy tlb. > >>>> > >>> Hmm... lazy tlb: This means a new task is switched in and has active_mm > >>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear > >>> picture yet. > >>> > >> Looking closer at the call sites of switch_mm, I think our the problem > >> is mostly related to use_mm from fs/aio.c (customer is using aio > >> heavily). But other callers need protection, too. We are going to test > >> this patch tomorrow: > >> > >> diff --git a/fs/aio.c b/fs/aio.c > >> index 76da125..d90fca3 100644 > >> --- a/fs/aio.c > >> +++ b/fs/aio.c > >> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) > >> { > >> struct mm_struct *active_mm; > >> struct task_struct *tsk = current; > >> + unsigned long flags; > >> > >> task_lock(tsk); > >> active_mm = tsk->active_mm; > >> atomic_inc(&mm->mm_count); > >> + local_irq_save_hw_cond(flags); > >> tsk->mm = mm; > >> tsk->active_mm = mm; > >> switch_mm(active_mm, mm, tsk); > >> + local_irq_restore_hw_cond(flags); > >> task_unlock(tsk); > >> > >> mmdrop(active_mm); > >> diff --git a/kernel/sched.c b/kernel/sched.c > >> index aa8f86c..8c545a4 100644 > >> --- a/kernel/sched.c > >> +++ b/kernel/sched.c > >> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, > >> next->active_mm = oldmm; > >> atomic_inc(&oldmm->mm_count); > >> enter_lazy_tlb(oldmm, next); > >> - } else > >> + } else { > >> + unsigned long flags; > >> + local_irq_save_hw_cond(flags); > >> switch_mm(oldmm, mm, next); > >> + local_irq_restore_hw_cond(flags); > >> + } > >> > >> if (unlikely(!prev->mm)) { > >> prev->active_mm = NULL; > >> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) > >> > >> BUG_ON(cpu_online(smp_processor_id())); > >> > >> - if (mm != &init_mm) > >> + if (mm != &init_mm) { > >> + unsigned long flags; > >> + local_irq_save_hw_cond(flags); > >> switch_mm(mm, &init_mm, current); > >> + local_irq_restore_hw_cond(flags); > >> + } > >> mmdrop(mm); > >> } > > > > Please fix the callee instead of ironing the call sites. This would > > avoid further issues as upstream emits additional switch_mm calls over > > time, and make ironing activate_mm useless. > > This was my first idea, too, but it didn't work, see use_mm(). I don't > think we will get around reviewing switch_mm users on kernel updates. > At the very least, you won't have to patch separately the spots that may work this way. Avoiding uselessly invasive changes is not a substitute for proper reviewing in any case. > Jan > -- Philippe. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-06-29 17:35 [Xenomai-core] x86: Endless minor faults Jan Kiszka 2009-06-29 18:09 ` Philippe Gerum 2009-06-30 8:32 ` Jan Kiszka @ 2009-07-02 17:14 ` Jan Kiszka 2009-07-03 14:54 ` Gilles Chanteperdrix 2 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-02 17:14 UTC (permalink / raw) To: Philippe Gerum, Gilles Chanteperdrix; +Cc: xenomai-core Hi again, this is now basically the patch which seems to stabilized x86 /wrt mmu switches again: There were 3 race windows between setting active_mm of the current task and actually switching it (that's a noarch issue), there were several calls into switch_mm without proper hard interrupt protection (on archs that have no preemptible switch_mm, like x86) and there was a race in x86's leave_mm (as Gilles already remarked earlier in this thread - while I was looking at an old tree where smp_invalidate_interrupt took care of this). The patch is thought as a basis for further discussions about o how to solve all the issues for all archs technically (ideally without the need to patch noarch parts in an arch-specific way...) o if anyone thinks there could be more spots like these (I've checked the code only for x86 so far) Jan --------> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index dddfb84..d261b77 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -29,12 +29,17 @@ void destroy_context(struct mm_struct *mm); #define activate_mm(prev, next) \ do { \ - unsigned long flags; \ paravirt_activate_mm((prev), (next)); \ - local_irq_save_hw_cond(flags); \ - switch_mm((prev), (next), NULL); \ - local_irq_restore_hw_cond(flags); \ + __switch_mm((prev), (next), NULL); \ } while (0); +static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk) +{ + unsigned long flags; + local_irq_save_hw_cond(flags); + __switch_mm(prev, next, tsk); + local_irq_restore_hw_cond(flags); +} #endif /* _ASM_X86_MMU_CONTEXT_H */ diff --git a/arch/x86/include/asm/mmu_context_32.h b/arch/x86/include/asm/mmu_context_32.h index 7e98ce1..e51cd09 100644 --- a/arch/x86/include/asm/mmu_context_32.h +++ b/arch/x86/include/asm/mmu_context_32.h @@ -9,12 +9,13 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) #endif } -static inline void switch_mm(struct mm_struct *prev, - struct mm_struct *next, - struct task_struct *tsk) +static inline void __switch_mm(struct mm_struct *prev, + struct mm_struct *next, + struct task_struct *tsk) { int cpu = smp_processor_id(); + WARN_ON_ONCE(!irqs_disabled_hw()); if (likely(prev != next)) { /* stop flush ipis for the previous mm */ cpu_clear(cpu, prev->cpu_vm_mask); diff --git a/arch/x86/include/asm/mmu_context_64.h b/arch/x86/include/asm/mmu_context_64.h index 677d36e..0118200 100644 --- a/arch/x86/include/asm/mmu_context_64.h +++ b/arch/x86/include/asm/mmu_context_64.h @@ -11,10 +11,12 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) #endif } -static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, - struct task_struct *tsk) +static inline void __switch_mm(struct mm_struct *prev, struct mm_struct *next, + struct task_struct *tsk) { unsigned cpu = smp_processor_id(); + + WARN_ON_ONCE(!irqs_disabled_hw()); if (likely(prev != next)) { /* stop flush ipis for the previous mm */ cpu_clear(cpu, prev->cpu_vm_mask); diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c index 08028a7..f68f2a6 100644 --- a/arch/x86/kernel/tlb_32.c +++ b/arch/x86/kernel/tlb_32.c @@ -34,9 +34,13 @@ static DEFINE_SPINLOCK(tlbstate_lock); */ void leave_mm(int cpu) { + unsigned long flags; + BUG_ON(x86_read_percpu(cpu_tlbstate.state) == TLBSTATE_OK); + local_irq_save_hw_cond(flags); cpu_clear(cpu, x86_read_percpu(cpu_tlbstate.active_mm)->cpu_vm_mask); load_cr3(swapper_pg_dir); + local_irq_restore_hw_cond(flags); } EXPORT_SYMBOL_GPL(leave_mm); diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c index ce54e12..9829990 100644 --- a/arch/x86/kernel/tlb_64.c +++ b/arch/x86/kernel/tlb_64.c @@ -62,10 +62,14 @@ static DEFINE_PER_CPU(union smp_flush_state, flush_state); */ void leave_mm(int cpu) { + unsigned long flags; + if (read_pda(mmu_state) == TLBSTATE_OK) BUG(); + local_irq_save_hw_cond(flags); cpu_clear(cpu, read_pda(active_mm)->cpu_vm_mask); load_cr3(swapper_pg_dir); + local_irq_restore_hw_cond(flags); } EXPORT_SYMBOL_GPL(leave_mm); diff --git a/fs/aio.c b/fs/aio.c index 76da125..0286f0f 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) { struct mm_struct *active_mm; struct task_struct *tsk = current; + unsigned long flags; task_lock(tsk); active_mm = tsk->active_mm; atomic_inc(&mm->mm_count); tsk->mm = mm; + local_irq_save_hw_cond(flags); tsk->active_mm = mm; - switch_mm(active_mm, mm, tsk); + __switch_mm(active_mm, mm, tsk); + local_irq_restore_hw_cond(flags); task_unlock(tsk); mmdrop(active_mm); diff --git a/fs/exec.c b/fs/exec.c index 3b36c69..06591ac 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct * old_mm, *active_mm; + unsigned long flags; /* Notify parent that we're no longer interested in the old VM */ tsk = current; @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) task_lock(tsk); active_mm = tsk->active_mm; tsk->mm = mm; + local_irq_save_hw_cond(flags); tsk->active_mm = mm; activate_mm(active_mm, mm); + local_irq_restore_hw_cond(flags); task_unlock(tsk); arch_pick_mmap_layout(mm); if (old_mm) { diff --git a/kernel/fork.c b/kernel/fork.c index 01a836b..cf3b68a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) } if (new_mm) { + unsigned long flags; mm = current->mm; active_mm = current->active_mm; current->mm = new_mm; + local_irq_save_hw_cond(flags); current->active_mm = new_mm; activate_mm(active_mm, new_mm); + local_irq_restore_hw_cond(flags); new_mm = mm; } -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-02 17:14 ` Jan Kiszka @ 2009-07-03 14:54 ` Gilles Chanteperdrix 2009-07-03 15:06 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-03 14:54 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Hi again, > > this is now basically the patch which seems to stabilized x86 /wrt mmu > switches again: > > There were 3 race windows between setting active_mm of the current task > and actually switching it (that's a noarch issue), there were several > calls into switch_mm without proper hard interrupt protection (on archs > that have no preemptible switch_mm, like x86) and there was a race in > x86's leave_mm (as Gilles already remarked earlier in this thread - > while I was looking at an old tree where smp_invalidate_interrupt took > care of this). > > The patch is thought as a basis for further discussions about > > o how to solve all the issues for all archs technically (ideally > without the need to patch noarch parts in an arch-specific way...) > > o if anyone thinks there could be more spots like these (I've checked > the code only for x86 so far) > diff --git a/fs/aio.c b/fs/aio.c > index 76da125..0286f0f 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) > { > struct mm_struct *active_mm; > struct task_struct *tsk = current; > + unsigned long flags; > > task_lock(tsk); > active_mm = tsk->active_mm; > atomic_inc(&mm->mm_count); > tsk->mm = mm; > + local_irq_save_hw_cond(flags); > tsk->active_mm = mm; > - switch_mm(active_mm, mm, tsk); > + __switch_mm(active_mm, mm, tsk); > + local_irq_restore_hw_cond(flags); > task_unlock(tsk); > > mmdrop(active_mm); > diff --git a/fs/exec.c b/fs/exec.c > index 3b36c69..06591ac 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) > { > struct task_struct *tsk; > struct mm_struct * old_mm, *active_mm; > + unsigned long flags; > > /* Notify parent that we're no longer interested in the old VM */ > tsk = current; > @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) > task_lock(tsk); > active_mm = tsk->active_mm; > tsk->mm = mm; > + local_irq_save_hw_cond(flags); > tsk->active_mm = mm; > activate_mm(active_mm, mm); > + local_irq_restore_hw_cond(flags); > task_unlock(tsk); > arch_pick_mmap_layout(mm); > if (old_mm) { > diff --git a/kernel/fork.c b/kernel/fork.c > index 01a836b..cf3b68a 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) > } > > if (new_mm) { > + unsigned long flags; > mm = current->mm; > active_mm = current->active_mm; > current->mm = new_mm; > + local_irq_save_hw_cond(flags); > current->active_mm = new_mm; > activate_mm(active_mm, new_mm); > + local_irq_restore_hw_cond(flags); > new_mm = mm; > } Ok. These are patches of the noarch code which are dependent on x86 implementation. I think we need something like mm_change_enter(flags); mm_change_leave(flags); which would resolve to ipipe_active_mm = NULL on architectures with unlocked context switches, and to local_irq_save_hw/local_irq_restore on x86? > > -- Gilles. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-03 14:54 ` Gilles Chanteperdrix @ 2009-07-03 15:06 ` Jan Kiszka 2009-07-04 16:39 ` Gilles Chanteperdrix 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-03 15:06 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Hi again, >> >> this is now basically the patch which seems to stabilized x86 /wrt mmu >> switches again: >> >> There were 3 race windows between setting active_mm of the current task >> and actually switching it (that's a noarch issue), there were several >> calls into switch_mm without proper hard interrupt protection (on archs >> that have no preemptible switch_mm, like x86) and there was a race in >> x86's leave_mm (as Gilles already remarked earlier in this thread - >> while I was looking at an old tree where smp_invalidate_interrupt took >> care of this). >> >> The patch is thought as a basis for further discussions about >> >> o how to solve all the issues for all archs technically (ideally >> without the need to patch noarch parts in an arch-specific way...) >> >> o if anyone thinks there could be more spots like these (I've checked >> the code only for x86 so far) >> diff --git a/fs/aio.c b/fs/aio.c >> index 76da125..0286f0f 100644 >> --- a/fs/aio.c >> +++ b/fs/aio.c >> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >> { >> struct mm_struct *active_mm; >> struct task_struct *tsk = current; >> + unsigned long flags; >> >> task_lock(tsk); >> active_mm = tsk->active_mm; >> atomic_inc(&mm->mm_count); >> tsk->mm = mm; >> + local_irq_save_hw_cond(flags); >> tsk->active_mm = mm; >> - switch_mm(active_mm, mm, tsk); >> + __switch_mm(active_mm, mm, tsk); >> + local_irq_restore_hw_cond(flags); >> task_unlock(tsk); >> >> mmdrop(active_mm); >> diff --git a/fs/exec.c b/fs/exec.c >> index 3b36c69..06591ac 100644 >> --- a/fs/exec.c >> +++ b/fs/exec.c >> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) >> { >> struct task_struct *tsk; >> struct mm_struct * old_mm, *active_mm; >> + unsigned long flags; >> >> /* Notify parent that we're no longer interested in the old VM */ >> tsk = current; >> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) >> task_lock(tsk); >> active_mm = tsk->active_mm; >> tsk->mm = mm; >> + local_irq_save_hw_cond(flags); >> tsk->active_mm = mm; >> activate_mm(active_mm, mm); >> + local_irq_restore_hw_cond(flags); >> task_unlock(tsk); >> arch_pick_mmap_layout(mm); >> if (old_mm) { >> diff --git a/kernel/fork.c b/kernel/fork.c >> index 01a836b..cf3b68a 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) >> } >> >> if (new_mm) { >> + unsigned long flags; >> mm = current->mm; >> active_mm = current->active_mm; >> current->mm = new_mm; >> + local_irq_save_hw_cond(flags); >> current->active_mm = new_mm; >> activate_mm(active_mm, new_mm); >> + local_irq_restore_hw_cond(flags); >> new_mm = mm; >> } > > Ok. These are patches of the noarch code which are dependent on x86 > implementation. > > I think we need something like > > mm_change_enter(flags); > mm_change_leave(flags); > > which would resolve to ipipe_active_mm = NULL on architectures with > unlocked context switches, and to local_irq_save_hw/local_irq_restore on > x86? + we have to change switch_mm in the lockless case, and maybe also its caller (if I look at the arm code), to use ipipe_active_mm in order to decide if to switch or not - see my explanation of the possible race in aio.c. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-03 15:06 ` Jan Kiszka @ 2009-07-04 16:39 ` Gilles Chanteperdrix 2009-07-05 12:01 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-04 16:39 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Jan Kiszka wrote: >>> Hi again, >>> >>> this is now basically the patch which seems to stabilized x86 /wrt mmu >>> switches again: >>> >>> There were 3 race windows between setting active_mm of the current task >>> and actually switching it (that's a noarch issue), there were several >>> calls into switch_mm without proper hard interrupt protection (on archs >>> that have no preemptible switch_mm, like x86) and there was a race in >>> x86's leave_mm (as Gilles already remarked earlier in this thread - >>> while I was looking at an old tree where smp_invalidate_interrupt took >>> care of this). >>> >>> The patch is thought as a basis for further discussions about >>> >>> o how to solve all the issues for all archs technically (ideally >>> without the need to patch noarch parts in an arch-specific way...) >>> >>> o if anyone thinks there could be more spots like these (I've checked >>> the code only for x86 so far) >>> diff --git a/fs/aio.c b/fs/aio.c >>> index 76da125..0286f0f 100644 >>> --- a/fs/aio.c >>> +++ b/fs/aio.c >>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>> { >>> struct mm_struct *active_mm; >>> struct task_struct *tsk = current; >>> + unsigned long flags; >>> >>> task_lock(tsk); >>> active_mm = tsk->active_mm; >>> atomic_inc(&mm->mm_count); >>> tsk->mm = mm; >>> + local_irq_save_hw_cond(flags); >>> tsk->active_mm = mm; >>> - switch_mm(active_mm, mm, tsk); >>> + __switch_mm(active_mm, mm, tsk); >>> + local_irq_restore_hw_cond(flags); >>> task_unlock(tsk); >>> >>> mmdrop(active_mm); >>> diff --git a/fs/exec.c b/fs/exec.c >>> index 3b36c69..06591ac 100644 >>> --- a/fs/exec.c >>> +++ b/fs/exec.c >>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) >>> { >>> struct task_struct *tsk; >>> struct mm_struct * old_mm, *active_mm; >>> + unsigned long flags; >>> >>> /* Notify parent that we're no longer interested in the old VM */ >>> tsk = current; >>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) >>> task_lock(tsk); >>> active_mm = tsk->active_mm; >>> tsk->mm = mm; >>> + local_irq_save_hw_cond(flags); >>> tsk->active_mm = mm; >>> activate_mm(active_mm, mm); >>> + local_irq_restore_hw_cond(flags); >>> task_unlock(tsk); >>> arch_pick_mmap_layout(mm); >>> if (old_mm) { >>> diff --git a/kernel/fork.c b/kernel/fork.c >>> index 01a836b..cf3b68a 100644 >>> --- a/kernel/fork.c >>> +++ b/kernel/fork.c >>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) >>> } >>> >>> if (new_mm) { >>> + unsigned long flags; >>> mm = current->mm; >>> active_mm = current->active_mm; >>> current->mm = new_mm; >>> + local_irq_save_hw_cond(flags); >>> current->active_mm = new_mm; >>> activate_mm(active_mm, new_mm); >>> + local_irq_restore_hw_cond(flags); >>> new_mm = mm; >>> } >> Ok. These are patches of the noarch code which are dependent on x86 >> implementation. >> >> I think we need something like >> >> mm_change_enter(flags); >> mm_change_leave(flags); >> >> which would resolve to ipipe_active_mm = NULL on architectures with >> unlocked context switches, and to local_irq_save_hw/local_irq_restore on >> x86? > > + we have to change switch_mm in the lockless case, and maybe also its > caller (if I look at the arm code), to use ipipe_active_mm in order to > decide if to switch or not - see my explanation of the possible race in > aio.c. This is already the way it is currently working. the "prev" passed to switch_mm is always ipipe_active_mm. -- Gilles. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-04 16:39 ` Gilles Chanteperdrix @ 2009-07-05 12:01 ` Jan Kiszka 2009-07-05 14:56 ` Gilles Chanteperdrix 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-05 12:01 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 4249 bytes --] Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >>> Jan Kiszka wrote: >>>> Hi again, >>>> >>>> this is now basically the patch which seems to stabilized x86 /wrt mmu >>>> switches again: >>>> >>>> There were 3 race windows between setting active_mm of the current task >>>> and actually switching it (that's a noarch issue), there were several >>>> calls into switch_mm without proper hard interrupt protection (on archs >>>> that have no preemptible switch_mm, like x86) and there was a race in >>>> x86's leave_mm (as Gilles already remarked earlier in this thread - >>>> while I was looking at an old tree where smp_invalidate_interrupt took >>>> care of this). >>>> >>>> The patch is thought as a basis for further discussions about >>>> >>>> o how to solve all the issues for all archs technically (ideally >>>> without the need to patch noarch parts in an arch-specific way...) >>>> >>>> o if anyone thinks there could be more spots like these (I've checked >>>> the code only for x86 so far) >>>> diff --git a/fs/aio.c b/fs/aio.c >>>> index 76da125..0286f0f 100644 >>>> --- a/fs/aio.c >>>> +++ b/fs/aio.c >>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>> { >>>> struct mm_struct *active_mm; >>>> struct task_struct *tsk = current; >>>> + unsigned long flags; >>>> >>>> task_lock(tsk); >>>> active_mm = tsk->active_mm; >>>> atomic_inc(&mm->mm_count); >>>> tsk->mm = mm; >>>> + local_irq_save_hw_cond(flags); >>>> tsk->active_mm = mm; >>>> - switch_mm(active_mm, mm, tsk); >>>> + __switch_mm(active_mm, mm, tsk); >>>> + local_irq_restore_hw_cond(flags); >>>> task_unlock(tsk); >>>> >>>> mmdrop(active_mm); >>>> diff --git a/fs/exec.c b/fs/exec.c >>>> index 3b36c69..06591ac 100644 >>>> --- a/fs/exec.c >>>> +++ b/fs/exec.c >>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) >>>> { >>>> struct task_struct *tsk; >>>> struct mm_struct * old_mm, *active_mm; >>>> + unsigned long flags; >>>> >>>> /* Notify parent that we're no longer interested in the old VM */ >>>> tsk = current; >>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) >>>> task_lock(tsk); >>>> active_mm = tsk->active_mm; >>>> tsk->mm = mm; >>>> + local_irq_save_hw_cond(flags); >>>> tsk->active_mm = mm; >>>> activate_mm(active_mm, mm); >>>> + local_irq_restore_hw_cond(flags); >>>> task_unlock(tsk); >>>> arch_pick_mmap_layout(mm); >>>> if (old_mm) { >>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>> index 01a836b..cf3b68a 100644 >>>> --- a/kernel/fork.c >>>> +++ b/kernel/fork.c >>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) >>>> } >>>> >>>> if (new_mm) { >>>> + unsigned long flags; >>>> mm = current->mm; >>>> active_mm = current->active_mm; >>>> current->mm = new_mm; >>>> + local_irq_save_hw_cond(flags); >>>> current->active_mm = new_mm; >>>> activate_mm(active_mm, new_mm); >>>> + local_irq_restore_hw_cond(flags); >>>> new_mm = mm; >>>> } >>> Ok. These are patches of the noarch code which are dependent on x86 >>> implementation. >>> >>> I think we need something like >>> >>> mm_change_enter(flags); >>> mm_change_leave(flags); >>> >>> which would resolve to ipipe_active_mm = NULL on architectures with >>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on >>> x86? >> + we have to change switch_mm in the lockless case, and maybe also its >> caller (if I look at the arm code), to use ipipe_active_mm in order to >> decide if to switch or not - see my explanation of the possible race in >> aio.c. > > This is already the way it is currently working. the "prev" passed to > switch_mm is always ipipe_active_mm. > Ah, I missed the difference in arm's xnarch_leave_root. Now I got it. OK, but this needs to be thought through in details again, specifically also the enter-root scenarios. I have no safe feeling about it yet, partly because switch_mm fiddles with ipipe_active_mm as well, but more generally because we used to have such critical races in this area for a "fairly" long time. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-05 12:01 ` Jan Kiszka @ 2009-07-05 14:56 ` Gilles Chanteperdrix 2009-07-05 17:12 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-05 14:56 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Jan Kiszka wrote: >>> Gilles Chanteperdrix wrote: >>>> Jan Kiszka wrote: >>>>> Hi again, >>>>> >>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu >>>>> switches again: >>>>> >>>>> There were 3 race windows between setting active_mm of the current task >>>>> and actually switching it (that's a noarch issue), there were several >>>>> calls into switch_mm without proper hard interrupt protection (on archs >>>>> that have no preemptible switch_mm, like x86) and there was a race in >>>>> x86's leave_mm (as Gilles already remarked earlier in this thread - >>>>> while I was looking at an old tree where smp_invalidate_interrupt took >>>>> care of this). >>>>> >>>>> The patch is thought as a basis for further discussions about >>>>> >>>>> o how to solve all the issues for all archs technically (ideally >>>>> without the need to patch noarch parts in an arch-specific way...) >>>>> >>>>> o if anyone thinks there could be more spots like these (I've checked >>>>> the code only for x86 so far) >>>>> diff --git a/fs/aio.c b/fs/aio.c >>>>> index 76da125..0286f0f 100644 >>>>> --- a/fs/aio.c >>>>> +++ b/fs/aio.c >>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>>> { >>>>> struct mm_struct *active_mm; >>>>> struct task_struct *tsk = current; >>>>> + unsigned long flags; >>>>> >>>>> task_lock(tsk); >>>>> active_mm = tsk->active_mm; >>>>> atomic_inc(&mm->mm_count); >>>>> tsk->mm = mm; >>>>> + local_irq_save_hw_cond(flags); >>>>> tsk->active_mm = mm; >>>>> - switch_mm(active_mm, mm, tsk); >>>>> + __switch_mm(active_mm, mm, tsk); >>>>> + local_irq_restore_hw_cond(flags); >>>>> task_unlock(tsk); >>>>> >>>>> mmdrop(active_mm); >>>>> diff --git a/fs/exec.c b/fs/exec.c >>>>> index 3b36c69..06591ac 100644 >>>>> --- a/fs/exec.c >>>>> +++ b/fs/exec.c >>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) >>>>> { >>>>> struct task_struct *tsk; >>>>> struct mm_struct * old_mm, *active_mm; >>>>> + unsigned long flags; >>>>> >>>>> /* Notify parent that we're no longer interested in the old VM */ >>>>> tsk = current; >>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) >>>>> task_lock(tsk); >>>>> active_mm = tsk->active_mm; >>>>> tsk->mm = mm; >>>>> + local_irq_save_hw_cond(flags); >>>>> tsk->active_mm = mm; >>>>> activate_mm(active_mm, mm); >>>>> + local_irq_restore_hw_cond(flags); >>>>> task_unlock(tsk); >>>>> arch_pick_mmap_layout(mm); >>>>> if (old_mm) { >>>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>>> index 01a836b..cf3b68a 100644 >>>>> --- a/kernel/fork.c >>>>> +++ b/kernel/fork.c >>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) >>>>> } >>>>> >>>>> if (new_mm) { >>>>> + unsigned long flags; >>>>> mm = current->mm; >>>>> active_mm = current->active_mm; >>>>> current->mm = new_mm; >>>>> + local_irq_save_hw_cond(flags); >>>>> current->active_mm = new_mm; >>>>> activate_mm(active_mm, new_mm); >>>>> + local_irq_restore_hw_cond(flags); >>>>> new_mm = mm; >>>>> } >>>> Ok. These are patches of the noarch code which are dependent on x86 >>>> implementation. >>>> >>>> I think we need something like >>>> >>>> mm_change_enter(flags); >>>> mm_change_leave(flags); >>>> >>>> which would resolve to ipipe_active_mm = NULL on architectures with >>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on >>>> x86? >>> + we have to change switch_mm in the lockless case, and maybe also its >>> caller (if I look at the arm code), to use ipipe_active_mm in order to >>> decide if to switch or not - see my explanation of the possible race in >>> aio.c. >> This is already the way it is currently working. the "prev" passed to >> switch_mm is always ipipe_active_mm. >> > > Ah, I missed the difference in arm's xnarch_leave_root. Now I got it. > > OK, but this needs to be thought through in details again, specifically > also the enter-root scenarios. I have no safe feeling about it yet, > partly because switch_mm fiddles with ipipe_active_mm as well, but more > generally because we used to have such critical races in this area for a > "fairly" long time. When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit is set, in which case the switch_mm preempted by Xenomai is restarted. -- Gilles. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-05 14:56 ` Gilles Chanteperdrix @ 2009-07-05 17:12 ` Jan Kiszka 2009-07-06 7:54 ` Gilles Chanteperdrix 0 siblings, 1 reply; 45+ messages in thread From: Jan Kiszka @ 2009-07-05 17:12 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 5555 bytes --] Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >>> Jan Kiszka wrote: >>>> Gilles Chanteperdrix wrote: >>>>> Jan Kiszka wrote: >>>>>> Hi again, >>>>>> >>>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu >>>>>> switches again: >>>>>> >>>>>> There were 3 race windows between setting active_mm of the current task >>>>>> and actually switching it (that's a noarch issue), there were several >>>>>> calls into switch_mm without proper hard interrupt protection (on archs >>>>>> that have no preemptible switch_mm, like x86) and there was a race in >>>>>> x86's leave_mm (as Gilles already remarked earlier in this thread - >>>>>> while I was looking at an old tree where smp_invalidate_interrupt took >>>>>> care of this). >>>>>> >>>>>> The patch is thought as a basis for further discussions about >>>>>> >>>>>> o how to solve all the issues for all archs technically (ideally >>>>>> without the need to patch noarch parts in an arch-specific way...) >>>>>> >>>>>> o if anyone thinks there could be more spots like these (I've checked >>>>>> the code only for x86 so far) >>>>>> diff --git a/fs/aio.c b/fs/aio.c >>>>>> index 76da125..0286f0f 100644 >>>>>> --- a/fs/aio.c >>>>>> +++ b/fs/aio.c >>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>>>> { >>>>>> struct mm_struct *active_mm; >>>>>> struct task_struct *tsk = current; >>>>>> + unsigned long flags; >>>>>> >>>>>> task_lock(tsk); >>>>>> active_mm = tsk->active_mm; >>>>>> atomic_inc(&mm->mm_count); >>>>>> tsk->mm = mm; >>>>>> + local_irq_save_hw_cond(flags); >>>>>> tsk->active_mm = mm; >>>>>> - switch_mm(active_mm, mm, tsk); >>>>>> + __switch_mm(active_mm, mm, tsk); >>>>>> + local_irq_restore_hw_cond(flags); >>>>>> task_unlock(tsk); >>>>>> >>>>>> mmdrop(active_mm); >>>>>> diff --git a/fs/exec.c b/fs/exec.c >>>>>> index 3b36c69..06591ac 100644 >>>>>> --- a/fs/exec.c >>>>>> +++ b/fs/exec.c >>>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) >>>>>> { >>>>>> struct task_struct *tsk; >>>>>> struct mm_struct * old_mm, *active_mm; >>>>>> + unsigned long flags; >>>>>> >>>>>> /* Notify parent that we're no longer interested in the old VM */ >>>>>> tsk = current; >>>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) >>>>>> task_lock(tsk); >>>>>> active_mm = tsk->active_mm; >>>>>> tsk->mm = mm; >>>>>> + local_irq_save_hw_cond(flags); >>>>>> tsk->active_mm = mm; >>>>>> activate_mm(active_mm, mm); >>>>>> + local_irq_restore_hw_cond(flags); >>>>>> task_unlock(tsk); >>>>>> arch_pick_mmap_layout(mm); >>>>>> if (old_mm) { >>>>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>>>> index 01a836b..cf3b68a 100644 >>>>>> --- a/kernel/fork.c >>>>>> +++ b/kernel/fork.c >>>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) >>>>>> } >>>>>> >>>>>> if (new_mm) { >>>>>> + unsigned long flags; >>>>>> mm = current->mm; >>>>>> active_mm = current->active_mm; >>>>>> current->mm = new_mm; >>>>>> + local_irq_save_hw_cond(flags); >>>>>> current->active_mm = new_mm; >>>>>> activate_mm(active_mm, new_mm); >>>>>> + local_irq_restore_hw_cond(flags); >>>>>> new_mm = mm; >>>>>> } >>>>> Ok. These are patches of the noarch code which are dependent on x86 >>>>> implementation. >>>>> >>>>> I think we need something like >>>>> >>>>> mm_change_enter(flags); >>>>> mm_change_leave(flags); >>>>> >>>>> which would resolve to ipipe_active_mm = NULL on architectures with >>>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on >>>>> x86? >>>> + we have to change switch_mm in the lockless case, and maybe also its >>>> caller (if I look at the arm code), to use ipipe_active_mm in order to >>>> decide if to switch or not - see my explanation of the possible race in >>>> aio.c. >>> This is already the way it is currently working. the "prev" passed to >>> switch_mm is always ipipe_active_mm. >>> >> Ah, I missed the difference in arm's xnarch_leave_root. Now I got it. >> >> OK, but this needs to be thought through in details again, specifically >> also the enter-root scenarios. I have no safe feeling about it yet, >> partly because switch_mm fiddles with ipipe_active_mm as well, but more >> generally because we used to have such critical races in this area for a >> "fairly" long time. > > When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit > is set, in which case the switch_mm preempted by Xenomai is restarted. > Right. But now we are also dealing with preemptions outside that loop. And we are dealing with archs that have atomic switch_mm, thus no such loops. The goal should be to find a solution against the races outside switch_mm that is applicable to all archs. But the question is still unanswered to me if we can simply reuse existing ipipe_active_mm without disturbing its current use in non-atomic switch_mm. Moreover, part of the problems on x86 was the non-atomic update of cpu_vm_mask vs. actually switching the hardware. How do you deal with cpu_vm_mask on ARM? Why can you call switch_mm without IRQ protection? My impression is that only the (costly) cpu_switch_mm should be allowed to be preemptible to have a safe state around it. I'm also asking as we would have to address this on x86 & co. once switch_mm could be called with undefined prev_mm. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-05 17:12 ` Jan Kiszka @ 2009-07-06 7:54 ` Gilles Chanteperdrix 2009-07-07 18:45 ` Jan Kiszka 0 siblings, 1 reply; 45+ messages in thread From: Gilles Chanteperdrix @ 2009-07-06 7:54 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Jan Kiszka wrote: >>> Gilles Chanteperdrix wrote: >>>> Jan Kiszka wrote: >>>>> Gilles Chanteperdrix wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Hi again, >>>>>>> >>>>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu >>>>>>> switches again: >>>>>>> >>>>>>> There were 3 race windows between setting active_mm of the current task >>>>>>> and actually switching it (that's a noarch issue), there were several >>>>>>> calls into switch_mm without proper hard interrupt protection (on archs >>>>>>> that have no preemptible switch_mm, like x86) and there was a race in >>>>>>> x86's leave_mm (as Gilles already remarked earlier in this thread - >>>>>>> while I was looking at an old tree where smp_invalidate_interrupt took >>>>>>> care of this). >>>>>>> >>>>>>> The patch is thought as a basis for further discussions about >>>>>>> >>>>>>> o how to solve all the issues for all archs technically (ideally >>>>>>> without the need to patch noarch parts in an arch-specific way...) >>>>>>> >>>>>>> o if anyone thinks there could be more spots like these (I've checked >>>>>>> the code only for x86 so far) >>>>>>> diff --git a/fs/aio.c b/fs/aio.c >>>>>>> index 76da125..0286f0f 100644 >>>>>>> --- a/fs/aio.c >>>>>>> +++ b/fs/aio.c >>>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>>>>> { >>>>>>> struct mm_struct *active_mm; >>>>>>> struct task_struct *tsk = current; >>>>>>> + unsigned long flags; >>>>>>> >>>>>>> task_lock(tsk); >>>>>>> active_mm = tsk->active_mm; >>>>>>> atomic_inc(&mm->mm_count); >>>>>>> tsk->mm = mm; >>>>>>> + local_irq_save_hw_cond(flags); >>>>>>> tsk->active_mm = mm; >>>>>>> - switch_mm(active_mm, mm, tsk); >>>>>>> + __switch_mm(active_mm, mm, tsk); >>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>> task_unlock(tsk); >>>>>>> >>>>>>> mmdrop(active_mm); >>>>>>> diff --git a/fs/exec.c b/fs/exec.c >>>>>>> index 3b36c69..06591ac 100644 >>>>>>> --- a/fs/exec.c >>>>>>> +++ b/fs/exec.c >>>>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) >>>>>>> { >>>>>>> struct task_struct *tsk; >>>>>>> struct mm_struct * old_mm, *active_mm; >>>>>>> + unsigned long flags; >>>>>>> >>>>>>> /* Notify parent that we're no longer interested in the old VM */ >>>>>>> tsk = current; >>>>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) >>>>>>> task_lock(tsk); >>>>>>> active_mm = tsk->active_mm; >>>>>>> tsk->mm = mm; >>>>>>> + local_irq_save_hw_cond(flags); >>>>>>> tsk->active_mm = mm; >>>>>>> activate_mm(active_mm, mm); >>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>> task_unlock(tsk); >>>>>>> arch_pick_mmap_layout(mm); >>>>>>> if (old_mm) { >>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>>>>> index 01a836b..cf3b68a 100644 >>>>>>> --- a/kernel/fork.c >>>>>>> +++ b/kernel/fork.c >>>>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) >>>>>>> } >>>>>>> >>>>>>> if (new_mm) { >>>>>>> + unsigned long flags; >>>>>>> mm = current->mm; >>>>>>> active_mm = current->active_mm; >>>>>>> current->mm = new_mm; >>>>>>> + local_irq_save_hw_cond(flags); >>>>>>> current->active_mm = new_mm; >>>>>>> activate_mm(active_mm, new_mm); >>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>> new_mm = mm; >>>>>>> } >>>>>> Ok. These are patches of the noarch code which are dependent on x86 >>>>>> implementation. >>>>>> >>>>>> I think we need something like >>>>>> >>>>>> mm_change_enter(flags); >>>>>> mm_change_leave(flags); >>>>>> >>>>>> which would resolve to ipipe_active_mm = NULL on architectures with >>>>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on >>>>>> x86? >>>>> + we have to change switch_mm in the lockless case, and maybe also its >>>>> caller (if I look at the arm code), to use ipipe_active_mm in order to >>>>> decide if to switch or not - see my explanation of the possible race in >>>>> aio.c. >>>> This is already the way it is currently working. the "prev" passed to >>>> switch_mm is always ipipe_active_mm. >>>> >>> Ah, I missed the difference in arm's xnarch_leave_root. Now I got it. >>> >>> OK, but this needs to be thought through in details again, specifically >>> also the enter-root scenarios. I have no safe feeling about it yet, >>> partly because switch_mm fiddles with ipipe_active_mm as well, but more >>> generally because we used to have such critical races in this area for a >>> "fairly" long time. >> When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit >> is set, in which case the switch_mm preempted by Xenomai is restarted. >> > > Right. But now we are also dealing with preemptions outside that loop. > And we are dealing with archs that have atomic switch_mm, thus no such > loops. > > The goal should be to find a solution against the races outside > switch_mm that is applicable to all archs. But the question is still > unanswered to me if we can simply reuse existing ipipe_active_mm without > disturbing its current use in non-atomic switch_mm. > > Moreover, part of the problems on x86 was the non-atomic update of > cpu_vm_mask vs. actually switching the hardware. How do you deal with > cpu_vm_mask on ARM? Why can you call switch_mm without IRQ protection? > My impression is that only the (costly) cpu_switch_mm should be allowed > to be preemptible to have a safe state around it. I'm also asking as we > would have to address this on x86 & co. once switch_mm could be called > with undefined prev_mm. when ipipe_active_mm is set to NULL, switch_mm does not touch the current mm cpu_vm_mask, since it does not know what the current mm is. However, we have a problem when switching back to root: we do not clear the previous mm cpu_vm_mask since we do not call switch_mm in xnarch_switch_to, we should fix this. So, I think we can extend the zone where ipipe_active_mm is set to NULL, to protect this area from the effects of preemption by Xenomai. So, IMO, what we need is something like ipipe_mm_switch_protect which sets ipipe_active_mm to NULL on ARM, and results in local_irq_save_hw on x86. > > Jan > -- Gilles ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Xenomai-core] x86: Endless minor faults 2009-07-06 7:54 ` Gilles Chanteperdrix @ 2009-07-07 18:45 ` Jan Kiszka 0 siblings, 0 replies; 45+ messages in thread From: Jan Kiszka @ 2009-07-07 18:45 UTC (permalink / raw) To: Gilles Chanteperdrix; +Cc: xenomai-core [-- Attachment #1: Type: text/plain, Size: 6695 bytes --] Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >>> Jan Kiszka wrote: >>>> Gilles Chanteperdrix wrote: >>>>> Jan Kiszka wrote: >>>>>> Gilles Chanteperdrix wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Hi again, >>>>>>>> >>>>>>>> this is now basically the patch which seems to stabilized x86 /wrt mmu >>>>>>>> switches again: >>>>>>>> >>>>>>>> There were 3 race windows between setting active_mm of the current task >>>>>>>> and actually switching it (that's a noarch issue), there were several >>>>>>>> calls into switch_mm without proper hard interrupt protection (on archs >>>>>>>> that have no preemptible switch_mm, like x86) and there was a race in >>>>>>>> x86's leave_mm (as Gilles already remarked earlier in this thread - >>>>>>>> while I was looking at an old tree where smp_invalidate_interrupt took >>>>>>>> care of this). >>>>>>>> >>>>>>>> The patch is thought as a basis for further discussions about >>>>>>>> >>>>>>>> o how to solve all the issues for all archs technically (ideally >>>>>>>> without the need to patch noarch parts in an arch-specific way...) >>>>>>>> >>>>>>>> o if anyone thinks there could be more spots like these (I've checked >>>>>>>> the code only for x86 so far) >>>>>>>> diff --git a/fs/aio.c b/fs/aio.c >>>>>>>> index 76da125..0286f0f 100644 >>>>>>>> --- a/fs/aio.c >>>>>>>> +++ b/fs/aio.c >>>>>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>>>>>> { >>>>>>>> struct mm_struct *active_mm; >>>>>>>> struct task_struct *tsk = current; >>>>>>>> + unsigned long flags; >>>>>>>> >>>>>>>> task_lock(tsk); >>>>>>>> active_mm = tsk->active_mm; >>>>>>>> atomic_inc(&mm->mm_count); >>>>>>>> tsk->mm = mm; >>>>>>>> + local_irq_save_hw_cond(flags); >>>>>>>> tsk->active_mm = mm; >>>>>>>> - switch_mm(active_mm, mm, tsk); >>>>>>>> + __switch_mm(active_mm, mm, tsk); >>>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>>> task_unlock(tsk); >>>>>>>> >>>>>>>> mmdrop(active_mm); >>>>>>>> diff --git a/fs/exec.c b/fs/exec.c >>>>>>>> index 3b36c69..06591ac 100644 >>>>>>>> --- a/fs/exec.c >>>>>>>> +++ b/fs/exec.c >>>>>>>> @@ -718,6 +718,7 @@ static int exec_mmap(struct mm_struct *mm) >>>>>>>> { >>>>>>>> struct task_struct *tsk; >>>>>>>> struct mm_struct * old_mm, *active_mm; >>>>>>>> + unsigned long flags; >>>>>>>> >>>>>>>> /* Notify parent that we're no longer interested in the old VM */ >>>>>>>> tsk = current; >>>>>>>> @@ -740,8 +741,10 @@ static int exec_mmap(struct mm_struct *mm) >>>>>>>> task_lock(tsk); >>>>>>>> active_mm = tsk->active_mm; >>>>>>>> tsk->mm = mm; >>>>>>>> + local_irq_save_hw_cond(flags); >>>>>>>> tsk->active_mm = mm; >>>>>>>> activate_mm(active_mm, mm); >>>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>>> task_unlock(tsk); >>>>>>>> arch_pick_mmap_layout(mm); >>>>>>>> if (old_mm) { >>>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>>>>>> index 01a836b..cf3b68a 100644 >>>>>>>> --- a/kernel/fork.c >>>>>>>> +++ b/kernel/fork.c >>>>>>>> @@ -1665,11 +1665,14 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags) >>>>>>>> } >>>>>>>> >>>>>>>> if (new_mm) { >>>>>>>> + unsigned long flags; >>>>>>>> mm = current->mm; >>>>>>>> active_mm = current->active_mm; >>>>>>>> current->mm = new_mm; >>>>>>>> + local_irq_save_hw_cond(flags); >>>>>>>> current->active_mm = new_mm; >>>>>>>> activate_mm(active_mm, new_mm); >>>>>>>> + local_irq_restore_hw_cond(flags); >>>>>>>> new_mm = mm; >>>>>>>> } >>>>>>> Ok. These are patches of the noarch code which are dependent on x86 >>>>>>> implementation. >>>>>>> >>>>>>> I think we need something like >>>>>>> >>>>>>> mm_change_enter(flags); >>>>>>> mm_change_leave(flags); >>>>>>> >>>>>>> which would resolve to ipipe_active_mm = NULL on architectures with >>>>>>> unlocked context switches, and to local_irq_save_hw/local_irq_restore on >>>>>>> x86? >>>>>> + we have to change switch_mm in the lockless case, and maybe also its >>>>>> caller (if I look at the arm code), to use ipipe_active_mm in order to >>>>>> decide if to switch or not - see my explanation of the possible race in >>>>>> aio.c. >>>>> This is already the way it is currently working. the "prev" passed to >>>>> switch_mm is always ipipe_active_mm. >>>>> >>>> Ah, I missed the difference in arm's xnarch_leave_root. Now I got it. >>>> >>>> OK, but this needs to be thought through in details again, specifically >>>> also the enter-root scenarios. I have no safe feeling about it yet, >>>> partly because switch_mm fiddles with ipipe_active_mm as well, but more >>>> generally because we used to have such critical races in this area for a >>>> "fairly" long time. >>> When reentering root with ipipe_active_mm NULL, the TIF_MMSWTICH_INT bit >>> is set, in which case the switch_mm preempted by Xenomai is restarted. >>> >> Right. But now we are also dealing with preemptions outside that loop. >> And we are dealing with archs that have atomic switch_mm, thus no such >> loops. >> >> The goal should be to find a solution against the races outside >> switch_mm that is applicable to all archs. But the question is still >> unanswered to me if we can simply reuse existing ipipe_active_mm without >> disturbing its current use in non-atomic switch_mm. >> >> Moreover, part of the problems on x86 was the non-atomic update of >> cpu_vm_mask vs. actually switching the hardware. How do you deal with >> cpu_vm_mask on ARM? Why can you call switch_mm without IRQ protection? >> My impression is that only the (costly) cpu_switch_mm should be allowed >> to be preemptible to have a safe state around it. I'm also asking as we >> would have to address this on x86 & co. once switch_mm could be called >> with undefined prev_mm. > > when ipipe_active_mm is set to NULL, switch_mm does not touch the > current mm cpu_vm_mask, since it does not know what the current mm is. > However, we have a problem when switching back to root: we do not clear > the previous mm cpu_vm_mask since we do not call switch_mm in > xnarch_switch_to, we should fix this. So, I think we can extend the zone > where ipipe_active_mm is set to NULL, to protect this area from the > effects of preemption by Xenomai. > > So, IMO, what we need is something like ipipe_mm_switch_protect which > sets ipipe_active_mm to NULL on ARM, and results in local_irq_save_hw on > x86. That sounds like a good step forward. I will split up my patch into a non-arch part that instruments the code and an x86 part that installs the required protection for that arch. Jan [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 257 bytes --] ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2009-07-07 18:45 UTC | newest] Thread overview: 45+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-29 17:35 [Xenomai-core] x86: Endless minor faults Jan Kiszka 2009-06-29 18:09 ` Philippe Gerum 2009-06-29 18:20 ` Jan Kiszka 2009-06-30 8:32 ` Jan Kiszka 2009-06-30 8:36 ` Gilles Chanteperdrix 2009-06-30 8:44 ` Jan Kiszka 2009-06-30 8:42 ` Gilles Chanteperdrix 2009-06-30 8:56 ` Jan Kiszka 2009-06-30 9:20 ` Philippe Gerum 2009-06-30 9:21 ` Gilles Chanteperdrix 2009-06-30 9:25 ` Philippe Gerum 2009-06-30 9:26 ` Gilles Chanteperdrix 2009-06-30 9:27 ` Philippe Gerum 2009-06-30 9:34 ` Gilles Chanteperdrix 2009-06-30 16:11 ` Jan Kiszka 2009-07-01 11:56 ` Jan Kiszka 2009-07-01 12:05 ` Jan Kiszka 2009-07-01 12:24 ` Gilles Chanteperdrix 2009-07-01 12:39 ` Jan Kiszka 2009-07-01 12:41 ` Gilles Chanteperdrix 2009-07-01 12:41 ` Jan Kiszka 2009-07-01 15:51 ` Gilles Chanteperdrix 2009-07-01 16:01 ` Jan Kiszka 2009-07-01 16:04 ` Jan Kiszka 2009-07-01 17:56 ` Jan Kiszka 2009-07-01 18:15 ` Philippe Gerum 2009-07-01 18:27 ` Philippe Gerum 2009-07-01 18:58 ` Jan Kiszka 2009-07-01 19:14 ` Jan Kiszka 2009-07-02 2:05 ` Gilles Chanteperdrix 2009-07-02 6:24 ` Jan Kiszka 2009-07-02 6:59 ` Gilles Chanteperdrix 2009-07-02 7:16 ` Jan Kiszka 2009-07-02 7:44 ` Gilles Chanteperdrix 2009-07-01 18:56 ` Jan Kiszka 2009-07-02 7:11 ` Philippe Gerum 2009-07-02 17:14 ` Jan Kiszka 2009-07-03 14:54 ` Gilles Chanteperdrix 2009-07-03 15:06 ` Jan Kiszka 2009-07-04 16:39 ` Gilles Chanteperdrix 2009-07-05 12:01 ` Jan Kiszka 2009-07-05 14:56 ` Gilles Chanteperdrix 2009-07-05 17:12 ` Jan Kiszka 2009-07-06 7:54 ` Gilles Chanteperdrix 2009-07-07 18:45 ` Jan Kiszka
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.