* [PATCH] xen mmu: fix a race window causing leave_mm BUG()
@ 2011-04-29 4:10 Tian, Kevin
2011-05-10 20:27 ` Konrad Rzeszutek Wilk
0 siblings, 1 reply; 6+ messages in thread
From: Tian, Kevin @ 2011-04-29 4:10 UTC (permalink / raw)
To: xen devel; +Cc: jeremy@goop.org, MaoXiaoyun
[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]
xen mmu: fix a race window causing leave_mm BUG()
there's a race window in xen_drop_mm_ref, where remote cpu may exit
dirty bitmap between the check on this cpu and the point where remote
cpu handles drop request. So in drop_other_mm_ref we need check
whether TLB state is still lazy before calling into leave_mm. This
bug is rarely observed in earlier kernel, but exaggerated by the
commit 831d52bc153971b70e64eccfbed2b232394f22f8 which clears bitmap
after changing the TLB state.
thanks for Maxiaoyun<tinnycloud@hotmail.com> to verify it.
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 4e5a611..74c6e4a 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1260,7 +1260,7 @@ static void drop_other_mm_ref(void *info)
active_mm = percpu_read(cpu_tlbstate.active_mm);
- if (active_mm == mm)
+ if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
leave_mm(smp_processor_id());
/* If this cpu still has a stale cr3 reference, then make sure
[-- Attachment #2: 20100429_fix_leave_mm_bug.patch --]
[-- Type: application/octet-stream, Size: 1224 bytes --]
commit d49e9a336371c5ab171d9eccec922b0d0db9e67d
Author: Kevin Tian <kevin.tian@intel.com>
Date: Fri Apr 29 10:42:05 2011 +0800
xen mmu: fix a race window causing leave_mm BUG()
there's a race window in xen_drop_mm_ref, where remote cpu may exit
dirty bitmap between the check on this cpu and the point where remote
cpu handles drop request. So in drop_other_mm_ref we need check
whether TLB state is still lazy before calling into leave_mm. This
bug is rarely observed in earlier kernel, but exaggerated by the
commit 831d52bc153971b70e64eccfbed2b232394f22f8 which clears bitmap
after changing the TLB state.
thanks for Maxiaoyun<tinnycloud@hotmail.com> to verify it.
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 4e5a611..91c9527 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1260,7 +1260,7 @@ static void drop_other_mm_ref(void *info)
active_mm = percpu_read(cpu_tlbstate.active_mm);
- if (active_mm == mm)
+ if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
leave_mm(smp_processor_id());
/* If this cpu still has a stale cr3 reference, then make sure
[-- Attachment #3: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH] xen mmu: fix a race window causing leave_mm BUG()
2011-04-29 4:10 [PATCH] xen mmu: fix a race window causing leave_mm BUG() Tian, Kevin
@ 2011-05-10 20:27 ` Konrad Rzeszutek Wilk
2011-05-11 1:20 ` Tian, Kevin
0 siblings, 1 reply; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-10 20:27 UTC (permalink / raw)
To: Tian, Kevin; +Cc: jeremy@goop.org, xen devel, MaoXiaoyun
On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote:
> xen mmu: fix a race window causing leave_mm BUG()
I've this in mailbox and I am wondering whether this still an issue with the 2.6.39 type kernels?
How do you reproduce the failure? When using LVM?
>
> there's a race window in xen_drop_mm_ref, where remote cpu may exit
> dirty bitmap between the check on this cpu and the point where remote
> cpu handles drop request. So in drop_other_mm_ref we need check
> whether TLB state is still lazy before calling into leave_mm. This
> bug is rarely observed in earlier kernel, but exaggerated by the
> commit 831d52bc153971b70e64eccfbed2b232394f22f8 which clears bitmap
> after changing the TLB state.
>
> thanks for Maxiaoyun<tinnycloud@hotmail.com> to verify it.
>
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 4e5a611..74c6e4a 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1260,7 +1260,7 @@ static void drop_other_mm_ref(void *info)
>
> active_mm = percpu_read(cpu_tlbstate.active_mm);
>
> - if (active_mm == mm)
> + if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
> leave_mm(smp_processor_id());
>
> /* If this cpu still has a stale cr3 reference, then make sure
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH] xen mmu: fix a race window causing leave_mm BUG()
2011-05-10 20:27 ` Konrad Rzeszutek Wilk
@ 2011-05-11 1:20 ` Tian, Kevin
2011-05-11 9:44 ` Ian Campbell
0 siblings, 1 reply; 6+ messages in thread
From: Tian, Kevin @ 2011-05-11 1:20 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: jeremy@goop.org, xen devel, MaoXiaoyun
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Wednesday, May 11, 2011 4:27 AM
>
> On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote:
> > xen mmu: fix a race window causing leave_mm BUG()
>
> I've this in mailbox and I am wondering whether this still an issue with the
> 2.6.39 type kernels?
> How do you reproduce the failure? When using LVM?
this issue is reported by Xiaoyun when he did extensive test which happened
occasionally after dozen of hours running. From the phenomenon and info
provided by Xiaoyun, I found this potential race window and Xiaoyun has
verified this patch solving his stability issue.
the original thread is at:
http://lists.xensource.com/archives/html/xen-devel/2011-04/msg01186.html
his kernel is based on 2.6.38, and I checked latest 2.6.39 from your maintained
repo, and same issue still exists.
btw, I didn't reproduce it myself, and not sure whether Xiaoyun uses LVM. But
I think it has nothing to do with storage type, and a pure mmu design issue.
Thanks
Kevin
> >
> > there's a race window in xen_drop_mm_ref, where remote cpu may exit
> > dirty bitmap between the check on this cpu and the point where remote
> > cpu handles drop request. So in drop_other_mm_ref we need check
> > whether TLB state is still lazy before calling into leave_mm. This
> > bug is rarely observed in earlier kernel, but exaggerated by the
> > commit 831d52bc153971b70e64eccfbed2b232394f22f8 which clears
> bitmap
> > after changing the TLB state.
> >
> > thanks for Maxiaoyun<tinnycloud@hotmail.com> to verify it.
> >
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> >
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index
> > 4e5a611..74c6e4a 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1260,7 +1260,7 @@ static void drop_other_mm_ref(void *info)
> >
> > active_mm = percpu_read(cpu_tlbstate.active_mm);
> >
> > - if (active_mm == mm)
> > + if (active_mm == mm && percpu_read(cpu_tlbstate.state) !=
> > +TLBSTATE_OK)
> > leave_mm(smp_processor_id());
> >
> > /* If this cpu still has a stale cr3 reference, then make sure
>
>
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH] xen mmu: fix a race window causing leave_mm BUG()
2011-05-11 1:20 ` Tian, Kevin
@ 2011-05-11 9:44 ` Ian Campbell
2011-05-11 12:34 ` Tian, Kevin
0 siblings, 1 reply; 6+ messages in thread
From: Ian Campbell @ 2011-05-11 9:44 UTC (permalink / raw)
To: Tian, Kevin; +Cc: jeremy@goop.org, xen devel, MaoXiaoyun, Konrad Rzeszutek Wilk
On Wed, 2011-05-11 at 02:20 +0100, Tian, Kevin wrote:
> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > Sent: Wednesday, May 11, 2011 4:27 AM
> >
> > On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote:
> > > xen mmu: fix a race window causing leave_mm BUG()
> >
> > I've this in mailbox and I am wondering whether this still an issue with the
> > 2.6.39 type kernels?
> > How do you reproduce the failure? When using LVM?
>
> this issue is reported by Xiaoyun when he did extensive test which happened
> occasionally after dozen of hours running. From the phenomenon and info
> provided by Xiaoyun, I found this potential race window and Xiaoyun has
> verified this patch solving his stability issue.
>
> the original thread is at:
> http://lists.xensource.com/archives/html/xen-devel/2011-04/msg01186.html
>
> his kernel is based on 2.6.38, and I checked latest 2.6.39 from your maintained
> repo, and same issue still exists.
>
> btw, I didn't reproduce it myself, and not sure whether Xiaoyun uses LVM. But
> I think it has nothing to do with storage type, and a pure mmu design issue.
Is there a specific stack trace (or two) which is associated with this
bug? I'm wondering if http://bugs.debian.org/613073 might be the same
thing...
Ian.
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH] xen mmu: fix a race window causing leave_mm BUG()
2011-05-11 9:44 ` Ian Campbell
@ 2011-05-11 12:34 ` Tian, Kevin
2011-05-11 15:44 ` Konrad Rzeszutek Wilk
0 siblings, 1 reply; 6+ messages in thread
From: Tian, Kevin @ 2011-05-11 12:34 UTC (permalink / raw)
To: Ian Campbell
Cc: jeremy@goop.org, xen devel, MaoXiaoyun, Konrad Rzeszutek Wilk
[-- Attachment #1: Type: text/plain, Size: 2090 bytes --]
> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Sent: Wednesday, May 11, 2011 5:44 PM
>
> On Wed, 2011-05-11 at 02:20 +0100, Tian, Kevin wrote:
> > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > > Sent: Wednesday, May 11, 2011 4:27 AM
> > >
> > > On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote:
> > > > xen mmu: fix a race window causing leave_mm BUG()
> > >
> > > I've this in mailbox and I am wondering whether this still an issue
> > > with the
> > > 2.6.39 type kernels?
> > > How do you reproduce the failure? When using LVM?
> >
> > this issue is reported by Xiaoyun when he did extensive test which
> > happened occasionally after dozen of hours running. From the
> > phenomenon and info provided by Xiaoyun, I found this potential race
> > window and Xiaoyun has verified this patch solving his stability issue.
> >
> > the original thread is at:
> > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg01186.ht
> > ml
> >
> > his kernel is based on 2.6.38, and I checked latest 2.6.39 from your
> > maintained repo, and same issue still exists.
> >
> > btw, I didn't reproduce it myself, and not sure whether Xiaoyun uses
> > LVM. But I think it has nothing to do with storage type, and a pure mmu
> design issue.
>
> Is there a specific stack trace (or two) which is associated with this bug? I'm
> wondering if http://bugs.debian.org/613073 might be the same thing...
>
If you look into above thread:
http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00657.html
[<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
[<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
[<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
[<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
[<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
[<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
[<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
[<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
...
Thanks
Kevin
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] xen mmu: fix a race window causing leave_mm BUG()
2011-05-11 12:34 ` Tian, Kevin
@ 2011-05-11 15:44 ` Konrad Rzeszutek Wilk
0 siblings, 0 replies; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-05-11 15:44 UTC (permalink / raw)
To: Tian, Kevin; +Cc: jeremy@goop.org, xen devel, Ian Campbell, MaoXiaoyun
On Wed, May 11, 2011 at 08:34:46PM +0800, Tian, Kevin wrote:
> > From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> > Sent: Wednesday, May 11, 2011 5:44 PM
> >
> > On Wed, 2011-05-11 at 02:20 +0100, Tian, Kevin wrote:
> > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> > > > Sent: Wednesday, May 11, 2011 4:27 AM
> > > >
> > > > On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote:
> > > > > xen mmu: fix a race window causing leave_mm BUG()
> > > >
> > > > I've this in mailbox and I am wondering whether this still an issue
> > > > with the
> > > > 2.6.39 type kernels?
> > > > How do you reproduce the failure? When using LVM?
> > >
> > > this issue is reported by Xiaoyun when he did extensive test which
> > > happened occasionally after dozen of hours running. From the
> > > phenomenon and info provided by Xiaoyun, I found this potential race
> > > window and Xiaoyun has verified this patch solving his stability issue.
> > >
> > > the original thread is at:
> > > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg01186.ht
> > > ml
> > >
> > > his kernel is based on 2.6.38, and I checked latest 2.6.39 from your
> > > maintained repo, and same issue still exists.
> > >
> > > btw, I didn't reproduce it myself, and not sure whether Xiaoyun uses
> > > LVM. But I think it has nothing to do with storage type, and a pure mmu
> > design issue.
> >
> > Is there a specific stack trace (or two) which is associated with this bug? I'm
> > wondering if http://bugs.debian.org/613073 might be the same thing...
> >
>
> If you look into above thread:
>
> http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00657.html
>
> [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53
>
> [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc
>
> [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28
>
> [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120
>
> [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e
>
> [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d
>
> [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46
>
> [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
Can you resend the patch to me, based on top of v2.6.39-rc7, with the above
stack dump? And please resend it as an attachment. Your mailer mangles
the patch.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2011-05-11 15:44 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-29 4:10 [PATCH] xen mmu: fix a race window causing leave_mm BUG() Tian, Kevin
2011-05-10 20:27 ` Konrad Rzeszutek Wilk
2011-05-11 1:20 ` Tian, Kevin
2011-05-11 9:44 ` Ian Campbell
2011-05-11 12:34 ` Tian, Kevin
2011-05-11 15:44 ` Konrad Rzeszutek Wilk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).