* [PATCH 0/2] Fix some machine check application recovery cases @ 2014-05-20 17:35 Tony Luck 2014-05-20 16:28 ` [PATCH 1/2] memory-failure: Send right signal code to correct thread Tony Luck 2014-05-20 16:46 ` [PATCH 2/2] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED Tony Luck 0 siblings, 2 replies; 31+ messages in thread From: Tony Luck @ 2014-05-20 17:35 UTC (permalink / raw) To: linux-kernel, linux-mm; +Cc: Andi Kleen, Borislav Petkov, Chen Gong Tesing recovery in mult-threaded applications showed a couple of issues in our code. Tony Luck (2): memory-failure: Send right signal code to correct thread memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED mm/memory-failure.c | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 1/2] memory-failure: Send right signal code to correct thread 2014-05-20 17:35 [PATCH 0/2] Fix some machine check application recovery cases Tony Luck @ 2014-05-20 16:28 ` Tony Luck 2014-05-20 17:54 ` Naoya Horiguchi ` (2 more replies) 2014-05-20 16:46 ` [PATCH 2/2] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED Tony Luck 1 sibling, 3 replies; 31+ messages in thread From: Tony Luck @ 2014-05-20 16:28 UTC (permalink / raw) To: linux-kernel, linux-mm; +Cc: Andi Kleen, Borislav Petkov, Chen Gong When a thread in a multi-threaded application hits a machine check because of an uncorrectable error in memory - we want to send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that if the active thread is not the primary thread in the process. collect_procs() just finds primary threads and this test: if ((flags & MF_ACTION_REQUIRED) && t == current) { will see that the thread we found isn't the current thread and so send a si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active thread at this time). We can fix this by checking whether "current" shares the same mm with the process that collect_procs() said owned the page. If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> --- mm/memory-failure.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 35ef28acf137..642c8434b166 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -204,9 +204,9 @@ static int kill_proc(struct task_struct *t, unsigned long addr, int trapno, #endif si.si_addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT; - if ((flags & MF_ACTION_REQUIRED) && t == current) { + if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { si.si_code = BUS_MCEERR_AR; - ret = force_sig_info(SIGBUS, &si, t); + ret = force_sig_info(SIGBUS, &si, current); } else { /* * Don't use force here, it's convenient if the signal -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] memory-failure: Send right signal code to correct thread 2014-05-20 16:28 ` [PATCH 1/2] memory-failure: Send right signal code to correct thread Tony Luck @ 2014-05-20 17:54 ` Naoya Horiguchi [not found] ` <1400608486-alyqz521@n-horiguchi@ah.jp.nec.com> 2014-05-23 3:34 ` Chen, Gong 2 siblings, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-20 17:54 UTC (permalink / raw) To: Tony Luck; +Cc: linux-kernel, linux-mm, Andi Kleen, bp, gong.chen On Tue, May 20, 2014 at 09:28:00AM -0700, Tony Luck wrote: > When a thread in a multi-threaded application hits a machine > check because of an uncorrectable error in memory - we want to > send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. > Currently we fail to do that if the active thread is not the > primary thread in the process. collect_procs() just finds primary > threads and this test: > if ((flags & MF_ACTION_REQUIRED) && t == current) { > will see that the thread we found isn't the current thread > and so send a si.si_code = BUS_MCEERR_AO to the primary > (and nothing to the active thread at this time). > > We can fix this by checking whether "current" shares the same > mm with the process that collect_procs() said owned the page. > If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). > > Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com> > Signed-off-by: Tony Luck <tony.luck@intel.com> Looks good to me, thank you. Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> and I think this is worth going into stable trees. Naoya > --- > mm/memory-failure.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 35ef28acf137..642c8434b166 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -204,9 +204,9 @@ static int kill_proc(struct task_struct *t, unsigned long addr, int trapno, > #endif > si.si_addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT; > > - if ((flags & MF_ACTION_REQUIRED) && t == current) { > + if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { > si.si_code = BUS_MCEERR_AR; > - ret = force_sig_info(SIGBUS, &si, t); > + ret = force_sig_info(SIGBUS, &si, current); > } else { > /* > * Don't use force here, it's convenient if the signal > -- > 1.8.4.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1400608486-alyqz521@n-horiguchi@ah.jp.nec.com>]
* RE: [PATCH 1/2] memory-failure: Send right signal code to correct thread [not found] ` <1400608486-alyqz521@n-horiguchi@ah.jp.nec.com> @ 2014-05-20 20:56 ` Luck, Tony 0 siblings, 0 replies; 31+ messages in thread From: Luck, Tony @ 2014-05-20 20:56 UTC (permalink / raw) To: Naoya Horiguchi Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen, bp@suse.de, gong.chen@linux.jf.intel.com > Looks good to me, thank you. > Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Thanks for your time reviewing this > and I think this is worth going into stable trees. Good point. I should dig in the git history and make one of those fancy "Fixes: sha1 title" tags too. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] memory-failure: Send right signal code to correct thread 2014-05-20 16:28 ` [PATCH 1/2] memory-failure: Send right signal code to correct thread Tony Luck 2014-05-20 17:54 ` Naoya Horiguchi [not found] ` <1400608486-alyqz521@n-horiguchi@ah.jp.nec.com> @ 2014-05-23 3:34 ` Chen, Gong 2014-05-23 16:48 ` Tony Luck 2 siblings, 1 reply; 31+ messages in thread From: Chen, Gong @ 2014-05-23 3:34 UTC (permalink / raw) To: Tony Luck; +Cc: linux-kernel, linux-mm, Andi Kleen, Borislav Petkov, Chen Gong [-- Attachment #1: Type: text/plain, Size: 2206 bytes --] On Tue, May 20, 2014 at 09:28:00AM -0700, Luck, Tony wrote: > When a thread in a multi-threaded application hits a machine > check because of an uncorrectable error in memory - we want to > send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. > Currently we fail to do that if the active thread is not the > primary thread in the process. collect_procs() just finds primary > threads and this test: > if ((flags & MF_ACTION_REQUIRED) && t == current) { > will see that the thread we found isn't the current thread > and so send a si.si_code = BUS_MCEERR_AO to the primary > (and nothing to the active thread at this time). > > We can fix this by checking whether "current" shares the same > mm with the process that collect_procs() said owned the page. > If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). > > Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com> > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > mm/memory-failure.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 35ef28acf137..642c8434b166 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -204,9 +204,9 @@ static int kill_proc(struct task_struct *t, unsigned long addr, int trapno, > #endif > si.si_addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT; > > - if ((flags & MF_ACTION_REQUIRED) && t == current) { > + if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { > si.si_code = BUS_MCEERR_AR; > - ret = force_sig_info(SIGBUS, &si, t); > + ret = force_sig_info(SIGBUS, &si, current); > } else { > /* > * Don't use force here, it's convenient if the signal > -- > 1.8.4.1 Very interesting. I remembered there was a thread about AO error. Here is the link: http://www.spinics.net/lists/linux-mm/msg66653.html. According to this link, I have two concerns: 1) how to handle the similar scenario like it in this link. I mean once the main thread doesn't handle AR error but a thread does this, if SIGBUS can't be handled at once. 2) why that patch isn't merged. From that thread, Naoya should mean "acknowledge" :-). [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] memory-failure: Send right signal code to correct thread 2014-05-23 3:34 ` Chen, Gong @ 2014-05-23 16:48 ` Tony Luck 2014-05-27 16:16 ` Kamil Iskra 0 siblings, 1 reply; 31+ messages in thread From: Tony Luck @ 2014-05-23 16:48 UTC (permalink / raw) To: Tony Luck, Linux Kernel Mailing List, linux-mm@kvack.org, Andi Kleen, Borislav Petkov, Chen Gong, iskra Added Kamil (hope I got the right one - the spinics.net archive obfuscates the e-mail addresses). >> - if ((flags & MF_ACTION_REQUIRED) && t == current) { >> + if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { >> si.si_code = BUS_MCEERR_AR; >> - ret = force_sig_info(SIGBUS, &si, t); >> + ret = force_sig_info(SIGBUS, &si, current); >> } else { >> /* >> * Don't use force here, it's convenient if the signal >> -- >> 1.8.4.1 > Very interesting. I remembered there was a thread about AO error. Here is > the link: http://www.spinics.net/lists/linux-mm/msg66653.html. > According to this link, I have two concerns: > > 1) how to handle the similar scenario like it in this link. I mean once > the main thread doesn't handle AR error but a thread does this, if SIGBUS > can't be handled at once. > 2) why that patch isn't merged. From that thread, Naoya should mean > "acknowledge" :-). That's an interesting thread ... and looks like it helps out in a case where there are only AO signals. But the "AR" case complicates things. Kamil points out at the start of the thread: > Also, do I understand it correctly that "action required" faults *must* be > handled by the thread that triggered the error? I guess it makes sense for > it to be that way, even if it circumvents the "dedicated handling thread" > idea... this is absolutely true ... in the BUS_MCEERR_AR case the current thread is executing an instruction that is attempting to consume poison data ... and we cannot let that instruction retire, so we have to signal that thread - if it can fix the problem by mapping a new page to the location that was lost, and refilling it with the right data - the handler can return to resume - otherwise it can longjmp() somewhere or exit. This means that the idea of having a multi-threaded application where just one thread has a SIGBUS handler and we gently steer the BUS_MCEERR_AO signals to that thread to be handled is flawed. Every thread needs to have a SIGBUS handler - so that we can handle the "AR" case. [Digression: what does happen to a process with a thread with no SIGBUS handler if we in fact send it a SIGBUS? Does just that thread die (default action for SIGBUS)? Or does the whole process get killed? If just one thread is terminated ... then perhaps someone could write a recovery aware application that worked like this - though it sounds like that would be working blindfold with one hand tied behind your back. How would the remaining threads know why their buddy just died? The siginfo_t describing the problem isn't available] If we want steerable AO signals to a dedicated thread - we'd have to use different signals for AO & AR. So every thread can have an AR handler, but just one have the AO handler. Or something more exotic with prctl to designate the preferred target for AO signals? Or just live with the fact that every thread needs a handler for AR ... and have the application internally pass AO activity from the thread that originally got the SIGBUS to some worker thread. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] memory-failure: Send right signal code to correct thread 2014-05-23 16:48 ` Tony Luck @ 2014-05-27 16:16 ` Kamil Iskra 2014-05-27 17:50 ` Naoya Horiguchi [not found] ` <5384d07e.4504e00a.2680.ffff8c31SMTPIN_ADDED_BROKEN@mx.google.com> 0 siblings, 2 replies; 31+ messages in thread From: Kamil Iskra @ 2014-05-27 16:16 UTC (permalink / raw) To: Tony Luck Cc: Tony Luck, Linux Kernel Mailing List, linux-mm@kvack.org, Andi Kleen, Borislav Petkov, Chen Gong On Fri, May 23, 2014 at 09:48:42 -0700, Tony Luck wrote: Tony, > Added Kamil (hope I got the right one - the spinics.net archive obfuscates > the e-mail addresses). Yes, you got the right address :-). > >> - if ((flags & MF_ACTION_REQUIRED) && t == current) { > >> + if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { > >> si.si_code = BUS_MCEERR_AR; > >> - ret = force_sig_info(SIGBUS, &si, t); > >> + ret = force_sig_info(SIGBUS, &si, current); > >> } else { > >> /* > >> * Don't use force here, it's convenient if the signal > >> -- > >> 1.8.4.1 > > Very interesting. I remembered there was a thread about AO error. Here is > > the link: http://www.spinics.net/lists/linux-mm/msg66653.html. > > According to this link, I have two concerns: > > > > 1) how to handle the similar scenario like it in this link. I mean once > > the main thread doesn't handle AR error but a thread does this, if SIGBUS > > can't be handled at once. > > 2) why that patch isn't merged. From that thread, Naoya should mean > > "acknowledge" :-). > That's an interesting thread ... and looks like it helps out in a case > where there are only AO signals. Unfortunately, I got distracted by other pressing work at the time and didn't follow up on my patch/didn't follow the correct kernel workflow on patch submission procedures. I haven't checked any developments in that area so I don't even know if my patch is still applicable -- do you think it makes sense for me to revisit the issue at this time, or will the patch that you are working on make my old patch redundant? > But the "AR" case complicates things. Kamil points out at the start > of the thread: > > Also, do I understand it correctly that "action required" faults *must* be > > handled by the thread that triggered the error? I guess it makes sense for > > it to be that way, even if it circumvents the "dedicated handling thread" > > idea... > this is absolutely true ... in the BUS_MCEERR_AR case the current > thread is executing an instruction that is attempting to consume poison > data ... and we cannot let that instruction retire, so we have to signal that > thread - if it can fix the problem by mapping a new page to the location > that was lost, and refilling it with the right data - the handler can return > to resume - otherwise it can longjmp() somewhere or exit. Exactly. > This means that the idea of having a multi-threaded application where > just one thread has a SIGBUS handler and we gently steer the > BUS_MCEERR_AO signals to that thread to be handled is flawed. > Every thread needs to have a SIGBUS handler - so that we can handle > the "AR" case. [Digression: what does happen to a process with a thread > with no SIGBUS handler if we in fact send it a SIGBUS? Does just that > thread die (default action for SIGBUS)? Or does the whole process get > killed? If just one thread is terminated ... then perhaps someone could > write a recovery aware application that worked like this - though it sounds > like that would be working blindfold with one hand tied behind your back. > How would the remaining threads know why their buddy just died? The > siginfo_t describing the problem isn't available] I believe I experimented with this and the whole process would get killed. > If we want steerable AO signals to a dedicated thread - we'd have to > use different signals for AO & AR. So every thread can have an AR > handler, but just one have the AO handler. Or something more exotic > with prctl to designate the preferred target for AO signals? > > Or just live with the fact that every thread needs a handler for AR ... > and have the application internally pass AO activity from the > thread that originally got the SIGBUS to some worker thread. Yes, you make a very valid point that my patch was not complete... but then, neither was what was there before it. So my patch was only an incremental improvement, enough to play with when artificially injecting fault events, but not enough to *really* solve the problem. If you have a complete solution in mind instead, that would be great. Kamil -- Kamil Iskra, PhD Argonne National Laboratory, Mathematics and Computer Science Division 9700 South Cass Avenue, Building 240, Argonne, IL 60439, USA phone: +1-630-252-7197 fax: +1-630-252-5986 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] memory-failure: Send right signal code to correct thread 2014-05-27 16:16 ` Kamil Iskra @ 2014-05-27 17:50 ` Naoya Horiguchi [not found] ` <5384d07e.4504e00a.2680.ffff8c31SMTPIN_ADDED_BROKEN@mx.google.com> 1 sibling, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-27 17:50 UTC (permalink / raw) To: iskra Cc: tony.luck, Tony Luck, linux-kernel, linux-mm, Andi Kleen, Borislav Petkov, gong.chen On Tue, May 27, 2014 at 11:16:13AM -0500, Kamil Iskra wrote: > On Fri, May 23, 2014 at 09:48:42 -0700, Tony Luck wrote: > > Tony, > > > Added Kamil (hope I got the right one - the spinics.net archive obfuscates > > the e-mail addresses). > > Yes, you got the right address :-). > > > >> - if ((flags & MF_ACTION_REQUIRED) && t == current) { > > >> + if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { > > >> si.si_code = BUS_MCEERR_AR; > > >> - ret = force_sig_info(SIGBUS, &si, t); > > >> + ret = force_sig_info(SIGBUS, &si, current); > > >> } else { > > >> /* > > >> * Don't use force here, it's convenient if the signal > > >> -- > > >> 1.8.4.1 > > > Very interesting. I remembered there was a thread about AO error. Here is > > > the link: http://www.spinics.net/lists/linux-mm/msg66653.html. > > > According to this link, I have two concerns: > > > > > > 1) how to handle the similar scenario like it in this link. I mean once > > > the main thread doesn't handle AR error but a thread does this, if SIGBUS > > > can't be handled at once. > > > 2) why that patch isn't merged. From that thread, Naoya should mean > > > "acknowledge" :-). > > That's an interesting thread ... and looks like it helps out in a case > > where there are only AO signals. > > Unfortunately, I got distracted by other pressing work at the time and > didn't follow up on my patch/didn't follow the correct kernel workflow on > patch submission procedures. I haven't checked any developments in that > area so I don't even know if my patch is still applicable -- do you think > it makes sense for me to revisit the issue at this time, or will the patch > that you are working on make my old patch redundant? > > > But the "AR" case complicates things. Kamil points out at the start > > of the thread: > > > Also, do I understand it correctly that "action required" faults *must* be > > > handled by the thread that triggered the error? I guess it makes sense for > > > it to be that way, even if it circumvents the "dedicated handling thread" > > > idea... > > this is absolutely true ... in the BUS_MCEERR_AR case the current > > thread is executing an instruction that is attempting to consume poison > > data ... and we cannot let that instruction retire, so we have to signal that > > thread - if it can fix the problem by mapping a new page to the location > > that was lost, and refilling it with the right data - the handler can return > > to resume - otherwise it can longjmp() somewhere or exit. > > Exactly. > > > This means that the idea of having a multi-threaded application where > > just one thread has a SIGBUS handler and we gently steer the > > BUS_MCEERR_AO signals to that thread to be handled is flawed. > > Every thread needs to have a SIGBUS handler - so that we can handle > > the "AR" case. [Digression: what does happen to a process with a thread > > with no SIGBUS handler if we in fact send it a SIGBUS? Does just that > > thread die (default action for SIGBUS)? Or does the whole process get > > killed? If just one thread is terminated ... then perhaps someone could > > write a recovery aware application that worked like this - though it sounds > > like that would be working blindfold with one hand tied behind your back. > > How would the remaining threads know why their buddy just died? The > > siginfo_t describing the problem isn't available] > > I believe I experimented with this and the whole process would get killed. > > > If we want steerable AO signals to a dedicated thread - we'd have to > > use different signals for AO & AR. I think that user process can distinguish which signal it got via (struct sigaction)->si_code, so we don't need different signals. If it's right, the followings solves Kamil's problem? - apply Kamil's patch - make sure that every thread in a recovery aware application should have a SIGBUS handler, inside which * code for SIGBUS(BUS_MCEERR_AR) is enabled for every thread * code for SIGBUS(BUS_MCEERR_AO) is enabled only for a dedicated thread One concern is that with Kamil's patch, some existing user who expects that only the main thread of "early kill" process receives SIGBUS(BUS_MCEERR_AO) could be surprised by this change, because other threads become to get SIGBUS and if those threads are not prepared for it, they're just killed (IOW, behavior of these threads could change.) Good example is qemu, is it safe from Kamil's change? Thanks, Naoya Horiguchi > So every thread can have an AR > > handler, but just one have the AO handler. Or something more exotic > > with prctl to designate the preferred target for AO signals? > > > > Or just live with the fact that every thread needs a handler for AR ... > > and have the application internally pass AO activity from the > > thread that originally got the SIGBUS to some worker thread. > > Yes, you make a very valid point that my patch was not complete... but > then, neither was what was there before it. So my patch was only an > incremental improvement, enough to play with when artificially injecting > fault events, but not enough to *really* solve the problem. If you have a > complete solution in mind instead, that would be great. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <5384d07e.4504e00a.2680.ffff8c31SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [PATCH 1/2] memory-failure: Send right signal code to correct thread [not found] ` <5384d07e.4504e00a.2680.ffff8c31SMTPIN_ADDED_BROKEN@mx.google.com> @ 2014-05-27 22:53 ` Tony Luck 2014-05-28 0:15 ` Naoya Horiguchi [not found] ` <53852abb.867ce00a.3cef.3c7eSMTPIN_ADDED_BROKEN@mx.google.com> 0 siblings, 2 replies; 31+ messages in thread From: Tony Luck @ 2014-05-27 22:53 UTC (permalink / raw) To: Naoya Horiguchi Cc: Kamil Iskra, Linux Kernel Mailing List, linux-mm@kvack.org, Andi Kleen, Borislav Petkov, Chen Gong > - make sure that every thread in a recovery aware application should have > a SIGBUS handler, inside which > * code for SIGBUS(BUS_MCEERR_AR) is enabled for every thread > * code for SIGBUS(BUS_MCEERR_AO) is enabled only for a dedicated thread But how does the kernel know which is the special thread that should see the "AO" signal? Broadcasting the signal to all threads seems to be just as likely to cause problems to an application as the h/w broadcasting MCE to all processors. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/2] memory-failure: Send right signal code to correct thread 2014-05-27 22:53 ` Tony Luck @ 2014-05-28 0:15 ` Naoya Horiguchi [not found] ` <53852abb.867ce00a.3cef.3c7eSMTPIN_ADDED_BROKEN@mx.google.com> 1 sibling, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-28 0:15 UTC (permalink / raw) To: tony.luck Cc: iskra, linux-kernel, linux-mm, Andi Kleen, Borislav Petkov, gong.chen On Tue, May 27, 2014 at 03:53:55PM -0700, Tony Luck wrote: > > - make sure that every thread in a recovery aware application should have > > a SIGBUS handler, inside which > > * code for SIGBUS(BUS_MCEERR_AR) is enabled for every thread > > * code for SIGBUS(BUS_MCEERR_AO) is enabled only for a dedicated thread > > But how does the kernel know which is the special thread that > should see the "AO" signal? Broadcasting the signal to all > threads seems to be just as likely to cause problems to > an application as the h/w broadcasting MCE to all processors. I thought that kernel doesn't have to know about which thread is the special one if the AO signal is broadcasted to all threads, because in such case the special thread always gets the AO signal. The reported problem happens only the application sets PF_MCE_EARLY flag, and such application is surely recovery aware, so we can assume that the coders must implement SIGBUS handler for all threads. Then all other threads but the special one can intentionally ignore AO signal. This is to avoid the default behavior for SIGBUS ("kill all threads" as Kamil said in the previous email.) And I hope that downside of signal broadcasting is smaller than MCE broadcasting because the range of broadcasting is limited to a process group, not to the whole system. # I don't intend to rule out other possibilities like adding another prctl # flag, so if you have a patch, that's would be great. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <53852abb.867ce00a.3cef.3c7eSMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [PATCH 1/2] memory-failure: Send right signal code to correct thread [not found] ` <53852abb.867ce00a.3cef.3c7eSMTPIN_ADDED_BROKEN@mx.google.com> @ 2014-05-28 5:09 ` Tony Luck 2014-05-28 18:47 ` [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread Naoya Horiguchi [not found] ` <53862f6c.91148c0a.5fb0.2d0cSMTPIN_ADDED_BROKEN@mx.google.com> 0 siblings, 2 replies; 31+ messages in thread From: Tony Luck @ 2014-05-28 5:09 UTC (permalink / raw) To: Naoya Horiguchi Cc: iskra@mcs.anl.gov, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andi Kleen, Borislav Petkov, gong.chen@linux.jf.intel.com I'm exploring options to see what writers of threaded applications might want/need. I'm very doubtful that they would really want "broadcast to all threads". What if there are hundreds or thousands of threads? We send the signals from the context of the thread that hit the error. But that might take a while. Meanwhile any of those threads that were already scheduled on other CPUs are back running again. So there are big races even if we broadcast. Sent from my iPhone > On May 27, 2014, at 17:15, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > On Tue, May 27, 2014 at 03:53:55PM -0700, Tony Luck wrote: >>> - make sure that every thread in a recovery aware application should have >>> a SIGBUS handler, inside which >>> * code for SIGBUS(BUS_MCEERR_AR) is enabled for every thread >>> * code for SIGBUS(BUS_MCEERR_AO) is enabled only for a dedicated thread >> >> But how does the kernel know which is the special thread that >> should see the "AO" signal? Broadcasting the signal to all >> threads seems to be just as likely to cause problems to >> an application as the h/w broadcasting MCE to all processors. > > I thought that kernel doesn't have to know about which thread is the > special one if the AO signal is broadcasted to all threads, because > in such case the special thread always gets the AO signal. > > The reported problem happens only the application sets PF_MCE_EARLY flag, > and such application is surely recovery aware, so we can assume that the > coders must implement SIGBUS handler for all threads. Then all other threads > but the special one can intentionally ignore AO signal. This is to avoid the > default behavior for SIGBUS ("kill all threads" as Kamil said in the previous > email.) > > And I hope that downside of signal broadcasting is smaller than MCE > broadcasting because the range of broadcasting is limited to a process group, > not to the whole system. > > # I don't intend to rule out other possibilities like adding another prctl > # flag, so if you have a patch, that's would be great. > > Thanks, > Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread 2014-05-28 5:09 ` Tony Luck @ 2014-05-28 18:47 ` Naoya Horiguchi [not found] ` <53862f6c.91148c0a.5fb0.2d0cSMTPIN_ADDED_BROKEN@mx.google.com> 1 sibling, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-28 18:47 UTC (permalink / raw) To: tony.luck Cc: iskra, linux-kernel, linux-mm, Andi Kleen, Borislav Petkov, gong.chen On Tue, May 27, 2014 at 10:09:54PM -0700, Tony Luck wrote: > I'm exploring options to see what writers of threaded applications might want/need. I'm very doubtful that they would really want "broadcast to all threads". What if there are hundreds or thousands of threads? We send the signals from the context of the thread that hit the error. But that might take a while. Meanwhile any of those threads that were already scheduled on other CPUs are back running again. So there are big races even if we broadcast. I see, so this approach is not good. I studied another approach and found that we have PF_MCE_EARLY flags on each thread, so we can implement a dedicated thread by setting the flag on that thread. IOW, current code assumes that PF_MCE_EARLY is always set on the main thread (otherwise ignored), so we can change this behavior. The following patch makes kernel aware of PF_MCE_EARLY flag on threads. Could you take a look? Thanks, Naoya Horiguchi --- Date: Wed, 28 May 2014 03:38:33 -0400 Subject: [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) Currently memory error handler handles action optional errors in the deferred manner by default. And if a recovery aware application wants to handle it immediately, it can do it by setting PF_MCE_EARLY flag. However, such signal can be sent only to the main thread, so it's problematic if the application wants to have a dedicated thread to handler such signals. So this patch adds dedicated thread support to memory error handler. We have PF_MCE_EARLY flags for each thread separately, so with this patch AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main thread. If you want to implement a dedicated thread, you call prctl() to set PF_MCE_EARLY on the thread. Memory error handler collects processes to be killed, so this patch lets it check PF_MCE_EARLY flag on each thread in the collecting routines. No behavioral change for all non-early kill cases. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> --- Documentation/vm/hwpoison.txt | 5 ++++ mm/memory-failure.c | 68 ++++++++++++++++++++++++++++++------------- 2 files changed, 53 insertions(+), 20 deletions(-) diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt index 550068466605..1906fd3bea0e 100644 --- a/Documentation/vm/hwpoison.txt +++ b/Documentation/vm/hwpoison.txt @@ -84,6 +84,11 @@ PR_MCE_KILL PR_MCE_KILL_EARLY: Early kill PR_MCE_KILL_LATE: Late kill PR_MCE_KILL_DEFAULT: Use system global default + Note that if you want to have a dedicated thread which handles + the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should + call prctl() on the thread. Otherwise, the SIGBUS is sent to + the main thread. + PR_MCE_KILL_GET return current mode diff --git a/mm/memory-failure.c b/mm/memory-failure.c index a18007ada3cb..3bd0428b2534 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -294,6 +294,46 @@ struct to_kill { */ /* + * Find a dedicated thread which is supposed to handle SIGBUS(BUS_MCEERR_AO) + * on behalf of the thread group. Return task_struct of the (first found) + * dedicated thread if found, and return NULL otherwise. + */ +static struct task_struct *find_early_kill_thread(struct task_struct *tsk) +{ + struct task_struct *t; + rcu_read_lock(); + for_each_thread(tsk, t) + if (t->flags & PF_MCE_PROCESS && t->flags & PF_MCE_EARLY) + goto found; + t = NULL; +found: + rcu_read_unlock(); + return t; +} + +/* + * Determine whether a given process is "early kill" process which expects + * to be signaled when some page under the process is hwpoisoned. + * Return task_struct of the dedicated thread (main thread unless explicitly + * specified) if the process is "early kill," and otherwise returns NULL. + */ +static struct task_struct *task_early_kill(struct task_struct *tsk, + int force_early) +{ + struct task_struct *t; + if (!tsk->mm) + return NULL; + if (force_early) + return tsk; + t = find_early_kill_thread(tsk); + if (t) + return t; + if (sysctl_memory_failure_early_kill) + return tsk; + return NULL; +} + +/* * Schedule a process for later kill. * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM. * TBD would GFP_NOIO be enough? @@ -380,17 +420,6 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, } } -static int task_early_kill(struct task_struct *tsk, int force_early) -{ - if (!tsk->mm) - return 0; - if (force_early) - return 1; - if (tsk->flags & PF_MCE_PROCESS) - return !!(tsk->flags & PF_MCE_EARLY); - return sysctl_memory_failure_early_kill; -} - /* * Collect processes when the error hit an anonymous page. */ @@ -410,16 +439,16 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, read_lock(&tasklist_lock); for_each_process (tsk) { struct anon_vma_chain *vmac; - - if (!task_early_kill(tsk, force_early)) + struct task_struct *t = task_early_kill(tsk, force_early); + if (!t) continue; anon_vma_interval_tree_foreach(vmac, &av->rb_root, pgoff, pgoff) { vma = vmac->vma; if (!page_mapped_in_vma(page, vma)) continue; - if (vma->vm_mm == tsk->mm) - add_to_kill(tsk, page, vma, to_kill, tkc); + if (vma->vm_mm == t->mm) + add_to_kill(t, page, vma, to_kill, tkc); } } read_unlock(&tasklist_lock); @@ -440,10 +469,9 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, read_lock(&tasklist_lock); for_each_process(tsk) { pgoff_t pgoff = page_pgoff(page); - - if (!task_early_kill(tsk, force_early)) + struct task_struct *t = task_early_kill(tsk, force_early); + if (!t) continue; - vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { /* @@ -453,8 +481,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, * Assume applications who requested early kill want * to be informed of all such data corruptions. */ - if (vma->vm_mm == tsk->mm) - add_to_kill(tsk, page, vma, to_kill, tkc); + if (vma->vm_mm == t->mm) + add_to_kill(t, page, vma, to_kill, tkc); } } read_unlock(&tasklist_lock); -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
[parent not found: <53862f6c.91148c0a.5fb0.2d0cSMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread [not found] ` <53862f6c.91148c0a.5fb0.2d0cSMTPIN_ADDED_BROKEN@mx.google.com> @ 2014-05-28 22:00 ` Tony Luck 2014-05-29 1:45 ` Naoya Horiguchi ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Tony Luck @ 2014-05-28 22:00 UTC (permalink / raw) To: Naoya Horiguchi Cc: Kamil Iskra, Linux Kernel Mailing List, linux-mm@kvack.org, Andi Kleen, Borislav Petkov, Chen Gong On Wed, May 28, 2014 at 11:47 AM, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > Could you take a look? It looks good - and should be a workable API for application writers to use. > @@ -84,6 +84,11 @@ PR_MCE_KILL > PR_MCE_KILL_EARLY: Early kill > PR_MCE_KILL_LATE: Late kill > PR_MCE_KILL_DEFAULT: Use system global default > + Note that if you want to have a dedicated thread which handles > + the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should > + call prctl() on the thread. Otherwise, the SIGBUS is sent to > + the main thread. Perhaps be more explicit here that the user should call prctl(PR_MCE_KILL_EARLY) on the designated thread to get this behavior? The user could also mark more than one thread in this way - in which case the kernel will pick the first one it sees (is that oldest, or newest?) that is marked. Not sure if this would ever be useful unless you want to pass responsibility around in an application that is dynamically creating and removing threads. > + if (t->flags & PF_MCE_PROCESS && t->flags & PF_MCE_EARLY) This is correct - but made me twitch to add extra brackets: if ((t->flags & PF_MCE_PROCESS) && (t->flags & PF_MCE_EARLY)) or if ((t->flags & (PF_MCE_PROCESS|PF_MCE_EARLY)) == PF_MCE_PROCESS|PF_MCE_EARLY) [oops, no ... that's too long and no clearer] -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread 2014-05-28 22:00 ` Tony Luck @ 2014-05-29 1:45 ` Naoya Horiguchi [not found] ` <5386915f.4772e50a.0657.ffffcda4SMTPIN_ADDED_BROKEN@mx.google.com> [not found] ` <1401327939-cvm7qh0m@n-horiguchi@ah.jp.nec.com> 2 siblings, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-29 1:45 UTC (permalink / raw) To: tony.luck Cc: iskra, linux-kernel, linux-mm, Andi Kleen, Borislav Petkov, gong.chen On Wed, May 28, 2014 at 03:00:11PM -0700, Tony Luck wrote: > On Wed, May 28, 2014 at 11:47 AM, Naoya Horiguchi > <n-horiguchi@ah.jp.nec.com> wrote: > > Could you take a look? > > It looks good - and should be a workable API for > application writers to use. > > > @@ -84,6 +84,11 @@ PR_MCE_KILL > > PR_MCE_KILL_EARLY: Early kill > > PR_MCE_KILL_LATE: Late kill > > PR_MCE_KILL_DEFAULT: Use system global default > > + Note that if you want to have a dedicated thread which handles > > + the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should > > + call prctl() on the thread. Otherwise, the SIGBUS is sent to > > + the main thread. > > Perhaps be more explicit here that the user should call > prctl(PR_MCE_KILL_EARLY) on the designated thread > to get this behavior? OK. > The user could also mark more than > one thread in this way - in which case the kernel will pick > the first one it sees (is that oldest, or newest?) that is marked. > Not sure if this would ever be useful unless you want to pass > responsibility around in an application that is dynamically > creating and removing threads. I'm not sure which is better to send signal to first-found marked thread or to all marked threads. If we have a good reason to do the latter, I'm ok about it. Any idea? > > > + if (t->flags & PF_MCE_PROCESS && t->flags & PF_MCE_EARLY) > > This is correct - but made me twitch to add extra brackets: > > if ((t->flags & PF_MCE_PROCESS) && (t->flags & PF_MCE_EARLY)) OK, I'll take this. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <5386915f.4772e50a.0657.ffffcda4SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread [not found] ` <5386915f.4772e50a.0657.ffffcda4SMTPIN_ADDED_BROKEN@mx.google.com> @ 2014-05-29 17:03 ` Tony Luck 2014-05-29 18:38 ` Naoya Horiguchi 0 siblings, 1 reply; 31+ messages in thread From: Tony Luck @ 2014-05-29 17:03 UTC (permalink / raw) To: Naoya Horiguchi Cc: Kamil Iskra, Linux Kernel Mailing List, linux-mm@kvack.org, Andi Kleen, Borislav Petkov, Chen Gong > OK, I'll take this. If you didn't already apply it, then add a "Reviewed-by: Tony Luck <tony.luck@intel,com>" I see that this patch is on top of my earlier ones (includes the "force_early" argument). That means you have both of those queued too? Thanks -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread 2014-05-29 17:03 ` Tony Luck @ 2014-05-29 18:38 ` Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Naoya Horiguchi 0 siblings, 1 reply; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-29 18:38 UTC (permalink / raw) To: tony.luck Cc: iskra, linux-kernel, linux-mm, Andi Kleen, Borislav Petkov, gong.chen On Thu, May 29, 2014 at 10:03:17AM -0700, Tony Luck wrote: > > OK, I'll take this. > > If you didn't already apply it, then add a "Reviewed-by: Tony Luck > <tony.luck@intel,com>" Thank you. > I see that this patch is on top of my earlier ones (includes the > "force_early" argument). Right. > That means you have both of those queued too? Yes, so I'll publish my tree and ask Andrew to pull it later. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 0/3] HWPOISON: improve memory error handling for multithread process 2014-05-29 18:38 ` Naoya Horiguchi @ 2014-05-30 6:51 ` Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 1/3] memory-failure: Send right signal code to correct thread Naoya Horiguchi ` (3 more replies) 0 siblings, 4 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-30 6:51 UTC (permalink / raw) To: Andrew Morton Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm This patchset is the summary of recent discussion about memory error handling on multithread application. Patch 1 and 2 is for action required errors, and patch 3 is for action optional errors. This patchset is based on mmotm-2014-05-21-16-57. Patches are also available on the following tree/branch. git@github.com:Naoya-Horiguchi/linux.git hwpoison/master Thanks, Naoya Horiguchi --- Summary: Naoya Horiguchi (1): mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) Tony Luck (2): memory-failure: Send right signal code to correct thread memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED Documentation/vm/hwpoison.txt | 5 +++ mm/memory-failure.c | 75 ++++++++++++++++++++++++++++++------------- 2 files changed, 58 insertions(+), 22 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 1/3] memory-failure: Send right signal code to correct thread 2014-05-30 6:51 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Naoya Horiguchi @ 2014-05-30 6:51 ` Naoya Horiguchi 2014-06-02 22:44 ` Andrew Morton 2014-05-30 6:51 ` [PATCH 2/3] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED Naoya Horiguchi ` (2 subsequent siblings) 3 siblings, 1 reply; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-30 6:51 UTC (permalink / raw) To: Andrew Morton Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm From: Tony Luck <tony.luck@intel.com> When a thread in a multi-threaded application hits a machine check because of an uncorrectable error in memory - we want to send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that if the active thread is not the primary thread in the process. collect_procs() just finds primary threads and this test: if ((flags & MF_ACTION_REQUIRED) && t == current) { will see that the thread we found isn't the current thread and so send a si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active thread at this time). We can fix this by checking whether "current" shares the same mm with the process that collect_procs() said owned the page. If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> --- mm/memory-failure.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/mm/memory-failure.c mmotm-2014-05-21-16-57/mm/memory-failure.c index e3154d99b87f..b73098ee91e6 100644 --- mmotm-2014-05-21-16-57.orig/mm/memory-failure.c +++ mmotm-2014-05-21-16-57/mm/memory-failure.c @@ -204,9 +204,9 @@ static int kill_proc(struct task_struct *t, unsigned long addr, int trapno, #endif si.si_addr_lsb = page_size_order(page) + PAGE_SHIFT; - if ((flags & MF_ACTION_REQUIRED) && t == current) { + if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { si.si_code = BUS_MCEERR_AR; - ret = force_sig_info(SIGBUS, &si, t); + ret = force_sig_info(SIGBUS, &si, current); } else { /* * Don't use force here, it's convenient if the signal -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] memory-failure: Send right signal code to correct thread 2014-05-30 6:51 ` [PATCH 1/3] memory-failure: Send right signal code to correct thread Naoya Horiguchi @ 2014-06-02 22:44 ` Andrew Morton 2014-06-03 1:12 ` Naoya Horiguchi 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2014-06-02 22:44 UTC (permalink / raw) To: Naoya Horiguchi Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm On Fri, 30 May 2014 02:51:08 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > From: Tony Luck <tony.luck@intel.com> > > When a thread in a multi-threaded application hits a machine > check because of an uncorrectable error in memory - we want to > send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. > Currently we fail to do that if the active thread is not the > primary thread in the process. collect_procs() just finds primary > threads and this test: > if ((flags & MF_ACTION_REQUIRED) && t == current) { > will see that the thread we found isn't the current thread > and so send a si.si_code = BUS_MCEERR_AO to the primary > (and nothing to the active thread at this time). > > We can fix this by checking whether "current" shares the same > mm with the process that collect_procs() said owned the page. > If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). > > Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com> > Signed-off-by: Tony Luck <tony.luck@intel.com> > Cc: Andi Kleen <andi@firstfloor.org> > Cc: Borislav Petkov <bp@suse.de> > Cc: Chen Gong <gong.chen@linux.jf.intel.com> > Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> You were on the patch delivery path, so it should have included your signed-off-by. Documentation/SubmittingPatches section 12 has the details. I have made that change to my copies of patches 1 and 2. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] memory-failure: Send right signal code to correct thread 2014-06-02 22:44 ` Andrew Morton @ 2014-06-03 1:12 ` Naoya Horiguchi 0 siblings, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-06-03 1:12 UTC (permalink / raw) To: Andrew Morton Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm On Mon, Jun 02, 2014 at 03:44:31PM -0700, Andrew Morton wrote: > On Fri, 30 May 2014 02:51:08 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > > From: Tony Luck <tony.luck@intel.com> > > > > When a thread in a multi-threaded application hits a machine > > check because of an uncorrectable error in memory - we want to > > send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. > > Currently we fail to do that if the active thread is not the > > primary thread in the process. collect_procs() just finds primary > > threads and this test: > > if ((flags & MF_ACTION_REQUIRED) && t == current) { > > will see that the thread we found isn't the current thread > > and so send a si.si_code = BUS_MCEERR_AO to the primary > > (and nothing to the active thread at this time). > > > > We can fix this by checking whether "current" shares the same > > mm with the process that collect_procs() said owned the page. > > If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). > > > > Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com> > > Signed-off-by: Tony Luck <tony.luck@intel.com> > > Cc: Andi Kleen <andi@firstfloor.org> > > Cc: Borislav Petkov <bp@suse.de> > > Cc: Chen Gong <gong.chen@linux.jf.intel.com> > > Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > > You were on the patch delivery path, so it should have included your > signed-off-by. Documentation/SubmittingPatches section 12 has the > details. Sorry, I didn't know that. > I have made that change to my copies of patches 1 and 2. Thank you. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 2/3] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED 2014-05-30 6:51 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 1/3] memory-failure: Send right signal code to correct thread Naoya Horiguchi @ 2014-05-30 6:51 ` Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 3/3] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) Naoya Horiguchi 2014-05-30 17:25 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Luck, Tony 3 siblings, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-30 6:51 UTC (permalink / raw) To: Andrew Morton Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm From: Tony Luck <tony.luck@intel.com> When Linux sees an "action optional" machine check (where h/w has reported an error that is not in the current execution path) we generally do not want to signal a process, since most processes do not have a SIGBUS handler - we'd just prematurely terminate the process for a problem that they might never actually see. task_early_kill() decides whether to consider a process - and it checks whether this specific process has been marked for early signals with "prctl", or if the system administrator has requested early signals for all processes using /proc/sys/vm/memory_failure_early_kill. But for MF_ACTION_REQUIRED case we must not defer. The error is in the execution path of the current thread so we must send the SIGBUS immediatley. Fix by passing a flag argument through collect_procs*() to task_early_kill() so it knows whether we can defer or must take action. Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> --- mm/memory-failure.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/mm/memory-failure.c mmotm-2014-05-21-16-57/mm/memory-failure.c index b73098ee91e6..fbcdb1d54c55 100644 --- mmotm-2014-05-21-16-57.orig/mm/memory-failure.c +++ mmotm-2014-05-21-16-57/mm/memory-failure.c @@ -380,10 +380,12 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, } } -static int task_early_kill(struct task_struct *tsk) +static int task_early_kill(struct task_struct *tsk, int force_early) { if (!tsk->mm) return 0; + if (force_early) + return 1; if (tsk->flags & PF_MCE_PROCESS) return !!(tsk->flags & PF_MCE_EARLY); return sysctl_memory_failure_early_kill; @@ -393,7 +395,7 @@ static int task_early_kill(struct task_struct *tsk) * Collect processes when the error hit an anonymous page. */ static void collect_procs_anon(struct page *page, struct list_head *to_kill, - struct to_kill **tkc) + struct to_kill **tkc, int force_early) { struct vm_area_struct *vma; struct task_struct *tsk; @@ -409,7 +411,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, for_each_process (tsk) { struct anon_vma_chain *vmac; - if (!task_early_kill(tsk)) + if (!task_early_kill(tsk, force_early)) continue; anon_vma_interval_tree_foreach(vmac, &av->rb_root, pgoff, pgoff) { @@ -428,7 +430,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, * Collect processes when the error hit a file mapped page. */ static void collect_procs_file(struct page *page, struct list_head *to_kill, - struct to_kill **tkc) + struct to_kill **tkc, int force_early) { struct vm_area_struct *vma; struct task_struct *tsk; @@ -439,7 +441,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, for_each_process(tsk) { pgoff_t pgoff = page_pgoff(page); - if (!task_early_kill(tsk)) + if (!task_early_kill(tsk, force_early)) continue; vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, @@ -465,7 +467,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, * First preallocate one tokill structure outside the spin locks, * so that we can kill at least one process reasonably reliable. */ -static void collect_procs(struct page *page, struct list_head *tokill) +static void collect_procs(struct page *page, struct list_head *tokill, + int force_early) { struct to_kill *tk; @@ -476,9 +479,9 @@ static void collect_procs(struct page *page, struct list_head *tokill) if (!tk) return; if (PageAnon(page)) - collect_procs_anon(page, tokill, &tk); + collect_procs_anon(page, tokill, &tk, force_early); else - collect_procs_file(page, tokill, &tk); + collect_procs_file(page, tokill, &tk, force_early); kfree(tk); } @@ -963,7 +966,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn, * there's nothing that can be done. */ if (kill) - collect_procs(ppage, &tokill); + collect_procs(ppage, &tokill, flags & MF_ACTION_REQUIRED); ret = try_to_unmap(ppage, ttu); if (ret != SWAP_SUCCESS) -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 3/3] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) 2014-05-30 6:51 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 1/3] memory-failure: Send right signal code to correct thread Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 2/3] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED Naoya Horiguchi @ 2014-05-30 6:51 ` Naoya Horiguchi 2014-06-02 22:42 ` Andrew Morton 2014-05-30 17:25 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Luck, Tony 3 siblings, 1 reply; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-30 6:51 UTC (permalink / raw) To: Andrew Morton Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm Currently memory error handler handles action optional errors in the deferred manner by default. And if a recovery aware application wants to handle it immediately, it can do it by setting PF_MCE_EARLY flag. However, such signal can be sent only to the main thread, so it's problematic if the application wants to have a dedicated thread to handler such signals. So this patch adds dedicated thread support to memory error handler. We have PF_MCE_EARLY flags for each thread separately, so with this patch AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main thread. If you want to implement a dedicated thread, you call prctl() to set PF_MCE_EARLY on the thread. Memory error handler collects processes to be killed, so this patch lets it check PF_MCE_EARLY flag on each thread in the collecting routines. No behavioral change for all non-early kill cases. ChangeLog: - document more specifically - add parenthesis in find_early_kill_thread() - move position of find_early_kill_thread() and task_early_kill() Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: Kamil Iskra <iskra@mcs.anl.gov> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> --- Documentation/vm/hwpoison.txt | 5 ++++ mm/memory-failure.c | 58 ++++++++++++++++++++++++++++++++----------- 2 files changed, 48 insertions(+), 15 deletions(-) diff --git mmotm-2014-05-21-16-57.orig/Documentation/vm/hwpoison.txt mmotm-2014-05-21-16-57/Documentation/vm/hwpoison.txt index 550068466605..6ae89a9edf2a 100644 --- mmotm-2014-05-21-16-57.orig/Documentation/vm/hwpoison.txt +++ mmotm-2014-05-21-16-57/Documentation/vm/hwpoison.txt @@ -84,6 +84,11 @@ PR_MCE_KILL PR_MCE_KILL_EARLY: Early kill PR_MCE_KILL_LATE: Late kill PR_MCE_KILL_DEFAULT: Use system global default + Note that if you want to have a dedicated thread which handles + the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should + call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, + the SIGBUS is sent to the main thread. + PR_MCE_KILL_GET return current mode diff --git mmotm-2014-05-21-16-57.orig/mm/memory-failure.c mmotm-2014-05-21-16-57/mm/memory-failure.c index fbcdb1d54c55..9751e19ab13b 100644 --- mmotm-2014-05-21-16-57.orig/mm/memory-failure.c +++ mmotm-2014-05-21-16-57/mm/memory-failure.c @@ -380,15 +380,44 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, } } -static int task_early_kill(struct task_struct *tsk, int force_early) +/* + * Find a dedicated thread which is supposed to handle SIGBUS(BUS_MCEERR_AO) + * on behalf of the thread group. Return task_struct of the (first found) + * dedicated thread if found, and return NULL otherwise. + */ +static struct task_struct *find_early_kill_thread(struct task_struct *tsk) +{ + struct task_struct *t; + rcu_read_lock(); + for_each_thread(tsk, t) + if ((t->flags & PF_MCE_PROCESS) && (t->flags & PF_MCE_EARLY)) + goto found; + t = NULL; +found: + rcu_read_unlock(); + return t; +} + +/* + * Determine whether a given process is "early kill" process which expects + * to be signaled when some page under the process is hwpoisoned. + * Return task_struct of the dedicated thread (main thread unless explicitly + * specified) if the process is "early kill," and otherwise returns NULL. + */ +static struct task_struct *task_early_kill(struct task_struct *tsk, + int force_early) { + struct task_struct *t; if (!tsk->mm) - return 0; + return NULL; if (force_early) - return 1; - if (tsk->flags & PF_MCE_PROCESS) - return !!(tsk->flags & PF_MCE_EARLY); - return sysctl_memory_failure_early_kill; + return tsk; + t = find_early_kill_thread(tsk); + if (t) + return t; + if (sysctl_memory_failure_early_kill) + return tsk; + return NULL; } /* @@ -410,16 +439,16 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, read_lock(&tasklist_lock); for_each_process (tsk) { struct anon_vma_chain *vmac; - - if (!task_early_kill(tsk, force_early)) + struct task_struct *t = task_early_kill(tsk, force_early); + if (!t) continue; anon_vma_interval_tree_foreach(vmac, &av->rb_root, pgoff, pgoff) { vma = vmac->vma; if (!page_mapped_in_vma(page, vma)) continue; - if (vma->vm_mm == tsk->mm) - add_to_kill(tsk, page, vma, to_kill, tkc); + if (vma->vm_mm == t->mm) + add_to_kill(t, page, vma, to_kill, tkc); } } read_unlock(&tasklist_lock); @@ -440,10 +469,9 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, read_lock(&tasklist_lock); for_each_process(tsk) { pgoff_t pgoff = page_pgoff(page); - - if (!task_early_kill(tsk, force_early)) + struct task_struct *t = task_early_kill(tsk, force_early); + if (!t) continue; - vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { /* @@ -453,8 +481,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, * Assume applications who requested early kill want * to be informed of all such data corruptions. */ - if (vma->vm_mm == tsk->mm) - add_to_kill(tsk, page, vma, to_kill, tkc); + if (vma->vm_mm == t->mm) + add_to_kill(t, page, vma, to_kill, tkc); } } read_unlock(&tasklist_lock); -- 1.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 3/3] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) 2014-05-30 6:51 ` [PATCH 3/3] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) Naoya Horiguchi @ 2014-06-02 22:42 ` Andrew Morton 2014-06-03 1:03 ` Naoya Horiguchi 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2014-06-02 22:42 UTC (permalink / raw) To: Naoya Horiguchi Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm On Fri, 30 May 2014 02:51:10 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > Currently memory error handler handles action optional errors in the deferred > manner by default. And if a recovery aware application wants to handle it > immediately, it can do it by setting PF_MCE_EARLY flag. However, such signal > can be sent only to the main thread, so it's problematic if the application > wants to have a dedicated thread to handler such signals. > > So this patch adds dedicated thread support to memory error handler. We have > PF_MCE_EARLY flags for each thread separately, so with this patch AO signal > is sent to the thread with PF_MCE_EARLY flag set, not the main thread. If > you want to implement a dedicated thread, you call prctl() to set PF_MCE_EARLY > on the thread. > > Memory error handler collects processes to be killed, so this patch lets it > check PF_MCE_EARLY flag on each thread in the collecting routines. > > No behavioral change for all non-early kill cases. > > ... > > --- mmotm-2014-05-21-16-57.orig/mm/memory-failure.c > +++ mmotm-2014-05-21-16-57/mm/memory-failure.c > @@ -380,15 +380,44 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, > } > } > > -static int task_early_kill(struct task_struct *tsk, int force_early) > +/* > + * Find a dedicated thread which is supposed to handle SIGBUS(BUS_MCEERR_AO) > + * on behalf of the thread group. Return task_struct of the (first found) > + * dedicated thread if found, and return NULL otherwise. > + */ > +static struct task_struct *find_early_kill_thread(struct task_struct *tsk) > +{ > + struct task_struct *t; > + rcu_read_lock(); > + for_each_thread(tsk, t) > + if ((t->flags & PF_MCE_PROCESS) && (t->flags & PF_MCE_EARLY)) > + goto found; > + t = NULL; > +found: > + rcu_read_unlock(); > + return t; > +} > + > +/* > + * Determine whether a given process is "early kill" process which expects > + * to be signaled when some page under the process is hwpoisoned. > + * Return task_struct of the dedicated thread (main thread unless explicitly > + * specified) if the process is "early kill," and otherwise returns NULL. > + */ > +static struct task_struct *task_early_kill(struct task_struct *tsk, > + int force_early) > { > + struct task_struct *t; > if (!tsk->mm) > - return 0; > + return NULL; > if (force_early) > - return 1; > - if (tsk->flags & PF_MCE_PROCESS) > - return !!(tsk->flags & PF_MCE_EARLY); > - return sysctl_memory_failure_early_kill; > + return tsk; > + t = find_early_kill_thread(tsk); > + if (t) > + return t; > + if (sysctl_memory_failure_early_kill) > + return tsk; > + return NULL; > } The above two functions are to be called under read_lock(tasklist_lock), which is rather important... Given this requirement, did find_early_kill_thread() need rcu_read_lock()? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 3/3] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) 2014-06-02 22:42 ` Andrew Morton @ 2014-06-03 1:03 ` Naoya Horiguchi 0 siblings, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-06-03 1:03 UTC (permalink / raw) To: Andrew Morton Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm On Mon, Jun 02, 2014 at 03:42:07PM -0700, Andrew Morton wrote: > On Fri, 30 May 2014 02:51:10 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > > > Currently memory error handler handles action optional errors in the deferred > > manner by default. And if a recovery aware application wants to handle it > > immediately, it can do it by setting PF_MCE_EARLY flag. However, such signal > > can be sent only to the main thread, so it's problematic if the application > > wants to have a dedicated thread to handler such signals. > > > > So this patch adds dedicated thread support to memory error handler. We have > > PF_MCE_EARLY flags for each thread separately, so with this patch AO signal > > is sent to the thread with PF_MCE_EARLY flag set, not the main thread. If > > you want to implement a dedicated thread, you call prctl() to set PF_MCE_EARLY > > on the thread. > > > > Memory error handler collects processes to be killed, so this patch lets it > > check PF_MCE_EARLY flag on each thread in the collecting routines. > > > > No behavioral change for all non-early kill cases. > > > > ... > > > > --- mmotm-2014-05-21-16-57.orig/mm/memory-failure.c > > +++ mmotm-2014-05-21-16-57/mm/memory-failure.c > > @@ -380,15 +380,44 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, > > } > > } > > > > -static int task_early_kill(struct task_struct *tsk, int force_early) > > +/* > > + * Find a dedicated thread which is supposed to handle SIGBUS(BUS_MCEERR_AO) > > + * on behalf of the thread group. Return task_struct of the (first found) > > + * dedicated thread if found, and return NULL otherwise. > > + */ > > +static struct task_struct *find_early_kill_thread(struct task_struct *tsk) > > +{ > > + struct task_struct *t; > > + rcu_read_lock(); > > + for_each_thread(tsk, t) > > + if ((t->flags & PF_MCE_PROCESS) && (t->flags & PF_MCE_EARLY)) > > + goto found; > > + t = NULL; > > +found: > > + rcu_read_unlock(); > > + return t; > > +} > > + > > +/* > > + * Determine whether a given process is "early kill" process which expects > > + * to be signaled when some page under the process is hwpoisoned. > > + * Return task_struct of the dedicated thread (main thread unless explicitly > > + * specified) if the process is "early kill," and otherwise returns NULL. > > + */ > > +static struct task_struct *task_early_kill(struct task_struct *tsk, > > + int force_early) > > { > > + struct task_struct *t; > > if (!tsk->mm) > > - return 0; > > + return NULL; > > if (force_early) > > - return 1; > > - if (tsk->flags & PF_MCE_PROCESS) > > - return !!(tsk->flags & PF_MCE_EARLY); > > - return sysctl_memory_failure_early_kill; > > + return tsk; > > + t = find_early_kill_thread(tsk); > > + if (t) > > + return t; > > + if (sysctl_memory_failure_early_kill) > > + return tsk; > > + return NULL; > > } > > The above two functions are to be called under > read_lock(tasklist_lock), which is rather important... > > Given this requirement, did find_early_kill_thread() need rcu_read_lock()? Right, we don't need this rcu_read_lock(). The following hunk should fix it. Thanks, Naoya Horiguchi diff --git a/mm/memory-failure.c b/mm/memory-failure.c index b0f48e34dec5..6fdc9a2eeb2f 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -297,18 +297,17 @@ struct to_kill { * Find a dedicated thread which is supposed to handle SIGBUS(BUS_MCEERR_AO) * on behalf of the thread group. Return task_struct of the (first found) * dedicated thread if found, and return NULL otherwise. + * + * We already hold read_lock(&tasklist_lock) in the caller, so we don't + * have to call rcu_read_lock/unlock() in this function. */ static struct task_struct *find_early_kill_thread(struct task_struct *tsk) { struct task_struct *t; - rcu_read_lock(); for_each_thread(tsk, t) if ((t->flags & PF_MCE_PROCESS) && (t->flags & PF_MCE_EARLY)) - goto found; - t = NULL; -found: - rcu_read_unlock(); - return t; + return t; + return NULL; } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* RE: [PATCH 0/3] HWPOISON: improve memory error handling for multithread process 2014-05-30 6:51 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Naoya Horiguchi ` (2 preceding siblings ...) 2014-05-30 6:51 ` [PATCH 3/3] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) Naoya Horiguchi @ 2014-05-30 17:25 ` Luck, Tony 2014-05-30 18:24 ` Naoya Horiguchi [not found] ` <5388cd0e.463edd0a.755d.6f61SMTPIN_ADDED_BROKEN@mx.google.com> 3 siblings, 2 replies; 31+ messages in thread From: Luck, Tony @ 2014-05-30 17:25 UTC (permalink / raw) To: Naoya Horiguchi, Andrew Morton Cc: Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel@vger.kernel.org, linux-mm@kvack.org > This patchset is the summary of recent discussion about memory error handling > on multithread application. Patch 1 and 2 is for action required errors, and > patch 3 is for action optional errors. Naoya, You suggested early in the discussion (when there were just two patches) that they deserved a "Cc: stable@vger.kernel.org". I agreed, and still think the same way. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/3] HWPOISON: improve memory error handling for multithread process 2014-05-30 17:25 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Luck, Tony @ 2014-05-30 18:24 ` Naoya Horiguchi [not found] ` <5388cd0e.463edd0a.755d.6f61SMTPIN_ADDED_BROKEN@mx.google.com> 1 sibling, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-30 18:24 UTC (permalink / raw) To: Tony Luck Cc: Andrew Morton, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm On Fri, May 30, 2014 at 05:25:39PM +0000, Luck, Tony wrote: > > This patchset is the summary of recent discussion about memory error handling > > on multithread application. Patch 1 and 2 is for action required errors, and > > patch 3 is for action optional errors. > > Naoya, > > You suggested early in the discussion (when there were just two patches) that > they deserved a "Cc: stable@vger.kernel.org". I agreed, and still think the same > way. Correct. AR error handling was added in v3.2-rc5, so adding "Cc: stable@vger.kernel.org # v3.2+" is fine. Thanks, Naoya Horiguchi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <5388cd0e.463edd0a.755d.6f61SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [PATCH 0/3] HWPOISON: improve memory error handling for multithread process [not found] ` <5388cd0e.463edd0a.755d.6f61SMTPIN_ADDED_BROKEN@mx.google.com> @ 2014-06-02 22:43 ` Andrew Morton 2014-06-02 23:37 ` Luck, Tony 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2014-06-02 22:43 UTC (permalink / raw) To: Naoya Horiguchi Cc: Tony Luck, Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel, linux-mm On Fri, 30 May 2014 14:24:52 -0400 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote: > On Fri, May 30, 2014 at 05:25:39PM +0000, Luck, Tony wrote: > > > This patchset is the summary of recent discussion about memory error handling > > > on multithread application. Patch 1 and 2 is for action required errors, and > > > patch 3 is for action optional errors. > > > > Naoya, > > > > You suggested early in the discussion (when there were just two patches) that > > they deserved a "Cc: stable@vger.kernel.org". I agreed, and still think the same > > way. > > Correct. AR error handling was added in v3.2-rc5, so adding > "Cc: stable@vger.kernel.org # v3.2+" is fine. I'm not sure that "[PATCH 3/3] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO)" is a -stable thing? That's a feature addition more than a bugfix? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: [PATCH 0/3] HWPOISON: improve memory error handling for multithread process 2014-06-02 22:43 ` Andrew Morton @ 2014-06-02 23:37 ` Luck, Tony 0 siblings, 0 replies; 31+ messages in thread From: Luck, Tony @ 2014-06-02 23:37 UTC (permalink / raw) To: Andrew Morton, Naoya Horiguchi Cc: Andi Kleen, Kamil Iskra, Borislav Petkov, Chen Gong, linux-kernel@vger.kernel.org, linux-mm@kvack.org > I'm not sure that "[PATCH 3/3] mm/memory-failure.c: support dedicated > thread to handle SIGBUS(BUS_MCEERR_AO)" is a -stable thing? That's a > feature addition more than a bugfix? No - the old behavior was crazy - someone with a multithreaded process might well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if that thread wasn't the main thread for the process. Perhaps the description for the commit should better reflect that? -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <1401327939-cvm7qh0m@n-horiguchi@ah.jp.nec.com>]
* Re: [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread [not found] ` <1401327939-cvm7qh0m@n-horiguchi@ah.jp.nec.com> @ 2014-05-30 19:52 ` Kamil Iskra 0 siblings, 0 replies; 31+ messages in thread From: Kamil Iskra @ 2014-05-30 19:52 UTC (permalink / raw) To: Naoya Horiguchi Cc: tony.luck, linux-kernel, linux-mm, Andi Kleen, Borislav Petkov, gong.chen On Wed, May 28, 2014 at 21:45:41 -0400, Naoya Horiguchi wrote: > > The user could also mark more than > > one thread in this way - in which case the kernel will pick > > the first one it sees (is that oldest, or newest?) that is marked. > > Not sure if this would ever be useful unless you want to pass > > responsibility around in an application that is dynamically > > creating and removing threads. > > I'm not sure which is better to send signal to first-found marked thread > or to all marked threads. If we have a good reason to do the latter, > I'm ok about it. Any idea? Well, it would be more flexible if the signal were sent to all marked threads, but I don't know if that constitutes a good enough reason to add the extra complexity involved. Sometimes better is the enemy of good, and in this case the patch you proposed should be good enough for any practical case I can think of. Naoya, Tony, thank you for taking the leadership on this issue and seeing it through, and for the courtesy of keeping me in the loop! Kamil -- Kamil Iskra, PhD Argonne National Laboratory, Mathematics and Computer Science Division 9700 South Cass Avenue, Building 240, Argonne, IL 60439, USA phone: +1-630-252-7197 fax: +1-630-252-5986 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 2/2] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED 2014-05-20 17:35 [PATCH 0/2] Fix some machine check application recovery cases Tony Luck 2014-05-20 16:28 ` [PATCH 1/2] memory-failure: Send right signal code to correct thread Tony Luck @ 2014-05-20 16:46 ` Tony Luck 2014-05-20 17:59 ` Naoya Horiguchi 1 sibling, 1 reply; 31+ messages in thread From: Tony Luck @ 2014-05-20 16:46 UTC (permalink / raw) To: linux-kernel, linux-mm; +Cc: Andi Kleen, Borislav Petkov, Chen Gong When Linux sees an "action optional" machine check (where h/w has reported an error that is not in the current execution path) we generally do not want to signal a process, since most processes do not have a SIGBUS handler - we'd just prematurely terminate the process for a problem that they might never actually see. task_early_kill() decides whether to consider a process - and it checks whether this specific process has been marked for early signals with "prctl", or if the system administrator has requested early signals for all processes using /proc/sys/vm/memory_failure_early_kill. But for MF_ACTION_REQUIRED case we must not defer. The error is in the execution path of the current thread so we must send the SIGBUS immediatley. Fix by passing a flag argument through collect_procs*() to task_early_kill() so it knows whether we can defer or must take action. Signed-off-by: Tony Luck <tony.luck@intel.com> --- mm/memory-failure.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 642c8434b166..f0967f72991c 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -380,10 +380,12 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, } } -static int task_early_kill(struct task_struct *tsk) +static int task_early_kill(struct task_struct *tsk, int force_early) { if (!tsk->mm) return 0; + if (force_early) + return 1; if (tsk->flags & PF_MCE_PROCESS) return !!(tsk->flags & PF_MCE_EARLY); return sysctl_memory_failure_early_kill; @@ -393,7 +395,7 @@ static int task_early_kill(struct task_struct *tsk) * Collect processes when the error hit an anonymous page. */ static void collect_procs_anon(struct page *page, struct list_head *to_kill, - struct to_kill **tkc) + struct to_kill **tkc, int force_early) { struct vm_area_struct *vma; struct task_struct *tsk; @@ -409,7 +411,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, for_each_process (tsk) { struct anon_vma_chain *vmac; - if (!task_early_kill(tsk)) + if (!task_early_kill(tsk, force_early)) continue; anon_vma_interval_tree_foreach(vmac, &av->rb_root, pgoff, pgoff) { @@ -428,7 +430,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, * Collect processes when the error hit a file mapped page. */ static void collect_procs_file(struct page *page, struct list_head *to_kill, - struct to_kill **tkc) + struct to_kill **tkc, int force_early) { struct vm_area_struct *vma; struct task_struct *tsk; @@ -439,7 +441,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, for_each_process(tsk) { pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); - if (!task_early_kill(tsk)) + if (!task_early_kill(tsk, force_early)) continue; vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, @@ -465,7 +467,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, * First preallocate one tokill structure outside the spin locks, * so that we can kill at least one process reasonably reliable. */ -static void collect_procs(struct page *page, struct list_head *tokill) +static void collect_procs(struct page *page, struct list_head *tokill, + int force_early) { struct to_kill *tk; @@ -476,9 +479,9 @@ static void collect_procs(struct page *page, struct list_head *tokill) if (!tk) return; if (PageAnon(page)) - collect_procs_anon(page, tokill, &tk); + collect_procs_anon(page, tokill, &tk, force_early); else - collect_procs_file(page, tokill, &tk); + collect_procs_file(page, tokill, &tk, force_early); kfree(tk); } @@ -963,7 +966,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn, * there's nothing that can be done. */ if (kill) - collect_procs(ppage, &tokill); + collect_procs(ppage, &tokill, flags & MF_ACTION_REQUIRED); ret = try_to_unmap(ppage, ttu); if (ret != SWAP_SUCCESS) -- 1.8.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 2/2] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED 2014-05-20 16:46 ` [PATCH 2/2] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED Tony Luck @ 2014-05-20 17:59 ` Naoya Horiguchi 0 siblings, 0 replies; 31+ messages in thread From: Naoya Horiguchi @ 2014-05-20 17:59 UTC (permalink / raw) To: Tony Luck; +Cc: linux-kernel, linux-mm, Andi Kleen, bp, gong.chen On Tue, May 20, 2014 at 09:46:43AM -0700, Tony Luck wrote: > When Linux sees an "action optional" machine check (where h/w has > reported an error that is not in the current execution path) we > generally do not want to signal a process, since most processes > do not have a SIGBUS handler - we'd just prematurely terminate the > process for a problem that they might never actually see. > > task_early_kill() decides whether to consider a process - and it > checks whether this specific process has been marked for early signals > with "prctl", or if the system administrator has requested early > signals for all processes using /proc/sys/vm/memory_failure_early_kill. > > But for MF_ACTION_REQUIRED case we must not defer. The error is in > the execution path of the current thread so we must send the SIGBUS > immediatley. > > Fix by passing a flag argument through collect_procs*() to > task_early_kill() so it knows whether we can defer or must > take action. > > Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Thanks, Naoya Horiguchi > --- > mm/memory-failure.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 642c8434b166..f0967f72991c 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -380,10 +380,12 @@ static void kill_procs(struct list_head *to_kill, int forcekill, int trapno, > } > } > > -static int task_early_kill(struct task_struct *tsk) > +static int task_early_kill(struct task_struct *tsk, int force_early) > { > if (!tsk->mm) > return 0; > + if (force_early) > + return 1; > if (tsk->flags & PF_MCE_PROCESS) > return !!(tsk->flags & PF_MCE_EARLY); > return sysctl_memory_failure_early_kill; > @@ -393,7 +395,7 @@ static int task_early_kill(struct task_struct *tsk) > * Collect processes when the error hit an anonymous page. > */ > static void collect_procs_anon(struct page *page, struct list_head *to_kill, > - struct to_kill **tkc) > + struct to_kill **tkc, int force_early) > { > struct vm_area_struct *vma; > struct task_struct *tsk; > @@ -409,7 +411,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, > for_each_process (tsk) { > struct anon_vma_chain *vmac; > > - if (!task_early_kill(tsk)) > + if (!task_early_kill(tsk, force_early)) > continue; > anon_vma_interval_tree_foreach(vmac, &av->rb_root, > pgoff, pgoff) { > @@ -428,7 +430,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, > * Collect processes when the error hit a file mapped page. > */ > static void collect_procs_file(struct page *page, struct list_head *to_kill, > - struct to_kill **tkc) > + struct to_kill **tkc, int force_early) > { > struct vm_area_struct *vma; > struct task_struct *tsk; > @@ -439,7 +441,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, > for_each_process(tsk) { > pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); > > - if (!task_early_kill(tsk)) > + if (!task_early_kill(tsk, force_early)) > continue; > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, > @@ -465,7 +467,8 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, > * First preallocate one tokill structure outside the spin locks, > * so that we can kill at least one process reasonably reliable. > */ > -static void collect_procs(struct page *page, struct list_head *tokill) > +static void collect_procs(struct page *page, struct list_head *tokill, > + int force_early) > { > struct to_kill *tk; > > @@ -476,9 +479,9 @@ static void collect_procs(struct page *page, struct list_head *tokill) > if (!tk) > return; > if (PageAnon(page)) > - collect_procs_anon(page, tokill, &tk); > + collect_procs_anon(page, tokill, &tk, force_early); > else > - collect_procs_file(page, tokill, &tk); > + collect_procs_file(page, tokill, &tk, force_early); > kfree(tk); > } > > @@ -963,7 +966,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn, > * there's nothing that can be done. > */ > if (kill) > - collect_procs(ppage, &tokill); > + collect_procs(ppage, &tokill, flags & MF_ACTION_REQUIRED); > > ret = try_to_unmap(ppage, ttu); > if (ret != SWAP_SUCCESS) > -- > 1.8.4.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2014-06-03 1:12 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-05-20 17:35 [PATCH 0/2] Fix some machine check application recovery cases Tony Luck 2014-05-20 16:28 ` [PATCH 1/2] memory-failure: Send right signal code to correct thread Tony Luck 2014-05-20 17:54 ` Naoya Horiguchi [not found] ` <1400608486-alyqz521@n-horiguchi@ah.jp.nec.com> 2014-05-20 20:56 ` Luck, Tony 2014-05-23 3:34 ` Chen, Gong 2014-05-23 16:48 ` Tony Luck 2014-05-27 16:16 ` Kamil Iskra 2014-05-27 17:50 ` Naoya Horiguchi [not found] ` <5384d07e.4504e00a.2680.ffff8c31SMTPIN_ADDED_BROKEN@mx.google.com> 2014-05-27 22:53 ` Tony Luck 2014-05-28 0:15 ` Naoya Horiguchi [not found] ` <53852abb.867ce00a.3cef.3c7eSMTPIN_ADDED_BROKEN@mx.google.com> 2014-05-28 5:09 ` Tony Luck 2014-05-28 18:47 ` [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread Naoya Horiguchi [not found] ` <53862f6c.91148c0a.5fb0.2d0cSMTPIN_ADDED_BROKEN@mx.google.com> 2014-05-28 22:00 ` Tony Luck 2014-05-29 1:45 ` Naoya Horiguchi [not found] ` <5386915f.4772e50a.0657.ffffcda4SMTPIN_ADDED_BROKEN@mx.google.com> 2014-05-29 17:03 ` Tony Luck 2014-05-29 18:38 ` Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 1/3] memory-failure: Send right signal code to correct thread Naoya Horiguchi 2014-06-02 22:44 ` Andrew Morton 2014-06-03 1:12 ` Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 2/3] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED Naoya Horiguchi 2014-05-30 6:51 ` [PATCH 3/3] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) Naoya Horiguchi 2014-06-02 22:42 ` Andrew Morton 2014-06-03 1:03 ` Naoya Horiguchi 2014-05-30 17:25 ` [PATCH 0/3] HWPOISON: improve memory error handling for multithread process Luck, Tony 2014-05-30 18:24 ` Naoya Horiguchi [not found] ` <5388cd0e.463edd0a.755d.6f61SMTPIN_ADDED_BROKEN@mx.google.com> 2014-06-02 22:43 ` Andrew Morton 2014-06-02 23:37 ` Luck, Tony [not found] ` <1401327939-cvm7qh0m@n-horiguchi@ah.jp.nec.com> 2014-05-30 19:52 ` [PATCH] mm/memory-failure.c: support dedicated thread to handle SIGBUS(BUS_MCEERR_AO) thread Kamil Iskra 2014-05-20 16:46 ` [PATCH 2/2] memory-failure: Don't let collect_procs() skip over processes for MF_ACTION_REQUIRED Tony Luck 2014-05-20 17:59 ` Naoya Horiguchi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).