From: Gavin Shan <gwshan@linux.vnet.ibm.com>
To: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>, linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE
Date: Wed, 23 Sep 2015 09:07:41 +1000 [thread overview]
Message-ID: <20150922230741.GA6721@gwshan> (raw)
In-Reply-To: <20150922044303.GB2072@Richards-MacBook-Pro.local>
On Tue, Sep 22, 2015 at 12:43:03PM +0800, Wei Yang wrote:
>On Mon, Sep 21, 2015 at 09:49:45PM +1000, Gavin Shan wrote:
>>On Mon, Sep 21, 2015 at 05:29:48PM +0800, Wei Yang wrote:
>>>Current EEH infrastructure would avoid to handle EEH when a PE is passed to
>>>guest, while if this PE is a Child PE of the one hit EEH, host would handle
>>>this. By doing so, this would leads to guest hang. The correct way is
>>>avoid to handle it on host and let guest to recover.
>>>
>>>This patch avoids to handle EEH on a passed Child PE.
>>>
>>
>>Ok. It's fixing the problem the guest, which owns a VF, when its PF hitting
>>EEH error, right? If so, I'm not sure if you really tested this code. Does
>>it work for you?
>
>Yes, I inject error on Parent Bus PE.
>
>>
>>When the parent PE (PF) is stopped for EEH recovery, it sounds impossible
>>that the child PE can't be affected and just escape from the error. The
>>question is how the guest can continue to work after the EEH recovery on
>>parent PE?
>
>What I see is the PF is covering and VF in guest is recovering.
>
What do you mean by "covering"? Which PE's error is detected first in your
testing? There is potentially race here: when the VF PE's error is detected
first and guest tries to recover it. After the recovery happened on guest
side, the host detects the PF PE's error and tries to recover it. During
the recovery, the VF PE is total unusable but guest doesn't know that and
operate like usual. I'm not sure what kinds of problems it can incur, but
it would incur issues.
On the other hand, if PF PE's error is detected on host first, and then the
guest detects the error on VF PE. After that, the host and guest try to do
recovery at same time. Host issues PE reset to PF PE, which isn't finished
yet. Then guest issues PE reset to VF PE, which will cause another EEH error.
So I'm not sure if had this patch fully tested. If so, it seems the result is
just achieved by luck, I guess.
>>
>>>Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
>>>---
>>> arch/powerpc/kernel/eeh_pe.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>>diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
>>>index 5cde950..c6d0e9f 100644
>>>--- a/arch/powerpc/kernel/eeh_pe.c
>>>+++ b/arch/powerpc/kernel/eeh_pe.c
>>>@@ -172,6 +172,7 @@ static struct eeh_pe *eeh_pe_next(struct eeh_pe *pe,
>>> * callback returns something other than NULL, or no more PEs
>>> * to be traversed.
>>> */
>>>+static void *__eeh_pe_get(void *data, void *flag);
>>> void *eeh_pe_traverse(struct eeh_pe *root,
>>> eeh_traverse_func fn, void *flag)
>>> {
>>>@@ -179,6 +180,8 @@ void *eeh_pe_traverse(struct eeh_pe *root,
>>> void *ret;
>>>
>>> for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>>>+ if (eeh_pe_passed(pe) && (fn != __eeh_pe_get))
>>>+ continue;
>>
>>The code change here seems ugly.
>>
>>The "flag" can be extended to carry the information to skip pass-through
>>PEs or not. So the function calling eeh_pe_traverse() decides to skip
>>pass-through PEs or not.
>
>I don't get the point, which "flag" you mean? Add a flag in eeh_pe?
>
>>> void *eeh_pe_traverse(struct eeh_pe *root,
>>> eeh_traverse_func fn, void *flag)
^^^^^^^^^^
This one
The code needn't to be changed in a hurry though. I don't think it's right
way to fix the issue as discussed as above.
>>
>>> ret = fn(pe, flag);
>>> if (ret) return ret;
>>> }
>>>@@ -210,6 +213,8 @@ void *eeh_pe_dev_traverse(struct eeh_pe *root,
>>
>>>
>>> /* Traverse root PE */
>>> for (pe = root; pe; pe = eeh_pe_next(pe, root)) {
>>>+ if (eeh_pe_passed(pe))
>>>+ continue;
>>> eeh_pe_for_each_dev(pe, edev, tmp) {
>>> ret = fn(edev, flag);
>>> if (ret)
>>>--
>>>2.5.0
>>>
>
>--
>Richard Yang
>Help you, Help me
next prev parent reply other threads:[~2015-09-22 23:08 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-21 9:29 [PATCH] powerpc/eeh: Avoid to handle EEH on a passed Child PE Wei Yang
2015-09-21 11:49 ` Gavin Shan
2015-09-22 4:43 ` Wei Yang
2015-09-22 23:07 ` Gavin Shan [this message]
2015-09-25 8:19 ` Wei Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150922230741.GA6721@gwshan \
--to=gwshan@linux.vnet.ibm.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=weiyang@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).