All of lore.kernel.org
 help / color / mirror / Atom feed
From: Xiaofei Tan <tanxiaofei@huawei.com>
To: James Morse <james.morse@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Linuxarm <linuxarm@huawei.com>, Will Deacon <will@kernel.org>,
	Dave Martin <Dave.Martin@arm.com>,
	linux-arm-kernel@lists.infradead.org,
	Shiju Jose <shiju.jose@huawei.com>
Subject: Re: Question about SEA handling process happened in user space
Date: Fri, 10 Apr 2020 17:43:57 +0800	[thread overview]
Message-ID: <5E903FDD.4080106@huawei.com> (raw)
In-Reply-To: <66db5a6a-e68b-00b7-6a78-2c8cd9e63aab@arm.com>

Hi James,

On 2020/4/9 22:28, James Morse wrote:
> On 09/04/2020 10:17, Xiaofei Tan wrote:
>> On 2020/4/8 0:37, James Morse wrote:
>>> On 02/04/2020 07:35, Xiaofei Tan wrote:
>>>> On 2020/3/31 0:49, James Morse wrote:
>>>>> If the CPU doesn't tell us the address, we can't tell user-space what it is. The
>>>>> alternative is to upgrade to SIGKILL in that case.
>>>>>
>>>>>
>>>>> If you see this instead of the address provided via firmware-first, there is a
>>>>> series to improve that here:
>>>>> https://lore.kernel.org/linux-acpi/20200228174817.74278-1-james.morse@arm.com/
>>>>>
>>>>> (We skip this signal code of APEI promises it did all the work. This lets you
>>>>> take the signal from memory_failure() instead, which may have better information.)
>>>
>>>> There may be an competition issue.
>>>> APEI run memory_failure() in an bottom half for memory errors. Then it may be not finished
>>>> before here SEA handling end, and application process may back to run.
> 
>>> With that series, it runs in process-context as task-work. memory_failure() needs to
>>> sleep, so it has to run in process-context. 
>>
>>
>>> Doing it as task-work means it runs before the thread returns to user-space.
>>
>> Sorry, i don't understand this. i thought the task-work need to reschedule, and current thread should
>> have returned to user-space before it.
> 
> ret_to_user has a loop around do_notify_resume(), if the _TIF_NOTIFY_RESUME flag is set
> and we call tracehook_notify_resume() which ends up in task_work_run()...
> 
> That TIF flag effectively prevents this thread returning to user-space until that task
> work has run.
> 

Got it. This function is great.
BTW, i have not found the place of setting the flag _TIF_NOTIFY_RESUME. Is it set by default for each thread?

> 
>> BTW, What context synchronous exception abort is? I thought it was process-context.
> 
> It depends what you interrupted.
> 32bit had different CPU modes for different contexts, we don't have that in 64bit. Instead
> we mask asynchronous interrupts, and tinker with the preempt count to track the context.
> Synchronous exceptions can't be masked, so they happen in whatever context you were
> already in.
> This means the exception handlers have to be be prepared for each eventuality.
> (which is why that code is starting to look complex)
> 

OK.

> 
>> Because in_interrupt() return false called in do_sea().
> 
> If you took the exception from EL0, or EL1 process context, yes. If you took the exception
> from an IRQ handler, in_interrupt() would return true.
> 

Got it.

> 
>>> If another thread in the same process accesses the affected memory, I'd expect to take a
>>> second external abort. If another process had the page mapped, it could access the
>>> affected memory, again taking an external abort.
> 
>> Yes, it is hard to avoid another thread to access the affected memory.
>> I just worry the same thread access it again.
> 
> This is the race that that series fixes.
> It can't happen with mainline as the arch code unconditionally signals the affected
> process, which was the pre-RAS behaviour.
> 

OK

>>> These two could happen while the first CPU was in firmware generating the CPER records, so
>>> its not a race we can fix. It should be harmless, the recovery action is the same, its
>>> just the error counters that count more events than errors. If you actually see it happen,
>>> we can try and make it smaller...
> 
>> Hmm, maybe this double SEA handling is an solution.
> 
> It assumes you get a second external-abort. We know this thread is affected, and will try
> and consume the error again if we restart it. We shouldn't restart it until we've given
> the recovery our best shot.
> Letting it loose is a poor choice if you have any kind of threshold for error-counts. They
> may jump NR_CPUs at a time until every CPU is waiting in memory_failure()...
> 

Got it. Thanks.

> 
> Thanks,
> 
> James
> 
> .
> 

-- 
 thanks
tanxiaofei


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2020-04-10  9:44 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-30 13:10 Question about SEA handling process happened in user space Xiaofei Tan
2020-03-30 16:49 ` James Morse
2020-03-31  9:41   ` Xiaofei Tan
2020-03-31 17:00     ` James Morse
2020-04-01  3:49       ` Xiaofei Tan
2020-04-07 16:37         ` James Morse
2020-04-09  8:42           ` Xiaofei Tan
2020-04-09 14:28             ` James Morse
2020-04-10  2:55               ` Xiaofei Tan
2020-04-16 13:27                 ` James Morse
2020-04-18 10:49                   ` Xiaofei Tan
2020-04-02  6:35   ` Xiaofei Tan
2020-04-07 16:37     ` James Morse
2020-04-09  9:17       ` Xiaofei Tan
2020-04-09 14:28         ` James Morse
2020-04-10  9:43           ` Xiaofei Tan [this message]
2020-04-16 13:50             ` James Morse
2020-04-18 11:25               ` Xiaofei Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5E903FDD.4080106@huawei.com \
    --to=tanxiaofei@huawei.com \
    --cc=Dave.Martin@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=james.morse@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linuxarm@huawei.com \
    --cc=shiju.jose@huawei.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.