From: Marco Elver <elver@google.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>,
Peter Zijlstra <peterz@infradead.org>,
kasan-dev <kasan-dev@googlegroups.com>, X86 ML <x86@kernel.org>,
open list <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
regressions@lists.linux.dev, lkft-triage@lists.linaro.org,
Andrew Morton <akpm@linux-foundation.org>,
Alexander Potapenko <glider@google.com>
Subject: Re: WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/kfence.h:46 kfence_protect
Date: Fri, 18 Nov 2022 00:23:03 +0100 [thread overview]
Message-ID: <Y3bCV6VckVUEF7Pq@elver.google.com> (raw)
In-Reply-To: <4208866d-338f-4781-7ff9-023f016c5b07@intel.com>
On Thu, Nov 17, 2022 at 06:34AM -0800, Dave Hansen wrote:
> On 11/17/22 05:58, Marco Elver wrote:
> > [ 0.663761] WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/kfence.h:46 kfence_protect+0x7b/0x120
> > [ 0.664033] WARNING: CPU: 0 PID: 0 at mm/kfence/core.c:234 kfence_protect+0x7d/0x120
> > [ 0.664465] kfence: kfence_init failed
>
> Any chance you could add some debugging and figure out what actually
> made kfence call over? Was it the pte or the level?
>
> if (WARN_ON(!pte || level != PG_LEVEL_4K))
> return false;
>
> I can see how the thing you bisected to might lead to a page table not
> being split, which could mess with the 'level' check.
Yes - it's the 'level != PG_LEVEL_4K'.
We do actually try to split the pages in arch_kfence_init_pool() (above
this function) - so with "x86/mm: Inhibit _PAGE_NX changes from
cpa_process_alias()" this somehow fails...
> Also, is there a reason this code is mucking with the page tables
> directly? It seems, uh, rather wonky. This, for instance:
>
> > if (protect)
> > set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
> > else
> > set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
> >
> > /*
> > * Flush this CPU's TLB, assuming whoever did the allocation/free is
> > * likely to continue running on this CPU.
> > */
> > preempt_disable();
> > flush_tlb_one_kernel(addr);
> > preempt_enable();
>
> Seems rather broken. I assume the preempt_disable() is there to get rid
> of some warnings. But, there is nothing I can see to *keep* the CPU
> that did the free from being different from the one where the TLB flush
> is performed until the preempt_disable(). That makes the
> flush_tlb_one_kernel() mostly useless.
>
> Is there a reason this code isn't using the existing page table
> manipulation functions and tries to code its own? What prevents it from
> using something like the attached patch?
Yes, see the comment below - it's to avoid the IPIs and TLB shoot-downs,
because KFENCE _can_ tolerate the inaccuracy even if we hit the wrong
TLB or other CPUs' TLBs aren't immediately flushed - we trade a few
false negatives for minimizing performance impact.
> diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
> index ff5c7134a37a..5cdb3a1f3995 100644
> --- a/arch/x86/include/asm/kfence.h
> +++ b/arch/x86/include/asm/kfence.h
> @@ -37,34 +37,13 @@ static inline bool arch_kfence_init_pool(void)
> return true;
> }
>
> -/* Protect the given page and flush TLB. */
> static inline bool kfence_protect_page(unsigned long addr, bool protect)
> {
> - unsigned int level;
> - pte_t *pte = lookup_address(addr, &level);
> -
> - if (WARN_ON(!pte || level != PG_LEVEL_4K))
> - return false;
> -
> - /*
> - * We need to avoid IPIs, as we may get KFENCE allocations or faults
> - * with interrupts disabled. Therefore, the below is best-effort, and
> - * does not flush TLBs on all CPUs. We can tolerate some inaccuracy;
> - * lazy fault handling takes care of faults after the page is PRESENT.
> - */
> -
^^ See this comment. Additionally there's a real performance concern,
and the inaccuracy is something that we deliberately accept.
> if (protect)
> - set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
> + set_memory_np(addr, addr + PAGE_SIZE);
> else
> - set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
> + set_memory_p(addr, addr + PAGE_SIZE);
Isn't this going to do tons of IPIs and shoot down other CPU's TLBs?
KFENCE shouldn't incur this overhead on large machines with >100 CPUs if
we can avoid it.
What does "x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()" do
that suddenly makes all this fail?
What solution do you prefer that both fixes the issue and avoids the
IPIs?
Thanks,
-- Marco
next prev parent reply other threads:[~2022-11-17 23:23 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-17 11:31 WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/kfence.h:46 kfence_protect Naresh Kamboju
2022-11-17 13:58 ` Marco Elver
2022-11-17 14:34 ` Dave Hansen
2022-11-17 23:23 ` Marco Elver [this message]
2022-11-17 23:54 ` Dave Hansen
2022-11-18 9:19 ` Marco Elver
2022-11-18 10:32 ` Peter Zijlstra
2022-11-21 7:28 ` Naresh Kamboju
2022-11-21 8:43 ` Marco Elver
2022-11-21 5:40 ` Naresh Kamboju
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y3bCV6VckVUEF7Pq@elver.google.com \
--to=elver@google.com \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@intel.com \
--cc=glider@google.com \
--cc=kasan-dev@googlegroups.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lkft-triage@lists.linaro.org \
--cc=naresh.kamboju@linaro.org \
--cc=peterz@infradead.org \
--cc=regressions@lists.linux.dev \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.