linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@kernel.org>
To: "Kani, Toshi" <toshi.kani@hpe.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"gratian.crisan@ni.com" <gratian.crisan@ni.com>,
	"mingo@kernel.org" <mingo@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"julia.cartwright@ni.com" <julia.cartwright@ni.com>,
	"torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"bp@suse.de" <bp@suse.de>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"brgerst@gmail.com" <brgerst@gmail.com>,
	"luto@kernel.org" <luto@kernel.org>,
	"dave.hansen@intel.com" <dave.hansen@intel.com>,
	"dvlasenk@redhat.com" <dvlasenk@redhat.com>,
	"gratian@gmail.com" <gratian@gmail.com>
Subject: Re: Kernel page fault in vmalloc_fault() after a preempted ioremap
Date: Thu, 8 Mar 2018 22:38:32 +0000	[thread overview]
Message-ID: <CALCETrUB0brd92Tuv_cgakTgBo8yXxaAC1eLUMePMNsoWPK+mw@mail.gmail.com> (raw)
In-Reply-To: <1520548101.2693.106.camel@hpe.com>

On Thu, Mar 8, 2018 at 9:43 PM, Kani, Toshi <toshi.kani@hpe.com> wrote:
> On Thu, 2018-03-08 at 14:34 -0600, Gratian Crisan wrote:
>> Hi all,
>>
>> We are seeing kernel page faults happening on module loads with certain
>> drivers like the i915 video driver[1]. This was initially discovered on
>> a 4.9 PREEMPT_RT kernel. It takes 5 days on average to reproduce using a
>> simple reboot loop test. Looking at the code paths involved I believe
>> the issue is still present in the latest vanilla kernel.
>>
>> Some relevant points are:
>>
>>   * x86_64 CPU: Intel Atom E3940
>>
>>   * CONFIG_HUGETLBFS is not set (which also gates CONFIG_HUGETLB_PAGE)
>>
>> Based on function traces I was able to gather the sequence of events is:
>>
>>   1. Driver starts a ioremap operation for a region that is PMD_SIZE in
>>   size (or PUD_SIZE).
>>
>>   2. The ioremap() operation is preempted while it's in the middle of
>>   setting up the page mappings:
>>   ioremap_page_range->...->ioremap_pmd_range->pmd_set_huge <<preempted>>
>>
>>   3. Unrelated tasks run. Traces also include some cross core scheduling
>>   IPI calls.
>>
>>   4. Driver resumes execution finishes the ioremap operation and tries to
>>   access the newly mapped IO region. This triggers a vmalloc fault.
>>
>>   5. The vmalloc_fault() function hits a kernel page fault when trying to
>>   dereference a non-existent *pte_ref.
>>
>> The reason this happens is the code paths called from ioremap_page_range()
>> make different assumptions about when a large page (pud/pmd) mapping can be
>> used versus the code paths in vmalloc_fault().
>>
>> Using the PMD sized ioremap case as an example (the PUD case is similar):
>> ioremap_pmd_range() calls ioremap_pmd_enabled() which is gated by
>> CONFIG_HAVE_ARCH_HUGE_VMAP. On x86_64 this will return true unless the
>> "nohugeiomap" kernel boot parameter is passed in.
>>
>> On the other hand, in the rare case when a page fault happens in the
>> ioremap'ed region, vmalloc_fault() calls the pmd_huge() function to check
>> if a PMD page is marked huge or if it should go on and get a reference to
>> the PTE. However pmd_huge() is conditionally compiled based on the user
>> configured CONFIG_HUGETLB_PAGE selected by CONFIG_HUGETLBFS. If the
>> CONFIG_HUGETLBFS option is not enabled pmd_huge() is always defined to be
>> 0.
>>
>> The end result is an OOPS in vmalloc_fault() when the non-existent pte_ref
>> is dereferenced because the test for pmd_huge() failed.
>>
>> Commit f4eafd8bcd52 ("x86/mm: Fix vmalloc_fault() to handle large pages
>> properly") attempted to fix the mismatch between ioremap() and
>> vmalloc_fault() with regards to huge page handling but it missed this use
>> case.
>>
>> I am working on a simpler reproducing case however so far I've been
>> unsuccessful in re-creating the conditions that trigger the vmalloc fault
>> in the first place. Adding explicit scheduling points in
>> ioremap_pmd_range/pmd_set_huge doesn't seem to be sufficient. Ideas
>> appreciated.
>>
>> Any thoughts on what a correct fix would look like? Should the ioremap
>> code paths respect the HUGETLBFS config or would it be better for the
>> vmalloc fault code paths to match the tests used in ioremap and not rely
>> on the HUGETLBFS option being enabled?
>
> Thanks for the report and analysis!  I believe pud_large() and
> pmd_large() should have been used here.  I will try to reproduce the
> issue and verify the fix.

Indeed.  I find myself wondering why pud_huge() exists at all.

While you're at it, I think there may be more bugs in there.
Specifically, the code walks the reference and current tables at the
same time without any synchronization and without READ_ONCE()
protection.  I think that all of the BUG() calls below the comment:

        /*
         * Below here mismatches are bugs because these lower tables
         * are shared:
         */

are bogus and could be hit due to races.  I also think they're
pointless -- we've already asserted that the reference and loaded
tables are literally the same pointers.  I think the right fix is to
remove pud_ref, pmd_ref and pte_ref entirely and to get rid of those
BUG() calls.

What do you think?

  reply	other threads:[~2018-03-08 22:38 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-08 20:34 Kernel page fault in vmalloc_fault() after a preempted ioremap Gratian Crisan
2018-03-08 21:43 ` Kani, Toshi
2018-03-08 22:38   ` Andy Lutomirski [this message]
2018-03-08 23:08     ` Kani, Toshi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrUB0brd92Tuv_cgakTgBo8yXxaAC1eLUMePMNsoWPK+mw@mail.gmail.com \
    --to=luto@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bp@suse.de \
    --cc=brgerst@gmail.com \
    --cc=dave.hansen@intel.com \
    --cc=dvlasenk@redhat.com \
    --cc=gratian.crisan@ni.com \
    --cc=gratian@gmail.com \
    --cc=hpa@zytor.com \
    --cc=julia.cartwright@ni.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=toshi.kani@hpe.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).