[PATCH] x86/doc: add PTI description

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Dave Hansen <dave.hansen@linux.intel.com>
To: linux-kernel@vger.kernel.org
Cc: x86@kernel.org, Dave Hansen <dave.hansen@linux.intel.com>,
	moritz.lipp@iaik.tugraz.at, daniel.gruss@iaik.tugraz.at,
	michael.schwarz@iaik.tugraz.at,
	richard.fellner@student.tugraz.at, luto@kernel.org,
	torvalds@linux-foundation.org, keescook@google.com,
	hughd@google.com
Subject: [PATCH] x86/doc: add PTI description
Date: Mon, 18 Dec 2017 14:04:13 -0800	[thread overview]
Message-ID: <20171218220413.67C44542@viggo.jf.intel.com> (raw)


This got kicked out of the PTI set as the implementation diverged
from its contents.  I've updated it so it can hopefully rejoin the
set.

---

From: Dave Hansen <dave.hansen@linux.intel.com>

Add some details about how PTI works, what some of the downsides
are, and how to debug it when things go wrong.

Also document the kernel parameter: 'nopti'.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/Documentation/admin-guide/kernel-parameters.txt |    4 
 b/Documentation/x86/pti.txt                       |  182 ++++++++++++++++++++++
 2 files changed, 186 insertions(+)

diff -puN Documentation/admin-guide/kernel-parameters.txt~kpti-doc Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt~kpti-doc	2017-12-18 13:55:59.635504663 -0800
+++ b/Documentation/admin-guide/kernel-parameters.txt	2017-12-18 13:55:59.640504663 -0800
@@ -904,6 +904,10 @@
 	nopku		[X86] Disable Memory Protection Keys CPU feature found
 			in some Intel CPUs.
 
+	nopti		[X86] Disable Page Table Isolation.  Disabling this
+			feature removes hardening, but improves performance
+			of system calls and interrupts.
+
 	module.async_probe [KNL]
 			Enable asynchronous probe on this module.
 
diff -puN /dev/null Documentation/x86/pti.txt
--- /dev/null	2017-12-15 13:48:30.454245127 -0800
+++ b/Documentation/x86/pti.txt	2017-12-18 13:57:29.433504439 -0800
@@ -0,0 +1,182 @@
+Overview
+========
+
+Page Table Isolation (pti, previously known as KAISER[1]) is a
+countermeasure against attacks on kernel address information.  There
+are at least three existing, published, approaches using the shared
+user/kernel mapping and hardware features to defeat KASLR.  One
+approach referenced in a paper[1] locates the kernel by observing
+differences in page fault timing between present-but-inaccessable
+kernel pages and non-present pages.
+
+To avoid leaking address information, we create an new, independent
+copy of the page tables which are used only when running userspace
+applications.  When the kernel is entered via syscalls, interrupts or
+exceptions, page tables are switched to the full "kernel" copy.  When
+the system switches back to user mode, the user copy is used again.
+
+The userspace page tables contain only a minimal amount of kernel
+data: only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor table
+(IDT).  There are a few unnecessary things that get mapped such as the
+first C function when entering an interrupt (see comments in pti.c).
+
+This approach helps to ensure that side-channel attacks that leverage
+the paging structures do not function when PTI is enabled.  It can be
+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
+Once enabled at compile-time, it can be disabled at boot with the
+'nopti' kernel parameter.
+
+Page Table Management
+=====================
+
+When PTI is enabled, the kernel manages two sets of page
+tables.  The first copy is very similar to what would be present
+for a kernel without PTI.  This includes a complete mapping of
+userspace that the kernel can use for things like copy_to_user().
+
+The userspace copy is used when running userspace and mirrors the
+mapping of userspace present in the kernel copy.  It maps a only
+the kernel data needed to enter and exit the kernel.  This data
+is entirely contained in the 'struct cpu_entry_area' structure
+which is placed in the fixmap and thus each CPU's copy of the
+area has a compile-time-fixed virtual address.
+
+For new userspace mappings, the kernel makes the entries in its
+page tables like normal.  The only difference is when the kernel
+makes entries in the top (PGD) level.  In addition to setting the
+entry in the main kernel PGD, a copy of the entry is made in the
+userspace page tables' PGD.
+
+This sharing at the PGD level also inherently shares all the lower
+layers of the page tables.  This leaves a single, shared set of
+userspace page tables to manage.  One PTE to lock, one set set of
+accessed bits, dirty bits, etc...
+
+Overhead
+========
+
+Protection against side-channel attacks is important.  But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+  a. Each process now needs an order-1 PGD instead of order-0.
+     (Consumes 4k per process).
+  b. The pre-allocated second-level (p4d or pud) kernel page
+     table pages cost ~1MB of additional memory at boot.  This
+     is not totally wasted because some of these pages would
+     have been needed eventually for normal kernel page tables
+     and things in the vmalloc() area like vmemmap[].
+  c. The 'cpu_entry_area' structure must be 2MB in size and 2MB
+     aligned so that it can be mapped by setting a single PMD
+     entry.  This consumes nearly 2MB of RAM once the kernel
+     is decompressed, but no space in the kernel image itself.
+
+2. Runtime Cost
+  a. CR3 manipulation to switch between the page table copies
+     must be done at interrupt, syscall, and exception entry
+     and exit (it can be skipped when the kernel is interrupted,
+     though.)  Moves to CR3 are on the order of a hundred
+     cycles, and are required every at entry and every at exit.
+  b. A "trampoline" must be used for SYSCALL entry.  This
+     trampoline depends on a smaller set of resources than the
+     non-PTI SYSCALL entry code, so requires mapping fewer
+     things into the userspace page tables.  The downside is
+     that stacks must be switched at entry time.
+  d. Global pages are disabled for all kernel structures not
+     mapped in both to kernel and userspace page tables.  This
+     feature of the MMU allows different processes to share TLB
+     entries mapping the kernel.  Losing the feature means more
+     TLB misses after a context switch.  The actual loss of
+     performance is very small, however, never exceeding 1%.
+  d. Process Context IDentifiers (PCID) is a CPU feature that
+     allows us to skip flushing the entire TLB when switching page
+     tables.  This makes switching the page tables (at context
+     switch, or kernel entry/exit) cheaper.  But, on systems with
+     PCID support, the context switch code must flush both the user
+     and kernel entries out of the TLB.  The user PCID TLB flush is
+     deferred until the exit to userspace, minimizing the cost.
+  e. The userspace page tables must be populated for each new
+     process.  Even without PTI, the shared kernel mappings
+     are created by copying top-level (PGD) entries into each
+     new process.  But, with PTI, there are now *two* kernel
+     mappings: one in the kernel page tables that maps everything
+     and one for the entry/exit structures.  At fork(), we need to
+     copy both.
+  f. In addition to the fork()-time copying, there must also
+     be an update to the userspace PGD any time a set_pgd() is done
+     on a PGD used to map userspace.  This ensures that the kernel
+     and userspace copies always map the same userspace
+     memory.
+  g. On systems without PCID support, each CR3 write flushes
+     the entire TLB.  That means that each syscall, interrupt
+     or exception flushes the TLB.
+
+Possible Future Work
+====================
+1. We can be more careful about not actually writing to CR3
+   unless its value is actually changed.
+2. Continue to compress the userspace-mapped data to be mapped
+   together.  Currently, it is using four entries, but could
+   be further minimized.
+3. Allow PTI to enabled/disabled at runtime in addition to the
+   boot-time switching.
+
+Testing
+========
+
+To test stability of PTI, the following test procedure is recommended,
+ideally doing all of these in parallel:
+
+1. Set CONFIG_DEBUG_ENTRY=y
+2. Run several copies of all of the tools/testing/selftests/x86/ tests
+   (excluding MPX and protection_keys) in a loop on multiple CPUs for
+   several minutes.  These tests frequently uncover corner cases in the
+   kernel entry code.  In general, old kernels might cause these tests
+   themselves to crash, but they should never crash the kernel.
+3. Run the 'perf' tool in a mode (top or record) that generates many
+   frequent performance monitoring non-maskable interrupts (see "NMI"
+   in /proc/interrupts).  This exercises the NMI entry/exit code which
+   is known to trigger bugs in code paths that did not expect to be
+   interrupted, including nested NMIs.  Using "-c" boosts the rate of
+   NMIs, and using two -c with separate counters encourages nested NMIs
+   and less deterministic behavior.
+
+	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
+
+4. Launch a KVM virtual machine.
+
+Debugging
+=========
+
+Bugs in PTI cause a few different signatures of crashes
+that are worth noting here.
+
+ * Failures of the selftests/x86 code.  Usually a bug in one of the
+   more obscure corners of entry_64.S
+ * Crashes in early boot, especially around CPU bringup.  Bugs
+   in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
+   like screwing up a page table switch.  Also caused by
+   incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI.  The NMI code is separate from main
+   interrupt handlers and can have bugs that do not affect
+   normal interrupts.  Also caused by incorrectly mapping NMI
+   code.  NMIs that interrupt the entry code must be very
+   careful and can be the cause of crashes that show up when
+   running perf.
+ * Kernel crashes at the first exit to userspace.  entry_64.S
+   bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+   in entry_64.S that return to userspace are sometimes separate
+   from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+   faults upon page faults.  Caused by touching non-pti-mapped
+   data in the entry code, or forgetting to switch to kernel
+   CR3 before calling into C functions which are not pti-mapped.
+ * Userspace segfaults early in boot, sometimes manifesting
+   as mount(8) failing to mount the rootfs.  These have
+   tended to be TLB invalidation issues.  Usually invalidating
+   the wrong PCID, or otherwise missing an invalidation.
+
+1. https://gruss.cc/files/kaiser.pdf
_

next             reply	other threads:[~2017-12-18 22:04 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-18 22:04 Dave Hansen [this message]
2017-12-19  0:09 ` [PATCH] x86/doc: add PTI description Randy Dunlap
2017-12-19  1:13   ` Dave Hansen
  -- strict thread matches above, loose matches on Subject: below --
2018-01-04 20:54 Dave Hansen
2018-01-04 21:46 ` Thomas Gleixner
2018-01-05  0:06 ` Kees Cook
2018-01-05  0:21   ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171218220413.67C44542@viggo.jf.intel.com \
    --to=dave.hansen@linux.intel.com \
    --cc=daniel.gruss@iaik.tugraz.at \
    --cc=hughd@google.com \
    --cc=keescook@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=michael.schwarz@iaik.tugraz.at \
    --cc=moritz.lipp@iaik.tugraz.at \
    --cc=richard.fellner@student.tugraz.at \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox