Review of KPTI patchset

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Review of KPTI patchset
@ 2017-12-30 18:45 Mathieu Desnoyers
  2017-12-30 19:58 ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2017-12-30 18:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	Dave Hansen, Hugh Dickins, Linus Torvalds

Hi Thomas,

Here is some feedback on the KPTI patchset. Sorry for not replying to the
patch, I was not CC'd on the original email, and don't have it in my inbox.

I notice that fill_ldt() sets the desc->type with "|= 1", whereas all
other operations on the desc type are done with a type enum based on
clearly defined bits. Is the hardcoded "1" on purpose ?

arch/x86/include/asm/processor.h:

"+ * With page table isolation enabled, we map the LDT in ... [stay tuned]"

I look forward to publication of the next chapter containing the rest of
this sentence. When is it due ? ;)

arch/x86/include/asm/tlbflush.h:

@@ -135,6 +193,24 @@ struct tlb_state {

bool invalidate_other;

^ I thought using bool in structures was discouraged ? Arguably, there are other
booleans in that structure already.

+static void free_ldt_pgtables(struct mm_struct *mm)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	struct mmu_gather tlb;
+	unsigned long start = LDT_BASE_ADDR;
+	unsigned long end = start + (1UL << PGDIR_SHIFT);
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	tlb_gather_mmu(&tlb, mm, start, end);
+	free_pgd_range(&tlb, start, end, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+#endif

^ AFAIK, the usual approach is to move the #ifdef outside of the function body,
and have one empty function.

@@ -156,6 +271,12 @@ int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
 	finalize_ldt_struct(new_ldt);

+	retval = map_ldt_struct(mm, new_ldt, 0);
+	if (retval) {
+		free_ldt_pgtables(mm);
+		free_ldt_struct(new_ldt);
+		goto out_unlock;
+	}
 	mm->context.ldt = new_ldt;

 out_unlock:

^ I don't get why it does "free_ldt_pgtables(mm)" on the mm argument, but
it's not done in other error paths. Perhaps it's OK, but ownership seems
non-obvious.

@@ -287,6 +413,18 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 	new_ldt->entries[ldt_info.entry_number] = ldt;
 	finalize_ldt_struct(new_ldt);

+	/*
+	 * If we are using PTI, map the new LDT into the userspace pagetables.
+	 * If there is already an LDT, use the other slot so that other CPUs
+	 * will continue to use the old LDT until install_ldt() switches
+	 * them over to the new LDT.
+	 */
+	error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
+	if (error) {
+		free_ldt_struct(old_ldt);
+		goto out_unlock;
+	}
+

^ is it really "old_ldt" that we want freed on error here ? Or should it be
"new_ldt" ?

--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -38,6 +38,32 @@ cea_map_percpu_pages(void *cea_vaddr, void *ptr, int pages, pgprot_t prot)
 		cea_set_pte(cea_vaddr, per_cpu_ptr_to_phys(ptr), prot);
 }

+static void percpu_setup_debug_store(int cpu)
+{
+#ifdef CONFIG_CPU_SUP_INTEL

^ This ifdef should ideally be outside of the function, defining
an empty function in the #else.

+	/*
+	 * Force the population of PMDs for not yet allocated per cpu
+	 * memory like debug store buffers.
+	 */
+	npages = sizeof(struct debug_store_buffers) / PAGE_SIZE;
+	for (; npages; npages--, cea += PAGE_SIZE)
+		cea_set_pte(cea, 0, PAGE_NONE);

^ the code above (in percpu_setup_debug_store()) depends on having
struct debug_store_buffers's size being a multiple of PAGE_SIZE. A
comment should be added near the structure declaration to document
this requirement.

+static void __init pti_setup_espfix64(void)
+{
+#ifdef CONFIG_X86_ESPFIX64
+	pti_clone_p4d(ESPFIX_BASE_ADDR);
+#endif
+}

Seeing how this ifdef within function layout is everywhere in the patch,
I start to wonder whether I missed a coding style guideline somewhere... ?

+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_asid_other(void)
+{
+	u16 asid;
+
+	/*
+	 * This is only expected to be set if we have disabled
+	 * kernel _PAGE_GLOBAL pages.
+	 */
+	if (!static_cpu_has(X86_FEATURE_PTI)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+		/* Do not need to flush the current asid */
+		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+			continue;
+		/*
+		 * Make sure the next time we go to switch to
+		 * this asid, we do a flush:
+		 */
+		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+	}
+	this_cpu_write(cpu_tlbstate.invalidate_other, false);
+}

Can this be called with preemption enabled ? If so, what happens
if migrated ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Review of KPTI patchset
  2017-12-30 18:45 Review of KPTI patchset Mathieu Desnoyers
@ 2017-12-30 19:58 ` Thomas Gleixner
  2017-12-30 20:43   ` Mathieu Desnoyers
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2017-12-30 19:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	Dave Hansen, Hugh Dickins, Linus Torvalds

On Sat, 30 Dec 2017, Mathieu Desnoyers wrote:

> Hi Thomas,
> 
> Here is some feedback on the KPTI patchset. Sorry for not replying to the
> patch, I was not CC'd on the original email, and don't have it in my inbox.

I can bounce you 196 versions if you want.

> I notice that fill_ldt() sets the desc->type with "|= 1", whereas all
> other operations on the desc type are done with a type enum based on
> clearly defined bits. Is the hardcoded "1" on purpose ?

I don't understand your question. That code does not have any enum involved
at all:

        desc->type              = (info->read_exec_only ^ 1) << 1;
        desc->type             |= info->contents << 2;
        /* Set the ACCESS bit so it can be mapped RO */
        desc->type             |= 1;

So the |= 1 is completely consistent with the rest of that code.

> arch/x86/include/asm/processor.h:
> 
> "+ * With page table isolation enabled, we map the LDT in ... [stay tuned]"
> 
> I look forward to publication of the next chapter containing the rest of
> this sentence. When is it due ? ;)

Don't know. Lost my crystal ball.

> +static void free_ldt_pgtables(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_PAGE_TABLE_ISOLATION
> +	struct mmu_gather tlb;
> +	unsigned long start = LDT_BASE_ADDR;
> +	unsigned long end = start + (1UL << PGDIR_SHIFT);
> +
> +	if (!static_cpu_has(X86_FEATURE_PTI))
> +		return;
> +
> +	tlb_gather_mmu(&tlb, mm, start, end);
> +	free_pgd_range(&tlb, start, end, start, end);
> +	tlb_finish_mmu(&tlb, start, end);
> +#endif
> 
> ^ AFAIK, the usual approach is to move the #ifdef outside of the function body,
> and have one empty function.

That really depends. If you have several functions that makes sense, if you
have only one, not so much.
 
> @@ -156,6 +271,12 @@ int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
>  	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
>  	finalize_ldt_struct(new_ldt);
>  
> +	retval = map_ldt_struct(mm, new_ldt, 0);
> +	if (retval) {
> +		free_ldt_pgtables(mm);
> +		free_ldt_struct(new_ldt);
> +		goto out_unlock;
> +	}
>  	mm->context.ldt = new_ldt;
>  
>  out_unlock:
> 
> ^ I don't get why it does "free_ldt_pgtables(mm)" on the mm argument, but
> it's not done in other error paths. Perhaps it's OK, but ownership seems
> non-obvious.

The pagetable for LDT is allocated and populated in the user space visible
part of a process PGDIR, which obviously is connected to the mm struct....

Which other error paths are you talking about? 

> @@ -287,6 +413,18 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
>  	new_ldt->entries[ldt_info.entry_number] = ldt;
>  	finalize_ldt_struct(new_ldt);
>  
> +	/*
> +	 * If we are using PTI, map the new LDT into the userspace pagetables.
> +	 * If there is already an LDT, use the other slot so that other CPUs
> +	 * will continue to use the old LDT until install_ldt() switches
> +	 * them over to the new LDT.
> +	 */
> +	error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
> +	if (error) {
> +		free_ldt_struct(old_ldt);
> +		goto out_unlock;
> +	}
> +
> 
> ^ is it really "old_ldt" that we want freed on error here ? Or should it be
> "new_ldt" ?

Ouch. Yes, that wants to be new_ldt indeed.

> +	/*
> +	 * Force the population of PMDs for not yet allocated per cpu
> +	 * memory like debug store buffers.
> +	 */
> +	npages = sizeof(struct debug_store_buffers) / PAGE_SIZE;
> +	for (; npages; npages--, cea += PAGE_SIZE)
> +		cea_set_pte(cea, 0, PAGE_NONE);
> 
> ^ the code above (in percpu_setup_debug_store()) depends on having
> struct debug_store_buffers's size being a multiple of PAGE_SIZE. A
> comment should be added near the structure declaration to document
> this requirement.

Hmm. There was a build_bug_on() somewhere which ensured that. That must
have been lost in one of the gazillion iterations.

> +static void __init pti_setup_espfix64(void)
> +{
> +#ifdef CONFIG_X86_ESPFIX64
> +	pti_clone_p4d(ESPFIX_BASE_ADDR);
> +#endif
> +}
> 
> Seeing how this ifdef within function layout is everywhere in the patch,
> I start to wonder whether I missed a coding style guideline somewhere... ?

I don't see how extra empty functions would improve that, but that's a
pointless debate.

> +/*
> + * We get here when we do something requiring a TLB invalidation
> + * but could not go invalidate all of the contexts.  We do the
> + * necessary invalidation by clearing out the 'ctx_id' which
> + * forces a TLB flush when the context is loaded.
> + */
> +void clear_asid_other(void)
> +{
> +	u16 asid;
> +
> +	/*
> +	 * This is only expected to be set if we have disabled
> +	 * kernel _PAGE_GLOBAL pages.
> +	 */
> +	if (!static_cpu_has(X86_FEATURE_PTI)) {
> +		WARN_ON_ONCE(1);
> +		return;
> +	}
> +
> +	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
> +		/* Do not need to flush the current asid */
> +		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
> +			continue;
> +		/*
> +		 * Make sure the next time we go to switch to
> +		 * this asid, we do a flush:
> +		 */
> +		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
> +	}
> +	this_cpu_write(cpu_tlbstate.invalidate_other, false);
> +}
> 
> Can this be called with preemption enabled ? If so, what happens
> if migrated ?

No, it can't and if it is then it's a bug and the smp_processor_id() debug
code will yell at you.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Review of KPTI patchset
  2017-12-30 19:58 ` Thomas Gleixner
@ 2017-12-30 20:43   ` Mathieu Desnoyers
  2017-12-30 22:02     ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2017-12-30 20:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	Dave Hansen, Hugh Dickins, Linus Torvalds

----- On Dec 30, 2017, at 2:58 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Sat, 30 Dec 2017, Mathieu Desnoyers wrote:
> 
>> Hi Thomas,
>> 
>> Here is some feedback on the KPTI patchset. Sorry for not replying to the
>> patch, I was not CC'd on the original email, and don't have it in my inbox.
> 
> I can bounce you 196 versions if you want.

Oh no, don't worry about this. I'm happy reviewing the resulting patchset
as it is. :)

> 
>> I notice that fill_ldt() sets the desc->type with "|= 1", whereas all
>> other operations on the desc type are done with a type enum based on
>> clearly defined bits. Is the hardcoded "1" on purpose ?
> 
> I don't understand your question. That code does not have any enum involved
> at all:

I think I got mixed up with other "desc" fields within other structures
of desc_defs.h.

> 
>        desc->type              = (info->read_exec_only ^ 1) << 1;
>        desc->type             |= info->contents << 2;
>        /* Set the ACCESS bit so it can be mapped RO */
>        desc->type             |= 1;
> 
> So the |= 1 is completely consistent with the rest of that code.

It indeed seems consistent with the rest of that code, which could use
more comments and documentation. For instance, x86 desc_defs.h
could benefit from extra comments describing the meaning of each bit
near the "type" field.

I guess a counter-argument is that anyone reading through that code
should look up the "segment descriptor" layout in a x86 manual. Not
ideal though.

> 
>> arch/x86/include/asm/processor.h:
>> 
>> "+ * With page table isolation enabled, we map the LDT in ... [stay tuned]"
>> 
>> I look forward to publication of the next chapter containing the rest of
>> this sentence. When is it due ? ;)
> 
> Don't know. Lost my crystal ball.

Me too :) I would be helpful to complete this comment though.

[...]

>> @@ -156,6 +271,12 @@ int ldt_dup_context(struct mm_struct *old_mm, struct
>> mm_struct *mm)
>>  	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
>>  	finalize_ldt_struct(new_ldt);
>>  
>> +	retval = map_ldt_struct(mm, new_ldt, 0);
>> +	if (retval) {
>> +		free_ldt_pgtables(mm);
>> +		free_ldt_struct(new_ldt);
>> +		goto out_unlock;
>> +	}
>>  	mm->context.ldt = new_ldt;
>>  
>>  out_unlock:
>> 
>> ^ I don't get why it does "free_ldt_pgtables(mm)" on the mm argument, but
>> it's not done in other error paths. Perhaps it's OK, but ownership seems
>> non-obvious.
> 
> The pagetable for LDT is allocated and populated in the user space visible
> part of a process PGDIR, which obviously is connected to the mm struct....
> 
> Which other error paths are you talking about?

Let's look at the entire function:

> /*
>  * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
>  * the new task is not running, so nothing can be installed.
>  */
> int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
> {
>       struct ldt_struct *new_ldt;
>       int retval = 0;
>
>       if (!old_mm)
>               return 0;

If old_mm is NULL, free_ldt_pgtables(mm) is not called.

>
>       mutex_lock(&old_mm->context.lock);
>       if (!old_mm->context.ldt)

If old_mm->context.ldt is NULL, free_ldt_pgtables(mm) is not called.

>               goto out_unlock;
>
>       new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
>       if (!new_ldt) {
>               retval = -ENOMEM;

On allocation error, free_ldt_pgtables(mm) is not called.

>               goto out_unlock;
>       }
>
>       memcpy(new_ldt->entries, old_mm->context.ldt->entries,
>              new_ldt->nr_entries * LDT_ENTRY_SIZE);
>       finalize_ldt_struct(new_ldt);
>
>       retval = map_ldt_struct(mm, new_ldt, 0);
>       if (retval) {
>               free_ldt_pgtables(mm);

Here, if we fail to map_ldt_struct, then free_ldt_pgtables(mm) is called.

>               free_ldt_struct(new_ldt);

In addition to call free_ldt_struct(), but map_ldt_struct failed... ?

This lack of symmetry makes me uncomfortable, and it may hint at something
fishy.

>               goto out_unlock;
>       }
>       mm->context.ldt = new_ldt;
>
> out_unlock:
>       mutex_unlock(&old_mm->context.lock);
>       return retval;
> }

[...]

> 
>> +	/*
>> +	 * Force the population of PMDs for not yet allocated per cpu
>> +	 * memory like debug store buffers.
>> +	 */
>> +	npages = sizeof(struct debug_store_buffers) / PAGE_SIZE;
>> +	for (; npages; npages--, cea += PAGE_SIZE)
>> +		cea_set_pte(cea, 0, PAGE_NONE);
>> 
>> ^ the code above (in percpu_setup_debug_store()) depends on having
>> struct debug_store_buffers's size being a multiple of PAGE_SIZE. A
>> comment should be added near the structure declaration to document
>> this requirement.
> 
> Hmm. There was a build_bug_on() somewhere which ensured that. That must
> have been lost in one of the gazillion iterations.

A build bug on would work as documentation indeed.

[...]

> 
>> +/*
>> + * We get here when we do something requiring a TLB invalidation
>> + * but could not go invalidate all of the contexts.  We do the
>> + * necessary invalidation by clearing out the 'ctx_id' which
>> + * forces a TLB flush when the context is loaded.
>> + */
>> +void clear_asid_other(void)
>> +{
>> +	u16 asid;
>> +
>> +	/*
>> +	 * This is only expected to be set if we have disabled
>> +	 * kernel _PAGE_GLOBAL pages.
>> +	 */
>> +	if (!static_cpu_has(X86_FEATURE_PTI)) {
>> +		WARN_ON_ONCE(1);
>> +		return;
>> +	}
>> +
>> +	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
>> +		/* Do not need to flush the current asid */
>> +		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
>> +			continue;
>> +		/*
>> +		 * Make sure the next time we go to switch to
>> +		 * this asid, we do a flush:
>> +		 */
>> +		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
>> +	}
>> +	this_cpu_write(cpu_tlbstate.invalidate_other, false);
>> +}
>> 
>> Can this be called with preemption enabled ? If so, what happens
>> if migrated ?
> 
> No, it can't and if it is then it's a bug and the smp_processor_id() debug
> code will yell at you.

I thought the whole point about this_cpu_*() was that it could be called
with preemption enabled, given that it figures out the per-cpu data offset
using a segment selector prefix. How would smp_processor_id() debug code be
involved here ?

Thanks,

Mathieu


> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Review of KPTI patchset
  2017-12-30 20:43   ` Mathieu Desnoyers
@ 2017-12-30 22:02     ` Thomas Gleixner
  2017-12-30 22:45       ` Thomas Gleixner
  2017-12-31 14:09       ` Mathieu Desnoyers
  0 siblings, 2 replies; 7+ messages in thread
From: Thomas Gleixner @ 2017-12-30 22:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	Dave Hansen, Hugh Dickins, Linus Torvalds

On Sat, 30 Dec 2017, Mathieu Desnoyers wrote:
> ----- On Dec 30, 2017, at 2:58 PM, Thomas Gleixner tglx@linutronix.de wrote:
> > /*
> >  * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
> >  * the new task is not running, so nothing can be installed.
> >  */
> > int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
> > {
> >       struct ldt_struct *new_ldt;
> >       int retval = 0;
> >
> >       if (!old_mm)
> >               return 0;
> 
> If old_mm is NULL, free_ldt_pgtables(mm) is not called.

Correct. Why should it be called? Nothing is allocated. You cannot
associate anything to a NULL pointer and you cannot duplicate the LDT
referenced by a NULL pointer, right?

> >       mutex_lock(&old_mm->context.lock);
> >       if (!old_mm->context.ldt)
> 
> If old_mm->context.ldt is NULL, free_ldt_pgtables(mm) is not called.

Again. Why would it be called? There is no page table allocated yet. Care
to look at the calling context or the comment above the function which
explains it?

That's fork: old_mm is the parent mm and mm is the child mm. So how would
the new child mm have that LDT pagetable _before_ it was ever tried to
allocate/map?

> >               goto out_unlock;
> >
> >       new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
> >       if (!new_ldt) {
> >               retval = -ENOMEM;
> 
> On allocation error, free_ldt_pgtables(mm) is not called.

Once more. There is no page table allocated yet.

> >               goto out_unlock;
> >       }
> >
> >       memcpy(new_ldt->entries, old_mm->context.ldt->entries,
> >              new_ldt->nr_entries * LDT_ENTRY_SIZE);
> >       finalize_ldt_struct(new_ldt);
> >
> >       retval = map_ldt_struct(mm, new_ldt, 0);
> >       if (retval) {
> >               free_ldt_pgtables(mm);
> 
> Here, if we fail to map_ldt_struct, then free_ldt_pgtables(mm) is called.
> >               free_ldt_struct(new_ldt);
> 
> In addition to call free_ldt_struct(), but map_ldt_struct failed... ?

Yes, because we failed to map it and free_ldt_pgtables() cleans up the
leftovers of the map function, which can exit with error and a half
populated page table.

free_ldt_struct() must be called otherwise we leak new_ldt which got
allocated above.

> This lack of symmetry makes me uncomfortable, and it may hint at something
> fishy.

Its not asymetric at all.

The only asymetry is in the error path of write_ldt() which can leak a half
allocated page table. But, that's a nasty one because if there is an
existing LDT mapped, then the pagetable cannot be freed. So yes, it's not
nice, but harmless and needs some thought to fix.

in the ldt_dup() case we can remove it right away because nothing has been
exposed at that point.

> >> +	this_cpu_write(cpu_tlbstate.invalidate_other, false);
> >> +}
> >> 
> >> Can this be called with preemption enabled ? If so, what happens
> >> if migrated ?
> > 
> > No, it can't and if it is then it's a bug and the smp_processor_id() debug
> > code will yell at you.
> 
> I thought the whole point about this_cpu_*() was that it could be called
> with preemption enabled, given that it figures out the per-cpu data offset
> using a segment selector prefix. How would smp_processor_id() debug code be
> involved here ?

You're right, the this_cpu_read/write wont. But the this_cpu_ptr()
dereference in one of the invoked functions will. 

Granted, it's not obvious and ideally we convert those this_cpu_read/writes
to __this_cpu_read/writes() to get the immediate fail reported on the first
access.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Review of KPTI patchset
  2017-12-30 22:02     ` Thomas Gleixner
@ 2017-12-30 22:45       ` Thomas Gleixner
  2017-12-31 14:09       ` Mathieu Desnoyers
  1 sibling, 0 replies; 7+ messages in thread
From: Thomas Gleixner @ 2017-12-30 22:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	Dave Hansen, Hugh Dickins, Linus Torvalds

On Sat, 30 Dec 2017, Thomas Gleixner wrote:
> On Sat, 30 Dec 2017, Mathieu Desnoyers wrote:
> The only asymetry is in the error path of write_ldt() which can leak a half
> allocated page table. But, that's a nasty one because if there is an
> existing LDT mapped, then the pagetable cannot be freed. So yes, it's not
> nice, but harmless and needs some thought to fix.

In fact it's not a leak. It's just memory waste because the pagetable gets
freed when the process exits.

The memory waste is rather simple to fix. Delta patch below.

Thanks,

	tglx

8<--------------
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -421,6 +421,14 @@ static int write_ldt(void __user *ptr, u
 	 */
 	error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
 	if (error) {
+		/*
+		 * Drop potentially half populated page table if the
+		 * mapping code failed and this was the first attempt to
+		 * install a LDT. If there is a LDT installed then the LDT
+		 * pagetable cannot be freed for obvious reasons.
+		 */
+		if (!old_ldt)
+			free_ldt_pgtables(mm);
 		free_ldt_struct(new_ldt);
 		goto out_unlock;
 	}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Review of KPTI patchset
  2017-12-30 22:02     ` Thomas Gleixner
  2017-12-30 22:45       ` Thomas Gleixner
@ 2017-12-31 14:09       ` Mathieu Desnoyers
  2017-12-31 14:14         ` Thomas Gleixner
  1 sibling, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2017-12-31 14:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	Dave Hansen, Hugh Dickins, Linus Torvalds

----- On Dec 30, 2017, at 5:02 PM, Thomas Gleixner tglx@linutronix.de wrote:

> On Sat, 30 Dec 2017, Mathieu Desnoyers wrote:
>> ----- On Dec 30, 2017, at 2:58 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> > /*
>> >  * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
>> >  * the new task is not running, so nothing can be installed.
>> >  */
>> > int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
>> > {
>> >       struct ldt_struct *new_ldt;
>> >       int retval = 0;
>> >
>> >       if (!old_mm)
>> >               return 0;
>> 
>> If old_mm is NULL, free_ldt_pgtables(mm) is not called.
> 
> Correct. Why should it be called? Nothing is allocated. You cannot
> associate anything to a NULL pointer and you cannot duplicate the LDT
> referenced by a NULL pointer, right?

Right, from a fork() context perspective it's fine. (same for rest of function)

[...]

> 
>> >> +	this_cpu_write(cpu_tlbstate.invalidate_other, false);
>> >> +}
>> >> 
>> >> Can this be called with preemption enabled ? If so, what happens
>> >> if migrated ?
>> > 
>> > No, it can't and if it is then it's a bug and the smp_processor_id() debug
>> > code will yell at you.
>> 
>> I thought the whole point about this_cpu_*() was that it could be called
>> with preemption enabled, given that it figures out the per-cpu data offset
>> using a segment selector prefix. How would smp_processor_id() debug code be
>> involved here ?
> 
> You're right, the this_cpu_read/write wont. But the this_cpu_ptr()
> dereference in one of the invoked functions will.

I don't see any this_cpu_ptr() dereference being invoked either directly
or indirectly from clear_asid_other(). What am I missing ?

> 
> Granted, it's not obvious and ideally we convert those this_cpu_read/writes
> to __this_cpu_read/writes() to get the immediate fail reported on the first
> access.

Indeed, if this function is expected to be called from non-preempt context,
the __this_cpu_*() would be appropriate.

Moreover, clear_asid_other() could use a "static" definition, given that it's only ever
used from arch/x86/mm/tlb.c:choose_new_asid(). Otherwise nothing prevents other users
from unrelated areas of the kernel from calling it, and then we should document that
preemption needs to be disabled around its invocation.

While we are there (discussing preemption and migration), I'm puzzled about the
preempt_disable/enable in __native_flush_tlb(). First, it should cover the
invalidate_user_asid() call too. Also, if the __ prefixed functions in this header
are expected to be called with preemption enabled, then we might want to add a
preempt disable around __native_flush_tlb_single(), which also do a few per-cpu
data accesses, or otherwise document that it needs to be called with preemption
disabled (with a warn-on).

The case of __flush_tlb_one() is also interesting: it invokes __flush_tlb_single()
and then invalidate_other_asid(), which each perform this_cpu_*() operations.
If preemption is not disabled across the entire function, migration could cause
__flush_tlb_single() and invalidate_other_asid() to run on different CPUs. Again,
we should either explicitly disable preemption, or document that it needs to be
invoked with preemption disabled (with a warn-on).

Thanks,

Mathieu

> 
> Thanks,
> 
> 	tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Review of KPTI patchset
  2017-12-31 14:09       ` Mathieu Desnoyers
@ 2017-12-31 14:14         ` Thomas Gleixner
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Gleixner @ 2017-12-31 14:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Andy Lutomirski, Peter Zijlstra, Borislav Petkov,
	Dave Hansen, Hugh Dickins, Linus Torvalds

On Sun, 31 Dec 2017, Mathieu Desnoyers wrote:
> > Granted, it's not obvious and ideally we convert those this_cpu_read/writes
> > to __this_cpu_read/writes() to get the immediate fail reported on the first
> > access.
> 
> Indeed, if this function is expected to be called from non-preempt context,
> the __this_cpu_*() would be appropriate.
> 
> Moreover, clear_asid_other() could use a "static" definition, given that it's only ever
> used from arch/x86/mm/tlb.c:choose_new_asid(). Otherwise nothing prevents other users
> from unrelated areas of the kernel from calling it, and then we should document that
> preemption needs to be disabled around its invocation.
> 
> While we are there (discussing preemption and migration), I'm puzzled about the
> preempt_disable/enable in __native_flush_tlb(). First, it should cover the
> invalidate_user_asid() call too. Also, if the __ prefixed functions in this header

You're late to that discussion. That has been fixed already....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-12-31 14:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-30 18:45 Review of KPTI patchset Mathieu Desnoyers
2017-12-30 19:58 ` Thomas Gleixner
2017-12-30 20:43   ` Mathieu Desnoyers
2017-12-30 22:02     ` Thomas Gleixner
2017-12-30 22:45       ` Thomas Gleixner
2017-12-31 14:09       ` Mathieu Desnoyers
2017-12-31 14:14         ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox