From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932161AbcGAXEL (ORCPT <rfc822;w@1wt.eu>);
	Fri, 1 Jul 2016 19:04:11 -0400
Received: from gate.crashing.org ([63.228.1.57]:34092 "EHLO gate.crashing.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932068AbcGAXEJ (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 1 Jul 2016 19:04:09 -0400
Message-ID: <1467412092.7422.56.camel@kernel.crashing.org>
Subject: Re: [PATCH 0/4] [RFC][v4] Workaround for Xeon Phi PTE A/D bits
 erratum
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Dave Hansen <dave@sr71.net>, linux-kernel@vger.kernel.org
Cc: x86@kernel.org, linux-mm@kvack.org, torvalds@linux-foundation.org,
        akpm@linux-foundation.org, bp@alien8.de, ak@linux.intel.com,
        mhocko@suse.com
Date: Sat, 02 Jul 2016 08:28:12 +1000
In-Reply-To: <20160701174658.6ED27E64@viggo.jf.intel.com>
References: <20160701174658.6ED27E64@viggo.jf.intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.20.3 (3.20.3-1.fc24.1) 
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, 2016-07-01 at 10:46 -0700, Dave Hansen wrote:
> The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
> Landing) has an erratum where a processor thread setting the Accessed
> or Dirty bits may not do so atomically against its checks for the
> Present bit.  This may cause a thread (which is about to page fault)
> to set A and/or D, even though the Present bit had already been
> atomically cleared.

Interesting.... I always wondered where in the Intel docs did it specify
that present was tested atomically with setting of A and D ... I couldn't
find it.

Isn't there a more fundamental issue however that you may actually lose
those bits ? For example if we do an munmap, in zap_pte_range()

We first exchange all the PTEs with 0 with ptep_get_and_clear_full()
and we then transfer D that we just read into the struct page.

We rely on the fact that D will never be set again, what we go it a
"final" D bit. IE. We rely on the fact that a processor either:

   - Has a cached PTE in its TLB with D set, in which case it can still
write to the page until we flush the TLB or

   - Doesn't have a cached PTE in its TLB with D set and so will fail
to do so due to the atomic P check, thus never writing.

With the errata, don't you have a situation where a processor in the second
category will write and set D despite P having been cleared (due to the
race) and thus causing us to miss the transfer of that D to the struct
page and essentially completely miss that the physical page is dirty ?

(Leading to memory corruption).

> If the PTE is used for storing a swap index or a NUMA migration index,
> the A bit could be misinterpreted as part of the swap type.  The stray
> bits being set cause a software-cleared PTE to be interpreted as a
> swap entry.  In some cases (like when the swap index ends up being
> for a non-existent swapfile), the kernel detects the stray value
> and WARN()s about it, but there is no guarantee that the kernel can
> always detect it.
> 
> This patch changes the kernel to attempt to ignore those stray bits
> when they get set.  We do this by making our swap PTE format
> completely ignore the A/D bits, and also by ignoring them in our
> pte_none() checks.
> 
> Andi Kleen wrote the original version of this patch.  Dave Hansen
> wrote the later ones.