All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Russell King (Oracle)" <linux@armlinux.org.uk>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>,
	kernel test robot <lkp@intel.com>,
	linux-mm@kvack.org, llvm@lists.linux.dev,
	oe-kbuild-all@lists.linux.dev,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	"Mike Rapoport (IBM)" <rppt@kernel.org>,
	Arnd Bergmann <arnd@arndb.de>,
	x86@kernel.org, linux-m68k@lists.linux-m68k.org,
	linux-fsdevel@vger.kernel.org, kasan-dev@googlegroups.com,
	linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
	Dimitri Sivanich <dimitri.sivanich@hpe.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Muchun Song <muchun.song@linux.dev>,
	Andrey Ryabinin <ryabinin.a.a@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [PATCH V2 7/7] mm: Use pgdp_get() for accessing PGD entries
Date: Fri, 20 Sep 2024 10:47:45 +0100	[thread overview]
Message-ID: <Zu1EwTItDrnkTVTB@shell.armlinux.org.uk> (raw)
In-Reply-To: <9e68ffad-8a7e-40d7-a6f3-fa989a834068@arm.com>

On Fri, Sep 20, 2024 at 08:57:23AM +0200, Ryan Roberts wrote:
> On 19/09/2024 21:25, Russell King (Oracle) wrote:
> > On Thu, Sep 19, 2024 at 07:49:09PM +0200, Ryan Roberts wrote:
> >> On 19/09/2024 18:06, Russell King (Oracle) wrote:
> >>> On Thu, Sep 19, 2024 at 05:48:58PM +0200, Ryan Roberts wrote:
> >>>>> 32-bit arm uses, in some circumstances, an array because each level 1
> >>>>> page table entry is actually two descriptors. It needs to be this way
> >>>>> because each level 2 table pointed to by each level 1 entry has 256
> >>>>> entries, meaning it only occupies 1024 bytes in a 4096 byte page.
> >>>>>
> >>>>> In order to cut down on the wastage, treat the level 1 page table as
> >>>>> groups of two entries, which point to two consecutive 1024 byte tables
> >>>>> in the level 2 page.
> >>>>>
> >>>>> The level 2 entry isn't suitable for the kernel's use cases (there are
> >>>>> no bits to represent accessed/dirty and other important stuff that the
> >>>>> Linux MM wants) so we maintain the hardware page tables and a separate
> >>>>> set that Linux uses in the same page. Again, the software tables are
> >>>>> consecutive, so from Linux's perspective, the level 2 page tables
> >>>>> have 512 entries in them and occupy one full page.
> >>>>>
> >>>>> This is documented in arch/arm/include/asm/pgtable-2level.h
> >>>>>
> >>>>> However, what this means is that from the software perspective, the
> >>>>> level 1 page table descriptors are an array of two entries, both of
> >>>>> which need to be setup when creating a level 2 page table, but only
> >>>>> the first one should ever be dereferenced when walking the tables,
> >>>>> otherwise the code that walks the second level of page table entries
> >>>>> will walk off the end of the software table into the actual hardware
> >>>>> descriptors.
> >>>>>
> >>>>> I've no idea what the idea is behind introducing pgd_get() and what
> >>>>> it's semantics are, so I can't comment further.
> >>>>
> >>>> The helper is intended to read the value of the entry pointed to by the passed
> >>>> in pointer. And it shoiuld be read in a "single copy atomic" manner, meaning no
> >>>> tearing. Further, the PTL is expected to be held when calling the getter. If the
> >>>> HW can write to the entry such that its racing with the lock holder (i.e. HW
> >>>> update of access/dirty) then READ_ONCE() should be suitable for most
> >>>> architectures. If there is no possibility of racing (because HW doesn't write to
> >>>> the entry), then a simple dereference would be sufficient, I think (which is
> >>>> what the core code was already doing in most cases).
> >>>
> >>> The core code should be making no access to the PGD entries on 32-bit
> >>> ARM since the PGD level does not exist. Writes are done at PMD level
> >>> in arch code. Reads are done by core code at PMD level.
> >>>
> >>> It feels to me like pgd_get() just doesn't fit the model to which 32-bit
> >>> ARM was designed to use decades ago, so I want full details about what
> >>> pgd_get() is going to be used for and how it is going to be used,
> >>> because I feel completely in the dark over this new development. I fear
> >>> that someone hasn't understood the Linux page table model if they're
> >>> wanting to access stuff at levels that effectively "aren't implemented"
> >>> in the architecture specific kernel model of the page tables.
> >>
> >> This change isn't as big and scary as I think you fear.
> > 
> > The situation is as I state above. Core code must _not_ dereference pgd
> > pointers on 32-bit ARM.
> 
> Let's just rewind a bit. This thread exists because the kernel test robot failed
> to compile pgd_none_or_clear_bad() (a core-mm function) for the arm architecture
> after Anshuman changed the direct pgd dereference to pgdp_get(). The reason
> compilation failed is because arm defines its own pgdp_get() override, but it is
> broken (there is a typo).

Let's not rewind, because had you fully read and digested my reply, you
would have seen why this isn't a problem... but let me spell it out.

> 
> Code before Anshuman's change:
> 
> static inline int pgd_none_or_clear_bad(pgd_t *pgd)
> {
> 	if (pgd_none(*pgd))
> 		return 1;
> 	if (unlikely(pgd_bad(*pgd))) {
> 		pgd_clear_bad(pgd);
> 		return 1;
> 	}
> 	return 0;
> }

This isn't a problem as the code stands. While there is a dereference
in C, that dereference is a simple struct copy, something that we use
everywhere in the kernel. However, that is as far as it goes, because
neither pgd_none() and pgd_bad() make use of their argument, and thus
the compiler will optimise it away, resulting in no actual access to
the page tables - _as_ _intended_.

If these are going to be converted to pgd_get(), then we need pgd_get()
to _also_ be optimised away, and if e.g. this is the only place that
pgd_get() is going to be used, the suggestion I made in my previous
email is entirely reasonable, since we know that the result of pgd_get()
will not actually be used.

> As an aside, the kernel also dereferences p4d, pud, pmd and pte pointers in
> various circumstances.

I already covered these in my previous reply.

> And other changes in this series are also replacing those
> direct dereferences with calls to similar helpers. The fact that these are all
> folded (by a custom arm implementation if I've understood the below correctly)
> just means that each dereference is returning what you would call the pmd from
> the HW perspective, I think?

It'll "return" the first of each pair of level-1 page table entries,
which is pgd[0] or *p4d, *pud, *pmd - but all of these except *pmd
need to be optimised away, so throwing lots of READ_ONCE() around
this code without considering this is certainly the wrong approach.

> >> The core-mm today
> >> dereferences pgd pointers (and p4d, pud, pmd pointers) directly in its code. See
> >> follow_pfnmap_start(),
> > 
> > Doesn't seem to exist at least not in 6.11.
> 
> Appologies, I'm on mm-unstable and that isn't upstream yet. See follow_pte() in
> v6.11 or __apply_to_page_range(), or pgd_none_or_clear_bad() as per above.

Looking at follow_pte(), it's not a problem.

I think we wouldn't be having this conversation before:

commit a32618d28dbe6e9bf8ec508ccbc3561a7d7d32f0
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Tue Nov 22 17:30:28 2011 +0000

    ARM: pgtable: switch to use pgtable-nopud.h

where:
-#define pgd_none(pgd)          (0)
-#define pgd_bad(pgd)           (0)

existed before this commit - and thus the dereference in things like:

	pgd_none(*pgd)

wouldn't even be visible to beyond the preprocessor step.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

  reply	other threads:[~2024-09-20  9:48 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-17  7:31 [PATCH V2 0/7] mm: Use pxdp_get() for accessing page table entries Anshuman Khandual
2024-09-17  7:31 ` [PATCH V2 1/7] m68k/mm: Change pmd_val() Anshuman Khandual
2024-09-17  8:40   ` Ryan Roberts
2024-09-17 10:20   ` David Hildenbrand
2024-09-17 10:27     ` Ryan Roberts
2024-09-17 10:30       ` David Hildenbrand
2024-09-17  7:31 ` [PATCH V2 2/7] x86/mm: Drop page table entry address output from pxd_ERROR() Anshuman Khandual
2024-09-17 10:22   ` David Hildenbrand
2024-09-17 11:19     ` Dave Hansen
2024-09-17 11:25       ` Anshuman Khandual
2024-09-17 11:31       ` David Hildenbrand
2024-09-17  7:31 ` [PATCH V2 3/7] mm: Use ptep_get() for accessing PTE entries Anshuman Khandual
2024-09-17  8:44   ` Ryan Roberts
2024-09-17 10:28   ` David Hildenbrand
2024-09-18  6:32     ` Anshuman Khandual
2024-09-19  8:04       ` David Hildenbrand
2024-09-19  9:20         ` Anshuman Khandual
2024-09-17  7:31 ` [PATCH V2 4/7] mm: Use pmdp_get() for accessing PMD entries Anshuman Khandual
2024-09-17 10:05   ` Ryan Roberts
2024-09-18 18:57   ` kernel test robot
2024-09-19  7:21     ` Anshuman Khandual
2024-09-18 19:07   ` kernel test robot
2024-09-19  7:12     ` Anshuman Khandual
2024-09-17  7:31 ` [PATCH V2 5/7] mm: Use pudp_get() for accessing PUD entries Anshuman Khandual
2024-09-17  7:31 ` [PATCH V2 6/7] mm: Use p4dp_get() for accessing P4D entries Anshuman Khandual
2024-09-17  7:31 ` [PATCH V2 7/7] mm: Use pgdp_get() for accessing PGD entries Anshuman Khandual
2024-09-18 20:30   ` kernel test robot
2024-09-19  7:55     ` Anshuman Khandual
2024-09-19  9:11       ` Russell King (Oracle)
2024-09-19 15:48         ` Ryan Roberts
2024-09-19 17:06           ` Russell King (Oracle)
2024-09-19 17:49             ` Ryan Roberts
2024-09-19 20:25               ` Russell King (Oracle)
2024-09-20  6:57                 ` Ryan Roberts
2024-09-20  9:47                   ` Russell King (Oracle) [this message]
2024-09-23 15:21                     ` Ryan Roberts
2024-09-25 10:05 ` [PATCH V2 0/7] mm: Use pxdp_get() for accessing page table entries Christophe Leroy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zu1EwTItDrnkTVTB@shell.armlinux.org.uk \
    --to=linux@armlinux.org.uk \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=arnd@arndb.de \
    --cc=cl@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=dimitri.sivanich@hpe.com \
    --cc=hch@infradead.org \
    --cc=kasan-dev@googlegroups.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-m68k@lists.linux-m68k.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=llvm@lists.linux.dev \
    --cc=muchun.song@linux.dev \
    --cc=oe-kbuild-all@lists.linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryabinin.a.a@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.