incoming

patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* incoming
@ 2022-02-12  0:27 Andrew Morton
  2022-02-12  2:02 ` incoming Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2022-02-12  0:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits, patches

5 patches, based on f1baf68e1383f6ed93eb9cff2866d46562607a43.

Subsystems affected by this patch series:

  binfmt
  procfs
  mm/vmscan
  mm/memcg
  mm/kfence

Subsystem: binfmt

    Mike Rapoport <rppt@linux.ibm.com>:
      fs/binfmt_elf: fix PT_LOAD p_align values for loaders

Subsystem: procfs

    Yang Shi <shy828301@gmail.com>:
      fs/proc: task_mmu.c: don't read mapcount for migration entry

Subsystem: mm/vmscan

    Mel Gorman <mgorman@suse.de>:
      mm: vmscan: remove deadlock due to throttling failing to make progress

Subsystem: mm/memcg

    Roman Gushchin <guro@fb.com>:
      mm: memcg: synchronize objcg lists with a dedicated spinlock

Subsystem: mm/kfence

    Peng Liu <liupeng256@huawei.com>:
      kfence: make test case compatible with run time set sample interval

 fs/binfmt_elf.c            |    2 +-
 fs/proc/task_mmu.c         |   40 +++++++++++++++++++++++++++++++---------
 include/linux/kfence.h     |    2 ++
 include/linux/memcontrol.h |    5 +++--
 mm/kfence/core.c           |    3 ++-
 mm/kfence/kfence_test.c    |    8 ++++----
 mm/memcontrol.c            |   10 +++++-----
 mm/vmscan.c                |    4 +++-
 8 files changed, 51 insertions(+), 23 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: incoming
  2022-02-12  0:27 incoming Andrew Morton
@ 2022-02-12  2:02 ` Linus Torvalds
  2022-02-12  5:24   ` incoming Andrew Morton
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-02-12  2:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux-MM, mm-commits, patches

On Fri, Feb 11, 2022 at 4:27 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> 5 patches, based on f1baf68e1383f6ed93eb9cff2866d46562607a43.

So this *completely* flummoxed 'b4', because you first sent the wrong
series, and then sent the right one in the same thread.

I fetched the emails  manually, but honestly, this was confusing even
then, with two "[PATCH x/5]" series where the only way to tell the
right one was basically by date of email. They did arrive in the same
order in my mailbox, but even that wouldn't have been guaranteed if
there had been some mailer delays somewhere..

So next time when you mess up, resend it all as a completely new
series and completely new threading - so with a new header email too.
Please?

And since I'm here, let me just verify that yes, the series you
actually want me to apply is this one (as described by the head
email):

  Subject: [patch 1/5] fs/binfmt_elf: fix PT_LOAD p_align values ..
  Subject: [patch 2/5] fs/proc: task_mmu.c: don't read mapcount f..
  Subject: [patch 3/5] mm: vmscan: remove deadlock due to throttl..
  Subject: [patch 4/5] mm: memcg: synchronize objcg lists with a ..
  Subject: [patch 5/5] kfence: make test case compatible with run..

and not the other one with GUP patches?

             Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: incoming
  2022-02-12  2:02 ` incoming Linus Torvalds
@ 2022-02-12  5:24   ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-02-12  5:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux-MM, mm-commits, patches

On Fri, 11 Feb 2022 18:02:53 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Feb 11, 2022 at 4:27 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > 5 patches, based on f1baf68e1383f6ed93eb9cff2866d46562607a43.
> 
> So this *completely* flummoxed 'b4', because you first sent the wrong
> series, and then sent the right one in the same thread.
> 
> I fetched the emails  manually, but honestly, this was confusing even
> then, with two "[PATCH x/5]" series where the only way to tell the
> right one was basically by date of email. They did arrive in the same
> order in my mailbox, but even that wouldn't have been guaranteed if
> there had been some mailer delays somewhere..

Yes, I wondered.  Sorry bout that.

> So next time when you mess up, resend it all as a completely new
> series and completely new threading - so with a new header email too.
> Please?

Wilco.

> And since I'm here, let me just verify that yes, the series you
> actually want me to apply is this one (as described by the head
> email):
> 
>   Subject: [patch 1/5] fs/binfmt_elf: fix PT_LOAD p_align values ..
>   Subject: [patch 2/5] fs/proc: task_mmu.c: don't read mapcount f..
>   Subject: [patch 3/5] mm: vmscan: remove deadlock due to throttl..
>   Subject: [patch 4/5] mm: memcg: synchronize objcg lists with a ..
>   Subject: [patch 5/5] kfence: make test case compatible with run..
> 
> and not the other one with GUP patches?

Those are the ones.  Five fixes, three with cc:stable.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-02-26  3:10 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-02-26  3:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm, patches

12 patches, based on c47658311d60be064b839f329c0e4d34f5f0735b.

Subsystems affected by this patch series:

  MAINTAINERS
  mm/hugetlb
  mm/kasan
  mm/hugetlbfs
  mm/pagemap
  mm/selftests
  mm/memcg
  m/slab
  mailmap
  memfd

Subsystem: MAINTAINERS

    Luis Chamberlain <mcgrof@kernel.org>:
      MAINTAINERS: add sysctl-next git tree

Subsystem: mm/hugetlb

    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>:
      mm/hugetlb: fix kernel crash with hugetlb mremap

Subsystem: mm/kasan

    Andrey Konovalov <andreyknvl@google.com>:
      kasan: test: prevent cache merging in kmem_cache_double_destroy

Subsystem: mm/hugetlbfs

    Liu Yuntao <liuyuntao10@huawei.com>:
      hugetlbfs: fix a truncation issue in hugepages parameter

Subsystem: mm/pagemap

    Suren Baghdasaryan <surenb@google.com>:
      mm: fix use-after-free bug when mm->mmap is reused after being freed

Subsystem: mm/selftests

    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>:
      selftest/vm: fix map_fixed_noreplace test failure

Subsystem: mm/memcg

    Roman Gushchin <roman.gushchin@linux.dev>:
      MAINTAINERS: add Roman as a memcg co-maintainer

    Vladimir Davydov <vdavydov.dev@gmail.com>:
      MAINTAINERS: remove Vladimir from memcg maintainers

    Shakeel Butt <shakeelb@google.com>:
      MAINTAINERS: add Shakeel as a memcg co-maintainer

Subsystem: m/slab

    Vlastimil Babka <vbabka@suse.cz>:
      MAINTAINERS, SLAB: add Roman as reviewer, git tree

Subsystem: mailmap

    Roman Gushchin <roman.gushchin@linux.dev>:
      mailmap: update Roman Gushchin's email

Subsystem: memfd

    Mike Kravetz <mike.kravetz@oracle.com>:
      selftests/memfd: clean up mapping in mfd_fail_write

 .mailmap                                         |    3 +
 MAINTAINERS                                      |    6 ++
 lib/test_kasan.c                                 |    5 +-
 mm/hugetlb.c                                     |   11 ++---
 mm/mmap.c                                        |    1 
 tools/testing/selftests/memfd/memfd_test.c       |    1 
 tools/testing/selftests/vm/map_fixed_noreplace.c |   49 +++++++++++++++++------
 7 files changed, 56 insertions(+), 20 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-03-05  4:28 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-03-05  4:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm, patches

8 patches, based on 07ebd38a0da24d2534da57b4841346379db9f354.

Subsystems affected by this patch series:

  mm/hugetlb
  mm/pagemap
  memfd
  selftests
  mm/userfaultfd
  kconfig

Subsystem: mm/hugetlb

    Mike Kravetz <mike.kravetz@oracle.com>:
      selftests/vm: cleanup hugetlb file after mremap test

Subsystem: mm/pagemap

    Suren Baghdasaryan <surenb@google.com>:
      mm: refactor vm_area_struct::anon_vma_name usage code
      mm: prevent vm_area_struct::anon_name refcount saturation
      mm: fix use-after-free when anon vma name is used after vma is freed

Subsystem: memfd

    Hugh Dickins <hughd@google.com>:
      memfd: fix F_SEAL_WRITE after shmem huge page allocated

Subsystem: selftests

    Chengming Zhou <zhouchengming@bytedance.com>:
      kselftest/vm: fix tests build with old libc

Subsystem: mm/userfaultfd

    Yun Zhou <yun.zhou@windriver.com>:
      proc: fix documentation and description of pagemap

Subsystem: kconfig

    Qian Cai <quic_qiancai@quicinc.com>:
      configs/debug: set CONFIG_DEBUG_INFO=y properly

 Documentation/admin-guide/mm/pagemap.rst     |    2 
 fs/proc/task_mmu.c                           |    9 +-
 fs/userfaultfd.c                             |    6 -
 include/linux/mm.h                           |    7 +
 include/linux/mm_inline.h                    |  105 ++++++++++++++++++---------
 include/linux/mm_types.h                     |    5 +
 kernel/configs/debug.config                  |    2 
 kernel/fork.c                                |    4 -
 kernel/sys.c                                 |   19 +++-
 mm/madvise.c                                 |   98 +++++++++----------------
 mm/memfd.c                                   |   40 +++++++---
 mm/mempolicy.c                               |    2 
 mm/mlock.c                                   |    2 
 mm/mmap.c                                    |   12 +--
 mm/mprotect.c                                |    2 
 tools/testing/selftests/vm/hugepage-mremap.c |   26 ++++--
 tools/testing/selftests/vm/run_vmtests.sh    |    3 
 tools/testing/selftests/vm/userfaultfd.c     |    1 
 18 files changed, 201 insertions(+), 144 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-03-16 23:14 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-03-16 23:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm, patches

4 patches, based on 56e337f2cf1326323844927a04e9dbce9a244835.

Subsystems affected by this patch series:

  mm/swap
  kconfig
  ocfs2
  selftests

Subsystem: mm/swap

    Guo Ziliang <guo.ziliang@zte.com.cn>:
      mm: swap: get rid of deadloop in swapin readahead

Subsystem: kconfig

    Qian Cai <quic_qiancai@quicinc.com>:
      configs/debug: restore DEBUG_INFO=y for overriding

Subsystem: ocfs2

    Joseph Qi <joseph.qi@linux.alibaba.com>:
      ocfs2: fix crash when initialize filecheck kobj fails

Subsystem: selftests

    Yosry Ahmed <yosryahmed@google.com>:
      selftests: vm: fix clang build error multiple output files

 fs/ocfs2/super.c                    |   22 +++++++++++-----------
 kernel/configs/debug.config         |    1 +
 mm/swap_state.c                     |    2 +-
 tools/testing/selftests/vm/Makefile |    6 ++----
 4 files changed, 15 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-03-22 21:38 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-03-22 21:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits, patches


- A few misc subsystems

- There is a lot of MM material in Willy's tree.  Folio work and
  non-folio patches which depended on that work.

  Here I send almost all the MM patches which precede the patches in
  Willy's tree.  The remaining ~100 MM patches are staged on Willy's
  tree and I'll send those along once Willy is merged up.

  I tried this batch against your current tree (as of
  51912904076680281) and a couple need some extra persuasion to apply,
  but all looks OK otherwise.


227 patches, based on f443e374ae131c168a065ea1748feac6b2e76613

Subsystems affected by this patch series:

  kthread
  scripts
  ntfs
  ocfs2
  block
  vfs
  mm/kasan
  mm/pagecache
  mm/gup
  mm/swap
  mm/shmem
  mm/memcg
  mm/selftests
  mm/pagemap
  mm/mremap
  mm/sparsemem
  mm/vmalloc
  mm/pagealloc
  mm/memory-failure
  mm/mlock
  mm/hugetlb
  mm/userfaultfd
  mm/vmscan
  mm/compaction
  mm/mempolicy
  mm/oom-kill
  mm/migration
  mm/thp
  mm/cma
  mm/autonuma
  mm/psi
  mm/ksm
  mm/page-poison
  mm/madvise
  mm/memory-hotplug
  mm/rmap
  mm/zswap
  mm/uaccess
  mm/ioremap
  mm/highmem
  mm/cleanups
  mm/kfence
  mm/hmm
  mm/damon

Subsystem: kthread

    Rasmus Villemoes <linux@rasmusvillemoes.dk>:
      linux/kthread.h: remove unused macros

Subsystem: scripts

    Colin Ian King <colin.i.king@gmail.com>:
      scripts/spelling.txt: add more spellings to spelling.txt

Subsystem: ntfs

    Dongliang Mu <mudongliangabcd@gmail.com>:
      ntfs: add sanity check on allocation size

Subsystem: ocfs2

    Joseph Qi <joseph.qi@linux.alibaba.com>:
      ocfs2: cleanup some return variables

    hongnanli <hongnan.li@linux.alibaba.com>:
      fs/ocfs2: fix comments mentioning i_mutex

Subsystem: block

    NeilBrown <neilb@suse.de>:
    Patch series "Remove remaining parts of congestion tracking code", v2:
      doc: convert 'subsection' to 'section' in gfp.h
      mm: document and polish read-ahead code
      mm: improve cleanup when ->readpages doesn't process all pages
      fuse: remove reliance on bdi congestion
      nfs: remove reliance on bdi congestion
      ceph: remove reliance on bdi congestion
      remove inode_congested()
      remove bdi_congested() and wb_congested() and related functions
      f2fs: replace congestion_wait() calls with io_schedule_timeout()
      block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"
      remove congestion tracking framework

Subsystem: vfs

    Anthony Iliopoulos <ailiop@suse.com>:
      mount: warn only once about timestamp range expiration

Subsystem: mm/kasan

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/memremap: avoid calling kasan_remove_zero_shadow() for device private memory

Subsystem: mm/pagecache

    Miaohe Lin <linmiaohe@huawei.com>:
      filemap: remove find_get_pages()
      mm/writeback: minor clean up for highmem_dirtyable_memory

    Minchan Kim <minchan@kernel.org>:
      mm: fs: fix lru_cache_disabled race in bh_lru

Subsystem: mm/gup

    Peter Xu <peterx@redhat.com>:
    Patch series "mm/gup: some cleanups", v5:
      mm: fix invalid page pointer returned with FOLL_PIN gups

    John Hubbard <jhubbard@nvidia.com>:
      mm/gup: follow_pfn_pte(): -EEXIST cleanup
      mm/gup: remove unused pin_user_pages_locked()
      mm: change lookup_node() to use get_user_pages_fast()
      mm/gup: remove unused get_user_pages_locked()

Subsystem: mm/swap

    Bang Li <libang.linuxer@gmail.com>:
      mm/swap: fix confusing comment in folio_mark_accessed

Subsystem: mm/shmem

    Xavier Roche <xavier.roche@algolia.com>:
      tmpfs: support for file creation time

    Hugh Dickins <hughd@google.com>:
      shmem: mapping_set_exiting() to help mapped resilience
      tmpfs: do not allocate pages on read

    Miaohe Lin <linmiaohe@huawei.com>:
      mm: shmem: use helper macro __ATTR_RW

Subsystem: mm/memcg

    Shakeel Butt <shakeelb@google.com>:
      memcg: replace in_interrupt() with !in_task()

    Yosry Ahmed <yosryahmed@google.com>:
      memcg: add per-memcg total kernel memory stat

    Wei Yang <richard.weiyang@gmail.com>:
      mm/memcg: mem_cgroup_per_node is already set to 0 on allocation
      mm/memcg: retrieve parent memcg from css.parent

    Shakeel Butt <shakeelb@google.com>:
    Patch series "memcg: robust enforcement of memory.high", v2:
      memcg: refactor mem_cgroup_oom
      memcg: unify force charging conditions
      selftests: memcg: test high limit for single entry allocation
      memcg: synchronously enforce memory.high for large overcharges

    Randy Dunlap <rdunlap@infradead.org>:
      mm/memcontrol: return 1 from cgroup.memory __setup() handler

    Michal Hocko <mhocko@suse.com>:
    Patch series "mm/memcg: Address PREEMPT_RT problems instead of disabling it", v5:
      mm/memcg: revert ("mm/memcg: optimize user context object stock access")

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      mm/memcg: disable threshold event handlers on PREEMPT_RT
      mm/memcg: protect per-CPU counter by disabling preemption on PREEMPT_RT where needed.

    Johannes Weiner <hannes@cmpxchg.org>:
      mm/memcg: opencode the inner part of obj_cgroup_uncharge_pages() in drain_obj_stock()

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      mm/memcg: protect memcg_stock with a local_lock_t
      mm/memcg: disable migration instead of preemption in drain_all_stock().

    Muchun Song <songmuchun@bytedance.com>:
    Patch series "Optimize list lru memory consumption", v6:
      mm: list_lru: transpose the array of per-node per-memcg lru lists
      mm: introduce kmem_cache_alloc_lru
      fs: introduce alloc_inode_sb() to allocate filesystems specific inode
      fs: allocate inode by using alloc_inode_sb()
      f2fs: allocate inode by using alloc_inode_sb()
      mm: dcache: use kmem_cache_alloc_lru() to allocate dentry
      xarray: use kmem_cache_alloc_lru to allocate xa_node
      mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online()
      mm: list_lru: allocate list_lru_one only when needed
      mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus
      mm: list_lru: replace linear array with xarray
      mm: memcontrol: reuse memory cgroup ID for kmem ID
      mm: memcontrol: fix cannot alloc the maximum memcg ID
      mm: list_lru: rename list_lru_per_memcg to list_lru_memcg
      mm: memcontrol: rename memcg_cache_id to memcg_kmem_id

    Vasily Averin <vvs@virtuozzo.com>:
      memcg: enable accounting for tty-related objects

Subsystem: mm/selftests

    Guillaume Tucker <guillaume.tucker@collabora.com>:
      selftests, x86: fix how check_cc.sh is being invoked

Subsystem: mm/pagemap

    Anshuman Khandual <anshuman.khandual@arm.com>:
      mm: merge pte_mkhuge() call into arch_make_huge_pte()

    Stafford Horne <shorne@gmail.com>:
      mm: remove mmu_gathers storage from remaining architectures

    Muchun Song <songmuchun@bytedance.com>:
    Patch series "Fix some cache flush bugs", v5:
      mm: thp: fix wrong cache flush in remove_migration_pmd()
      mm: fix missing cache flush for all tail pages of compound page
      mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()
      mm: hugetlb: fix missing cache flush in hugetlb_mcopy_atomic_pte()
      mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte()
      mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()
      mm: replace multiple dcache flush with flush_dcache_folio()

    Peter Xu <peterx@redhat.com>:
    Patch series "mm: Rework zap ptes on swap entries", v5:
      mm: don't skip swap entry even if zap_details specified
      mm: rename zap_skip_check_mapping() to should_zap_page()
      mm: change zap_details.zap_mapping into even_cows
      mm: rework swap handling of zap_pte_range

    Randy Dunlap <rdunlap@infradead.org>:
      mm/mmap: return 1 from stack_guard_gap __setup() handler

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/memory.c: use helper function range_in_vma()
      mm/memory.c: use helper macro min and max in unmap_mapping_range_tree()

    Hugh Dickins <hughd@google.com>:
      mm: _install_special_mapping() apply VM_LOCKED_CLEAR_MASK

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/mmap: remove obsolete comment in ksys_mmap_pgoff

Subsystem: mm/mremap

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/mremap:: use vma_lookup() instead of find_vma()

Subsystem: mm/sparsemem

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/sparse: make mminit_validate_memmodel_limits() static

Subsystem: mm/vmalloc

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/vmalloc: remove unneeded function forward declaration

    "Uladzislau Rezki (Sony)" <urezki@gmail.com>:
      mm/vmalloc: Move draining areas out of caller context

    Uladzislau Rezki <uladzislau.rezki@sony.com>:
      mm/vmalloc: add adjust_search_size parameter

    "Uladzislau Rezki (Sony)" <urezki@gmail.com>:
      mm/vmalloc: eliminate an extra orig_gfp_mask

    Jiapeng Chong <jiapeng.chong@linux.alibaba.com>:
      mm/vmalloc.c: fix "unused function" warning

    Bang Li <libang.linuxer@gmail.com>:
      mm/vmalloc: fix comments about vmap_area struct

Subsystem: mm/pagealloc

    Zi Yan <ziy@nvidia.com>:
      mm: page_alloc: avoid merging non-fallbackable pageblocks with others

    Peter Collingbourne <pcc@google.com>:
      mm/mmzone.c: use try_cmpxchg() in page_cpupid_xchg_last()

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/mmzone.h: remove unused macros

    Nicolas Saenz Julienne <nsaenzju@redhat.com>:
      mm/page_alloc: don't pass pfn to free_unref_page_commit()

    David Hildenbrand <david@redhat.com>:
    Patch series "mm: enforce pageblock_order < MAX_ORDER":
      cma: factor out minimum alignment requirement
      mm: enforce pageblock_order < MAX_ORDER

    Nathan Chancellor <nathan@kernel.org>:
      mm/page_alloc: mark pagesets as __maybe_unused

    Alistair Popple <apopple@nvidia.com>:
      mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node

    Mel Gorman <mgorman@techsingularity.net>:
    Patch series "Follow-up on high-order PCP caching", v2:
      mm/page_alloc: fetch the correct pcp buddy during bulk free
      mm/page_alloc: track range of active PCP lists during bulk free
      mm/page_alloc: simplify how many pages are selected per pcp list during bulk free
      mm/page_alloc: drain the requested list first during bulk free
      mm/page_alloc: free pages in a single pass during bulk free
      mm/page_alloc: limit number of high-order pages on PCP during bulk free
      mm/page_alloc: do not prefetch buddies during bulk free

    Oscar Salvador <osalvador@suse.de>:
      arch/x86/mm/numa: Do not initialize nodes twice

    Suren Baghdasaryan <surenb@google.com>:
      mm: count time in drain_all_pages during direct reclaim as memory pressure

    Eric Dumazet <edumazet@google.com>:
      mm/page_alloc: call check_new_pages() while zone spinlock is not held

    Mel Gorman <mgorman@techsingularity.net>:
      mm/page_alloc: check high-order pages for corruption during PCP operations

Subsystem: mm/memory-failure

    Naoya Horiguchi <naoya.horiguchi@nec.com>:
      mm/memory-failure.c: remove obsolete comment
      mm/hwpoison: fix error page recovered but reported "not recovered"

    Rik van Riel <riel@surriel.com>:
      mm: invalidate hwpoison page cache page in fault path

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "A few cleanup and fixup patches for memory failure", v3:
      mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap
      mm/memory-failure.c: catch unexpected -EFAULT from vma_address()
      mm/memory-failure.c: rework the signaling logic in kill_proc
      mm/memory-failure.c: fix race with changing page more robustly
      mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev
      mm/memory-failure.c: rework the try_to_unmap logic in hwpoison_user_mappings()
      mm/memory-failure.c: remove obsolete comment in __soft_offline_page
      mm/memory-failure.c: remove unnecessary PageTransTail check
      mm/hwpoison-inject: support injecting hwpoison to free page

    luofei <luofei@unicloud.com>:
      mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
      mm/hwpoison: add in-use hugepage hwpoison filter judgement

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "A few fixup patches for memory failure", v2:
      mm/memory-failure.c: fix race with changing page compound again
      mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages
      mm/memory-failure.c: make non-LRU movable pages unhandlable

    Vlastimil Babka <vbabka@suse.cz>:
      mm, fault-injection: declare should_fail_alloc_page()

Subsystem: mm/mlock

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/mlock: fix potential imbalanced rlimit ucounts adjustment

Subsystem: mm/hugetlb

    Muchun Song <songmuchun@bytedance.com>:
    Patch series "Free the 2nd vmemmap page associated with each HugeTLB page", v7:
      mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page
      mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key
      mm: sparsemem: use page table lock to protect kernel pmd operations
      selftests: vm: add a hugetlb test case
      mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP

    Anshuman Khandual <anshuman.khandual@arm.com>:
      mm/hugetlb: generalize ARCH_WANT_GENERAL_HUGETLB

    Mike Kravetz <mike.kravetz@oracle.com>:
      hugetlb: clean up potential spectre issue warnings

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/hugetlb: use helper macro __ATTR_RW

    David Howells <dhowells@redhat.com>:
      mm/hugetlb.c: export PageHeadHuge()

    Miaohe Lin <linmiaohe@huawei.com>:
      mm: remove unneeded local variable follflags

Subsystem: mm/userfaultfd

    Nadav Amit <namit@vmware.com>:
      userfaultfd: provide unmasked address on page-fault

    Guo Zhengkui <guozhengkui@vivo.com>:
      userfaultfd/selftests: fix uninitialized_var.cocci warning

Subsystem: mm/vmscan

    Hugh Dickins <hughd@google.com>:
      mm/fs: delete PF_SWAPWRITE
      mm: __isolate_lru_page_prepare() in isolate_migratepages_block()

    Waiman Long <longman@redhat.com>:
      mm/list_lru: optimize memcg_reparent_list_lru_node()

    Marcelo Tosatti <mtosatti@redhat.com>:
      mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      mm: workingset: replace IRQ-off check with a lockdep assert.

    Charan Teja Kalla <quic_charante@quicinc.com>:
      mm: vmscan: fix documentation for page_check_references()

Subsystem: mm/compaction

    Baolin Wang <baolin.wang@linux.alibaba.com>:
      mm: compaction: cleanup the compaction trace events

Subsystem: mm/mempolicy

    Hugh Dickins <hughd@google.com>:
      mempolicy: mbind_range() set_policy() after vma_merge()

Subsystem: mm/oom-kill

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/oom_kill: remove unneeded is_memcg_oom check

Subsystem: mm/migration

    Huang Ying <ying.huang@intel.com>:
      mm,migrate: fix establishing demotion target

    "andrew.yang" <andrew.yang@mediatek.com>:
      mm/migrate: fix race between lock page and clear PG_Isolated

Subsystem: mm/thp

    Hugh Dickins <hughd@google.com>:
      mm/thp: refix __split_huge_pmd_locked() for migration PMD

Subsystem: mm/cma

    Hari Bathini <hbathini@linux.ibm.com>:
    Patch series "powerpc/fadump: handle CMA activation failure appropriately", v3:
      mm/cma: provide option to opt out from exposing pages on activation failure
      powerpc/fadump: opt out from freeing pages on cma activation failure

Subsystem: mm/autonuma

    Huang Ying <ying.huang@intel.com>:
    Patch series "NUMA balancing: optimize memory placement for memory tiering system", v13:
      NUMA Balancing: add page promotion counter
      NUMA balancing: optimize page placement for memory tiering system
      memory tiering: skip to scan fast memory

Subsystem: mm/psi

    Johannes Weiner <hannes@cmpxchg.org>:
      mm: page_io: fix psi memory pressure error on cold swapins

Subsystem: mm/ksm

    Yang Yang <yang.yang29@zte.com.cn>:
      mm/vmstat: add event for ksm swapping in copy

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/ksm: use helper macro __ATTR_RW

Subsystem: mm/page-poison

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      mm/hwpoison: check the subpage, not the head page

Subsystem: mm/madvise

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/madvise: use vma_lookup() instead of find_vma()

    Charan Teja Kalla <quic_charante@quicinc.com>:
    Patch series "mm: madvise: return correct bytes processed with:
      mm: madvise: return correct bytes advised with process_madvise
      mm: madvise: skip unmapped vma holes passed to process_madvise

Subsystem: mm/memory-hotplug

    Michal Hocko <mhocko@suse.com>:
    Patch series "mm, memory_hotplug: handle unitialized numa node gracefully":
      mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG
      mm: handle uninitialized numa nodes gracefully
      mm, memory_hotplug: drop arch_free_nodedata
      mm, memory_hotplug: reorganize new pgdat initialization
      mm: make free_area_init_node aware of memory less nodes

    Wei Yang <richard.weiyang@gmail.com>:
      memcg: do not tweak node in alloc_mem_cgroup_per_node_info

    David Hildenbrand <david@redhat.com>:
      drivers/base/memory: add memory block to memory group after registration succeeded
      drivers/base/node: consolidate node device subsystem initialization in node_dev_init()

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "A few cleanup patches around memory_hotplug":
      mm/memory_hotplug: remove obsolete comment of __add_pages
      mm/memory_hotplug: avoid calling zone_intersects() for ZONE_NORMAL
      mm/memory_hotplug: clean up try_offline_node
      mm/memory_hotplug: fix misplaced comment in offline_pages

    David Hildenbrand <david@redhat.com>:
    Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2:
      drivers/base/node: rename link_mem_sections() to register_memory_block_under_node()
      drivers/base/memory: determine and store zone for single-zone memory blocks
      drivers/base/memory: clarify adding and removing of memory blocks

    Oscar Salvador <osalvador@suse.de>:
      mm: only re-generate demotion targets when a numa node changes its N_CPU state

Subsystem: mm/rmap

    Hugh Dickins <hughd@google.com>:
      mm/thp: ClearPageDoubleMap in first page_add_file_rmap()

Subsystem: mm/zswap

    "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>:
      mm/zswap.c: allow handling just same-value filled pages

Subsystem: mm/uaccess

    Christophe Leroy <christophe.leroy@csgroup.eu>:
      mm: remove usercopy_warn()
      mm: uninline copy_overflow()

    Randy Dunlap <rdunlap@infradead.org>:
      mm/usercopy: return 1 from hardened_usercopy __setup() handler

Subsystem: mm/ioremap

    Vlastimil Babka <vbabka@suse.cz>:
      mm/early_ioremap: declare early_memremap_pgprot_adjust()

Subsystem: mm/highmem

    Ira Weiny <ira.weiny@intel.com>:
      highmem: document kunmap_local()

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/highmem: remove unnecessary done label

Subsystem: mm/cleanups

    "Dr. David Alan Gilbert" <linux@treblig.org>:
      mm/page_table_check.c: use strtobool for param parsing

Subsystem: mm/kfence

    tangmeng <tangmeng@uniontech.com>:
      mm/kfence: remove unnecessary CONFIG_KFENCE option

    Tianchen Ding <dtcccc@linux.alibaba.com>:
    Patch series "provide the flexibility to enable KFENCE", v3:
      kfence: allow re-enabling KFENCE after system startup
      kfence: alloc kfence_pool after system startup

    Peng Liu <liupeng256@huawei.com>:
    Patch series "kunit: fix a UAF bug and do some optimization", v2:
      kunit: fix UAF when run kfence test case test_gfpzero
      kunit: make kunit_test_timeout compatible with comment
      kfence: test: try to avoid test_gfpzero trigger rcu_stall

    Marco Elver <elver@google.com>:
      kfence: allow use of a deferrable timer

Subsystem: mm/hmm

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/hmm.c: remove unneeded local variable ret

Subsystem: mm/damon

    SeongJae Park <sj@kernel.org>:
    Patch series "Remove the type-unclear target id concept":
      mm/damon/dbgfs/init_regions: use target index instead of target id
      Docs/admin-guide/mm/damon/usage: update for changed initail_regions file input
      mm/damon/core: move damon_set_targets() into dbgfs
      mm/damon: remove the target id concept

    Baolin Wang <baolin.wang@linux.alibaba.com>:
      mm/damon: remove redundant page validation

    SeongJae Park <sj@kernel.org>:
    Patch series "Allow DAMON user code independent of monitoring primitives":
      mm/damon: rename damon_primitives to damon_operations
      mm/damon: let monitoring operations can be registered and selected
      mm/damon/paddr,vaddr: register themselves to DAMON in subsys_initcall
      mm/damon/reclaim: use damon_select_ops() instead of damon_{v,p}a_set_operations()
      mm/damon/dbgfs: use damon_select_ops() instead of damon_{v,p}a_set_operations()
      mm/damon/dbgfs: use operations id for knowing if the target has pid
      mm/damon/dbgfs-test: fix is_target_id() change
      mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()

    tangmeng <tangmeng@uniontech.com>:
      mm/damon: remove unnecessary CONFIG_DAMON option

    SeongJae Park <sj@kernel.org>:
    Patch series "Docs/damon: Update documents for better consistency":
      Docs/vm/damon: call low level monitoring primitives the operations
      Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
      Docs/damon: update outdated term 'regions update interval'
    Patch series "Introduce DAMON sysfs interface", v3:
      mm/damon/core: allow non-exclusive DAMON start/stop
      mm/damon/core: add number of each enum type values
      mm/damon: implement a minimal stub for sysfs-based DAMON interface
      mm/damon/sysfs: link DAMON for virtual address spaces monitoring
      mm/damon/sysfs: support the physical address space monitoring
      mm/damon/sysfs: support DAMON-based Operation Schemes
      mm/damon/sysfs: support DAMOS quotas
      mm/damon/sysfs: support schemes prioritization
      mm/damon/sysfs: support DAMOS watermarks
      mm/damon/sysfs: support DAMOS stats
      selftests/damon: add a test for DAMON sysfs interface
      Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
      Docs/ABI/testing: add DAMON sysfs interface ABI document

    Xin Hao <xhao@linux.alibaba.com>:
      mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()

 Documentation/ABI/testing/sysfs-kernel-mm-damon  |  274 ++
 Documentation/admin-guide/cgroup-v1/memory.rst   |    2 
 Documentation/admin-guide/cgroup-v2.rst          |    5 
 Documentation/admin-guide/kernel-parameters.txt  |    2 
 Documentation/admin-guide/mm/damon/usage.rst     |  380 +++
 Documentation/admin-guide/mm/zswap.rst           |   22 
 Documentation/admin-guide/sysctl/kernel.rst      |   31 
 Documentation/core-api/mm-api.rst                |   19 
 Documentation/dev-tools/kfence.rst               |   12 
 Documentation/filesystems/porting.rst            |    6 
 Documentation/filesystems/vfs.rst                |   16 
 Documentation/vm/damon/design.rst                |   43 
 Documentation/vm/damon/faq.rst                   |    2 
 MAINTAINERS                                      |    1 
 arch/arm/Kconfig                                 |    4 
 arch/arm64/kernel/setup.c                        |    3 
 arch/arm64/mm/hugetlbpage.c                      |    1 
 arch/hexagon/mm/init.c                           |    2 
 arch/ia64/kernel/topology.c                      |   10 
 arch/ia64/mm/discontig.c                         |   11 
 arch/mips/kernel/topology.c                      |    5 
 arch/nds32/mm/init.c                             |    1 
 arch/openrisc/mm/init.c                          |    2 
 arch/powerpc/include/asm/fadump-internal.h       |    5 
 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h |    4 
 arch/powerpc/kernel/fadump.c                     |    8 
 arch/powerpc/kernel/sysfs.c                      |   17 
 arch/riscv/Kconfig                               |    4 
 arch/riscv/kernel/setup.c                        |    3 
 arch/s390/kernel/numa.c                          |    7 
 arch/sh/kernel/topology.c                        |    5 
 arch/sparc/kernel/sysfs.c                        |   12 
 arch/sparc/mm/hugetlbpage.c                      |    1 
 arch/x86/Kconfig                                 |    4 
 arch/x86/kernel/cpu/mce/core.c                   |    8 
 arch/x86/kernel/topology.c                       |    5 
 arch/x86/mm/numa.c                               |   33 
 block/bdev.c                                     |    2 
 block/bfq-iosched.c                              |    2 
 drivers/base/init.c                              |    1 
 drivers/base/memory.c                            |  149 +
 drivers/base/node.c                              |   48 
 drivers/block/drbd/drbd_int.h                    |    3 
 drivers/block/drbd/drbd_req.c                    |    3 
 drivers/dax/super.c                              |    2 
 drivers/of/of_reserved_mem.c                     |    9 
 drivers/tty/tty_io.c                             |    2 
 drivers/virtio/virtio_mem.c                      |    9 
 fs/9p/vfs_inode.c                                |    2 
 fs/adfs/super.c                                  |    2 
 fs/affs/super.c                                  |    2 
 fs/afs/super.c                                   |    2 
 fs/befs/linuxvfs.c                               |    2 
 fs/bfs/inode.c                                   |    2 
 fs/btrfs/inode.c                                 |    2 
 fs/buffer.c                                      |    8 
 fs/ceph/addr.c                                   |   22 
 fs/ceph/inode.c                                  |    2 
 fs/ceph/super.c                                  |    1 
 fs/ceph/super.h                                  |    1 
 fs/cifs/cifsfs.c                                 |    2 
 fs/coda/inode.c                                  |    2 
 fs/dcache.c                                      |    3 
 fs/ecryptfs/super.c                              |    2 
 fs/efs/super.c                                   |    2 
 fs/erofs/super.c                                 |    2 
 fs/exfat/super.c                                 |    2 
 fs/ext2/ialloc.c                                 |    5 
 fs/ext2/super.c                                  |    2 
 fs/ext4/super.c                                  |    2 
 fs/f2fs/compress.c                               |    4 
 fs/f2fs/data.c                                   |    3 
 fs/f2fs/f2fs.h                                   |    6 
 fs/f2fs/segment.c                                |    8 
 fs/f2fs/super.c                                  |   14 
 fs/fat/inode.c                                   |    2 
 fs/freevxfs/vxfs_super.c                         |    2 
 fs/fs-writeback.c                                |   40 
 fs/fuse/control.c                                |   17 
 fs/fuse/dev.c                                    |    8 
 fs/fuse/file.c                                   |   17 
 fs/fuse/inode.c                                  |    2 
 fs/gfs2/super.c                                  |    2 
 fs/hfs/super.c                                   |    2 
 fs/hfsplus/super.c                               |    2 
 fs/hostfs/hostfs_kern.c                          |    2 
 fs/hpfs/super.c                                  |    2 
 fs/hugetlbfs/inode.c                             |    2 
 fs/inode.c                                       |    2 
 fs/isofs/inode.c                                 |    2 
 fs/jffs2/super.c                                 |    2 
 fs/jfs/super.c                                   |    2 
 fs/minix/inode.c                                 |    2 
 fs/namespace.c                                   |    2 
 fs/nfs/inode.c                                   |    2 
 fs/nfs/write.c                                   |   14 
 fs/nilfs2/segbuf.c                               |   16 
 fs/nilfs2/super.c                                |    2 
 fs/ntfs/inode.c                                  |    6 
 fs/ntfs3/super.c                                 |    2 
 fs/ocfs2/alloc.c                                 |    2 
 fs/ocfs2/aops.c                                  |    2 
 fs/ocfs2/cluster/nodemanager.c                   |    2 
 fs/ocfs2/dir.c                                   |    4 
 fs/ocfs2/dlmfs/dlmfs.c                           |    2 
 fs/ocfs2/file.c                                  |   13 
 fs/ocfs2/inode.c                                 |    2 
 fs/ocfs2/localalloc.c                            |    6 
 fs/ocfs2/namei.c                                 |    2 
 fs/ocfs2/ocfs2.h                                 |    4 
 fs/ocfs2/quota_global.c                          |    2 
 fs/ocfs2/stack_user.c                            |   18 
 fs/ocfs2/super.c                                 |    2 
 fs/ocfs2/xattr.c                                 |    2 
 fs/openpromfs/inode.c                            |    2 
 fs/orangefs/super.c                              |    2 
 fs/overlayfs/super.c                             |    2 
 fs/proc/inode.c                                  |    2 
 fs/qnx4/inode.c                                  |    2 
 fs/qnx6/inode.c                                  |    2 
 fs/reiserfs/super.c                              |    2 
 fs/romfs/super.c                                 |    2 
 fs/squashfs/super.c                              |    2 
 fs/sysv/inode.c                                  |    2 
 fs/ubifs/super.c                                 |    2 
 fs/udf/super.c                                   |    2 
 fs/ufs/super.c                                   |    2 
 fs/userfaultfd.c                                 |    5 
 fs/vboxsf/super.c                                |    2 
 fs/xfs/libxfs/xfs_btree.c                        |    2 
 fs/xfs/xfs_buf.c                                 |    3 
 fs/xfs/xfs_icache.c                              |    2 
 fs/zonefs/super.c                                |    2 
 include/linux/backing-dev-defs.h                 |    8 
 include/linux/backing-dev.h                      |   50 
 include/linux/cma.h                              |   14 
 include/linux/damon.h                            |   95 
 include/linux/fault-inject.h                     |    2 
 include/linux/fs.h                               |   21 
 include/linux/gfp.h                              |   10 
 include/linux/highmem-internal.h                 |   10 
 include/linux/hugetlb.h                          |    8 
 include/linux/kthread.h                          |   22 
 include/linux/list_lru.h                         |   45 
 include/linux/memcontrol.h                       |   46 
 include/linux/memory.h                           |   12 
 include/linux/memory_hotplug.h                   |  132 -
 include/linux/migrate.h                          |    8 
 include/linux/mm.h                               |   11 
 include/linux/mmzone.h                           |   22 
 include/linux/nfs_fs_sb.h                        |    1 
 include/linux/node.h                             |   25 
 include/linux/page-flags.h                       |   96 
 include/linux/pageblock-flags.h                  |    7 
 include/linux/pagemap.h                          |    7 
 include/linux/sched.h                            |    1 
 include/linux/sched/sysctl.h                     |   10 
 include/linux/shmem_fs.h                         |    1 
 include/linux/slab.h                             |    3 
 include/linux/swap.h                             |    6 
 include/linux/thread_info.h                      |    5 
 include/linux/uaccess.h                          |    2 
 include/linux/vm_event_item.h                    |    3 
 include/linux/vmalloc.h                          |    4 
 include/linux/xarray.h                           |    9 
 include/ras/ras_event.h                          |    1 
 include/trace/events/compaction.h                |   26 
 include/trace/events/writeback.h                 |   28 
 include/uapi/linux/userfaultfd.h                 |    8 
 ipc/mqueue.c                                     |    2 
 kernel/dma/contiguous.c                          |    4 
 kernel/sched/core.c                              |   21 
 kernel/sysctl.c                                  |    2 
 lib/Kconfig.kfence                               |   12 
 lib/kunit/try-catch.c                            |    3 
 lib/xarray.c                                     |   10 
 mm/Kconfig                                       |    6 
 mm/backing-dev.c                                 |   57 
 mm/cma.c                                         |   31 
 mm/cma.h                                         |    1 
 mm/compaction.c                                  |   60 
 mm/damon/Kconfig                                 |   19 
 mm/damon/Makefile                                |    7 
 mm/damon/core-test.h                             |   23 
 mm/damon/core.c                                  |  190 +
 mm/damon/dbgfs-test.h                            |  103 
 mm/damon/dbgfs.c                                 |  264 +-
 mm/damon/ops-common.c                            |  133 +
 mm/damon/ops-common.h                            |   16 
 mm/damon/paddr.c                                 |   62 
 mm/damon/prmtv-common.c                          |  133 -
 mm/damon/prmtv-common.h                          |   16 
 mm/damon/reclaim.c                               |   11 
 mm/damon/sysfs.c                                 | 2632 ++++++++++++++++++++++-
 mm/damon/vaddr-test.h                            |    8 
 mm/damon/vaddr.c                                 |   67 
 mm/early_ioremap.c                               |    1 
 mm/fadvise.c                                     |    5 
 mm/filemap.c                                     |   17 
 mm/gup.c                                         |  103 
 mm/highmem.c                                     |    9 
 mm/hmm.c                                         |    3 
 mm/huge_memory.c                                 |   41 
 mm/hugetlb.c                                     |   23 
 mm/hugetlb_vmemmap.c                             |   74 
 mm/hwpoison-inject.c                             |    7 
 mm/internal.h                                    |   19 
 mm/kfence/Makefile                               |    2 
 mm/kfence/core.c                                 |  147 +
 mm/kfence/kfence_test.c                          |    3 
 mm/ksm.c                                         |    6 
 mm/list_lru.c                                    |  690 ++----
 mm/maccess.c                                     |    6 
 mm/madvise.c                                     |   18 
 mm/memcontrol.c                                  |  549 ++--
 mm/memory-failure.c                              |  148 -
 mm/memory.c                                      |  116 -
 mm/memory_hotplug.c                              |  136 -
 mm/mempolicy.c                                   |   29 
 mm/memremap.c                                    |    3 
 mm/migrate.c                                     |  128 -
 mm/mlock.c                                       |    1 
 mm/mmap.c                                        |    5 
 mm/mmzone.c                                      |    7 
 mm/mprotect.c                                    |   13 
 mm/mremap.c                                      |    4 
 mm/oom_kill.c                                    |    3 
 mm/page-writeback.c                              |   12 
 mm/page_alloc.c                                  |  429 +--
 mm/page_io.c                                     |    7 
 mm/page_table_check.c                            |   10 
 mm/ptdump.c                                      |   16 
 mm/readahead.c                                   |  124 +
 mm/rmap.c                                        |   15 
 mm/shmem.c                                       |   46 
 mm/slab.c                                        |   39 
 mm/slab.h                                        |   25 
 mm/slob.c                                        |    6 
 mm/slub.c                                        |   42 
 mm/sparse-vmemmap.c                              |   70 
 mm/sparse.c                                      |    2 
 mm/swap.c                                        |   25 
 mm/swapfile.c                                    |    1 
 mm/usercopy.c                                    |   16 
 mm/userfaultfd.c                                 |    3 
 mm/vmalloc.c                                     |  102 
 mm/vmscan.c                                      |  138 -
 mm/vmstat.c                                      |   19 
 mm/workingset.c                                  |    7 
 mm/zswap.c                                       |   15 
 net/socket.c                                     |    2 
 net/sunrpc/rpc_pipe.c                            |    2 
 scripts/spelling.txt                             |   16 
 tools/testing/selftests/cgroup/cgroup_util.c     |   15 
 tools/testing/selftests/cgroup/cgroup_util.h     |    1 
 tools/testing/selftests/cgroup/test_memcontrol.c |   78 
 tools/testing/selftests/damon/Makefile           |    1 
 tools/testing/selftests/damon/sysfs.sh           |  306 ++
 tools/testing/selftests/vm/.gitignore            |    1 
 tools/testing/selftests/vm/Makefile              |    7 
 tools/testing/selftests/vm/hugepage-vmemmap.c    |  144 +
 tools/testing/selftests/vm/run_vmtests.sh        |   11 
 tools/testing/selftests/vm/userfaultfd.c         |    2 
 tools/testing/selftests/x86/Makefile             |    6 
 264 files changed, 7205 insertions(+), 3090 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-03-23 23:04 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-03-23 23:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm, patches


Various misc subsystems, before getting into the post-linux-next material.

This is all based on v5.17.  I tested applying and compiling against
today's 1bc191051dca28fa6.  One patch required an extra whack, all
looks good.


41 patches, based on f443e374ae131c168a065ea1748feac6b2e76613.

Subsystems affected by this patch series:

  procfs
  misc
  core-kernel
  lib
  checkpatch
  init
  pipe
  minix
  fat
  cgroups
  kexec
  kdump
  taskstats
  panic
  kcov
  resource
  ubsan

Subsystem: procfs

    Hao Lee <haolee.swjtu@gmail.com>:
      proc: alloc PATH_MAX bytes for /proc/${pid}/fd/ symlinks

    David Hildenbrand <david@redhat.com>:
      proc/vmcore: fix possible deadlock on concurrent mmap and read

    Yang Li <yang.lee@linux.alibaba.com>:
      proc/vmcore: fix vmcore_alloc_buf() kernel-doc comment

Subsystem: misc

    Bjorn Helgaas <bhelgaas@google.com>:
      linux/types.h: remove unnecessary __bitwise__
      Documentation/sparse: add hints about __CHECKER__

Subsystem: core-kernel

    Miaohe Lin <linmiaohe@huawei.com>:
      kernel/ksysfs.c: use helper macro __ATTR_RW

Subsystem: lib

    Kees Cook <keescook@chromium.org>:
      Kconfig.debug: make DEBUG_INFO selectable from a choice

    Rasmus Villemoes <linux@rasmusvillemoes.dk>:
      include: drop pointless __compiler_offsetof indirection

    Christophe Leroy <christophe.leroy@csgroup.eu>:
      ilog2: force inlining of __ilog2_u32() and __ilog2_u64()

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      bitfield: add explicit inclusions to the example

    Feng Tang <feng.tang@intel.com>:
      lib/Kconfig.debug: add ARCH dependency for FUNCTION_ALIGN option

    Randy Dunlap <rdunlap@infradead.org>:
      lib: bitmap: fix many kernel-doc warnings

Subsystem: checkpatch

    Joe Perches <joe@perches.com>:
      checkpatch: prefer MODULE_LICENSE("GPL") over MODULE_LICENSE("GPL v2")
      checkpatch: add --fix option for some TRAILING_STATEMENTS
      checkpatch: add early_param exception to blank line after struct/function test

    Sagar Patel <sagarmp@cs.unc.edu>:
      checkpatch: use python3 to find codespell dictionary

Subsystem: init

    Mark-PK Tsai <mark-pk.tsai@mediatek.com>:
      init: use ktime_us_delta() to make initcall_debug log more precise

    Randy Dunlap <rdunlap@infradead.org>:
      init.h: improve __setup and early_param documentation
      init/main.c: return 1 from handled __setup() functions

Subsystem: pipe

    Andrei Vagin <avagin@gmail.com>:
      fs/pipe: use kvcalloc to allocate a pipe_buffer array
      fs/pipe.c: local vars have to match types of proper pipe_inode_info fields

Subsystem: minix

    Qinghua Jin <qhjin.dev@gmail.com>:
      minix: fix bug when opening a file with O_DIRECT

Subsystem: fat

    Helge Deller <deller@gmx.de>:
      fat: use pointer to simple type in put_user()

Subsystem: cgroups

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      cgroup: use irqsave in cgroup_rstat_flush_locked().
cgroup: add a comment to cgroup_rstat_flush_locked().

Subsystem: kexec

    Jisheng Zhang <jszhang@kernel.org>:
    Patch series "kexec: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef", v2:
      kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible
      riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
      x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
      arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef

Subsystem: kdump

    Tiezhu Yang <yangtiezhu@loongson.cn>:
    Patch series "Update doc and fix some issues about kdump", v2:
      docs: kdump: update description about sysfs file system support
      docs: kdump: add scp example to write out the dump file
      panic: unset panic_on_warn inside panic()
      ubsan: no need to unset panic_on_warn in ubsan_epilogue()
      kasan: no need to unset panic_on_warn in end_report()

Subsystem: taskstats

    Lukas Bulwahn <lukas.bulwahn@gmail.com>:
      taskstats: remove unneeded dead assignment

Subsystem: panic

    "Guilherme G. Piccoli" <gpiccoli@igalia.com>:
    Patch series "Some improvements on panic_print":
      docs: sysctl/kernel: add missing bit to panic_print
      panic: add option to dump all CPUs backtraces in panic_print
      panic: move panic_print before kmsg dumpers

Subsystem: kcov

    Aleksandr Nogikh <nogikh@google.com>:
    Patch series "kcov: improve mmap processing", v3:
      kcov: split ioctl handling into locked and unlocked parts
      kcov: properly handle subsequent mmap calls

Subsystem: resource

    Miaohe Lin <linmiaohe@huawei.com>:
      kernel/resource: fix kfree() of bootmem memory again

Subsystem: ubsan

    Marco Elver <elver@google.com>:
      Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang"

 Documentation/admin-guide/kdump/kdump.rst       |   10 +
 Documentation/admin-guide/kernel-parameters.txt |    5 
 Documentation/admin-guide/sysctl/kernel.rst     |    2 
 Documentation/dev-tools/sparse.rst              |    2 
 arch/arm64/mm/init.c                            |    9 -
 arch/riscv/mm/init.c                            |    6 -
 arch/x86/kernel/setup.c                         |   10 -
 fs/fat/dir.c                                    |    2 
 fs/minix/inode.c                                |    3 
 fs/pipe.c                                       |   13 +-
 fs/proc/base.c                                  |    8 -
 fs/proc/vmcore.c                                |   43 +++----
 include/linux/bitfield.h                        |    3 
 include/linux/compiler_types.h                  |    3 
 include/linux/init.h                            |   11 +
 include/linux/kexec.h                           |   12 +-
 include/linux/log2.h                            |    4 
 include/linux/stddef.h                          |    6 -
 include/uapi/linux/types.h                      |    6 -
 init/main.c                                     |   14 +-
 kernel/cgroup/rstat.c                           |   13 +-
 kernel/kcov.c                                   |  102 ++++++++---------
 kernel/ksysfs.c                                 |    3 
 kernel/panic.c                                  |   37 ++++--
 kernel/resource.c                               |   41 +-----
 kernel/taskstats.c                              |    5 
 lib/Kconfig.debug                               |  142 ++++++++++++------------
 lib/Kconfig.kcsan                               |   11 -
 lib/Kconfig.ubsan                               |   12 --
 lib/bitmap.c                                    |   24 ++--
 lib/ubsan.c                                     |   10 -
 mm/kasan/report.c                               |   10 -
 scripts/checkpatch.pl                           |   31 ++++-
 tools/include/linux/types.h                     |    5 
 34 files changed, 313 insertions(+), 305 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-03-25  1:07 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-03-25  1:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm, patches


This is the material which was staged after willystuff in linux-next. 
Everything applied seamlessly on your latest, all looks well.



114 patches, based on 52deda9551a01879b3562e7b41748e85c591f14c.

Subsystems affected by this patch series:

  mm/debug
  mm/selftests
  mm/pagecache
  mm/thp
  mm/rmap
  mm/migration
  mm/kasan
  mm/hugetlb
  mm/pagemap
  mm/madvise
  selftests

Subsystem: mm/debug

    Sean Anderson <seanga2@gmail.com>:
      tools/vm/page_owner_sort.c: sort by stacktrace before culling
      tools/vm/page_owner_sort.c: support sorting by stack trace

    Yinan Zhang <zhangyinan2019@email.szu.edu.cn>:
      tools/vm/page_owner_sort.c: add switch between culling by stacktrace and txt

    Chongxi Zhao <zhaochongxi2019@email.szu.edu.cn>:
      tools/vm/page_owner_sort.c: support sorting pid and time

    Shenghong Han <hanshenghong2019@email.szu.edu.cn>:
      tools/vm/page_owner_sort.c: two trivial fixes

    Yixuan Cao <caoyixuan2019@email.szu.edu.cn>:
      tools/vm/page_owner_sort.c: delete invalid duplicate code

    Shenghong Han <hanshenghong2019@email.szu.edu.cn>:
      Documentation/vm/page_owner.rst: update the documentation

    Shuah Khan <skhan@linuxfoundation.org>:
      Documentation/vm/page_owner.rst: fix unexpected indentation warns

    Waiman Long <longman@redhat.com>:
    Patch series "mm/page_owner: Extend page_owner to show memcg information", v4:
      lib/vsprintf: avoid redundant work with 0 size
      mm/page_owner: use scnprintf() to avoid excessive buffer overrun check
      mm/page_owner: print memcg information
      mm/page_owner: record task command name

    Yixuan Cao <caoyixuan2019@email.szu.edu.cn>:
      mm/page_owner.c: record tgid
      tools/vm/page_owner_sort.c: fix the instructions for use

    Jiajian Ye <yejiajian2018@email.szu.edu.cn>:
      tools/vm/page_owner_sort.c: fix comments
      tools/vm/page_owner_sort.c: add a security check
      tools/vm/page_owner_sort.c: support sorting by tgid and update documentation
      tools/vm/page_owner_sort: fix three trivival places
      tools/vm/page_owner_sort: support for sorting by task command name
      tools/vm/page_owner_sort.c: support for selecting by PID, TGID or task command name
      tools/vm/page_owner_sort.c: support for user-defined culling rules

    Christoph Hellwig <hch@lst.de>:
      mm: unexport page_init_poison

Subsystem: mm/selftests

    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>:
      selftest/vm: add util.h and and move helper functions there

    Mike Rapoport <rppt@kernel.org>:
      selftest/vm: add helpers to detect PAGE_SIZE and PAGE_SHIFT

Subsystem: mm/pagecache

    Hugh Dickins <hughd@google.com>:
      mm: delete __ClearPageWaiters()
      mm: filemap_unaccount_folio() large skip mapcount fixup

Subsystem: mm/thp

    Hugh Dickins <hughd@google.com>:
      mm/thp: fix NR_FILE_MAPPED accounting in page_*_file_rmap()

Subsystem: mm/rmap

Subsystem: mm/migration

    Anshuman Khandual <anshuman.khandual@arm.com>:
    Patch series "mm/migration: Add trace events", v3:
      mm/migration: add trace events for THP migrations
      mm/migration: add trace events for base page and HugeTLB migrations

Subsystem: mm/kasan

    Andrey Konovalov <andreyknvl@google.com>:
    Patch series "kasan, vmalloc, arm64: add vmalloc tagging support for SW/HW_TAGS", v6:
      kasan, page_alloc: deduplicate should_skip_kasan_poison
      kasan, page_alloc: move tag_clear_highpage out of kernel_init_free_pages
      kasan, page_alloc: merge kasan_free_pages into free_pages_prepare
      kasan, page_alloc: simplify kasan_poison_pages call site
      kasan, page_alloc: init memory of skipped pages on free
      kasan: drop skip_kasan_poison variable in free_pages_prepare
      mm: clarify __GFP_ZEROTAGS comment
      kasan: only apply __GFP_ZEROTAGS when memory is zeroed
      kasan, page_alloc: refactor init checks in post_alloc_hook
      kasan, page_alloc: merge kasan_alloc_pages into post_alloc_hook
      kasan, page_alloc: combine tag_clear_highpage calls in post_alloc_hook
      kasan, page_alloc: move SetPageSkipKASanPoison in post_alloc_hook
      kasan, page_alloc: move kernel_init_free_pages in post_alloc_hook
      kasan, page_alloc: rework kasan_unpoison_pages call site
      kasan: clean up metadata byte definitions
      kasan: define KASAN_VMALLOC_INVALID for SW_TAGS
      kasan, x86, arm64, s390: rename functions for modules shadow
      kasan, vmalloc: drop outdated VM_KASAN comment
      kasan: reorder vmalloc hooks
      kasan: add wrappers for vmalloc hooks
      kasan, vmalloc: reset tags in vmalloc functions
      kasan, fork: reset pointer tags of vmapped stacks
      kasan, arm64: reset pointer tags of vmapped stacks
      kasan, vmalloc: add vmalloc tagging for SW_TAGS
      kasan, vmalloc, arm64: mark vmalloc mappings as pgprot_tagged
      kasan, vmalloc: unpoison VM_ALLOC pages after mapping
      kasan, mm: only define ___GFP_SKIP_KASAN_POISON with HW_TAGS
      kasan, page_alloc: allow skipping unpoisoning for HW_TAGS
      kasan, page_alloc: allow skipping memory init for HW_TAGS
      kasan, vmalloc: add vmalloc tagging for HW_TAGS
      kasan, vmalloc: only tag normal vmalloc allocations
      kasan, arm64: don't tag executable vmalloc allocations
      kasan: mark kasan_arg_stacktrace as __initdata
      kasan: clean up feature flags for HW_TAGS mode
      kasan: add kasan.vmalloc command line flag
      kasan: allow enabling KASAN_VMALLOC and SW/HW_TAGS
      arm64: select KASAN_VMALLOC for SW/HW_TAGS modes
      kasan: documentation updates
      kasan: improve vmalloc tests
      kasan: test: support async (again) and asymm modes for HW_TAGS

    tangmeng <tangmeng@uniontech.com>:
      mm/kasan: remove unnecessary CONFIG_KASAN option

    Peter Collingbourne <pcc@google.com>:
      kasan: update function name in comments

    Andrey Konovalov <andreyknvl@google.com>:
      kasan: print virtual mapping info in reports
    Patch series "kasan: report clean-ups and improvements":
      kasan: drop addr check from describe_object_addr
      kasan: more line breaks in reports
      kasan: rearrange stack frame info in reports
      kasan: improve stack frame info in reports
      kasan: print basic stack frame info for SW_TAGS
      kasan: simplify async check in end_report()
      kasan: simplify kasan_update_kunit_status() and call sites
      kasan: check CONFIG_KASAN_KUNIT_TEST instead of CONFIG_KUNIT
      kasan: move update_kunit_status to start_report
      kasan: move disable_trace_on_warning to start_report
      kasan: split out print_report from __kasan_report
      kasan: simplify kasan_find_first_bad_addr call sites
      kasan: restructure kasan_report
      kasan: merge __kasan_report into kasan_report
      kasan: call print_report from kasan_report_invalid_free
      kasan: move and simplify kasan_report_async
      kasan: rename kasan_access_info to kasan_report_info
      kasan: add comment about UACCESS regions to kasan_report
      kasan: respect KASAN_BIT_REPORTED in all reporting routines
      kasan: reorder reporting functions
      kasan: move and hide kasan_save_enable/restore_multi_shot
      kasan: disable LOCKDEP when printing reports

Subsystem: mm/hugetlb

    Mike Kravetz <mike.kravetz@oracle.com>:
    Patch series "Add hugetlb MADV_DONTNEED support", v3:
      mm: enable MADV_DONTNEED for hugetlb mappings
      selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
      userfaultfd/selftests: enable hugetlb remap and remove event testing

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/huge_memory: make is_transparent_hugepage() static

Subsystem: mm/pagemap

    David Hildenbrand <david@redhat.com>:
    Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3:
      mm: optimize do_wp_page() for exclusive pages in the swapcache
      mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
      mm: slightly clarify KSM logic in do_swap_page()
      mm: streamline COW logic in do_swap_page()
      mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
      mm/khugepaged: remove reuse_swap_page() usage
      mm/swapfile: remove stale reuse_swap_page()
      mm/huge_memory: remove stale page_trans_huge_mapcount()
      mm/huge_memory: remove stale locking logic from __split_huge_pmd()

    Hugh Dickins <hughd@google.com>:
      mm: warn on deleting redirtied only if accounted
      mm: unmap_mapping_range_tree() with i_mmap_rwsem shared

    Anshuman Khandual <anshuman.khandual@arm.com>:
      mm: generalize ARCH_HAS_FILTER_PGPROT

Subsystem: mm/madvise

    Mauricio Faria de Oliveira <mfo@canonical.com>:
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read

    Johannes Weiner <hannes@cmpxchg.org>:
      mm: madvise: MADV_DONTNEED_LOCKED

Subsystem: selftests

    Muhammad Usama Anjum <usama.anjum@collabora.com>:
      selftests: vm: remove dependecy from internal kernel macros

    Kees Cook <keescook@chromium.org>:
      selftests: kselftest framework: provide "finished" helper

 Documentation/dev-tools/kasan.rst             |   17 
 Documentation/vm/page_owner.rst               |   72 ++
 arch/alpha/include/uapi/asm/mman.h            |    2 
 arch/arm64/Kconfig                            |    2 
 arch/arm64/include/asm/vmalloc.h              |    6 
 arch/arm64/include/asm/vmap_stack.h           |    5 
 arch/arm64/kernel/module.c                    |    5 
 arch/arm64/mm/pageattr.c                      |    2 
 arch/arm64/net/bpf_jit_comp.c                 |    3 
 arch/mips/include/uapi/asm/mman.h             |    2 
 arch/parisc/include/uapi/asm/mman.h           |    2 
 arch/powerpc/mm/book3s64/trace.c              |    1 
 arch/s390/kernel/module.c                     |    2 
 arch/x86/Kconfig                              |    3 
 arch/x86/kernel/module.c                      |    2 
 arch/x86/mm/init.c                            |    1 
 arch/xtensa/include/uapi/asm/mman.h           |    2 
 include/linux/gfp.h                           |   53 +-
 include/linux/huge_mm.h                       |    6 
 include/linux/kasan.h                         |  136 +++--
 include/linux/mm.h                            |    5 
 include/linux/page-flags.h                    |    2 
 include/linux/pagemap.h                       |    3 
 include/linux/swap.h                          |    4 
 include/linux/vmalloc.h                       |   18 
 include/trace/events/huge_memory.h            |    1 
 include/trace/events/migrate.h                |   31 +
 include/trace/events/mmflags.h                |   18 
 include/trace/events/thp.h                    |   27 +
 include/uapi/asm-generic/mman-common.h        |    2 
 kernel/fork.c                                 |   13 
 kernel/scs.c                                  |   16 
 lib/Kconfig.kasan                             |   18 
 lib/test_kasan.c                              |  239 ++++++++-
 lib/vsprintf.c                                |    8 
 mm/Kconfig                                    |    3 
 mm/debug.c                                    |    1 
 mm/filemap.c                                  |   63 +-
 mm/huge_memory.c                              |  109 ----
 mm/kasan/Makefile                             |    2 
 mm/kasan/common.c                             |    4 
 mm/kasan/hw_tags.c                            |  243 +++++++---
 mm/kasan/kasan.h                              |   76 ++-
 mm/kasan/report.c                             |  516 +++++++++++----------
 mm/kasan/report_generic.c                     |   34 -
 mm/kasan/report_hw_tags.c                     |    1 
 mm/kasan/report_sw_tags.c                     |   16 
 mm/kasan/report_tags.c                        |    2 
 mm/kasan/shadow.c                             |   76 +--
 mm/khugepaged.c                               |   11 
 mm/madvise.c                                  |   57 +-
 mm/memory.c                                   |  129 +++--
 mm/memremap.c                                 |    2 
 mm/migrate.c                                  |    4 
 mm/page-writeback.c                           |   18 
 mm/page_alloc.c                               |  270 ++++++-----
 mm/page_owner.c                               |   86 ++-
 mm/rmap.c                                     |   62 +-
 mm/swap.c                                     |    4 
 mm/swapfile.c                                 |  104 ----
 mm/vmalloc.c                                  |  167 ++++--
 tools/testing/selftests/kselftest.h           |   10 
 tools/testing/selftests/vm/.gitignore         |    1 
 tools/testing/selftests/vm/Makefile           |    1 
 tools/testing/selftests/vm/gup_test.c         |    3 
 tools/testing/selftests/vm/hugetlb-madvise.c  |  410 ++++++++++++++++
 tools/testing/selftests/vm/ksm_tests.c        |   38 -
 tools/testing/selftests/vm/memfd_secret.c     |    2 
 tools/testing/selftests/vm/run_vmtests.sh     |   15 
 tools/testing/selftests/vm/transhuge-stress.c |   41 -
 tools/testing/selftests/vm/userfaultfd.c      |   72 +-
 tools/testing/selftests/vm/util.h             |   75 ++-
 tools/vm/page_owner_sort.c                    |  628 +++++++++++++++++++++-----
 73 files changed, 2797 insertions(+), 1288 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-04-01 18:20 Andrew Morton
  2022-04-01 18:27 ` incoming Andrew Morton
  0 siblings, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2022-04-01 18:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits, patches

16 patches, based on e8b767f5e04097aaedcd6e06e2270f9fe5282696.

Subsystems affected by this patch series:

  mm/madvise
  ofs2
  nilfs2
  mm/mlock
  mm/mfence
  mailmap
  mm/memory-failure
  mm/kasan
  mm/debug
  mm/kmemleak
  mm/damon

Subsystem: mm/madvise

    Charan Teja Kalla <quic_charante@quicinc.com>:
      Revert "mm: madvise: skip unmapped vma holes passed to process_madvise"

Subsystem: ofs2

    Joseph Qi <joseph.qi@linux.alibaba.com>:
      ocfs2: fix crash when mount with quota enabled

Subsystem: nilfs2

    Ryusuke Konishi <konishi.ryusuke@gmail.com>:
    Patch series "nilfs2 lockdep warning fixes":
      nilfs2: fix lockdep warnings in page operations for btree nodes
      nilfs2: fix lockdep warnings during disk space reclamation
      nilfs2: get rid of nilfs_mapping_init()

Subsystem: mm/mlock

    Hugh Dickins <hughd@google.com>:
      mm/munlock: add lru_add_drain() to fix memcg_stat_test
      mm/munlock: update Documentation/vm/unevictable-lru.rst

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      mm/munlock: protect the per-CPU pagevec by a local_lock_t

Subsystem: mm/kfence

    Muchun Song <songmuchun@bytedance.com>:
      mm: kfence: fix objcgs vector allocation

Subsystem: mailmap

    Kirill Tkhai <kirill.tkhai@openvz.org>:
      mailmap: update Kirill's email

Subsystem: mm/memory-failure

    Rik van Riel <riel@surriel.com>:
      mm,hwpoison: unmap poisoned page before invalidation

Subsystem: mm/kasan

    Andrey Konovalov <andreyknvl@google.com>:
      mm, kasan: fix __GFP_BITS_SHIFT definition breaking LOCKDEP

Subsystem: mm/debug

    Yinan Zhang <zhangyinan2019@email.szu.edu.cn>:
      tools/vm/page_owner_sort.c: remove -c option
      doc/vm/page_owner.rst: remove content related to -c option

Subsystem: mm/kmemleak

    Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>:
      mm/kmemleak: reset tag when compare object pointer

Subsystem: mm/damon

    Jonghyeon Kim <tome01@ajou.ac.kr>:
      mm/damon: prevent activated scheme from sleeping by deactivated schemes

 .mailmap                             |    1 
 Documentation/vm/page_owner.rst      |    1 
 Documentation/vm/unevictable-lru.rst |  473 +++++++++++++++--------------------
 fs/nilfs2/btnode.c                   |   23 +
 fs/nilfs2/btnode.h                   |    1 
 fs/nilfs2/btree.c                    |   27 +
 fs/nilfs2/dat.c                      |    4 
 fs/nilfs2/gcinode.c                  |    7 
 fs/nilfs2/inode.c                    |  167 +++++++++++-
 fs/nilfs2/mdt.c                      |   45 ++-
 fs/nilfs2/mdt.h                      |    6 
 fs/nilfs2/nilfs.h                    |   16 -
 fs/nilfs2/page.c                     |   16 -
 fs/nilfs2/page.h                     |    1 
 fs/nilfs2/segment.c                  |    9 
 fs/nilfs2/super.c                    |    5 
 fs/ocfs2/quota_global.c              |   23 -
 fs/ocfs2/quota_local.c               |    2 
 include/linux/gfp.h                  |    4 
 mm/damon/core.c                      |    5 
 mm/gup.c                             |   10 
 mm/internal.h                        |    6 
 mm/kfence/core.c                     |   11 
 mm/kfence/kfence.h                   |    3 
 mm/kmemleak.c                        |    9 
 mm/madvise.c                         |    9 
 mm/memory.c                          |   12 
 mm/migrate.c                         |    2 
 mm/mlock.c                           |   46 ++-
 mm/page_alloc.c                      |    1 
 mm/rmap.c                            |    4 
 mm/swap.c                            |    4 
 tools/vm/page_owner_sort.c           |    6 
 33 files changed, 560 insertions(+), 399 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: incoming
  2022-04-01 18:20 incoming Andrew Morton
@ 2022-04-01 18:27 ` Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-01 18:27 UTC (permalink / raw)
  To: Linus Torvalds, linux-mm, mm-commits, patches

Argh, messed up in-reply-to.  Let me redo...

^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-04-01 18:27 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-01 18:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits, patches

16 patches, based on e8b767f5e04097aaedcd6e06e2270f9fe5282696.

Subsystems affected by this patch series:

  mm/madvise
  ofs2
  nilfs2
  mm/mlock
  mm/mfence
  mailmap
  mm/memory-failure
  mm/kasan
  mm/debug
  mm/kmemleak
  mm/damon

Subsystem: mm/madvise

    Charan Teja Kalla <quic_charante@quicinc.com>:
      Revert "mm: madvise: skip unmapped vma holes passed to process_madvise"

Subsystem: ofs2

    Joseph Qi <joseph.qi@linux.alibaba.com>:
      ocfs2: fix crash when mount with quota enabled

Subsystem: nilfs2

    Ryusuke Konishi <konishi.ryusuke@gmail.com>:
    Patch series "nilfs2 lockdep warning fixes":
      nilfs2: fix lockdep warnings in page operations for btree nodes
      nilfs2: fix lockdep warnings during disk space reclamation
      nilfs2: get rid of nilfs_mapping_init()

Subsystem: mm/mlock

    Hugh Dickins <hughd@google.com>:
      mm/munlock: add lru_add_drain() to fix memcg_stat_test
      mm/munlock: update Documentation/vm/unevictable-lru.rst

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      mm/munlock: protect the per-CPU pagevec by a local_lock_t

Subsystem: mm/kfence

    Muchun Song <songmuchun@bytedance.com>:
      mm: kfence: fix objcgs vector allocation

Subsystem: mailmap

    Kirill Tkhai <kirill.tkhai@openvz.org>:
      mailmap: update Kirill's email

Subsystem: mm/memory-failure

    Rik van Riel <riel@surriel.com>:
      mm,hwpoison: unmap poisoned page before invalidation

Subsystem: mm/kasan

    Andrey Konovalov <andreyknvl@google.com>:
      mm, kasan: fix __GFP_BITS_SHIFT definition breaking LOCKDEP

Subsystem: mm/debug

    Yinan Zhang <zhangyinan2019@email.szu.edu.cn>:
      tools/vm/page_owner_sort.c: remove -c option
      doc/vm/page_owner.rst: remove content related to -c option

Subsystem: mm/kmemleak

    Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>:
      mm/kmemleak: reset tag when compare object pointer

Subsystem: mm/damon

    Jonghyeon Kim <tome01@ajou.ac.kr>:
      mm/damon: prevent activated scheme from sleeping by deactivated schemes

 .mailmap                             |    1 
 Documentation/vm/page_owner.rst      |    1 
 Documentation/vm/unevictable-lru.rst |  473 +++++++++++++++--------------------
 fs/nilfs2/btnode.c                   |   23 +
 fs/nilfs2/btnode.h                   |    1 
 fs/nilfs2/btree.c                    |   27 +
 fs/nilfs2/dat.c                      |    4 
 fs/nilfs2/gcinode.c                  |    7 
 fs/nilfs2/inode.c                    |  167 +++++++++++-
 fs/nilfs2/mdt.c                      |   45 ++-
 fs/nilfs2/mdt.h                      |    6 
 fs/nilfs2/nilfs.h                    |   16 -
 fs/nilfs2/page.c                     |   16 -
 fs/nilfs2/page.h                     |    1 
 fs/nilfs2/segment.c                  |    9 
 fs/nilfs2/super.c                    |    5 
 fs/ocfs2/quota_global.c              |   23 -
 fs/ocfs2/quota_local.c               |    2 
 include/linux/gfp.h                  |    4 
 mm/damon/core.c                      |    5 
 mm/gup.c                             |   10 
 mm/internal.h                        |    6 
 mm/kfence/core.c                     |   11 
 mm/kfence/kfence.h                   |    3 
 mm/kmemleak.c                        |    9 
 mm/madvise.c                         |    9 
 mm/memory.c                          |   12 
 mm/migrate.c                         |    2 
 mm/mlock.c                           |   46 ++-
 mm/page_alloc.c                      |    1 
 mm/rmap.c                            |    4 
 mm/swap.c                            |    4 
 tools/vm/page_owner_sort.c           |    6 
 33 files changed, 560 insertions(+), 399 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-04-08 20:08 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-08 20:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits, patches

9 patches, based on d00c50b35101b862c3db270ffeba53a63a1063d9.

Subsystems affected by this patch series:

  mm/migration
  mm/highmem
  lz4
  mm/sparsemem
  mm/mremap
  mm/mempolicy
  mailmap
  mm/memcg
  MAINTAINERS

Subsystem: mm/migration

    Zi Yan <ziy@nvidia.com>:
      mm: migrate: use thp_order instead of HPAGE_PMD_ORDER for new page allocation.

Subsystem: mm/highmem

    Max Filippov <jcmvbkbc@gmail.com>:
      highmem: fix checks in __kmap_local_sched_{in,out}

Subsystem: lz4

    Guo Xuenan <guoxuenan@huawei.com>:
      lz4: fix LZ4_decompress_safe_partial read out of bound

Subsystem: mm/sparsemem

    Waiman Long <longman@redhat.com>:
      mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning

Subsystem: mm/mremap

    Paolo Bonzini <pbonzini@redhat.com>:
      mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0)

Subsystem: mm/mempolicy

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/mempolicy: fix mpol_new leak in shared_policy_replace

Subsystem: mailmap

    Vasily Averin <vasily.averin@linux.dev>:
      mailmap: update Vasily Averin's email address

Subsystem: mm/memcg

    Andrew Morton <akpm@linux-foundation.org>:
      mm/list_lru.c: revert "mm/list_lru: optimize memcg_reparent_list_lru_node()"

Subsystem: MAINTAINERS

    Tom Rix <trix@redhat.com>:
      MAINTAINERS: add Tom as clang reviewer

 .mailmap                 |    4 ++++
 MAINTAINERS              |    1 +
 include/linux/mmzone.h   |   11 +++++++----
 lib/lz4/lz4_decompress.c |    8 ++++++--
 mm/highmem.c             |    4 ++--
 mm/list_lru.c            |    6 ------
 mm/mempolicy.c           |    3 ++-
 mm/migrate.c             |    2 +-
 mm/mremap.c              |    3 +++
 9 files changed, 26 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-04-15  2:12 Andrew Morton
  2022-04-15  2:13 ` [patch 01/14] MAINTAINERS: Broadcom internal lists aren't maintainers Andrew Morton
                   ` (13 more replies)
  0 siblings, 14 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits, patches

14 patches, based on 115acbb56978941bb7537a97dfc303da286106c1.

Subsystems affected by this patch series:

  MAINTAINERS
  mm/tmpfs
  m/secretmem
  mm/kasan
  mm/kfence
  mm/pagealloc
  mm/zram
  mm/compaction
  mm/hugetlb
  binfmt
  mm/vmalloc
  mm/kmemleak

Subsystem: MAINTAINERS

    Joe Perches <joe@perches.com>:
      MAINTAINERS: Broadcom internal lists aren't maintainers

Subsystem: mm/tmpfs

    Hugh Dickins <hughd@google.com>:
      tmpfs: fix regressions from wider use of ZERO_PAGE

Subsystem: m/secretmem

    Axel Rasmussen <axelrasmussen@google.com>:
      mm/secretmem: fix panic when growing a memfd_secret

Subsystem: mm/kasan

    Zqiang <qiang1.zhang@intel.com>:
      irq_work: use kasan_record_aux_stack_noalloc() record callstack

    Vincenzo Frascino <vincenzo.frascino@arm.com>:
      kasan: fix hw tags enablement when KUNIT tests are disabled

Subsystem: mm/kfence

    Marco Elver <elver@google.com>:
      mm, kfence: support kmem_dump_obj() for KFENCE objects

Subsystem: mm/pagealloc

    Juergen Gross <jgross@suse.com>:
      mm, page_alloc: fix build_zonerefs_node()

Subsystem: mm/zram

    Minchan Kim <minchan@kernel.org>:
      mm: fix unexpected zeroed page mapping with zram swap

Subsystem: mm/compaction

    Charan Teja Kalla <quic_charante@quicinc.com>:
      mm: compaction: fix compiler warning when CONFIG_COMPACTION=n

Subsystem: mm/hugetlb

    Mike Kravetz <mike.kravetz@oracle.com>:
      hugetlb: do not demote poisoned hugetlb pages

Subsystem: binfmt

    Andrew Morton <akpm@linux-foundation.org>:
      revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
      revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"

Subsystem: mm/vmalloc

    Omar Sandoval <osandov@fb.com>:
      mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore

Subsystem: mm/kmemleak

    Patrick Wang <patrick.wang.shcn@gmail.com>:
      mm: kmemleak: take a full lowmem check in kmemleak_*_phys()

 MAINTAINERS                     |   64 ++++++++++++++++++++--------------------
 arch/x86/include/asm/io.h       |    2 -
 arch/x86/kernel/crash_dump_64.c |    1 
 fs/binfmt_elf.c                 |    6 +--
 include/linux/kfence.h          |   24 +++++++++++++++
 kernel/irq_work.c               |    2 -
 mm/compaction.c                 |   10 +++---
 mm/filemap.c                    |    6 ---
 mm/hugetlb.c                    |   17 ++++++----
 mm/kasan/hw_tags.c              |    5 +--
 mm/kasan/kasan.h                |   10 +++---
 mm/kfence/core.c                |   21 -------------
 mm/kfence/kfence.h              |   21 +++++++++++++
 mm/kfence/report.c              |   47 +++++++++++++++++++++++++++++
 mm/kmemleak.c                   |    8 ++---
 mm/page_alloc.c                 |    2 -
 mm/page_io.c                    |   54 ---------------------------------
 mm/secretmem.c                  |   17 ++++++++++
 mm/shmem.c                      |   31 ++++++++++++-------
 mm/slab.c                       |    2 -
 mm/slab.h                       |    2 -
 mm/slab_common.c                |    9 +++++
 mm/slob.c                       |    2 -
 mm/slub.c                       |    2 -
 mm/vmalloc.c                    |   11 ------
 25 files changed, 207 insertions(+), 169 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 01/14] MAINTAINERS: Broadcom internal lists aren't maintainers
  2022-04-15  2:12 incoming Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE Andrew Morton
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: joe, akpm, patches, linux-mm, mm-commits, torvalds, akpm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 11904 bytes --]

From: Joe Perches <joe@perches.com>
Subject: MAINTAINERS: Broadcom internal lists aren't maintainers

Convert the broadcom internal list M: and L: entries to R: as
exploder email addresses are neither maintainers nor mailing lists.

Reorder the entries as necessary.

Link: https://lkml.kernel.org/r/04eb301f5b3adbefdd78e76657eff0acb3e3d87f.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |   64 +++++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 32 deletions(-)

--- a/MAINTAINERS~maintainers-broadcom-internal-lists-arent-maintainers
+++ a/MAINTAINERS
@@ -3743,7 +3743,7 @@ F:	include/linux/platform_data/b53.h
 
 BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE
 M:	Nicolas Saenz Julienne <nsaenz@kernel.org>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-rpi-kernel@lists.infradead.org (moderated for non-subscribers)
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
@@ -3758,7 +3758,7 @@ BROADCOM BCM281XX/BCM11XXX/BCM216XX ARM
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Ray Jui <rjui@broadcom.com>
 M:	Scott Branden <sbranden@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 T:	git git://github.com/broadcom/mach-bcm
 F:	arch/arm/mach-bcm/
@@ -3778,7 +3778,7 @@ F:	arch/mips/include/asm/mach-bcm47xx/*
 
 BROADCOM BCM4908 ETHERNET DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/net/brcm,bcm4908-enet.yaml
@@ -3787,7 +3787,7 @@ F:	drivers/net/ethernet/broadcom/unimac.
 
 BROADCOM BCM4908 PINMUX DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-gpio@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/pinctrl/brcm,bcm4908-pinctrl.yaml
@@ -3797,7 +3797,7 @@ BROADCOM BCM5301X ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Hauke Mehrtens <hauke@hauke-m.de>
 M:	Rafał Miłecki <zajec5@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	arch/arm/boot/dts/bcm470*
@@ -3808,7 +3808,7 @@ F:	arch/arm/mach-bcm/bcm_5301x.c
 BROADCOM BCM53573 ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
 M:	Rafał Miłecki <rafal@milecki.pl>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	arch/arm/boot/dts/bcm47189*
@@ -3816,7 +3816,7 @@ F:	arch/arm/boot/dts/bcm53573*
 
 BROADCOM BCM63XX ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3830,7 +3830,7 @@ F:	drivers/usb/gadget/udc/bcm63xx_udc.*
 
 BROADCOM BCM7XXX ARM ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3848,21 +3848,21 @@ N:	bcm7120
 BROADCOM BDC DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,bdc.yaml
 F:	drivers/usb/gadget/udc/bdc/
 
 BROADCOM BMIPS CPUFREQ DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	drivers/cpufreq/bmips-cpufreq.c
 
 BROADCOM BMIPS MIPS ARCHITECTURE
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mips@vger.kernel.org
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -3928,53 +3928,53 @@ F:	drivers/net/wireless/broadcom/brcm802
 BROADCOM BRCMSTB GPIO DRIVER
 M:	Doug Berger <opendmb@gmail.com>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	Documentation/devicetree/bindings/gpio/brcm,brcmstb-gpio.yaml
 F:	drivers/gpio/gpio-brcmstb.c
 
 BROADCOM BRCMSTB I2C DRIVER
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-i2c@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Supported
 F:	Documentation/devicetree/bindings/i2c/brcm,brcmstb-i2c.yaml
 F:	drivers/i2c/busses/i2c-brcmstb.c
 
 BROADCOM BRCMSTB UART DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-serial@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/serial/brcm,bcm7271-uart.yaml
 F:	drivers/tty/serial/8250/8250_bcm7271.c
 
 BROADCOM BRCMSTB USB EHCI DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,bcm7445-ehci.yaml
 F:	drivers/usb/host/ehci-brcm.*
 
 BROADCOM BRCMSTB USB PIN MAP DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-usb@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	Documentation/devicetree/bindings/usb/brcm,usb-pinmap.yaml
 F:	drivers/usb/misc/brcmstb-usb-pinmap.c
 
 BROADCOM BRCMSTB USB2 and USB3 PHY DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-kernel@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/phy/broadcom/phy-brcm-usb*
 
 BROADCOM ETHERNET PHY DRIVERS
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	Documentation/devicetree/bindings/net/broadcom-bcm87xx.txt
@@ -3985,7 +3985,7 @@ F:	include/linux/brcmphy.h
 BROADCOM GENET ETHERNET DRIVER
 M:	Doug Berger <opendmb@gmail.com>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml
@@ -3999,7 +3999,7 @@ F:	include/linux/platform_data/mdio-bcm-
 BROADCOM IPROC ARM ARCHITECTURE
 M:	Ray Jui <rjui@broadcom.com>
 M:	Scott Branden <sbranden@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -4027,7 +4027,7 @@ N:	stingray
 
 BROADCOM IPROC GBIT ETHERNET DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/net/brcm,amac.yaml
@@ -4036,7 +4036,7 @@ F:	drivers/net/ethernet/broadcom/unimac.
 
 BROADCOM KONA GPIO DRIVER
 M:	Ray Jui <rjui@broadcom.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	Documentation/devicetree/bindings/gpio/brcm,kona-gpio.txt
 F:	drivers/gpio/gpio-bcm-kona.c
@@ -4069,7 +4069,7 @@ F:	drivers/firmware/broadcom/*
 BROADCOM PMB (POWER MANAGEMENT BUS) DRIVER
 M:	Rafał Miłecki <rafal@milecki.pl>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 T:	git git://github.com/broadcom/stblinux.git
@@ -4085,7 +4085,7 @@ F:	include/linux/bcma/
 
 BROADCOM SPI DRIVER
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Maintained
 F:	Documentation/devicetree/bindings/spi/brcm,spi-bcm-qspi.yaml
 F:	drivers/spi/spi-bcm-qspi.*
@@ -4094,7 +4094,7 @@ F:	drivers/spi/spi-iproc-qspi.c
 
 BROADCOM STB AVS CPUFREQ DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/cpufreq/brcm,stb-avs-cpu-freq.txt
@@ -4102,7 +4102,7 @@ F:	drivers/cpufreq/brcmstb*
 
 BROADCOM STB AVS TMON DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pm@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/thermal/brcm,avs-tmon.yaml
@@ -4110,7 +4110,7 @@ F:	drivers/thermal/broadcom/brcmstb*
 
 BROADCOM STB DPFE DRIVER
 M:	Markus Mayer <mmayer@broadcom.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	Documentation/devicetree/bindings/memory-controllers/brcm,dpfe-cpu.yaml
@@ -4119,8 +4119,8 @@ F:	drivers/memory/brcmstb_dpfe.c
 BROADCOM STB NAND FLASH DRIVER
 M:	Brian Norris <computersforpeace@gmail.com>
 M:	Kamal Dasu <kdasu.kdev@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mtd@lists.infradead.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/mtd/nand/raw/brcmnand/
 F:	include/linux/platform_data/brcmnand.h
@@ -4129,7 +4129,7 @@ BROADCOM STB PCIE DRIVER
 M:	Jim Quinlan <jim2101024@gmail.com>
 M:	Nicolas Saenz Julienne <nsaenz@kernel.org>
 M:	Florian Fainelli <f.fainelli@gmail.com>
-M:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-pci@vger.kernel.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/pci/brcm,stb-pcie.yaml
@@ -4137,7 +4137,7 @@ F:	drivers/pci/controller/pcie-brcmstb.c
 
 BROADCOM SYSTEMPORT ETHERNET DRIVER
 M:	Florian Fainelli <f.fainelli@gmail.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	netdev@vger.kernel.org
 S:	Supported
 F:	drivers/net/ethernet/broadcom/bcmsysport.*
@@ -4154,7 +4154,7 @@ F:	drivers/net/ethernet/broadcom/tg3.*
 
 BROADCOM VK DRIVER
 M:	Scott Branden <scott.branden@broadcom.com>
-L:	bcm-kernel-feedback-list@broadcom.com
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 S:	Supported
 F:	drivers/misc/bcm-vk/
 F:	include/uapi/linux/misc/bcm_vk.h
@@ -17648,8 +17648,8 @@ K:	\bTIF_SECCOMP\b
 
 SECURE DIGITAL HOST CONTROLLER INTERFACE (SDHCI) Broadcom BRCMSTB DRIVER
 M:	Al Cooper <alcooperx@gmail.com>
+R:	Broadcom Kernel Team <bcm-kernel-feedback-list@broadcom.com>
 L:	linux-mmc@vger.kernel.org
-L:	bcm-kernel-feedback-list@broadcom.com
 S:	Maintained
 F:	drivers/mmc/host/sdhci-brcmstb*
 
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15  2:12 incoming Andrew Morton
  2022-04-15  2:13 ` [patch 01/14] MAINTAINERS: Broadcom internal lists aren't maintainers Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15 22:10   ` Linus Torvalds
  2022-04-15  2:13 ` [patch 03/14] mm/secretmem: fix panic when growing a memfd_secret Andrew Morton
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: patrice.chotard, mpatocka, markhemm, lczerner, hch, djwong,
	chuck.lever, hughd, akpm, patches, linux-mm, mm-commits, torvalds,
	akpm

From: Hugh Dickins <hughd@google.com>
Subject: tmpfs: fix regressions from wider use of ZERO_PAGE

Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing
when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change.

Whilst nfsd_splice_action() does contain some questionable handling of
repeated pages, and Chuck was able to work around there, history from
Mark Hemment makes clear that there might be similar dangers elsewhere:
it was not a good idea for me to pass ZERO_PAGE down to unknown actors.

Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
but the x86 clear_user() is not nearly so well optimized as copy to user
(dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).

And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)):
which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards

Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com
Fixes: 56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Patrice CHOTARD <patrice.chotard@foss.st.com>
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Tested-by: Chuck Lever III <chuck.lever@oracle.com>
Cc: Mark Hemment <markhemm@googlemail.com>
Cc: Patrice CHOTARD <patrice.chotard@foss.st.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |    6 ------
 mm/shmem.c   |   31 ++++++++++++++++++++-----------
 2 files changed, 20 insertions(+), 17 deletions(-)

--- a/mm/filemap.c~tmpfs-fix-regressions-from-wider-use-of-zero_page
+++ a/mm/filemap.c
@@ -1063,12 +1063,6 @@ void __init pagecache_init(void)
 		init_waitqueue_head(&folio_wait_table[i]);
 
 	page_writeback_init();
-
-	/*
-	 * tmpfs uses the ZERO_PAGE for reading holes: it is up-to-date,
-	 * and splice's page_cache_pipe_buf_confirm() needs to see that.
-	 */
-	SetPageUptodate(ZERO_PAGE(0));
 }
 
 /*
--- a/mm/shmem.c~tmpfs-fix-regressions-from-wider-use-of-zero_page
+++ a/mm/shmem.c
@@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(stru
 		pgoff_t end_index;
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
-		bool got_page;
 
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
@@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(stru
 			 */
 			if (!offset)
 				mark_page_accessed(page);
-			got_page = true;
+			/*
+			 * Ok, we have the page, and it's up-to-date, so
+			 * now we can copy it to user space...
+			 */
+			ret = copy_page_to_iter(page, offset, nr, to);
+			put_page(page);
+
+		} else if (iter_is_iovec(to)) {
+			/*
+			 * Copy to user tends to be so well optimized, but
+			 * clear_user() not so much, that it is noticeably
+			 * faster to copy the zero page instead of clearing.
+			 */
+			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
 		} else {
-			page = ZERO_PAGE(0);
-			got_page = false;
+			/*
+			 * But submitting the same page twice in a row to
+			 * splice() - or others? - can result in confusion:
+			 * so don't attempt that optimization on pipes etc.
+			 */
+			ret = iov_iter_zero(nr, to);
 		}
 
-		/*
-		 * Ok, we have the page, and it's up-to-date, so
-		 * now we can copy it to user space...
-		 */
-		ret = copy_page_to_iter(page, offset, nr, to);
 		retval += ret;
 		offset += ret;
 		index += offset >> PAGE_SHIFT;
 		offset &= ~PAGE_MASK;
 
-		if (got_page)
-			put_page(page);
 		if (!iov_iter_count(to))
 			break;
 		if (ret < nr) {
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 03/14] mm/secretmem: fix panic when growing a memfd_secret
  2022-04-15  2:12 incoming Andrew Morton
  2022-04-15  2:13 ` [patch 01/14] MAINTAINERS: Broadcom internal lists aren't maintainers Andrew Morton
  2022-04-15  2:13 ` [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 04/14] irq_work: use kasan_record_aux_stack_noalloc() record callstack Andrew Morton
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: willy, stable, rppt, lkp, axelrasmussen, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: mm/secretmem: fix panic when growing a memfd_secret

When one tries to grow an existing memfd_secret with ftruncate, one gets a
panic [1].  For example, doing the following reliably induces the panic:

    fd = memfd_secret();

    ftruncate(fd, 10);
    ptr = mmap(NULL, 10, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    strcpy(ptr, "123456789");

    munmap(ptr, 10);
    ftruncate(fd, 20);

The basic reason for this is, when we grow with ftruncate, we call down
into simple_setattr, and then truncate_inode_pages_range, and eventually
we try to zero part of the memory.  The normal truncation code does this
via the direct map (i.e., it calls page_address() and hands that to
memset()).

For memfd_secret though, we specifically don't map our pages via the
direct map (i.e.  we call set_direct_map_invalid_noflush() on every
fault).  So the address returned by page_address() isn't useful, and when
we try to memset() with it we panic.

This patch avoids the panic by implementing a custom setattr for
memfd_secret, which detects resizes specifically (setting the size for the
first time works just fine, since there are no existing pages to try to
zero), and rejects them with EINVAL.

One could argue growing should be supported, but I think that will require
a significantly more lengthy change.  So, I propose a minimal fix for the
benefit of stable kernels, and then perhaps to extend memfd_secret to
support growing in a separate patch.

[1]:

[  774.320433] BUG: unable to handle page fault for address: ffffa0a889277028
[  774.322297] #PF: supervisor write access in kernel mode
[  774.323306] #PF: error_code(0x0002) - not-present page
[  774.324296] PGD afa01067 P4D afa01067 PUD 83f909067 PMD 83f8bf067 PTE 800ffffef6d88060
[  774.325841] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[  774.326934] CPU: 0 PID: 281 Comm: repro Not tainted 5.17.0-dbg-DEV #1
[  774.328074] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[  774.329732] RIP: 0010:memset_erms+0x9/0x10
[  774.330474] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
[  774.333543] RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
[  774.334404] RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
[  774.335545] RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
[  774.336685] RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
[  774.337929] R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
[  774.339236] R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
[  774.340356] FS:  00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
[  774.341635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  774.342535] CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
[  774.343651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  774.344780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  774.345938] Call Trace:
[  774.346334]  <TASK>
[  774.346671]  ? zero_user_segments+0x82/0x190
[  774.347346]  truncate_inode_partial_folio+0xd4/0x2a0
[  774.348128]  truncate_inode_pages_range+0x380/0x830
[  774.348904]  truncate_setsize+0x63/0x80
[  774.349530]  simple_setattr+0x37/0x60
[  774.350102]  notify_change+0x3d8/0x4d0
[  774.350681]  do_sys_ftruncate+0x162/0x1d0
[  774.351302]  __x64_sys_ftruncate+0x1c/0x20
[  774.351936]  do_syscall_64+0x44/0xa0
[  774.352486]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  774.353284] RIP: 0033:0x7f72947c392b
[  774.354001] Code: 77 05 c3 0f 1f 40 00 48 8b 15 41 85 0c 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 11 85 0c 00 f7 d8
[  774.357938] RSP: 002b:00007ffcad62a1a8 EFLAGS: 00000202 ORIG_RAX: 000000000000004d
[  774.359116] RAX: ffffffffffffffda RBX: 000055f47662b440 RCX: 00007f72947c392b
[  774.360186] RDX: 0000000000000028 RSI: 0000000000000028 RDI: 0000000000000003
[  774.361246] RBP: 00007ffcad62a1c0 R08: 0000000000000003 R09: 0000000000000000
[  774.362324] R10: 00007f72946dc230 R11: 0000000000000202 R12: 000055f47662b0e0
[  774.363393] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  774.364470]  </TASK>
[  774.364807] Modules linked in: xhci_pci xhci_hcd virtio_net net_failover failover virtio_blk virtio_balloon uhci_hcd ohci_pci ohci_hcd evdev ehci_pci ehci_hcd 9pnet_virtio 9p netfs 9pnet
[  774.367325] CR2: ffffa0a889277028
[  774.367838] ---[ end trace 0000000000000000 ]---
[  774.368543] RIP: 0010:memset_erms+0x9/0x10
[  774.369187] Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
[  774.372282] RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
[  774.373372] RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
[  774.374814] RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
[  774.376248] RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
[  774.377687] R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
[  774.379135] R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
[  774.380550] FS:  00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
[  774.382177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  774.383329] CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
[  774.384763] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  774.386229] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  774.387664] Kernel panic - not syncing: Fatal exception
[  774.388863] Kernel Offset: 0x8000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  774.391014] ---[ end Kernel panic - not syncing: Fatal exception ]---

[lkp@intel.com: secretmem_iops can be static]
  Signed-off-by: kernel test robot <lkp@intel.com>
[axelrasmussen@google.com: return EINVAL]
  Link: https://lkml.kernel.org/r/20220412193023.279320-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220324210909.1843814-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/secretmem.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

--- a/mm/secretmem.c~mm-secretmem-fix-panic-when-growing-a-memfd_secret
+++ a/mm/secretmem.c
@@ -158,6 +158,22 @@ const struct address_space_operations se
 	.isolate_page	= secretmem_isolate_page,
 };
 
+static int secretmem_setattr(struct user_namespace *mnt_userns,
+			     struct dentry *dentry, struct iattr *iattr)
+{
+	struct inode *inode = d_inode(dentry);
+	unsigned int ia_valid = iattr->ia_valid;
+
+	if ((ia_valid & ATTR_SIZE) && inode->i_size)
+		return -EINVAL;
+
+	return simple_setattr(mnt_userns, dentry, iattr);
+}
+
+static const struct inode_operations secretmem_iops = {
+	.setattr = secretmem_setattr,
+};
+
 static struct vfsmount *secretmem_mnt;
 
 static struct file *secretmem_file_create(unsigned long flags)
@@ -177,6 +193,7 @@ static struct file *secretmem_file_creat
 	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
 	mapping_set_unevictable(inode->i_mapping);
 
+	inode->i_op = &secretmem_iops;
 	inode->i_mapping->a_ops = &secretmem_aops;
 
 	/* pretend we are a normal file with zero size */
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 04/14] irq_work: use kasan_record_aux_stack_noalloc() record callstack
  2022-04-15  2:12 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2022-04-15  2:13 ` [patch 03/14] mm/secretmem: fix panic when growing a memfd_secret Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 05/14] kasan: fix hw tags enablement when KUNIT tests are disabled Andrew Morton
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: ryabinin.a.a, glider, dvyukov, andreyknvl, qiang1.zhang, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

From: Zqiang <qiang1.zhang@intel.com>
Subject: irq_work: use kasan_record_aux_stack_noalloc() record callstack

[    4.113128] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
[    4.113132] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 239, name: bootlogd
[    4.113149] Preemption disabled at:
[    4.113149] [<ffffffffbab1a531>] rt_mutex_slowunlock+0xa1/0x4e0
[    4.113154] CPU: 3 PID: 239 Comm: bootlogd Tainted: G        W
5.17.1-rt17-yocto-preempt-rt+ #105
[    4.113157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[    4.113159] Call Trace:
[    4.113160]  <TASK>
[    4.113161]  dump_stack_lvl+0x60/0x8c
[    4.113165]  dump_stack+0x10/0x12
[    4.113167]  __might_resched.cold+0x13b/0x173
[    4.113172]  rt_spin_lock+0x5b/0xf0
[    4.113179]  get_page_from_freelist+0x20c/0x1610
[    4.113208]  __alloc_pages+0x25e/0x5e0
[    4.113222]  __stack_depot_save+0x3c0/0x4a0
[    4.113228]  kasan_save_stack+0x3a/0x50
[    4.113322]  __kasan_record_aux_stack+0xb6/0xc0
[    4.113326]  kasan_record_aux_stack+0xe/0x10
[    4.113329]  irq_work_queue_on+0x6a/0x1c0
[    4.113333]  pull_rt_task+0x631/0x6b0
[    4.113343]  do_balance_callbacks+0x56/0x80
[    4.113346]  __balance_callbacks+0x63/0x90
[    4.113350]  rt_mutex_setprio+0x349/0x880
[    4.113366]  rt_mutex_slowunlock+0x22a/0x4e0
[    4.113377]  rt_spin_unlock+0x49/0x80
[    4.113380]  uart_write+0x186/0x2b0
[    4.113385]  do_output_char+0x2e9/0x3a0
[    4.113389]  n_tty_write+0x306/0x800
[    4.113413]  file_tty_write.isra.0+0x2af/0x450
[    4.113422]  tty_write+0x22/0x30
[    4.113425]  new_sync_write+0x27c/0x3a0
[    4.113446]  vfs_write+0x3f7/0x5d0
[    4.113451]  ksys_write+0xd9/0x180
[    4.113463]  __x64_sys_write+0x43/0x50
[    4.113466]  do_syscall_64+0x44/0x90
[    4.113469]  entry_SYSCALL_64_after_hwframe+0x44/0xae

On PREEMPT_RT kernel and KASAN is enabled.  the kasan_record_aux_stack()
may call alloc_pages(), and the rt-spinlock will be acquired, if currently
in atomic context, will trigger warning.  fix it by use
kasan_record_aux_stack_noalloc() to avoid call alloc_pages().

Link: https://lkml.kernel.org/r/20220402142555.2699582-1-qiang1.zhang@intel.com
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/irq_work.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/irq_work.c~irq_work-use-kasan_record_aux_stack_noalloc-record-callstack
+++ a/kernel/irq_work.c
@@ -137,7 +137,7 @@ bool irq_work_queue_on(struct irq_work *
 	if (!irq_work_claim(work))
 		return false;
 
-	kasan_record_aux_stack(work);
+	kasan_record_aux_stack_noalloc(work);
 
 	preempt_disable();
 	if (cpu != smp_processor_id()) {
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 05/14] kasan: fix hw tags enablement when KUNIT tests are disabled
  2022-04-15  2:12 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2022-04-15  2:13 ` [patch 04/14] irq_work: use kasan_record_aux_stack_noalloc() record callstack Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 06/14] mm, kfence: support kmem_dump_obj() for KFENCE objects Andrew Morton
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: will, ryabinin.a.a, glider, dvyukov, catalin.marinas, andreyknvl,
	vincenzo.frascino, akpm, patches, linux-mm, mm-commits, torvalds,
	akpm

From: Vincenzo Frascino <vincenzo.frascino@arm.com>
Subject: kasan: fix hw tags enablement when KUNIT tests are disabled

Kasan enables hw tags via kasan_enable_tagging() which based on the mode
passed via kernel command line selects the correct hw backend. 
kasan_enable_tagging() is meant to be invoked indirectly via the cpu
features framework of the architectures that support these backends. 
Currently the invocation of this function is guarded by
CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend
only when KUNIT tests are enabled in the kernel.

This inconsistency was introduced in commit:

  ed6d74446cbf ("kasan: test: support async (again) and asymm modes for HW_TAGS")

... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are
disabled.

Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not
prevent the correct invocation of kasan_enable_tagging().

Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com
Fixes: ed6d74446cbf ("kasan: test: support async (again) and asymm modes for HW_TAGS")
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kasan/hw_tags.c |    5 +++--
 mm/kasan/kasan.h   |   10 ++++++----
 2 files changed, 9 insertions(+), 6 deletions(-)

--- a/mm/kasan/hw_tags.c~kasan-fix-hw-tags-enablement-when-kunit-tests-are-disabled
+++ a/mm/kasan/hw_tags.c
@@ -336,8 +336,6 @@ void __kasan_poison_vmalloc(const void *
 
 #endif
 
-#if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
-
 void kasan_enable_tagging(void)
 {
 	if (kasan_arg_mode == KASAN_ARG_MODE_ASYNC)
@@ -347,6 +345,9 @@ void kasan_enable_tagging(void)
 	else
 		hw_enable_tagging_sync();
 }
+
+#if IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
+
 EXPORT_SYMBOL_GPL(kasan_enable_tagging);
 
 void kasan_force_async_fault(void)
--- a/mm/kasan/kasan.h~kasan-fix-hw-tags-enablement-when-kunit-tests-are-disabled
+++ a/mm/kasan/kasan.h
@@ -355,25 +355,27 @@ static inline const void *arch_kasan_set
 #define hw_set_mem_tag_range(addr, size, tag, init) \
 			arch_set_mem_tag_range((addr), (size), (tag), (init))
 
+void kasan_enable_tagging(void);
+
 #else /* CONFIG_KASAN_HW_TAGS */
 
 #define hw_enable_tagging_sync()
 #define hw_enable_tagging_async()
 #define hw_enable_tagging_asymm()
 
+static inline void kasan_enable_tagging(void) { }
+
 #endif /* CONFIG_KASAN_HW_TAGS */
 
 #if defined(CONFIG_KASAN_HW_TAGS) && IS_ENABLED(CONFIG_KASAN_KUNIT_TEST)
 
-void kasan_enable_tagging(void);
 void kasan_force_async_fault(void);
 
-#else /* CONFIG_KASAN_HW_TAGS || CONFIG_KASAN_KUNIT_TEST */
+#else /* CONFIG_KASAN_HW_TAGS && CONFIG_KASAN_KUNIT_TEST */
 
-static inline void kasan_enable_tagging(void) { }
 static inline void kasan_force_async_fault(void) { }
 
-#endif /* CONFIG_KASAN_HW_TAGS || CONFIG_KASAN_KUNIT_TEST */
+#endif /* CONFIG_KASAN_HW_TAGS && CONFIG_KASAN_KUNIT_TEST */
 
 #ifdef CONFIG_KASAN_SW_TAGS
 u8 kasan_random_tag(void);
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 06/14] mm, kfence: support kmem_dump_obj() for KFENCE objects
  2022-04-15  2:12 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2022-04-15  2:13 ` [patch 05/14] kasan: fix hw tags enablement when KUNIT tests are disabled Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 07/14] mm, page_alloc: fix build_zonerefs_node() Andrew Morton
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: vbabka, oliver.sang, 42.hyeyoo, elver, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Marco Elver <elver@google.com>
Subject: mm, kfence: support kmem_dump_obj() for KFENCE objects

Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
producing garbage data due to the object not actually being maintained by
SLAB or SLUB.

Fix this by implementing __kfence_obj_info() that copies relevant
information to struct kmem_obj_info when the object was allocated by
KFENCE; this is called by a common kmem_obj_info(), which also calls the
slab/slub/slob specific variant now called __kmem_obj_info().

For completeness, kmem_dump_obj() now displays if the object was allocated
by KFENCE.

Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB")
Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>	[slab]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/kfence.h |   24 +++++++++++++++++++
 mm/kfence/core.c       |   21 -----------------
 mm/kfence/kfence.h     |   21 +++++++++++++++++
 mm/kfence/report.c     |   47 +++++++++++++++++++++++++++++++++++++++
 mm/slab.c              |    2 -
 mm/slab.h              |    2 -
 mm/slab_common.c       |    9 +++++++
 mm/slob.c              |    2 -
 mm/slub.c              |    2 -
 9 files changed, 105 insertions(+), 25 deletions(-)

--- a/include/linux/kfence.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/include/linux/kfence.h
@@ -204,6 +204,22 @@ static __always_inline __must_check bool
  */
 bool __must_check kfence_handle_page_fault(unsigned long addr, bool is_write, struct pt_regs *regs);
 
+#ifdef CONFIG_PRINTK
+struct kmem_obj_info;
+/**
+ * __kfence_obj_info() - fill kmem_obj_info struct
+ * @kpp: kmem_obj_info to be filled
+ * @object: the object
+ *
+ * Return:
+ * * false - not a KFENCE object
+ * * true - a KFENCE object, filled @kpp
+ *
+ * Copies information to @kpp for KFENCE objects.
+ */
+bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
+#endif
+
 #else /* CONFIG_KFENCE */
 
 static inline bool is_kfence_address(const void *addr) { return false; }
@@ -221,6 +237,14 @@ static inline bool __must_check kfence_h
 	return false;
 }
 
+#ifdef CONFIG_PRINTK
+struct kmem_obj_info;
+static inline bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	return false;
+}
+#endif
+
 #endif
 
 #endif /* _LINUX_KFENCE_H */
--- a/mm/kfence/core.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/core.c
@@ -231,27 +231,6 @@ static bool kfence_unprotect(unsigned lo
 	return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), false));
 }
 
-static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
-{
-	long index;
-
-	/* The checks do not affect performance; only called from slow-paths. */
-
-	if (!is_kfence_address((void *)addr))
-		return NULL;
-
-	/*
-	 * May be an invalid index if called with an address at the edge of
-	 * __kfence_pool, in which case we would report an "invalid access"
-	 * error.
-	 */
-	index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1;
-	if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
-		return NULL;
-
-	return &kfence_metadata[index];
-}
-
 static inline unsigned long metadata_to_pageaddr(const struct kfence_metadata *meta)
 {
 	unsigned long offset = (meta - kfence_metadata + 1) * PAGE_SIZE * 2;
--- a/mm/kfence/kfence.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/kfence.h
@@ -96,6 +96,27 @@ struct kfence_metadata {
 
 extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
 
+static inline struct kfence_metadata *addr_to_metadata(unsigned long addr)
+{
+	long index;
+
+	/* The checks do not affect performance; only called from slow-paths. */
+
+	if (!is_kfence_address((void *)addr))
+		return NULL;
+
+	/*
+	 * May be an invalid index if called with an address at the edge of
+	 * __kfence_pool, in which case we would report an "invalid access"
+	 * error.
+	 */
+	index = (addr - (unsigned long)__kfence_pool) / (PAGE_SIZE * 2) - 1;
+	if (index < 0 || index >= CONFIG_KFENCE_NUM_OBJECTS)
+		return NULL;
+
+	return &kfence_metadata[index];
+}
+
 /* KFENCE error types for report generation. */
 enum kfence_error_type {
 	KFENCE_ERROR_OOB,		/* Detected a out-of-bounds access. */
--- a/mm/kfence/report.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/kfence/report.c
@@ -273,3 +273,50 @@ void kfence_report_error(unsigned long a
 	/* We encountered a memory safety error, taint the kernel! */
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_STILL_OK);
 }
+
+#ifdef CONFIG_PRINTK
+static void kfence_to_kp_stack(const struct kfence_track *track, void **kp_stack)
+{
+	int i, j;
+
+	i = get_stack_skipnr(track->stack_entries, track->num_stack_entries, NULL);
+	for (j = 0; i < track->num_stack_entries && j < KS_ADDRS_COUNT; ++i, ++j)
+		kp_stack[j] = (void *)track->stack_entries[i];
+	if (j < KS_ADDRS_COUNT)
+		kp_stack[j] = NULL;
+}
+
+bool __kfence_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	struct kfence_metadata *meta = addr_to_metadata((unsigned long)object);
+	unsigned long flags;
+
+	if (!meta)
+		return false;
+
+	/*
+	 * If state is UNUSED at least show the pointer requested; the rest
+	 * would be garbage data.
+	 */
+	kpp->kp_ptr = object;
+
+	/* Requesting info an a never-used object is almost certainly a bug. */
+	if (WARN_ON(meta->state == KFENCE_OBJECT_UNUSED))
+		return true;
+
+	raw_spin_lock_irqsave(&meta->lock, flags);
+
+	kpp->kp_slab = slab;
+	kpp->kp_slab_cache = meta->cache;
+	kpp->kp_objp = (void *)meta->addr;
+	kfence_to_kp_stack(&meta->alloc_track, kpp->kp_stack);
+	if (meta->state == KFENCE_OBJECT_FREED)
+		kfence_to_kp_stack(&meta->free_track, kpp->kp_free_stack);
+	/* get_stack_skipnr() ensures the first entry is outside allocator. */
+	kpp->kp_ret = kpp->kp_stack[0];
+
+	raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+	return true;
+}
+#endif
--- a/mm/slab.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab.c
@@ -3665,7 +3665,7 @@ EXPORT_SYMBOL(__kmalloc_node_track_calle
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	struct kmem_cache *cachep;
 	unsigned int objnr;
--- a/mm/slab_common.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab_common.c
@@ -555,6 +555,13 @@ bool kmem_valid_obj(void *object)
 }
 EXPORT_SYMBOL_GPL(kmem_valid_obj);
 
+static void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+{
+	if (__kfence_obj_info(kpp, object, slab))
+		return;
+	__kmem_obj_info(kpp, object, slab);
+}
+
 /**
  * kmem_dump_obj - Print available slab provenance information
  * @object: slab object for which to find provenance information.
@@ -590,6 +597,8 @@ void kmem_dump_obj(void *object)
 		pr_cont(" slab%s %s", cp, kp.kp_slab_cache->name);
 	else
 		pr_cont(" slab%s", cp);
+	if (is_kfence_address(object))
+		pr_cont(" (kfence)");
 	if (kp.kp_objp)
 		pr_cont(" start %px", kp.kp_objp);
 	if (kp.kp_data_offset)
--- a/mm/slab.h~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slab.h
@@ -868,7 +868,7 @@ struct kmem_obj_info {
 	void *kp_stack[KS_ADDRS_COUNT];
 	void *kp_free_stack[KS_ADDRS_COUNT];
 };
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab);
 #endif
 
 #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
--- a/mm/slob.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slob.c
@@ -463,7 +463,7 @@ out:
 }
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	kpp->kp_ptr = object;
 	kpp->kp_slab = slab;
--- a/mm/slub.c~mm-kfence-support-kmem_dump_obj-for-kfence-objects
+++ a/mm/slub.c
@@ -4312,7 +4312,7 @@ int __kmem_cache_shutdown(struct kmem_ca
 }
 
 #ifdef CONFIG_PRINTK
-void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
+void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 {
 	void *base;
 	int __maybe_unused i;
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 07/14] mm, page_alloc: fix build_zonerefs_node()
  2022-04-15  2:12 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2022-04-15  2:13 ` [patch 06/14] mm, kfence: support kmem_dump_obj() for KFENCE objects Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 08/14] mm: fix unexpected zeroed page mapping with zram swap Andrew Morton
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, richard.weiyang, mhocko, marmarek, david, jgross, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2684 bytes --]

From: Juergen Gross <jgross@suse.com>
Subject: mm, page_alloc: fix build_zonerefs_node()

Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from
zones with pages managed by the buddy allocator") only zones with free
memory are included in a built zonelist.  This is problematic when e.g. 
all memory of a zone has been ballooned out when zonelists are being
rebuilt.

The decision whether to rebuild the zonelists when onlining new memory is
done based on populated_zone() returning 0 for the zone the memory will be
added to.  The new zone is added to the zonelists only, if it has free
memory pages (managed_zone() returns a non-zero value) after the memory
has been onlined.  This implies, that onlining memory will always free the
added pages to the allocator immediately, but this is not true in all
cases: when e.g.  running as a Xen guest the onlined new memory will be
added only to the ballooned memory list, it will be freed only when the
guest is being ballooned up afterwards.

Another problem with using managed_zone() for the decision whether a zone
is being added to the zonelists is, that a zone with all memory used will
in fact be removed from all zonelists in case the zonelists happen to be
rebuilt.

Use populated_zone() when building a zonelist as it has been done before
that commit.

There was a report that QubesOS (based on Xen) is hitting this problem.
Xen has switched to use the zone device functionality in kernel 5.9
and QubesOS wants to use memory hotplugging for guests in order to be
able to start a guest with minimal memory and expand it as needed. 
This was the report leading to the patch.

Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-page_alloc-fix-build_zonerefs_node
+++ a/mm/page_alloc.c
@@ -6131,7 +6131,7 @@ static int build_zonerefs_node(pg_data_t
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
-		if (managed_zone(zone)) {
+		if (populated_zone(zone)) {
 			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
 		}
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 08/14] mm: fix unexpected zeroed page mapping with zram swap
  2022-04-15  2:12 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2022-04-15  2:13 ` [patch 07/14] mm, page_alloc: fix build_zonerefs_node() Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 09/14] mm: compaction: fix compiler warning when CONFIG_COMPACTION=n Andrew Morton
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, senozhatsky, ngupta, ivan, david, axboe, minchan, akpm,
	patches, linux-mm, mm-commits, torvalds, akpm

From: Minchan Kim <minchan@kernel.org>
Subject: mm: fix unexpected zeroed page mapping with zram swap

Two processes under CLONE_VM cloning, user process can be corrupted by
seeing zeroed page unexpectedly.

    CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
swap_readpage valid data
  swap_slot_free_notify
    delete zram entry
                            swap_readpage zeroed(invalid) data
                            pte_lock
                            map the *zero data* to userspace
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return and next refault will
read zeroed data

The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
increase the refcount of swap slot at copy_mm so it couldn't catch up
whether it's safe or not to discard data from backing device.  In the
case, only the lock it could rely on to synchronize swap slot freeing is
page table lock.  Thus, this patch gets rid of the swap_slot_free_notify
function.  With this patch, CPU A will see correct data.

    CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
                            swap_readpage original data
                            pte_lock
                            map the original data
                            swap_free
                              swap_range_free
                                bd_disk->fops->swap_slot_free_notify
swap_readpage read zeroed data
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return
on next refault will see mapped data by CPU B

The concern of the patch would increase memory consumption since it could
keep wasted memory with compressed form in zram as well as uncompressed
form in address space.  However, most of cases of zram uses no readahead
and do_swap_page is followed by swap_free so it will free the compressed
form from in zram quickly.

Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Tested-by: Ivan Babrou <ivan@cloudflare.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_io.c |   54 -------------------------------------------------
 1 file changed, 54 deletions(-)

--- a/mm/page_io.c~mm-fix-unexpected-zeroed-page-mapping-with-zram-swap
+++ a/mm/page_io.c
@@ -51,54 +51,6 @@ void end_swap_bio_write(struct bio *bio)
 	bio_put(bio);
 }
 
-static void swap_slot_free_notify(struct page *page)
-{
-	struct swap_info_struct *sis;
-	struct gendisk *disk;
-	swp_entry_t entry;
-
-	/*
-	 * There is no guarantee that the page is in swap cache - the software
-	 * suspend code (at least) uses end_swap_bio_read() against a non-
-	 * swapcache page.  So we must check PG_swapcache before proceeding with
-	 * this optimization.
-	 */
-	if (unlikely(!PageSwapCache(page)))
-		return;
-
-	sis = page_swap_info(page);
-	if (data_race(!(sis->flags & SWP_BLKDEV)))
-		return;
-
-	/*
-	 * The swap subsystem performs lazy swap slot freeing,
-	 * expecting that the page will be swapped out again.
-	 * So we can avoid an unnecessary write if the page
-	 * isn't redirtied.
-	 * This is good for real swap storage because we can
-	 * reduce unnecessary I/O and enhance wear-leveling
-	 * if an SSD is used as the as swap device.
-	 * But if in-memory swap device (eg zram) is used,
-	 * this causes a duplicated copy between uncompressed
-	 * data in VM-owned memory and compressed data in
-	 * zram-owned memory.  So let's free zram-owned memory
-	 * and make the VM-owned decompressed page *dirty*,
-	 * so the page should be swapped out somewhere again if
-	 * we again wish to reclaim it.
-	 */
-	disk = sis->bdev->bd_disk;
-	entry.val = page_private(page);
-	if (disk->fops->swap_slot_free_notify && __swap_count(entry) == 1) {
-		unsigned long offset;
-
-		offset = swp_offset(entry);
-
-		SetPageDirty(page);
-		disk->fops->swap_slot_free_notify(sis->bdev,
-				offset);
-	}
-}
-
 static void end_swap_bio_read(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
@@ -114,7 +66,6 @@ static void end_swap_bio_read(struct bio
 	}
 
 	SetPageUptodate(page);
-	swap_slot_free_notify(page);
 out:
 	unlock_page(page);
 	WRITE_ONCE(bio->bi_private, NULL);
@@ -394,11 +345,6 @@ int swap_readpage(struct page *page, boo
 	if (sis->flags & SWP_SYNCHRONOUS_IO) {
 		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
 		if (!ret) {
-			if (trylock_page(page)) {
-				swap_slot_free_notify(page);
-				unlock_page(page);
-			}
-
 			count_vm_event(PSWPIN);
 			goto out;
 		}
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 09/14] mm: compaction: fix compiler warning when CONFIG_COMPACTION=n
  2022-04-15  2:12 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2022-04-15  2:13 ` [patch 08/14] mm: fix unexpected zeroed page mapping with zram swap Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 10/14] hugetlb: do not demote poisoned hugetlb pages Andrew Morton
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: vbabka, nigupta, lkp, quic_charante, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Charan Teja Kalla <quic_charante@quicinc.com>
Subject: mm: compaction: fix compiler warning when CONFIG_COMPACTION=n

The below warning is reported when CONFIG_COMPACTION=n:

   mm/compaction.c:56:27: warning: 'HPAGE_FRAG_CHECK_INTERVAL_MSEC'
defined but not used [-Wunused-const-variable=]
      56 | static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC =
500;
         |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix it by moving 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' under
CONFIG_COMPACTION defconfig. Also since this is just a 'static const
int' type, use #define for it.

Link: https://lkml.kernel.org/r/1647608518-20924-1-git-send-email-quic_charante@quicinc.com
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/mm/compaction.c~mm-compaction-fix-compiler-warning-when-config_compaction=n
+++ a/mm/compaction.c
@@ -26,6 +26,11 @@
 #include "internal.h"
 
 #ifdef CONFIG_COMPACTION
+/*
+ * Fragmentation score check interval for proactive compaction purposes.
+ */
+#define HPAGE_FRAG_CHECK_INTERVAL_MSEC	(500)
+
 static inline void count_compact_event(enum vm_event_item item)
 {
 	count_vm_event(item);
@@ -51,11 +56,6 @@ static inline void count_compact_events(
 #define pageblock_end_pfn(pfn)		block_end_pfn(pfn, pageblock_order)
 
 /*
- * Fragmentation score check interval for proactive compaction purposes.
- */
-static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
-
-/*
  * Page order with-respect-to which proactive compaction
  * calculates external fragmentation, which is used as
  * the "fragmentation score" of a node/zone.
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 10/14] hugetlb: do not demote poisoned hugetlb pages
  2022-04-15  2:12 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2022-04-15  2:13 ` [patch 09/14] mm: compaction: fix compiler warning when CONFIG_COMPACTION=n Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 11/14] revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders" Andrew Morton
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: stable, naoya.horiguchi, mike.kravetz, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: do not demote poisoned hugetlb pages

It is possible for poisoned hugetlb pages to reside on the free lists. 
The huge page allocation routines which dequeue entries from the free
lists make a point of avoiding poisoned pages.  There is no such check and
avoidance in the demote code path.

If a hugetlb page on the is on a free list, poison will only be set in the
head page rather then the page with the actual error.  If such a page is
demoted, then the poison flag may follow the wrong page.  A page without
error could have poison set, and a page with poison could not have the
flag set.

Check for poison before attempting to demote a hugetlb page.  Also, return
-EBUSY to the caller if only poisoned pages are on the free list.

Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com
Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

--- a/mm/hugetlb.c~hugetlb-do-not-demote-poisoned-hugetlb-pages
+++ a/mm/hugetlb.c
@@ -3475,7 +3475,6 @@ static int demote_pool_huge_page(struct
 {
 	int nr_nodes, node;
 	struct page *page;
-	int rc = 0;
 
 	lockdep_assert_held(&hugetlb_lock);
 
@@ -3486,15 +3485,19 @@ static int demote_pool_huge_page(struct
 	}
 
 	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
-		if (!list_empty(&h->hugepage_freelists[node])) {
-			page = list_entry(h->hugepage_freelists[node].next,
-					struct page, lru);
-			rc = demote_free_huge_page(h, page);
-			break;
+		list_for_each_entry(page, &h->hugepage_freelists[node], lru) {
+			if (PageHWPoison(page))
+				continue;
+
+			return demote_free_huge_page(h, page);
 		}
 	}
 
-	return rc;
+	/*
+	 * Only way to get here is if all pages on free lists are poisoned.
+	 * Return -EBUSY so that caller will not retry.
+	 */
+	return -EBUSY;
 }
 
 #define HSTATE_ATTR_RO(_name) \
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 11/14] revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"
  2022-04-15  2:12 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2022-04-15  2:13 ` [patch 10/14] hugetlb: do not demote poisoned hugetlb pages Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:13 ` [patch 12/14] revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" Andrew Morton
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: viro, surenb, stable, sspatil, songliubraving, shuah, rppt,
	rientjes, regressions, ndesaulniers, mike.kravetz, maskray,
	kirill.shutemov, irogers, hughd, hjl.tools, ckennelly, adobriyan,
	akpm, patches, linux-mm, mm-commits, torvalds, akpm

From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders"

925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
is an attempt to fix regressions due to 9630f0d60fec5f ("fs/binfmt_elf:
use PT_LOAD p_align values for static PIE").

But regressionss continue to be reported:

https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
https://bugzilla.kernel.org/show_bug.cgi?id=215720
https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

This patch reverts the fix, so the original can also be reverted.

Fixes: 925346c129da11 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/binfmt_elf.c~revert-fs-binfmt_elf-fix-pt_load-p_align-values-for-loaders
+++ a/fs/binfmt_elf.c
@@ -1118,7 +1118,7 @@ out_free_interp:
 			 * without MAP_FIXED nor MAP_FIXED_NOREPLACE).
 			 */
 			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
-			if (interpreter || alignment > ELF_MIN_ALIGN) {
+			if (alignment > ELF_MIN_ALIGN) {
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 12/14] revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"
  2022-04-15  2:12 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2022-04-15  2:13 ` [patch 11/14] revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders" Andrew Morton
@ 2022-04-15  2:13 ` Andrew Morton
  2022-04-15  2:14 ` [patch 13/14] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore Andrew Morton
  2022-04-15  2:14 ` [patch 14/14] mm: kmemleak: take a full lowmem check in kmemleak_*_phys() Andrew Morton
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:13 UTC (permalink / raw)
  To: viro, surenb, stable, sspatil, songliubraving, shuah, rppt,
	rientjes, regressions, ndesaulniers, mike.kravetz, maskray,
	kirill.shutemov, irogers, hughd, hjl.tools, ckennelly, adobriyan,
	akpm, patches, linux-mm, mm-commits, torvalds, akpm

From: Andrew Morton <akpm@linux-foundation.org>
Subject: revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"

Despite Mike's attempted fix (925346c129da117122), regressions reports
continue:

https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
https://bugzilla.kernel.org/show_bug.cgi?id=215720
https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

So revert this patch.

Fixes: 9630f0d60fec ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/binfmt_elf.c~revert-fs-binfmt_elf-use-pt_load-p_align-values-for-static-pie
+++ a/fs/binfmt_elf.c
@@ -1117,11 +1117,11 @@ out_free_interp:
 			 * independently randomized mmap region (0 load_bias
 			 * without MAP_FIXED nor MAP_FIXED_NOREPLACE).
 			 */
-			alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
-			if (alignment > ELF_MIN_ALIGN) {
+			if (interpreter) {
 				load_bias = ELF_ET_DYN_BASE;
 				if (current->flags & PF_RANDOMIZE)
 					load_bias += arch_mmap_rnd();
+				alignment = maximum_alignment(elf_phdata, elf_ex->e_phnum);
 				if (alignment)
 					load_bias &= ~(alignment - 1);
 				elf_flags |= MAP_FIXED_NOREPLACE;
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 13/14] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore
  2022-04-15  2:12 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2022-04-15  2:13 ` [patch 12/14] revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" Andrew Morton
@ 2022-04-15  2:14 ` Andrew Morton
  2022-04-15  2:14 ` [patch 14/14] mm: kmemleak: take a full lowmem check in kmemleak_*_phys() Andrew Morton
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:14 UTC (permalink / raw)
  To: urezki, hch, chris, bhe, osandov, akpm, patches, linux-mm,
	mm-commits, torvalds, akpm

From: Omar Sandoval <osandov@fb.com>
Subject: mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore

Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately purge
the vmap areas instead of doing it lazily.

Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread. 
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever).  For example, consider the following scenario:

1. Thread reads from /proc/vmcore. This eventually calls
   __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
   vmap_lazy_nr to lazy_max_pages() + 1.
2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
   pages (one page plus the guard page) to the purge list and
   vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
   drain_vmap_work is scheduled.
3. Thread returns from the kernel and is scheduled out.
4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
   frees the 2 pages on the purge list. vmap_lazy_nr is now
   lazy_max_pages() + 1.
5. This is still over the threshold, so it tries to purge areas again,
   but doesn't find anything.
6. Repeat 5.

If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs.  If there is more than
one CPU or preemption is enabled, then the worker thread will spin forever
in the background.  (Note that if there were already pages to be purged at
the time that set_iounmap_nonlazy() was called, this bug is avoided.)

This can be reproduced with anything that reads from /proc/vmcore multiple
times.  E.g., vmcore-dmesg /proc/vmcore.

It turns out that improvements to vmap() over the years have obsoleted the
need for this "optimization".  I benchmarked `dd if=/proc/vmcore
of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore. 
The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
5.18-rc1 with set_iounmap_nonlazy() removed entirely:

  |5.17  |5.18+fix|5.18+removal
4k|40.86s|  40.09s|      26.73s
1M|24.47s|  23.98s|      21.84s

The removal was the fastest (by a wide margin with 4k reads). This patch
removes set_iounmap_nonlazy().

Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
Fixes: 690467c81b1a  ("mm/vmalloc: Move draining areas out of caller context")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/include/asm/io.h       |    2 --
 arch/x86/kernel/crash_dump_64.c |    1 -
 mm/vmalloc.c                    |   11 -----------
 3 files changed, 14 deletions(-)

--- a/arch/x86/include/asm/io.h~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/arch/x86/include/asm/io.h
@@ -210,8 +210,6 @@ void __iomem *ioremap(resource_size_t of
 extern void iounmap(volatile void __iomem *addr);
 #define iounmap iounmap

-extern void set_iounmap_nonlazy(void);
-
 #ifdef __KERNEL__

 void memcpy_fromio(void *, const volatile void __iomem *, size_t);
--- a/arch/x86/kernel/crash_dump_64.c~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/arch/x86/kernel/crash_dump_64.c
@@ -37,7 +37,6 @@ static ssize_t __copy_oldmem_page(unsign
 	} else
 		memcpy(buf, vaddr + offset, csize);

-	set_iounmap_nonlazy();
 	iounmap((void __iomem *)vaddr);
 	return csize;
 }
--- a/mm/vmalloc.c~mm-vmalloc-fix-spinning-drain_vmap_work-after-reading-from-proc-vmcore
+++ a/mm/vmalloc.c
@@ -1671,17 +1671,6 @@ static DEFINE_MUTEX(vmap_purge_lock);
 /* for per-CPU blocks */
 static void purge_fragmented_blocks_allcpus(void);

-#ifdef CONFIG_X86_64
-/*
- * called before a call to iounmap() if the caller wants vm_area_struct's
- * immediately freed.
- */
-void set_iounmap_nonlazy(void)
-{
-	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
-}
-#endif /* CONFIG_X86_64 */
-
 /*
  * Purges all lazily-freed vmap areas.
  */
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [patch 14/14] mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
  2022-04-15  2:12 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2022-04-15  2:14 ` [patch 13/14] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore Andrew Morton
@ 2022-04-15  2:14 ` Andrew Morton
  13 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-15  2:14 UTC (permalink / raw)
  To: stable, catalin.marinas, patrick.wang.shcn, akpm, patches,
	linux-mm, mm-commits, torvalds, akpm

From: Patrick Wang <patrick.wang.shcn@gmail.com>
Subject: mm: kmemleak: take a full lowmem check in kmemleak_*_phys()

The kmemleak_*_phys() apis do not check the address for lowmem's min
boundary, while the caller may pass an address below lowmem, which will
trigger an oops:

# echo scan > /sys/kernel/debug/kmemleak
[   54.888353] Unable to handle kernel paging request at virtual address ff5fffffffe00000
[   54.888932] Oops [#1]
[   54.889102] Modules linked in:
[   54.889326] CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33
[   54.889620] Hardware name: riscv-virtio,qemu (DT)
[   54.889901] epc : scan_block+0x74/0x15c
[   54.890215]  ra : scan_block+0x72/0x15c
[   54.890390] epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30
[   54.890607]  gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200
[   54.890835]  t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90
[   54.891024]  s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000
[   54.891201]  a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001
[   54.891377]  a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005
[   54.891552]  s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90
[   54.891727]  s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0
[   54.891903]  s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000
[   54.892078]  s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f
[   54.892271]  t5 : 0000000000000001 t6 : 0000000000000000
[   54.892408] status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d
[   54.892643] [<ffffffff801e5a1c>] scan_gray_list+0x12e/0x1a6
[   54.892824] [<ffffffff801e5d3e>] kmemleak_scan+0x2aa/0x57e
[   54.892961] [<ffffffff801e633c>] kmemleak_write+0x32a/0x40c
[   54.893096] [<ffffffff803915ac>] full_proxy_write+0x56/0x82
[   54.893235] [<ffffffff801ef456>] vfs_write+0xa6/0x2a6
[   54.893362] [<ffffffff801ef880>] ksys_write+0x6c/0xe2
[   54.893487] [<ffffffff801ef918>] sys_write+0x22/0x2a
[   54.893609] [<ffffffff8000397c>] ret_from_syscall+0x0/0x2
[   54.894183] ---[ end trace 0000000000000000 ]---

The callers may not quite know the actual address they pass(e.g.  from
devicetree).  So the kmemleak_*_phys() apis should guarantee the
address they finally use is in lowmem range, so check the address for
lowmem's min boundary.

Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.com
Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kmemleak.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/kmemleak.c~mm-kmemleak-take-a-full-lowmem-check-in-kmemleak__phys
+++ a/mm/kmemleak.c
@@ -1132,7 +1132,7 @@ EXPORT_SYMBOL(kmemleak_no_scan);
 void __ref kmemleak_alloc_phys(phys_addr_t phys, size_t size, int min_count,
 			       gfp_t gfp)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_alloc(__va(phys), size, min_count, gfp);
 }
 EXPORT_SYMBOL(kmemleak_alloc_phys);
@@ -1146,7 +1146,7 @@ EXPORT_SYMBOL(kmemleak_alloc_phys);
  */
 void __ref kmemleak_free_part_phys(phys_addr_t phys, size_t size)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_free_part(__va(phys), size);
 }
 EXPORT_SYMBOL(kmemleak_free_part_phys);
@@ -1158,7 +1158,7 @@ EXPORT_SYMBOL(kmemleak_free_part_phys);
  */
 void __ref kmemleak_not_leak_phys(phys_addr_t phys)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_not_leak(__va(phys));
 }
 EXPORT_SYMBOL(kmemleak_not_leak_phys);
@@ -1170,7 +1170,7 @@ EXPORT_SYMBOL(kmemleak_not_leak_phys);
  */
 void __ref kmemleak_ignore_phys(phys_addr_t phys)
 {
-	if (!IS_ENABLED(CONFIG_HIGHMEM) || PHYS_PFN(phys) < max_low_pfn)
+	if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn)
 		kmemleak_ignore(__va(phys));
 }
 EXPORT_SYMBOL(kmemleak_ignore_phys);
_

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15  2:13 ` [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE Andrew Morton
@ 2022-04-15 22:10   ` Linus Torvalds
  2022-04-15 22:21     ` Matthew Wilcox
                       ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Linus Torvalds @ 2022-04-15 22:10 UTC (permalink / raw)
  To: Andrew Morton, the arch/x86 maintainers, Peter Zijlstra,
	Borislav Petkov
  Cc: patrice.chotard, Mikulas Patocka, markhemm, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> but the x86 clear_user() is not nearly so well optimized as copy to user
> (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).

Ugh.

I've applied this patch, but honestly, the proper course of action
should just be to improve on clear_user().

If it really is important enough that we should care about that
performance, then we just should fix clear_user().

It's a very odd special thing right now (at least on x86-64) using
some strange handcrafted inline asm code.

I assume that 'rep stosb' is the fastest way to clear things on modern
CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
"rep stos" except for small areas, and probably that "store zeros by
hand" for older CPUs).

Adding PeterZ and Borislav (who seem to be the last ones to have
worked on the copy and clear_page stuff respectively) and the x86
maintainers in case somebody gets the urge to just fix this.

Because memory clearing should be faster than copying, and the thing
that makes copying fast is that FSRM and ERMS logic (the whole
"manually unrolled copy" is hopefully mostly a thing of the past and
we can consider it legacy)

             Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15 22:10   ` Linus Torvalds
@ 2022-04-15 22:21     ` Matthew Wilcox
  2022-04-15 22:41     ` Hugh Dickins
  2022-04-16  6:36     ` Borislav Petkov
  2 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2022-04-15 22:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, the arch/x86 maintainers, Peter Zijlstra,
	Borislav Petkov, patrice.chotard, Mikulas Patocka, markhemm,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Chuck Lever,
	Hugh Dickins, patches, Linux-MM, mm-commits

On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote:
> On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> > iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> > instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> > but the x86 clear_user() is not nearly so well optimized as copy to user
> > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).
> 
> Ugh.
> 
> I've applied this patch, but honestly, the proper course of action
> should just be to improve on clear_user().
> 
> If it really is important enough that we should care about that
> performance, then we just should fix clear_user().
> 
> It's a very odd special thing right now (at least on x86-64) using
> some strange handcrafted inline asm code.
> 
> I assume that 'rep stosb' is the fastest way to clear things on modern
> CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
> "rep stos" except for small areas, and probably that "store zeros by
> hand" for older CPUs).
> 
> Adding PeterZ and Borislav (who seem to be the last ones to have
> worked on the copy and clear_page stuff respectively) and the x86
> maintainers in case somebody gets the urge to just fix this.

Perhaps the x86 maintainers would like to start from
https://lore.kernel.org/lkml/20210523180423.108087-1-sneves@dei.uc.pt/
instead of pushing that work off on the submitter.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15 22:10   ` Linus Torvalds
  2022-04-15 22:21     ` Matthew Wilcox
@ 2022-04-15 22:41     ` Hugh Dickins
  2022-04-16  6:36     ` Borislav Petkov
  2 siblings, 0 replies; 82+ messages in thread
From: Hugh Dickins @ 2022-04-15 22:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, the arch/x86 maintainers, Peter Zijlstra,
	Borislav Petkov, patrice.chotard, Mikulas Patocka, markhemm,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Chuck Lever,
	Hugh Dickins, patches, Linux-MM, mm-commits

On Fri, 15 Apr 2022, Linus Torvalds wrote:
> On Thu, Apr 14, 2022 at 7:13 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
> > iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
> > instead of copy_page_to_iter().  We would use iov_iter_zero() throughout,
> > but the x86 clear_user() is not nearly so well optimized as copy to user
> > (dd of 1T sparse tmpfs file takes 57 seconds rather than 44 seconds).
> 
> Ugh.
> 
> I've applied this patch,

Phew, thanks.

> but honestly, the proper course of action
> should just be to improve on clear_user().

You'll find no disagreement here: we've all been saying the same.
It's just that that work is yet to be done (or yet to be accepted).

> 
> If it really is important enough that we should care about that
> performance, then we just should fix clear_user().
> 
> It's a very odd special thing right now (at least on x86-64) using
> some strange handcrafted inline asm code.
> 
> I assume that 'rep stosb' is the fastest way to clear things on modern
> CPU's that have FSRM, and then we have the usual fallbacks (ie ERMS ->
> "rep stos" except for small areas, and probably that "store zeros by
> hand" for older CPUs).
> 
> Adding PeterZ and Borislav (who seem to be the last ones to have
> worked on the copy and clear_page stuff respectively) and the x86
> maintainers in case somebody gets the urge to just fix this.

Yes, it was exactly Borislav and PeterZ whom I first approached too,
link 3 in the commit message of the patch that this one is fixing,
https://lore.kernel.org/lkml/2f5ca5e4-e250-a41c-11fb-a7f4ebc7e1c9@google.com/

Borislav wants a thorough good patch, and I don't blame him for that!

Hugh

> 
> Because memory clearing should be faster than copying, and the thing
> that makes copying fast is that FSRM and ERMS logic (the whole
> "manually unrolled copy" is hopefully mostly a thing of the past and
> we can consider it legacy)
> 
>              Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-15 22:10   ` Linus Torvalds
  2022-04-15 22:21     ` Matthew Wilcox
  2022-04-15 22:41     ` Hugh Dickins
@ 2022-04-16  6:36     ` Borislav Petkov
  2022-04-16 14:07       ` Mark Hemment
  2 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-16  6:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, the arch/x86 maintainers, Peter Zijlstra,
	patrice.chotard, Mikulas Patocka, markhemm, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote:
> Adding PeterZ and Borislav (who seem to be the last ones to have
> worked on the copy and clear_page stuff respectively) and the x86
> maintainers in case somebody gets the urge to just fix this.

I guess if enough people ask and keep asking, some people at least try
to move...

> Because memory clearing should be faster than copying, and the thing
> that makes copying fast is that FSRM and ERMS logic (the whole
> "manually unrolled copy" is hopefully mostly a thing of the past and
> we can consider it legacy)

So I did give it a look and it seems to me, if we want to do the
alternatives thing here, it will have to look something like
arch/x86/lib/copy_user_64.S.

I.e., the current __clear_user() will have to become the "handle_tail"
thing there which deals with uncopied rest-bytes at the end and the new
fsrm/erms/rep_good variants will then be alternative_call_2 or _3.

The fsrm thing will have only the handle_tail thing at the end when size
!= 0.

The others - erms and rep_good - will have to check for sizes smaller
than, say a cacheline, and for those call the handle_tail thing directly
instead of going into a REP loop. The current __clear_user() is still a
lot better than that copy_user_generic_unrolled() abomination. And it's
not like old CPUs would get any perf penalty - they'll simply use the
same code.

And then you need the labels for _ASM_EXTABLE_UA() exception handling.

Anyway, something along those lines.

And then we'll need to benchmark this on a bunch of current machines to
make sure there's no funny surprises, perf-wise.

I can get cracking on this but I would advise people not to hold their
breaths. :)

Unless someone has a better idea or is itching to get hands dirty
her-/himself.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16  6:36     ` Borislav Petkov
@ 2022-04-16 14:07       ` Mark Hemment
  2022-04-16 17:28         ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Mark Hemment @ 2022-04-16 14:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, markhemm,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Chuck Lever,
	Hugh Dickins, patches, Linux-MM, mm-commits


On Sat, 16 Apr 2022, Borislav Petkov wrote:

> On Fri, Apr 15, 2022 at 03:10:51PM -0700, Linus Torvalds wrote:
> > Adding PeterZ and Borislav (who seem to be the last ones to have
> > worked on the copy and clear_page stuff respectively) and the x86
> > maintainers in case somebody gets the urge to just fix this.
> 
> I guess if enough people ask and keep asking, some people at least try
> to move...
> 
> > Because memory clearing should be faster than copying, and the thing
> > that makes copying fast is that FSRM and ERMS logic (the whole
> > "manually unrolled copy" is hopefully mostly a thing of the past and
> > we can consider it legacy)
> 
> So I did give it a look and it seems to me, if we want to do the
> alternatives thing here, it will have to look something like
> arch/x86/lib/copy_user_64.S.
> 
> I.e., the current __clear_user() will have to become the "handle_tail"
> thing there which deals with uncopied rest-bytes at the end and the new
> fsrm/erms/rep_good variants will then be alternative_call_2 or _3.
> 
> The fsrm thing will have only the handle_tail thing at the end when size
> != 0.
> 
> The others - erms and rep_good - will have to check for sizes smaller
> than, say a cacheline, and for those call the handle_tail thing directly
> instead of going into a REP loop. The current __clear_user() is still a
> lot better than that copy_user_generic_unrolled() abomination. And it's
> not like old CPUs would get any perf penalty - they'll simply use the
> same code.
> 
> And then you need the labels for _ASM_EXTABLE_UA() exception handling.
> 
> Anyway, something along those lines.
> 
> And then we'll need to benchmark this on a bunch of current machines to
> make sure there's no funny surprises, perf-wise.
> 
> I can get cracking on this but I would advise people not to hold their
> breaths. :)
> 
> Unless someone has a better idea or is itching to get hands dirty
> her-/himself.

I've done a skeleton implementation of alternative __clear_user() based on 
CPU features.
It has three versions of __clear_user();
o __clear_user_original() - similar to the 'standard' __clear_user()
o __clear_user_rep_good() - using resp stos{qb} when CPU has 'rep_good'
o __clear_user_erms() - using 'resp stosb' when CPU has 'erms'

Not claiming the implementation is ideal, but might be a useful starting 
point for someone.
Patch is against 5.18.0-rc2.
Only basic sanity testing done.

Simple performance testing done for large sizes, on a system (Intel E8400) 
which has rep_good but not erms;
# dd if=/dev/zero of=/dev/null bs=16384 count=10000
o *_original() - ~14.2GB/s.  Same as the 'standard' __clear_user().
o *_rep_good() - same throughput as *_original().
o *_erms()     - ~12.2GB/s (expected on a system without erms).

No performance testing done for zeroing small sizes.

Cheers,
Mark

Signed-off-by: Mark Hemment <markhemm@googlemail.com>
---
 arch/x86/include/asm/asm.h        |  39 +++++++++++++++
 arch/x86/include/asm/uaccess_64.h |  36 ++++++++++++++
 arch/x86/lib/clear_page_64.S      | 100 ++++++++++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  32 ------------
 4 files changed, 175 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index fbcfec4dc4cc..373ed6be7a8d 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -132,6 +132,35 @@
 /* Exception table entry */
 #ifdef __ASSEMBLY__
 
+# define UNDEFINE_EXTABLE_TYPE_REG \
+	.purgem extable_type_reg ;
+
+# define DEFINE_EXTABLE_TYPE_REG \
+	.macro extable_type_reg type:req reg:req ;			\
+	.set .Lfound, 0	;						\
+	.set .Lregnr, 0 ;						\
+	.irp rs,rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,	\
+	     r14,r15 ;							\
+	.ifc \reg, %\rs ;						\
+	.set .Lfound, .Lfound+1 ;					\
+	.long \type + (.Lregnr << 8) ;					\
+	.endif ;							\
+	.set .Lregnr, .Lregnr+1 ;					\
+	.endr ;								\
+	.set .Lregnr, 0 ;						\
+	.irp rs,eax,ecx,edx,ebx,esp,ebp,esi,edi,r8d,r9d,r10d,r11d,r12d, \
+	     r13d,r14d,r15d ;						\
+	.ifc \reg, %\rs ;						\
+	.set .Lfound, .Lfound+1 ;					\
+	.long \type + (.Lregnr << 8) ;					\
+	.endif ;							\
+	.set .Lregnr, .Lregnr+1 ;					\
+	.endr ;								\
+	.if (.Lfound != 1) ;						\
+	.error "extable_type_reg: bad register argument" ;		\
+	.endif ;							\
+	.endm ;
+
 # define _ASM_EXTABLE_TYPE(from, to, type)			\
 	.pushsection "__ex_table","a" ;				\
 	.balign 4 ;						\
@@ -140,6 +169,16 @@
 	.long type ;						\
 	.popsection
 
+# define _ASM_EXTABLE_TYPE_REG(from, to, type1, reg1)		\
+	.pushsection "__ex_table","a" ;				\
+	.balign 4 ;						\
+	.long (from) - . ;					\
+	.long (to) - . ;					\
+	DEFINE_EXTABLE_TYPE_REG					\
+	extable_type_reg reg=reg1, type=type1 ;			\
+	UNDEFINE_EXTABLE_TYPE_REG				\
+	.popsection
+
 # ifdef CONFIG_KPROBES
 #  define _ASM_NOKPROBE(entry)					\
 	.pushsection "_kprobe_blacklist","aw" ;			\
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..6a4995e4cfae 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,40 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long
+___clear_user(void __user *addr, unsigned long len)
+{
+	unsigned long	ret;
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+
+	might_fault();
+	alternative_call_2(
+		clear_user_original,
+		clear_user_rep_good,
+		X86_FEATURE_REP_GOOD,
+		clear_user_erms,
+		X86_FEATURE_ERMS,
+		ASM_OUTPUT2("=a" (ret), "=D" (addr), "=c" (len)),
+		"1" (addr), "2" (len)
+		: "%rdx", "cc");
+	return ret;
+}
+
+#define __clear_user(d, n)	___clear_user(d, n)
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..abe1f44ea422 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
+#include <asm/smap.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +52,101 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_original)
+	ASM_STAC
+	movq %rcx,%rax
+	shrq $3,%rcx
+	andq $7,%rax
+	testq %rcx,%rcx
+	jz 1f
+
+	.p2align 4
+0:	movq $0,(%rdi)
+	leaq 8(%rdi),%rdi
+	decq %rcx
+	jnz   0b
+
+1:	movq %rax,%rcx
+	testq %rcx,%rcx
+	jz 3f
+
+2:	movb $0,(%rdi)
+	incq %rdi
+	decl %ecx
+	jnz  2b
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax)
+	_ASM_EXTABLE_UA(2b, 3b)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_rep_good)
+	ASM_STAC
+	movq %rcx,%rdx
+	xorq %rax,%rax
+	shrq $3,%rcx
+	andq $7,%rdx
+
+0:	rep stosq
+	movq %rdx,%rcx
+
+1:	rep stosb
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rdx)
+	_ASM_EXTABLE_UA(1b, 3b)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rax uncopied bytes or 0 if successful.
+ */
+
+SYM_FUNC_START(clear_user_erms)
+	xorq %rax,%rax
+	ASM_STAC
+
+0:	rep stosb
+
+3:	ASM_CLAC
+	movq %rcx,%rax
+	RET
+
+	_ASM_EXTABLE_UA(0b, 3b)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0402a749f3a0..3a2872c9c4a9 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,38 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
 unsigned long clear_user(void __user *to, unsigned long n)
 {
 	if (access_ok(to, n))

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16 14:07       ` Mark Hemment
@ 2022-04-16 17:28         ` Borislav Petkov
  2022-04-16 17:42           ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-16 17:28 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Linus Torvalds, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sat, Apr 16, 2022 at 03:07:47PM +0100, Mark Hemment wrote:
> I've done a skeleton implementation of alternative __clear_user() based on 
> CPU features.

Cool!

Just a couple of quick notes - more indepth look next week.

> It has three versions of __clear_user();
> o __clear_user_original() - similar to the 'standard' __clear_user()
> o __clear_user_rep_good() - using resp stos{qb} when CPU has 'rep_good'
> o __clear_user_erms() - using 'resp stosb' when CPU has 'erms'

you also need a _fsrm() one which checks X86_FEATURE_FSRM. That one
should simply do rep; stosb regardless of the size. For that you can
define an alternative_call_3 similar to how the _2 variant is defined.

> Not claiming the implementation is ideal, but might be a useful starting
> point for someone.
> Patch is against 5.18.0-rc2.
> Only basic sanity testing done.
> 
> Simple performance testing done for large sizes, on a system (Intel E8400) 
> which has rep_good but not erms;
> # dd if=/dev/zero of=/dev/null bs=16384 count=10000
> o *_original() - ~14.2GB/s.  Same as the 'standard' __clear_user().
> o *_rep_good() - same throughput as *_original().
> o *_erms()     - ~12.2GB/s (expected on a system without erms).

Right.

I have a couple of boxes too - I can run the benchmarks on them too so
we should have enough perf data points eventually.

> diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
> index fbcfec4dc4cc..373ed6be7a8d 100644
> --- a/arch/x86/include/asm/asm.h
> +++ b/arch/x86/include/asm/asm.h
> @@ -132,6 +132,35 @@
>  /* Exception table entry */
>  #ifdef __ASSEMBLY__
>  
> +# define UNDEFINE_EXTABLE_TYPE_REG \
> +	.purgem extable_type_reg ;
> +
> +# define DEFINE_EXTABLE_TYPE_REG \
> +	.macro extable_type_reg type:req reg:req ;			\

I love that macro PeterZ!</sarcasm>

> +	.set .Lfound, 0	;						\
> +	.set .Lregnr, 0 ;						\
> +	.irp rs,rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,	\
> +	     r14,r15 ;							\
> +	.ifc \reg, %\rs ;						\
> +	.set .Lfound, .Lfound+1 ;					\
> +	.long \type + (.Lregnr << 8) ;					\
> +	.endif ;							\
> +	.set .Lregnr, .Lregnr+1 ;					\
> +	.endr ;								\
> +	.set .Lregnr, 0 ;						\
> +	.irp rs,eax,ecx,edx,ebx,esp,ebp,esi,edi,r8d,r9d,r10d,r11d,r12d, \
> +	     r13d,r14d,r15d ;						\
> +	.ifc \reg, %\rs ;						\
> +	.set .Lfound, .Lfound+1 ;					\
> +	.long \type + (.Lregnr << 8) ;					\
> +	.endif ;							\
> +	.set .Lregnr, .Lregnr+1 ;					\
> +	.endr ;								\
> +	.if (.Lfound != 1) ;						\
> +	.error "extable_type_reg: bad register argument" ;		\
> +	.endif ;							\
> +	.endm ;
> +
>  # define _ASM_EXTABLE_TYPE(from, to, type)			\
>  	.pushsection "__ex_table","a" ;				\
>  	.balign 4 ;						\
> @@ -140,6 +169,16 @@
>  	.long type ;						\
>  	.popsection
>  
> +# define _ASM_EXTABLE_TYPE_REG(from, to, type1, reg1)		\
> +	.pushsection "__ex_table","a" ;				\
> +	.balign 4 ;						\
> +	.long (from) - . ;					\
> +	.long (to) - . ;					\
> +	DEFINE_EXTABLE_TYPE_REG					\
> +	extable_type_reg reg=reg1, type=type1 ;			\
> +	UNDEFINE_EXTABLE_TYPE_REG				\
> +	.popsection
> +
>  # ifdef CONFIG_KPROBES
>  #  define _ASM_NOKPROBE(entry)					\
>  	.pushsection "_kprobe_blacklist","aw" ;			\
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index 45697e04d771..6a4995e4cfae 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -79,4 +79,40 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
>  	kasan_check_write(dst, size);
>  	return __copy_user_flushcache(dst, src, size);
>  }
> +
> +/*
> + * Zero Userspace.
> + */
> +
> +__must_check unsigned long
> +clear_user_original(void __user *addr, unsigned long len);
> +__must_check unsigned long
> +clear_user_rep_good(void __user *addr, unsigned long len);
> +__must_check unsigned long
> +clear_user_erms(void __user *addr, unsigned long len);
> +
> +static __always_inline __must_check unsigned long
> +___clear_user(void __user *addr, unsigned long len)
> +{
> +	unsigned long	ret;
> +
> +	/*
> +	 * No memory constraint because it doesn't change any memory gcc
> +	 * knows about.
> +	 */
> +
> +	might_fault();

I think you could do the stac(); clac() sandwich around the
alternative_call_2 here and remove the respective calls from the asm.

> +	alternative_call_2(
> +		clear_user_original,
> +		clear_user_rep_good,
> +		X86_FEATURE_REP_GOOD,
> +		clear_user_erms,
> +		X86_FEATURE_ERMS,
> +		ASM_OUTPUT2("=a" (ret), "=D" (addr), "=c" (len)),

You can do here:

                           : "+&c" (size), "+&D" (addr)
                           :: "eax");

because size and addr need to be earlyclobbers:

  e0a96129db57 ("x86: use early clobbers in usercopy*.c")

and then simply return size;

The asm modifies it anyway - no need for ret.

In any case, thanks for doing that!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16 17:28         ` Borislav Petkov
@ 2022-04-16 17:42           ` Linus Torvalds
  2022-04-16 21:15             ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-16 17:42 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sat, Apr 16, 2022 at 10:28 AM Borislav Petkov <bp@alien8.de> wrote:
>
> you also need a _fsrm() one which checks X86_FEATURE_FSRM. That one
> should simply do rep; stosb regardless of the size. For that you can
> define an alternative_call_3 similar to how the _2 variant is defined.

Honestly, my personal preference would be that with FSRM, we'd have an
alternative that looks something like

    asm volatile(
        "1:"
        ALTERNATIVE("call __stosb_user", "rep movsb", X86_FEATURE_FSRM)
        "2:"
       _ASM_EXTABLE_UA(1b, 2b)
        :"=c" (count), "=D" (dest),ASM_CALL_CONSTRAINT
        :"0" (count), "1" (dest), "a" (0)
        :"memory");

iow, the 'rep stosb' case would be inline.

Note that the above would have a few things to look out for:

 - special 'stosb' calling convention:

     %rax/%rcx/%rdx as inputs
     %rcx as "bytes not copied" return value
     %rdi can be clobbered

   so the actual functions would look a bit odd and would need to
save/restore some registers, but they'd basically just emulate "rep
stosb".

 - since the whole point is that the "rep movsb" is inlined, it also
means that the "call __stosb_user" is done within the STAC/CLAC
region, so objdump would have to be taught that's ok

but wouldn't it be lovely if we could start moving towards a model
where we can just inline 'memset' and 'memcpy' like this?

NOTE! The above asm has not been tested. I wrote it in this email. I'm
sure I messed something up.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16 17:42           ` Linus Torvalds
@ 2022-04-16 21:15             ` Borislav Petkov
  2022-04-17 19:41               ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-16 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sat, Apr 16, 2022 at 10:42:22AM -0700, Linus Torvalds wrote:
> On Sat, Apr 16, 2022 at 10:28 AM Borislav Petkov <bp@alien8.de> wrote:
> >
> > you also need a _fsrm() one which checks X86_FEATURE_FSRM. That one
> > should simply do rep; stosb regardless of the size. For that you can
> > define an alternative_call_3 similar to how the _2 variant is defined.
> 
> Honestly, my personal preference would be that with FSRM, we'd have an
> alternative that looks something like
> 
>     asm volatile(
>         "1:"
>         ALTERNATIVE("call __stosb_user", "rep movsb", X86_FEATURE_FSRM)
>         "2:"
>        _ASM_EXTABLE_UA(1b, 2b)
>         :"=c" (count), "=D" (dest),ASM_CALL_CONSTRAINT
>         :"0" (count), "1" (dest), "a" (0)
>         :"memory");
> 
> iow, the 'rep stosb' case would be inline.

I knew you were gonna say that - we have talked about this in the past.
And I'll do you one better -- we have the patch-if-bit-not-set thing now
too, so I think it should work if we did:


       alternative_call_3(__clear_user_fsrm,
                          __clear_user_erms,   ALT_NOT(X86_FEATURE_FSRM),
                          __clear_user_string, ALT_NOT(X86_FEATURE_ERMS),
			  __clear_user_orig,   ALT_NOT(X86_FEATURE_REP_GOOD),
                          : "+&c" (size), "+&D" (addr)
                          :: "eax");

and yeah, you wanna get rid of the CALL even and I guess that could
be made to work - I just need to play with it a bit to hammer out the
details.

I.e., it would be most optimal if it ended up being

	ALTERNATIVE_3("rep stosb",
		      "call ... ", ALT_NOT(X86_FEATURE_FSRM),
		      ...


> Note that the above would have a few things to look out for:
> 
>  - special 'stosb' calling convention:
> 
>      %rax/%rcx/%rdx as inputs
>      %rcx as "bytes not copied" return value
>      %rdi can be clobbered
> 
>    so the actual functions would look a bit odd and would need to
> save/restore some registers, but they'd basically just emulate "rep
> stosb".

Right.

>  - since the whole point is that the "rep movsb" is inlined, it also
> means that the "call __stosb_user" is done within the STAC/CLAC
> region, so objdump would have to be taught that's ok
> 
> but wouldn't it be lovely if we could start moving towards a model
> where we can just inline 'memset' and 'memcpy' like this?

Yeah, inlined insns without even a CALL insn would be the most optimal
thing to do.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-16 21:15             ` Borislav Petkov
@ 2022-04-17 19:41               ` Borislav Petkov
  2022-04-17 20:56                 ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-17 19:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sat, Apr 16, 2022 at 11:15:16PM +0200, Borislav Petkov wrote:
> > but wouldn't it be lovely if we could start moving towards a model
> > where we can just inline 'memset' and 'memcpy' like this?
> 
> Yeah, inlined insns without even a CALL insn would be the most optimal
> thing to do.

Here's a diff ontop of Mark's patch, it inlines clear_user() everywhere
and it seems to patch properly

vmlinux has:

ffffffff812ba53e:       0f 87 f8 fd ff ff       ja     ffffffff812ba33c <load_elf_binary+0x60c>
ffffffff812ba544:       90                      nop
ffffffff812ba545:       90                      nop
ffffffff812ba546:       90                      nop
ffffffff812ba547:       4c 89 e7                mov    %r12,%rdi
ffffffff812ba54a:       f3 aa                   rep stos %al,%es:(%rdi)
ffffffff812ba54c:       90                      nop
ffffffff812ba54d:       90                      nop
ffffffff812ba54e:       90                      nop
ffffffff812ba54f:       90                      nop
ffffffff812ba550:       90                      nop
ffffffff812ba551:       90                      nop

which is the rep; stosb and when I boot my guest, it has been patched to:

0xffffffff812ba544 <+2068>:  0f 01 cb        stac   
0xffffffff812ba547 <+2071>:  4c 89 e7        mov    %r12,%rdi
0xffffffff812ba54a <+2074>:  e8 71 9a 15 00  call   0xffffffff81413fc0 <clear_user_rep_good>
0xffffffff812ba54f <+2079>:  0f 01 ca        clac

(stac and clac are also alternative-patched). And there's the call to
rep_good because the guest doesn't advertize FRSM nor ERMS.

Anyway, more playing with this later to make sure it really does what it
should.

---

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index f78e2b3501a1..335d571e8a79 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -405,9 +405,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -429,6 +426,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 6a4995e4cfae..dc0f2bfc80a4 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -91,28 +91,34 @@ clear_user_rep_good(void __user *addr, unsigned long len);
 __must_check unsigned long
 clear_user_erms(void __user *addr, unsigned long len);
 
-static __always_inline __must_check unsigned long
-___clear_user(void __user *addr, unsigned long len)
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
 {
-	unsigned long	ret;
-
 	/*
 	 * No memory constraint because it doesn't change any memory gcc
 	 * knows about.
 	 */
 
 	might_fault();
-	alternative_call_2(
-		clear_user_original,
-		clear_user_rep_good,
-		X86_FEATURE_REP_GOOD,
-		clear_user_erms,
-		X86_FEATURE_ERMS,
-		ASM_OUTPUT2("=a" (ret), "=D" (addr), "=c" (len)),
-		"1" (addr), "2" (len)
-		: "%rdx", "cc");
-	return ret;
+	stac();
+	asm_inline volatile(
+		"1:"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0));
+
+	clac();
+	return size;
 }
 
-#define __clear_user(d, n)	___clear_user(d, n)
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index abe1f44ea422..b80f710a7f30 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -62,9 +62,7 @@ EXPORT_SYMBOL_GPL(clear_page_erms)
  * Output:
  * rax uncopied bytes or 0 if successful.
  */
-
 SYM_FUNC_START(clear_user_original)
-	ASM_STAC
 	movq %rcx,%rax
 	shrq $3,%rcx
 	andq $7,%rax
@@ -86,7 +84,7 @@ SYM_FUNC_START(clear_user_original)
 	decl %ecx
 	jnz  2b
 
-3:	ASM_CLAC
+3:
 	movq %rcx,%rax
 	RET
 
@@ -105,11 +103,8 @@ EXPORT_SYMBOL(clear_user_original)
  * Output:
  * rax uncopied bytes or 0 if successful.
  */
-
 SYM_FUNC_START(clear_user_rep_good)
-	ASM_STAC
 	movq %rcx,%rdx
-	xorq %rax,%rax
 	shrq $3,%rcx
 	andq $7,%rdx
 
@@ -118,7 +113,7 @@ SYM_FUNC_START(clear_user_rep_good)
 
 1:	rep stosb
 
-3:	ASM_CLAC
+3:
 	movq %rcx,%rax
 	RET
 
@@ -135,15 +130,13 @@ EXPORT_SYMBOL(clear_user_rep_good)
  *
  * Output:
  * rax uncopied bytes or 0 if successful.
+ *
+ * XXX: check for small sizes and call the original version.
+ * Benchmark it first though.
  */
-
 SYM_FUNC_START(clear_user_erms)
-	xorq %rax,%rax
-	ASM_STAC
-
 0:	rep stosb
-
-3:	ASM_CLAC
+3:
 	movq %rcx,%rax
 	RET
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 3a2872c9c4a9..8bf04d95dc04 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,14 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index bd0c2c828940..ea48ee11521f 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1019,6 +1019,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-17 19:41               ` Borislav Petkov
@ 2022-04-17 20:56                 ` Linus Torvalds
  2022-04-18 10:15                   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-17 20:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 17, 2022 at 12:41 PM Borislav Petkov <bp@alien8.de> wrote:
>
> Anyway, more playing with this later to make sure it really does what it
> should.

I think the special calling conventions have tripped you up:

>  SYM_FUNC_START(clear_user_original)
> -       ASM_STAC
>         movq %rcx,%rax
>         shrq $3,%rcx
>         andq $7,%rax
> @@ -86,7 +84,7 @@ SYM_FUNC_START(clear_user_original)
>         decl %ecx
>         jnz  2b
>
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

That 'movq %rcx,%rax' can't be right. The caller expects it to be zero
on input and stay zero on output.

But I think "xorl %eax,%eax" is good, since %eax was used as a
temporary in that function.

And the comment above the function should be fixed too.

>  SYM_FUNC_START(clear_user_rep_good)
> -       ASM_STAC
>         movq %rcx,%rdx
> -       xorq %rax,%rax
>         shrq $3,%rcx
>         andq $7,%rdx
>
> @@ -118,7 +113,7 @@ SYM_FUNC_START(clear_user_rep_good)
>
>  1:     rep stosb
>
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

Same issue here.

Probably nothing notices, since %rcx *does* end up containing the
right value, and it's _likely_ that the compiler doesn't actually end
up re-using the zero value in %rax after the inline asm (that this bug
has corrupted), but if the compiler ever goes "Oh, I put zero in %rax,
so I'll just use that afterwards", this is going to blow up very
spectacularly and be very hard to debug.

> @@ -135,15 +130,13 @@ EXPORT_SYMBOL(clear_user_rep_good)
>   *
>   * Output:
>   * rax uncopied bytes or 0 if successful.
> + *
> + * XXX: check for small sizes and call the original version.
> + * Benchmark it first though.
>   */
> -
>  SYM_FUNC_START(clear_user_erms)
> -       xorq %rax,%rax
> -       ASM_STAC
> -
>  0:     rep stosb
> -
> -3:     ASM_CLAC
> +3:
>         movq %rcx,%rax
>         RET

.. and one more time.

Also, I do think that the rep_good and erms cases should probably
check for small copes, and use the clear_user_original thing for %rcx
< 64 or something like that.

It's what we do on the memcpy side - and more importantly, it's the
only difference between "erms" and FSRM. If the ERMS code doesn't do
anything different for small copies, why have it at all?

But other than these small issues, it looks good to me.

                  Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-17 20:56                 ` Linus Torvalds
@ 2022-04-18 10:15                   ` Borislav Petkov
  2022-04-18 17:10                     ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-18 10:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 17, 2022 at 01:56:25PM -0700, Linus Torvalds wrote:
> That 'movq %rcx,%rax' can't be right. The caller expects it to be zero
> on input and stay zero on output.
> 
> But I think "xorl %eax,%eax" is good, since %eax was used as a
> temporary in that function.

Yah, wanted to singlestep that whole asm anyway to make sure it is good.
And just started going through it - I think it can be even optimized a
bit to use %rax for the rest bytes and decrement it into 0 eventually.

The "xorl %eax,%eax" is still there, though, in case we fault on the
user access and so that we can clear it to the compiler's expectation.

I've added comments too so that it is clear what happens at a quick glance.

SYM_FUNC_START(clear_user_original)
        mov %rcx,%rax
        shr $3,%rcx             # qwords
        and $7,%rax             # rest bytes
        test %rcx,%rcx
        jz 1f

        # do the qwords first
        .p2align 4
0:      movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz   0b

1:      test %rax,%rax
        jz 3f

        # now do the rest bytes
2:      movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz  2b

3:
        xorl %eax,%eax
        RET

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-18 10:15                   ` Borislav Petkov
@ 2022-04-18 17:10                     ` Linus Torvalds
  2022-04-19  9:17                       ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-18 17:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Mon, Apr 18, 2022 at 3:15 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Yah, wanted to singlestep that whole asm anyway to make sure it is good.
> And just started going through it - I think it can be even optimized a
> bit to use %rax for the rest bytes and decrement it into 0 eventually.

Ugh. If you do this, you need to have a big comment about how that
%rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
up doing "%rcx = %rcx*8+%rax" in ex_handler_ucopy_len() for the
exception case).

Plus you need to explain the xorl here:

> 3:
>         xorl %eax,%eax
>         RET

because with your re-written function it *looks* like %eax is already
zero, so - once again - this is about how the exception cases work and
get here magically.

So that needs some big comment about "if an exception happens, we jump
to label '3', and the exception handler will fix up %rcx, but we'll
need to clear %rax".

Or something like that.

But yes, that does look like it will work, it's just really subtle how
%rcx is zero for the 'rest bytes', and %rax is the byte count
remaining in addition.

                  Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-18 17:10                     ` Linus Torvalds
@ 2022-04-19  9:17                       ` Borislav Petkov
  2022-04-19 16:41                         ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-19  9:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Mon, Apr 18, 2022 at 10:10:42AM -0700, Linus Torvalds wrote:
> Ugh. If you do this, you need to have a big comment about how that
> %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
> up doing "%rcx = %rcx*8+%rax" in ex_handler_ucopy_len() for the
> exception case).

Yap, and I reused your text and expanded it. You made me look at that
crazy DEFINE_EXTABLE_TYPE_REG macro finally so that I know what it does
in detail.

So I have the below now, it boots in the guest so it must be perfect.

---

/*
 * Default clear user-space.
 * Input:
 * rdi destination
 * rcx count
 *
 * Output:
 * rcx uncopied bytes or 0 if successful.
 */
SYM_FUNC_START(clear_user_original)
        mov %rcx,%rax
        shr $3,%rcx             # qwords
        and $7,%rax             # rest bytes
        test %rcx,%rcx
        jz 1f

        # do the qwords first
        .p2align 4
0:      movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz   0b

1:      test %rax,%rax
        jz 3f

        # now do the rest bytes
2:      movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz  2b

3:
        xorl %eax,%eax
        RET

        _ASM_EXTABLE_UA(0b, 3b)

        /*
         * The %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
         * up doing "%rcx = %rcx*8 + %rax" in ex_handler_ucopy_len() for the exception
         * case). That is, we use %rax above at label 2: for simpler asm but the number
         * of uncleared bytes will land in %rcx, as expected by the caller.
         *
         * %rax at label 3: still needs to be cleared in the exception case because this
         * is called from inline asm and the compiler expects %rax to be zero when exiting
         * the inline asm, in case it might reuse it somewhere.
         */
        _ASM_EXTABLE_TYPE_REG(2b, 3b, EX_TYPE_UCOPY_LEN8, %rax)
SYM_FUNC_END(clear_user_original)
EXPORT_SYMBOL(clear_user_original)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-19  9:17                       ` Borislav Petkov
@ 2022-04-19 16:41                         ` Linus Torvalds
  2022-04-19 17:48                           ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-19 16:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Tue, Apr 19, 2022 at 2:17 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Yap, and I reused your text and expanded it. You made me look at that
> crazy DEFINE_EXTABLE_TYPE_REG macro finally so that I know what it does
> in detail.
>
> So I have the below now, it boots in the guest so it must be perfect.

This looks fine to me.

Although honestly, I'd be even happier without those fancy exception
table tricks. I actually think things would be more legible if we had
explicit error return points that did the

err8:
        shrq $3,%rcx
        addq %rax,%rcx
err1:
        xorl %eax,%eax
        RET

things explicitly.

That's perhaps especially true since this whole thing now added a new
- and even more complex - error case with that _ASM_EXTABLE_TYPE_REG.

But I'm ok with the complex version too, I guess.

                   Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-19 16:41                         ` Linus Torvalds
@ 2022-04-19 17:48                           ` Borislav Petkov
  2022-04-21 15:06                             ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-19 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Tue, Apr 19, 2022 at 09:41:42AM -0700, Linus Torvalds wrote:
> But I'm ok with the complex version too, I guess.

I feel ya. I myself am, more often than not, averse to such complex
things if it can be solved in a simpler way but we have those fancy
exception handlers tricks everywhere and if we did it here differently,
it would completely not fit the mental pattern and we would wonder why
is this special...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-19 17:48                           ` Borislav Petkov
@ 2022-04-21 15:06                             ` Borislav Petkov
  2022-04-21 16:50                               ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-21 15:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

Some numbers with the microbenchmark:

$ for i in $(seq 1 100); do dd if=/dev/zero of=/dev/null bs=16K count=10000 2>&1 | grep GB; done > /tmp/w.log
$ awk 'BEGIN { sum = 0 } { sum +=$10 } END { print sum/100 }' /tmp/w.log

on AMD zen3

original:	20.11 Gb/s
rep_good:	34.662 Gb/s
erms:		36.378 Gb/s
fsrm:		36.398 Gb/s

I'll run some real benchmarks later but these numbers kinda speak for
themselves so it would be unlikely that it would show different perf
with a real, process-intensive benchmark...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-21 15:06                             ` Borislav Petkov
@ 2022-04-21 16:50                               ` Linus Torvalds
  2022-04-21 17:22                                 ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-21 16:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

[-- Attachment #1: Type: text/plain, Size: 1159 bytes --]

On Thu, Apr 21, 2022 at 8:06 AM Borislav Petkov <bp@alien8.de> wrote:
>
> on AMD zen3
>
> original:       20.11 Gb/s
> rep_good:       34.662 Gb/s
> erms:           36.378 Gb/s
> fsrm:           36.398 Gb/s

Looks good.

Of course, the interesting cases are the "took a page fault in the middle" ones.

A very simple basic test is something like the attached.

It does no error checking or anything else, but doing a 'strace
./a.out' should give you something like

  ...
  openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
  mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x7f10ddfd0000
  munmap(0x7f10ddfe0000, 65536)           = 0
  read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16
  exit_group(16)                          = ?

where that "read(..) = 16" is the important part. It correctly figured
out that it can only do 16 bytes (ok, 17, but we've always allowed the
user accessor functions to block).

With erms/fsrm, presumably you get that optimal "read(..) = 17".

I'm sure we have a test-case for this somewhere, but it was easier for
me to write a few lines of (bad) code than try to find it.

               Linus

[-- Attachment #2: t.c --]
[-- Type: text/x-csrc, Size: 397 bytes --]

#include <stddef.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

// Whatever
#define PAGE_SIZE 65536

int main(int argc, char **argv)
{
	int fd;
	void *map;

	fd = open("/dev/zero", O_RDONLY); 

	map = mmap(NULL, 3*PAGE_SIZE, PROT_READ | PROT_WRITE,
		MAP_PRIVATE |MAP_ANONYMOUS , -1, 0);
	munmap(map + PAGE_SIZE, PAGE_SIZE);

	return read(fd, map + PAGE_SIZE - 17, PAGE_SIZE);
}

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-21 16:50                               ` Linus Torvalds
@ 2022-04-21 17:22                                 ` Linus Torvalds
  2022-04-24 19:37                                   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-21 17:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Thu, Apr 21, 2022 at 9:50 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> where that "read(..) = 16" is the important part. It correctly figured
> out that it can only do 16 bytes (ok, 17, but we've always allowed the
> user accessor functions to block).

Bad choice of words - by "block" I meant "doing the accesses in
blocks", in this case 64-bit words.

Obviously the user accessors _also_ "block" in the sense of having to
wait for page faults and IO.

I think 'copy_{to,from}_user()' actually does go to the effort to try
to do byte-exact results, though.

In particular, see copy_user_handle_tail in arch/x86/lib/copy_user_64.S.

But I think that we long ago ended up deciding it really wasn't worth
doing it, and x86 ends up just going to unnecessary lengths for this
case.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-04-21 23:35 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-21 23:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm, patches

13 patches, based on b253435746d9a4a701b5f09211b9c14d3370d0da.

Subsystems affected by this patch series:

  mm/memory-failure
  mm/memcg
  mm/userfaultfd
  mm/hugetlbfs
  mm/mremap
  mm/oom-kill
  mm/kasan
  kcov
  mm/hmm

Subsystem: mm/memory-failure

    Naoya Horiguchi <naoya.horiguchi@nec.com>:
      mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()

    Xu Yu <xuyu@linux.alibaba.com>:
      mm/memory-failure.c: skip huge_zero_page in memory_failure()

Subsystem: mm/memcg

    Shakeel Butt <shakeelb@google.com>:
      memcg: sync flush only if periodic flush is delayed

Subsystem: mm/userfaultfd

    Nadav Amit <namit@vmware.com>:
      userfaultfd: mark uffd_wp regardless of VM_WRITE flag

Subsystem: mm/hugetlbfs

    Christophe Leroy <christophe.leroy@csgroup.eu>:
      mm, hugetlb: allow for "high" userspace addresses

Subsystem: mm/mremap

    Sidhartha Kumar <sidhartha.kumar@oracle.com>:
      selftest/vm: verify mmap addr in mremap_test
      selftest/vm: verify remap destination address in mremap_test
      selftest/vm: support xfail in mremap_test
      selftest/vm: add skip support to mremap_test

Subsystem: mm/oom-kill

    Nico Pache <npache@redhat.com>:
      oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup

Subsystem: mm/kasan

    Vincenzo Frascino <vincenzo.frascino@arm.com>:
      MAINTAINERS: add Vincenzo Frascino to KASAN reviewers

Subsystem: kcov

    Aleksandr Nogikh <nogikh@google.com>:
      kcov: don't generate a warning on vm_insert_page()'s failure

Subsystem: mm/hmm

    Alistair Popple <apopple@nvidia.com>:
      mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()

 MAINTAINERS                               |    1 
 fs/hugetlbfs/inode.c                      |    9 -
 include/linux/hugetlb.h                   |    6 +
 include/linux/memcontrol.h                |    5 
 include/linux/mm.h                        |    8 +
 include/linux/sched.h                     |    1 
 include/linux/sched/mm.h                  |    8 +
 kernel/kcov.c                             |    7 -
 mm/hugetlb.c                              |   10 +
 mm/memcontrol.c                           |   12 ++
 mm/memory-failure.c                       |  158 ++++++++++++++++++++++--------
 mm/mmap.c                                 |    8 -
 mm/mmu_notifier.c                         |   14 ++
 mm/oom_kill.c                             |   54 +++++++---
 mm/userfaultfd.c                          |   15 +-
 mm/workingset.c                           |    2 
 tools/testing/selftests/vm/mremap_test.c  |   85 +++++++++++++++-
 tools/testing/selftests/vm/run_vmtests.sh |   11 +-
 18 files changed, 327 insertions(+), 87 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-21 17:22                                 ` Linus Torvalds
@ 2022-04-24 19:37                                   ` Borislav Petkov
  2022-04-24 19:54                                     ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-24 19:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Thu, Apr 21, 2022 at 10:22:32AM -0700, Linus Torvalds wrote:
> I think 'copy_{to,from}_user()' actually does go to the effort to try
> to do byte-exact results, though.

Yeah, we have had headaches with this byte-exact copying, wrt MCEs.

> In particular, see copy_user_handle_tail in arch/x86/lib/copy_user_64.S.
> 
> But I think that we long ago ended up deciding it really wasn't worth
> doing it, and x86 ends up just going to unnecessary lengths for this
> case.

You could give me some more details but AFAIU, you mean, that
fallback to byte-sized reads is unnecessary and I can get rid of
copy_user_handle_tail? Because that would be a nice cleanup.

Anyway, I ran your short prog and it all looks like you predicted it:

fsrm:
----
openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fafd78fe000
munmap(0x7fafd790e000, 65536)           = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17
exit_group(17)                          = ?
+++ exited with 17 +++

erms:
-----
openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe0b5321000
munmap(0x7fe0b5331000, 65536)           = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17
exit_group(17)                          = ?
+++ exited with 17 +++

rep_good:
---------
openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0b5f0c7000
munmap(0x7f0b5f0d7000, 65536)           = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16
exit_group(16)                          = ?
+++ exited with 16 +++

original:
---------
openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3ff61c6000
munmap(0x7f3ff61d6000, 65536)           = 0
read(3, strace: umoven: short read (17 < 33) @0x7f3ff61d5fef
0x7f3ff61d5fef, 65536)          = 3586
exit_group(3586)                        = ?
+++ exited with 2 +++

that "umoven: short read" is strace spitting out something about the
address space of the tracee being unaccessible. But the 17 bytes short
read is still there.

From strace sources:

/* legacy method of copying from tracee */
static int
umoven_peekdata(const int pid, kernel_ulong_t addr, unsigned int len,
		void *laddr)
{

	...

	switch (errno) {
			case EFAULT: case EIO: case EPERM:
				/* address space is inaccessible */
				if (nread) {
					perror_msg("umoven: short read (%u < %u) @0x%" PRI_klx,
						   nread, nread + len, addr - nread);
				}
				return -1;

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-24 19:37                                   ` Borislav Petkov
@ 2022-04-24 19:54                                     ` Linus Torvalds
  2022-04-24 20:24                                       ` Linus Torvalds
  2022-04-27  0:14                                       ` Borislav Petkov
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2022-04-24 19:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 24, 2022 at 12:37 PM Borislav Petkov <bp@alien8.de> wrote:
>
> You could give me some more details but AFAIU, you mean, that
> fallback to byte-sized reads is unnecessary and I can get rid of
> copy_user_handle_tail? Because that would be a nice cleanup.

Yeah, we already don't have that fallback in many other places.

For example: traditionally we implemented fixed-size small
copy_to/from_user() with get/put_user().

We don't do that any more, but see commit 4b842e4e25b1 ("x86: get rid
of small constant size cases in raw_copy_{to,from}_user()") and
realize how it historically never did the byte-loop fallback.

The "clear_user()" case is another example of something that was never
byte-exact.

And last time we discussed this, Al was looking at making it
byte-exact, and I'm pretty sure he noted that other architectures
already didn't do always do it.

Let me go try to find it.

> Anyway, I ran your short prog and it all looks like you predicted it:

Well, this actually shows a bug.

> fsrm:
> ----
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17

The above is good.

> erms:
> -----
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 17

This is good too: erms should do small copies with the expanded case,
but sicne it *thinks* it's a large copy, it will just use "rep movsb"
and be byte-exact.

> rep_good:
> ---------
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16

This is good: it starts off with "rep movsq", does two iterations, and
then fails on the third word, so it only succeeds for 16 bytes.

> original:
> ---------
> read(3, strace: umoven: short read (17 < 33) @0x7f3ff61d5fef
> 0x7f3ff61d5fef, 65536)          = 3586

This is broken.

And I'm *not* talking about the strace warning. The strace warning is
actually just a *result* of the kernel bug.

Look at that return value. It returns '3586'. That's just pure garbage.

So that 'original' routine simply returns the wrong value.

I suspect it's a %rax vs %rcx confusion again, but with your "patch on
top of earlier patch" I didn't go and sort it out.

                   Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-24 19:54                                     ` Linus Torvalds
@ 2022-04-24 20:24                                       ` Linus Torvalds
  2022-04-27  0:14                                       ` Borislav Petkov
  1 sibling, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2022-04-24 20:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 24, 2022 at 12:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And last time we discussed this, Al was looking at making it
> byte-exact, and I'm pretty sure he noted that other architectures
> already didn't do always do it.
>
> Let me go try to find it.

Hmnm. I may have mis-remembered the details. The thread I was thinking
of was this:

   https://lore.kernel.org/all/20200719031733.GI2786714@ZenIV.linux.org.uk/

and while Al was arguing for not enforcing the exact byte count, he
still suggested that it must make *some* progress. But note the whole

 "There are two problems with that.  First of all, the promise was
  bogus - there are architectures where it is simply not true.  E.g. ppc
  (or alpha, or sparc, or...)  can have copy_from_user() called with source
  one word prior to an unmapped page, fetch that word, fail to fetch the next
  one and bugger off without doing any stores"

ie it's simply never been the case in general, and Al mentions ppc,
alpha and sparc as examples of architectures where it has not been
true.

(arm and arm64, btw, does seem to have the "try harder" byte copy loop
at the end, like x86 does).

And that's when I argued that we should just accept that the byte
exact behavior simply has never been reality, and we shouldn't even
try to make it be reality.

NOTE! We almost certainly do want to have some limit of how much off
we can be, though. I do *not* think we can just unroll the loop a ton,
and say "hey, we're doing copies in chunks of 16 words, so now we're
off by up to 128 bytes".

I'd suggest making it clear that being "off" by a word is fine,
because that's the natural thing for any architecture that needs to do
a "load low/high word" followed by "store aligned word" due to not
handling unaligned faults well (eg the whole traditional RISC thing).

And yes, I think it's actually somewhat detrimental to our test
coverage that x86 does the byte-exact thing, because it means that
*if* we have any code that depends on it, it will just happen to work
on x86, but then fail on architectures that don't get nearly the same
test coverage.

              Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-24 19:54                                     ` Linus Torvalds
  2022-04-24 20:24                                       ` Linus Torvalds
@ 2022-04-27  0:14                                       ` Borislav Petkov
  2022-04-27  1:29                                         ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-27  0:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Sun, Apr 24, 2022 at 12:54:57PM -0700, Linus Torvalds wrote:
> I suspect it's a %rax vs %rcx confusion again, but with your "patch on
> top of earlier patch" I didn't go and sort it out.

Finally had some quiet time to stare at this.

So when we enter the function, we shift %rcx to get the number of
qword-sized quantities to zero:

SYM_FUNC_START(clear_user_original)
	mov %rcx,%rax
	shr $3,%rcx		# qwords	<---

and we zero in qword quantities merrily:

        # do the qwords first
        .p2align 4
0:      movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz   0b 


but when we encounter the fault here, we return *%rcx* - not %rcx << 3
- latter being the *bytes* leftover which we *actually* need to return
when we encounter the #PF.

So, we need to shift back when we fault during the qword-sized zeroing,
i.e., full function below, see label 3 there.

With that, strace looks good too:

openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
mmap(NULL, 196608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7dc5000
munmap(0x7ffff7dd5000, 65536)           = 0
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 65536) = 16
exit_group(16)                          = ?
+++ exited with 16 +++

As to the byte-exact deal, I'll put it on my TODO to play with it later
and see how much asm we can shed from this simplification so thanks for
the pointers!

/* 
 * Default clear user-space.
 * Input:
 * rdi destination
 * rcx count
 *
 * Output:
 * rcx uncleared bytes or 0 if successful.
 */
SYM_FUNC_START(clear_user_original)
        mov %rcx,%rax
        shr $3,%rcx             # qwords
        and $7,%rax             # rest bytes
        test %rcx,%rcx
        jz 1f

        # do the qwords first
        .p2align 4
0:      movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz   0b

1:      test %rax,%rax
        jz 3f

        # now do the rest bytes
2:      movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz  2b

3:
        # convert qwords back into bytes to return to caller
        shl $3, %rcx

4:
        xorl %eax,%eax
        RET

        _ASM_EXTABLE_UA(0b, 3b)

        /*
         * The %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
         * up doing "%rcx = %rcx*8 + %rax" in ex_handler_ucopy_len() for the exception
         * case). That is, we use %rax above at label 2: for simpler asm but the number
         * of uncleared bytes will land in %rcx, as expected by the caller.
         *
         * %rax at label 3: still needs to be cleared in the exception case because this
         * is called from inline asm and the compiler expects %rax to be zero when exiting
         * the inline asm, in case it might reuse it somewhere.
         */
        _ASM_EXTABLE_TYPE_REG(2b, 4b, EX_TYPE_UCOPY_LEN8, %rax)


Btw, I'm wondering if using descriptive label names would make this function even more
understandable:

/*      
 * Default clear user-space.
 * Input:
 * rdi destination
 * rcx count
 *
 * Output:
 * rcx uncleared bytes or 0 if successful.
 */
SYM_FUNC_START(clear_user_original)
        mov %rcx,%rax
        shr $3,%rcx             # qwords
        and $7,%rax             # rest bytes
        test %rcx,%rcx
        jz .Lrest_bytes

        # do the qwords first
        .p2align 4

.Lqwords:
        movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz .Lqwords
        
.Lrest_bytes:
        test %rax,%rax
        jz .Lexit
        
        # now do the rest bytes
.Lbytes:
        movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz .Lbytes
        
.Lqwords_exit:
        # convert qwords back into bytes to return to caller
        shl $3, %rcx

.Lexit:
        xorl %eax,%eax
        RET
        
        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exit)
      
        /*
         * The %rcx value gets fixed up with EX_TYPE_UCOPY_LEN (which basically ends
         * up doing "%rcx = %rcx*8 + %rax" in ex_handler_ucopy_len() for the exception
         * case). That is, we use %rax above at label 2: for simpler asm but the number
         * of uncleared bytes will land in %rcx, as expected by the caller.
         *
         * %rax at label 3: still needs to be cleared in the exception case because this
         * is called from inline asm and the compiler expects %rax to be zero when exiting
         * the inline asm, in case it might reuse it somewhere.
         */
        _ASM_EXTABLE_TYPE_REG(.Lbytes, .Lexit, EX_TYPE_UCOPY_LEN8, %rax)
SYM_FUNC_END(clear_user_original) 
EXPORT_SYMBOL(clear_user_original)


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-27  0:14                                       ` Borislav Petkov
@ 2022-04-27  1:29                                         ` Linus Torvalds
  2022-04-27 10:41                                           ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-27  1:29 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Tue, Apr 26, 2022 at 5:14 PM Borislav Petkov <bp@alien8.de> wrote:
>
> So when we enter the function, we shift %rcx to get the number of
> qword-sized quantities to zero:
>
> SYM_FUNC_START(clear_user_original)
>         mov %rcx,%rax
>         shr $3,%rcx             # qwords        <---

Yes.

But that's what we do for "rep stosq" too, for all the same reasons.

> but when we encounter the fault here, we return *%rcx* - not %rcx << 3
> - latter being the *bytes* leftover which we *actually* need to return
> when we encounter the #PF.

Yes, but:

> So, we need to shift back when we fault during the qword-sized zeroing,
> i.e., full function below, see label 3 there.

No.

The problem is that you're using the wrong exception type.

Thanks for posting the whole thing, because that makes it much more obvious.

You have the exception table entries switched.

You should have

         _ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax)
        _ASM_EXTABLE_UA(2b, 3b)

and not need that label '4' at all.

Note how that "_ASM_EXTABLE_TYPE_REG" thing is literally designed to do

   %rcx = 8*%rcx+%rax

in the exception handler.

Of course, to get that regular

        _ASM_EXTABLE_UA(2b, 3b)

to work, you need to have the final byte count in %rcx, not in %rax so
that means that the "now do the rest of the bytes" case should have
done something like

        movl %eax,%ecx
2:      movb $0,(%rdi)
        inc %rdi
        decl %ecx
        jnz  2b

instead.

Yeah, yeah, you could also use that _ASM_EXTABLE_TYPE_REG thing for
the second exception point, and keep %rcx as zero, and keep it in
%eax, and depend on that whole "%rcx = 8*%rcx+%rax" fixing it up, and
knowing that if an exception does *not* happen, %rcx will be zero from
the word-size loop.

But that really seems much too subtle for me - why not just keep
things in %rcx, and try to make this look as much as possible like the
"rep stosq + rep stosb" case?

And finally: I still think that those fancy exception table things are
*much* too fancy, and much too subtle, and much too complicated.

So I'd actually prefer to get rid of them entirely, and make the code
use the "no register changes" exception, and make the exception
handler do a proper site-specific fixup. At that point, you can get
rid of all the "mask bits early" logic, get rid of all the extraneous
'test' instructions, and make it all look something like below.

NOTE! I've intentionally kept the %eax thing using 32-bit instructions
- smaller encoding, and only the low three bits matter, so why
move/mask full 64 bits?

NOTE2! Entirely untested. But I tried to make the normal code do
minimal work, and then fix things up in the exception case more. So it
just keeps the original count in the 32 bits in %eax until it wants to
test it, and then uses the 'andl' to both mask and test. And the
exception case knows that, so it masks there too.

I dunno. But I really think that whole new _ASM_EXTABLE_TYPE_REG and
EX_TYPE_UCOPY_LEN8 was unnecessary.

               Linus

SYM_FUNC_START(clear_user_original)
        movl %ecx,%eax
        shrq $3,%rcx             # qwords
        jz .Lrest_bytes

        # do the qwords first
        .p2align 4
.Lqwords:
        movq $0,(%rdi)
        lea 8(%rdi),%rdi
        dec %rcx
        jnz .Lqwords

.Lrest_bytes:
        andl $7,%eax             # rest bytes
        jz .Lexit

        # now do the rest bytes
.Lbytes:
        movb $0,(%rdi)
        inc %rdi
        decl %eax
        jnz .Lbytes

.Lexit:
        xorl %eax,%eax
        RET

.Lqwords_exception:
        # convert qwords back into bytes to return to caller
        shlq $3, %rcx
        andl $7, %eax
        addq %rax,%rcx
        jmp .Lexit

.Lbytes_exception:
        movl %eax,%ecx
        jmp .Lexit

        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-27  1:29                                         ` Linus Torvalds
@ 2022-04-27 10:41                                           ` Borislav Petkov
  2022-04-27 16:00                                             ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-04-27 10:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Tue, Apr 26, 2022 at 06:29:16PM -0700, Linus Torvalds wrote:
> Yeah, yeah, you could also use that _ASM_EXTABLE_TYPE_REG thing for
> the second exception point, and keep %rcx as zero, and keep it in
> %eax, and depend on that whole "%rcx = 8*%rcx+%rax" fixing it up, and
> knowing that if an exception does *not* happen, %rcx will be zero from
> the word-size loop.

Exactly!

> But that really seems much too subtle for me - why not just keep
> things in %rcx, and try to make this look as much as possible like the
> "rep stosq + rep stosb" case?
> 
> And finally: I still think that those fancy exception table things are
> *much* too fancy, and much too subtle, and much too complicated.

This whole confusion goes to show that this really is too subtle. ;-\

I mean, it is probably fine for some simple asm where you want to have
the register fixup out of the way. But for something like that where
I wanted the asm to be optimal but not tricky at the same time, that
tricky "hidden fixup" might turn out to be not such a good idea.

And sure, yeah, we can make this work this way now and all but after
time passes, swapping all that tricky stuff back in is going to be a
pain. That's why I'm leaving breadcrumbs everywhere I can.

> So I'd actually prefer to get rid of them entirely, and make the code
> use the "no register changes" exception, and make the exception
> handler do a proper site-specific fixup. At that point, you can get
> rid of all the "mask bits early" logic, get rid of all the extraneous
> 'test' instructions, and make it all look something like below.

Yap, and the named labels make it even clearer.

> NOTE! I've intentionally kept the %eax thing using 32-bit instructions
> - smaller encoding, and only the low three bits matter, so why
> move/mask full 64 bits?

Some notes below.

> NOTE2! Entirely untested.

I'll poke at it.

> But I tried to make the normal code do minimal work, and then fix
> things up in the exception case more.

Yap.

> So it just keeps the original count in the 32 bits in %eax until it
> wants to test it, and then uses the 'andl' to both mask and test. And
> the exception case knows that, so it masks there too.

Sure.

> I dunno. But I really think that whole new _ASM_EXTABLE_TYPE_REG and
> EX_TYPE_UCOPY_LEN8 was unnecessary.

I've pointed this out to Mr. Z. He likes them as complex as possible.
:-P

> SYM_FUNC_START(clear_user_original)
>         movl %ecx,%eax

I'll add a comment here along the lines of us only needing the 32-bit
quantity of the size as we're dealing with the rest bytes and so we save
a REX byte in the opcode.

>         shrq $3,%rcx             # qwords

Any particular reason for the explicit "q" suffix? The register already
denotes the opsize and the generated opcode is the same.

>         jz .Lrest_bytes
> 
>         # do the qwords first
>         .p2align 4
> .Lqwords:
>         movq $0,(%rdi)
>         lea 8(%rdi),%rdi
>         dec %rcx

Like here, for example.

I guess you wanna be explicit as to the operation width...

>         jnz .Lqwords
> 
> .Lrest_bytes:
>         andl $7,%eax             # rest bytes
>         jz .Lexit
> 
>         # now do the rest bytes
> .Lbytes:
>         movb $0,(%rdi)
>         inc %rdi
>         decl %eax
>         jnz .Lbytes
> 
> .Lexit:
>         xorl %eax,%eax
>         RET
> 
> .Lqwords_exception:
>         # convert qwords back into bytes to return to caller
>         shlq $3, %rcx
>         andl $7, %eax
>         addq %rax,%rcx
>         jmp .Lexit
> 
> .Lbytes_exception:
>         movl %eax,%ecx
>         jmp .Lexit
> 
>         _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
>         _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)

Yap, I definitely like this one more. Straightforward.

Thx!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-27 10:41                                           ` Borislav Petkov
@ 2022-04-27 16:00                                             ` Linus Torvalds
  2022-05-04 18:56                                               ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-04-27 16:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Wed, Apr 27, 2022 at 3:41 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Any particular reason for the explicit "q" suffix? The register already
> denotes the opsize and the generated opcode is the same.

No, I just added the 'q' prefix to distinguish it from the 'l'
instruction working on the same register one line earlier.

And then I wasn't consistent throughout, because it was really just
about me thinking about that "here we can use 32-bit ops, and here we
are working with the full possible 64-bit range".

So take that code as what it is - an untested and slightly edited
version of the code you sent me, meant for "how about like this"
purposes rather than anything else.

                   Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* incoming
@ 2022-04-27 19:41 Andrew Morton
  0 siblings, 0 replies; 82+ messages in thread
From: Andrew Morton @ 2022-04-27 19:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits, patches

2 patches, based on d615b5416f8a1afeb82d13b238f8152c572d59c0.

Subsystems affected by this patch series:

  mm/kasan
  mm/debug

Subsystem: mm/kasan

    Zqiang <qiang1.zhang@intel.com>:
      kasan: prevent cpu_quarantine corruption when CPU offline and cache shrink occur at same time

Subsystem: mm/debug

    Akira Yokosawa <akiyks@gmail.com>:
      docs: vm/page_owner: use literal blocks for param description

 Documentation/vm/page_owner.rst |    5 +++--
 mm/kasan/quarantine.c           |    7 +++++++
 2 files changed, 10 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-04-27 16:00                                             ` Linus Torvalds
@ 2022-05-04 18:56                                               ` Borislav Petkov
  2022-05-04 19:22                                                 ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-04 18:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

Just to update folks here: I haven't forgotten about this - Mel and I
are running some benchmarks first and staring at results to see whether
all the hoopla is even worth it.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 18:56                                               ` Borislav Petkov
@ 2022-05-04 19:22                                                 ` Linus Torvalds
  2022-05-04 20:18                                                   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-04 19:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits

On Wed, May 4, 2022 at 11:56 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Just to update folks here: I haven't forgotten about this - Mel and I
> are running some benchmarks first and staring at results to see whether
> all the hoopla is even worth it.

Side note: the "do FSRM inline" would likely be a really good thing
for "copy_to_user()", more so than the silly "clear_user()" that we
realistically do almost nowhere.

I doubt you can find "clear_user()" outside of benchmarks (but hey,
people do odd things).

But "copy_to_user()" is everywhere, and the I$ advantage of inlining
it might be noticeable on some real loads.

I remember some git profiles having copy_to_user very high due to
fstat(), for example - cp_new_stat64 and friends.

Of course, I haven't profiled git in ages, but I doubt that has
changed. Many of those kinds of loads are all about name lookup and
stat (basic things like "make" would be that too, if it weren't for
the fact that it spends a _lot_ of its time in user space string
handling).

The inlining advantage would obviously only show up on CPUs that
actually do FSRM. Which I think is currently only Ice Lake. I don't
have access to one.

                Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 19:22                                                 ` Linus Torvalds
@ 2022-05-04 20:18                                                   ` Borislav Petkov
  2022-05-04 20:40                                                     ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-04 20:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 04, 2022 at 12:22:34PM -0700, Linus Torvalds wrote:
> Side note: the "do FSRM inline" would likely be a really good thing
> for "copy_to_user()", more so than the silly "clear_user()" that we
> realistically do almost nowhere.

Right, that would be my next project.
> 
> I doubt you can find "clear_user()" outside of benchmarks (but hey,
> people do odd things).

Well, see preview below.

> But "copy_to_user()" is everywhere, and the I$ advantage of inlining
> it might be noticeable on some real loads.
> 
> I remember some git profiles having copy_to_user very high due to
> fstat(), for example - cp_new_stat64 and friends.
> 
> Of course, I haven't profiled git in ages, but I doubt that has

Yeah, see below.

> changed. Many of those kinds of loads are all about name lookup and
> stat (basic things like "make" would be that too, if it weren't for
> the fact that it spends a _lot_ of its time in user space string
> handling).
> 
> The inlining advantage would obviously only show up on CPUs that
> actually do FSRM. Which I think is currently only Ice Lake. I don't
> have access to one.

Zen3 has FSRM.

So below's the git test suite with clear_user on Zen3. It creates a lot
of processes so we get to clear_user a bunch and that's the inlined rep
movsb.

You can see some small but noticeable improvement:

gitsource
                                      rc              clear_use
                                     rc5             clear_user
Min       User         196.65 (   0.00%)      193.16 (   1.77%)
Min       System        57.20 (   0.00%)       55.89 (   2.29%)
Min       Elapsed      270.27 (   0.00%)      266.09 (   1.55%)
Min       CPU           93.00 (   0.00%)       93.00 (   0.00%)
Amean     User         197.05 (   0.00%)      194.14 *   1.48%*
Amean     System        57.41 (   0.00%)       56.35 *   1.83%*
Amean     Elapsed      270.97 (   0.00%)      266.90 *   1.50%*
Amean     CPU           93.00 (   0.00%)       93.00 (   0.00%)
Stddev    User           0.25 (   0.00%)        0.64 (-151.28%)
Stddev    System         0.24 (   0.00%)        0.31 ( -28.73%)
Stddev    Elapsed        0.56 (   0.00%)        0.62 ( -10.17%)
Stddev    CPU            0.00 (   0.00%)        0.00 (   0.00%)
CoeffVar  User           0.13 (   0.00%)        0.33 (-155.05%)
CoeffVar  System         0.41 (   0.00%)        0.54 ( -31.13%)
CoeffVar  Elapsed        0.21 (   0.00%)        0.23 ( -11.85%)
CoeffVar  CPU            0.00 (   0.00%)        0.00 (   0.00%)
Max       User         197.35 (   0.00%)      194.92 (   1.23%)
Max       System        57.75 (   0.00%)       56.64 (   1.92%)
Max       Elapsed      271.66 (   0.00%)      267.60 (   1.49%)
Max       CPU           93.00 (   0.00%)       93.00 (   0.00%)
BAmean-50 User         196.85 (   0.00%)      193.60 (   1.65%)
BAmean-50 System        57.20 (   0.00%)       56.05 (   2.01%)
BAmean-50 Elapsed      270.40 (   0.00%)      266.29 (   1.52%)
BAmean-50 CPU           93.00 (   0.00%)       93.00 (   0.00%)
BAmean-95 User         196.98 (   0.00%)      193.94 (   1.54%)
BAmean-95 System        57.32 (   0.00%)       56.28 (   1.81%)
BAmean-95 Elapsed      270.79 (   0.00%)      266.72 (   1.50%)
BAmean-95 CPU           93.00 (   0.00%)       93.00 (   0.00%)
BAmean-99 User         196.98 (   0.00%)      193.94 (   1.54%)
BAmean-99 System        57.32 (   0.00%)       56.28 (   1.81%)
BAmean-99 Elapsed      270.79 (   0.00%)      266.72 (   1.50%)
BAmean-99 CPU           93.00 (   0.00%)       93.00 (   0.00%)

                          rc   clear_use
                         rc5  clear_user
Duration User        1182.22     1165.67
Duration System       345.58      338.46
Duration Elapsed     1626.80     1602.99

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 20:18                                                   ` Borislav Petkov
@ 2022-05-04 20:40                                                     ` Linus Torvalds
  2022-05-04 21:01                                                       ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-04 20:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 4, 2022 at 1:18 PM Borislav Petkov <bp@alien8.de> wrote:
>
> Zen3 has FSRM.

Sadly, I'm on Zen2 with my 3970X, and the Zen 3 threadrippers seem to
be basically impossible to get.

> So below's the git test suite with clear_user on Zen3. It creates a lot
> of processes so we get to clear_user a bunch and that's the inlined rep
> movsb.

Oh, the clear_user() in the ELF loader? I wouldn't have expected that
to be noticeable.

Now, clear_page() is *very* noticeable, but that has its own special
magic and doesn't use clear_user().

Maybe there is some other clear_user() case I never thought of. My dim
memories of profiles definitely had copy_to_user, clear_page and
copy_page being some of the top copy loops.

                   Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 20:40                                                     ` Linus Torvalds
@ 2022-05-04 21:01                                                       ` Borislav Petkov
  2022-05-04 21:09                                                         ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-04 21:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 04, 2022 at 01:40:07PM -0700, Linus Torvalds wrote:
> Sadly, I'm on Zen2 with my 3970X, and the Zen 3 threadrippers seem to
> be basically impossible to get.

Yeah, what we did is get a gigabyte board:

[    0.000000] DMI: GIGABYTE MZ32-AR0-00/MZ32-AR0-00, BIOS M06 07/10/2021

and stick a server CPU in it:

[    2.352371] smpboot: CPU0: AMD EPYC 7313 16-Core Processor (family: 0x19, model: 0x1, stepping: 0x1)

so that we can have the memory encryption stuff. 32 threads is fairly
decent and kernel builds are fast enough, at least for me. If you need
to do a lot of allmodconfigs, you probably need something bigger though.

> Oh, the clear_user() in the ELF loader? I wouldn't have expected that
> to be noticeable.

Yah, that guy in load_elf_binary(). At least it did hit my breakpoint
fairly often so I thought, what is a benchmark that does create a lot of
processes...

> Now, clear_page() is *very* noticeable, but that has its own special
> magic and doesn't use clear_user().

Yeah, that's with the alternative CALL thing. Could be useful to try to
see how much the inlined rep; movsb would bring...

> Maybe there is some other clear_user() case I never thought of. My
> dim memories of profiles definitely had copy_to_user, clear_page and
> copy_page being some of the top copy loops.

I could try to do a perf probe or whatever fancy new thing we do now on
clear_user to get some numbers of how many times it gets called during
the benchmark run. Or do you wanna know the callers too?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE
  2022-05-04 21:01                                                       ` Borislav Petkov
@ 2022-05-04 21:09                                                         ` Linus Torvalds
  2022-05-10  9:31                                                           ` clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE) Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-04 21:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 4, 2022 at 2:01 PM Borislav Petkov <bp@alien8.de> wrote:
>
> I could try to do a perf probe or whatever fancy new thing we do now on
> clear_user to get some numbers of how many times it gets called during
> the benchmark run. Or do you wanna know the callers too?

One of the non-performance reasons I like inlined memcpy is actually
that when you do a regular 'perf record' run, the cost of the memcpy
gets associated with the call-site.

Which is universally what I want for those things. I used to love our
inlined spinlocks for the same reason back when we did them.

Yeah, yeah, you can do it with callchain magic, but then you get it
all - and I really consider memcpy/memset to be a special case.
Normally I want the "oh, that leaf function is expensive", but not for
memcpy and memset (and not for spinlocks, but we'll never go back to
the old trivial spinlocks)

I don't tend to particularly care about "how many times has this been
called" kind of trace profiles. It's the actual expense in CPU cycles
I tend to care about.

That said, I cared deeply about those kinds of CPU profiles when I was
working with Al on the RCU path lookup code and looking for where the
problem spots were.

That was years ago.

I haven't really done serious profiling work for a while (which is
just as well, because it's one of the things that went backwards when
I switch to the Zen 2 threadripper for my main machine)

                  Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-04 21:09                                                         ` Linus Torvalds
@ 2022-05-10  9:31                                                           ` Borislav Petkov
  2022-05-10 17:17                                                             ` Linus Torvalds
  2022-05-10 17:28                                                             ` Linus Torvalds
  0 siblings, 2 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-05-10  9:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

Lemme fix that subject so that I can find it easier in my avalanche mbox...

On Wed, May 04, 2022 at 02:09:52PM -0700, Linus Torvalds wrote:
> I don't tend to particularly care about "how many times has this been
> called" kind of trace profiles. It's the actual expense in CPU cycles
> I tend to care about.

Yeah, but, I wanted to measure how much perf improvement that would
bring with the git test suite and wanted to know how often clear_user()
is called in conjunction with it. Because the benchmarks I ran would
show very small improvements and a PF benchmark would even show weird
things like slowdowns with higher core counts.

So for a ~6m running test suite, the function gets called under 700K
times, all from padzero:

           <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
           <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
           <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
           <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
	   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes. I.e.:

                start = rdtsc_ordered();
                ret = __clear_user(to, n);
                end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

clear_user_original:
Amean: 9219.71 (Sum: 6340154910, samples: 687674)

fsrm:
Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

I'll run this on Icelake now too.

> I haven't really done serious profiling work for a while (which is
> just as well, because it's one of the things that went backwards when
> I switch to the Zen 2 threadripper for my main machine)

Because of the not as advanced perf support there? Any pain points I can
forward?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-10  9:31                                                           ` clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE) Borislav Petkov
@ 2022-05-10 17:17                                                             ` Linus Torvalds
  2022-05-10 17:28                                                             ` Linus Torvalds
  1 sibling, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2022-05-10 17:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 10, 2022 at 2:31 AM Borislav Petkov <bp@alien8.de> wrote:
>
> > I haven't really done serious profiling work for a while (which is
> > just as well, because it's one of the things that went backwards when
> > I switch to the Zen 2 threadripper for my main machine)
>
> Because of the not as advanced perf support there? Any pain points I can
> forward?

It's not anything fancy, and it's not anything new - you've been cc'd
on me talking about it before.

As mentioned, I don't actually do anything fancy with profiling - I
basically almost always just want to do a simple

    perf record -e cycles:pp

so that I get reasonable instruction attribution for what the
expensive part actually is (where "actually is" is obviously always
just an approximation, I'm not claiming anything else - but I just
don't want to have to try to figure out some huge instruction skid
issue). And then (because I only tend to care about the kernel, and
don't care about _who_ is doing things), I just do

    perf report --sort=dso,symbol

and start looking at the kernel side of things. I then occasionally
enable -g, but I hate doing it, and it' susually because I see "oh
damn, some spinlock slowpath, let's see what the callers are" just to
figure out which spinlock it ls.

VERY rudimentary, in other words. It's the "I don't know where the
time is going, so let's find out".

And that simple thing doesn't work, because Zen 2 doesn't like the
per-thread profiling. So I can be root and use '-a', and it works
fine, except I don't want to do things as root just for profiling.
Plus I don't actually want to see the ACPI "CPU idle" things.

I'm told the issue is that Zen 2 is not stable with IBS (aka "PEBS for
AMD"), which is all kinds of sad, but there it is.

As also mentioned, it's not actually a huge deal for me, because all I
do is read email and do "git pull". And the times when I used
profiling to find things git could need improvement on (usually
pathname lookup for "git diff") are long long gone.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-10  9:31                                                           ` clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE) Borislav Petkov
  2022-05-10 17:17                                                             ` Linus Torvalds
@ 2022-05-10 17:28                                                             ` Linus Torvalds
  2022-05-10 18:10                                                               ` Borislav Petkov
  1 sibling, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-10 17:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 10, 2022 at 2:31 AM Borislav Petkov <bp@alien8.de> wrote:
>
> clear_user_original:
> Amean: 9219.71 (Sum: 6340154910, samples: 687674)
>
> fsrm:
> Amean: 8030.63 (Sum: 5522277720, samples: 687652)

Well, that's pretty conclusive.

I'm obviously very happy with fsrm. I've been pushing for that thing
for probably over two decades by now, because I absolutely detest
uarch optimizations for memset/memcpy that can never be done well in
software anyway (because it depends not just on cache organization,
but on cache sizes and dynamic cache hit/miss behavior of the load).

And one of the things I always wanted to do was to just have
memcpy/memset entirely inlined.

In fact, if you go back to the 0.01 linux kernel sources, you'll see
that they only compile with my bastardized version of gcc-1.40,
because I made the compiler inline those things with 'rep movs/stos',
and there was no other implementation of memcpy/memset at all.

That was a bit optimistic at the time, but here we are, 30+ years
later and it is finally looking possible, at least on some uarchs.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-10 17:28                                                             ` Linus Torvalds
@ 2022-05-10 18:10                                                               ` Borislav Petkov
  2022-05-10 18:57                                                                 ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-10 18:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 10, 2022 at 10:28:28AM -0700, Linus Torvalds wrote:
> Well, that's pretty conclusive.

Yap. It appears I don't have a production-type Icelake so I probably
can't show the numbers there but at least I can check whether there's an
improvement too.

> I'm obviously very happy with fsrm. I've been pushing for that thing
> for probably over two decades by now,

The time sounds about right - I'm closing in on two decades poking at
the kernel myself and I've yet to see a more complex feature I've been
advocating for, materialize.

> because I absolutely detest uarch optimizations for memset/memcpy that
> can never be done well in software anyway (because it depends not just
> on cache organization, but on cache sizes and dynamic cache hit/miss
> behavior of the load).

Yeah, you want all that cacheline aggregation to happen underneath where
it can do all the checks etc.

> And one of the things I always wanted to do was to just have
> memcpy/memset entirely inlined.
> 
> In fact, if you go back to the 0.01 linux kernel sources, you'll see

LOL, I think I've seen those sources printed out on a wall somewhere.
:-)

> that they only compile with my bastardized version of gcc-1.40,
> because I made the compiler inline those things with 'rep movs/stos',
> and there was no other implementation of memcpy/memset at all.

Yeah, I have it on my todo to look at inlining the other primitives
too, and see whether that brings any improvements. Now our patching
infrastructure is nicely mature too so that we can be very creative
there.

> That was a bit optimistic at the time, but here we are, 30+ years
> later and it is finally looking possible, at least on some uarchs.

Yap, it takes "only" 30+ years. :-\

And when you think of all the crap stuff that got added in silicon
and *removed* *again* in the meantime... but I'm optimistic now that
Murphy's Law is not going to hold true anymore, we will finally start
optimizing hardware *and* software.

:-)))

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE)
  2022-05-10 18:10                                                               ` Borislav Petkov
@ 2022-05-10 18:57                                                                 ` Borislav Petkov
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-05-10 18:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 10, 2022 at 08:10:14PM +0200, Borislav Petkov wrote:
> And when you think of all the crap stuff that got added in silicon
> and *removed* *again* in the meantime... but I'm optimistic now that
> Murphy's Law is not going to hold true anymore, we will finally start
  ^^^^^^

Dunno if this was a Freudian slip... :-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] x86/clear_user: Make it faster
  2022-05-10 18:57                                                                 ` Borislav Petkov
@ 2022-05-24 12:32                                                                   ` Borislav Petkov
  2022-05-24 16:51                                                                     ` Linus Torvalds
                                                                                       ` (3 more replies)
  0 siblings, 4 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-05-24 12:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

Ok,

finally a somewhat final version, lightly tested.

I still need to run it on production Icelake and that is kinda being
delayed due to server room cooling issues (don't ask ;-\).

---
From: Borislav Petkov <bp@suse.de>

Based on a patch by Mark Hemment <markhemm@googlemail.com> and
incorporating very sane suggestions from Linus.

The point here is to have the default case with FSRM - which is supposed
to be the majority of x86 hw out there - if not now then soon - be
directly inlined into the instruction stream so that no function call
overhead is taking place.

The benchmarks I ran would show very small improvements and a PF
benchmark would even show weird things like slowdowns with higher core
counts.

So for a ~6m running the git test suite, the function gets called under
700K times, all from padzero():

  <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
  <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
  <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
  <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes.
I.e.:

  start = rdtsc_ordered();
  ret = __clear_user(to, n);
  end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

  clear_user_original:
  Amean: 9219.71 (Sum: 6340154910, samples: 687674)

  fsrm:
  Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
---
 arch/x86/include/asm/uaccess.h    |   5 +-
 arch/x86/include/asm/uaccess_64.h |  45 ++++++++++
 arch/x86/lib/clear_page_64.S      | 142 ++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  40 ---------
 tools/objtool/check.c             |   3 +
 5 files changed, 192 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index f78e2b3501a1..335d571e8a79 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -405,9 +405,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -429,6 +426,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..4ffefb4bb844 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,49 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
+{
+	might_fault();
+	stac();
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+	asm volatile(
+		"1:\n\t"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:\n"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0)
+		/* rep_good clobbers %rdx */
+	       : "rdx");
+
+	clac();
+
+	return size;
+}
+
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..6dd6c7fd1eb8 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +51,144 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_original)
+	/*
+	 * Copy only the lower 32 bits of size as that is enough to handle the rest bytes,
+	 * i.e., no need for a 'q' suffix and thus a REX prefix.
+	 */
+	mov %ecx,%eax
+	shr $3,%rcx
+	jz .Lrest_bytes
+
+	# do the qwords first
+	.p2align 4
+.Lqwords:
+	movq $0,(%rdi)
+	lea 8(%rdi),%rdi
+	dec %rcx
+	jnz .Lqwords
+
+.Lrest_bytes:
+	and $7,  %eax
+	jz .Lexit
+
+	# now do the rest bytes
+.Lbytes:
+	movb $0,(%rdi)
+	inc %rdi
+	dec %eax
+	jnz .Lbytes
+
+.Lexit:
+	/*
+	 * %rax still needs to be cleared in the exception case because this function is called
+	 * from inline asm and the compiler expects %rax to be zero when exiting the inline asm,
+	 * in case it might reuse it somewhere.
+	 */
+        xor %eax,%eax
+        RET
+
+.Lqwords_exception:
+        # convert remaining qwords back into bytes to return to caller
+        shl $3, %rcx
+        and $7, %eax
+        add %rax,%rcx
+        jmp .Lexit
+
+.Lbytes_exception:
+        mov %eax,%ecx
+        jmp .Lexit
+
+        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
+        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_rep_good)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	ja  .Lprep
+	call clear_user_original
+	RET
+
+.Lprep:
+	# copy lower 32-bits for rest bytes
+	mov %ecx, %edx
+	shr $3, %rcx
+	jz .Lrep_good_rest_bytes
+
+.Lrep_good_qwords:
+	rep stosq
+
+.Lrep_good_rest_bytes:
+	and $7, %edx
+	jz .Lrep_good_exit
+
+.Lrep_good_bytes:
+	mov %edx, %ecx
+	rep stosb
+
+.Lrep_good_exit:
+	# see .Lexit comment above
+	xor %eax, %eax
+	RET
+
+.Lrep_good_qwords_exception:
+	# convert remaining qwords back into bytes to return to caller
+	shl $3, %rcx
+	and $7, %edx
+	add %rdx, %rcx
+	jmp .Lrep_good_exit
+
+	_ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
+	_ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ *
+ */
+SYM_FUNC_START(clear_user_erms)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	ja  .Lerms_bytes
+	call clear_user_original
+	RET
+
+.Lerms_bytes:
+	rep stosb
+
+.Lerms_exit:
+	xorl %eax,%eax
+	RET
+
+	_ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0ae6cf804197..6c1f8ac5e721 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,46 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index ca5b74603008..e460aa004b88 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1020,6 +1020,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 
-- 
2.35.1


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
@ 2022-05-24 16:51                                                                     ` Linus Torvalds
  2022-05-24 17:30                                                                       ` Borislav Petkov
  2022-05-25 12:11                                                                     ` Mark Hemment
                                                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-05-24 16:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 24, 2022 at 5:32 AM Borislav Petkov <bp@alien8.de> wrote:
>
> finally a somewhat final version, lightly tested.

I can't find anything wrong with this, but who knows what
patch-blindness I have from looking at a few different versions of it.
Maybe my eyes just skim over it now.

I do note that the clearing of %rax here:

> +.Lerms_exit:
> +       xorl %eax,%eax
> +       RET

seems to be unnecessary, since %rax is never modified in the path
leading to this. But maybe just as well just for consistency with the
cases where it *is* used as a temporary.

And I still suspect that "copy_to_user()" is *much* more interesting
than "clear_user()", but I guess we can't inline it anyway due to all
the other overhead (ie access_ok() and stac/clac).

And for a plain "call memcpy/memset", we'd need compiler help to do
this (at a minimum, we'd have to have the compiler use the 'rep
movs/stos' register logic, and then we could patch things in place
afterwards, with objtool creating the alternatives section or
something).

                    Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 16:51                                                                     ` Linus Torvalds
@ 2022-05-24 17:30                                                                       ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-05-24 17:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 24, 2022 at 09:51:56AM -0700, Linus Torvalds wrote:
> I can't find anything wrong with this, but who knows what
> patch-blindness I have from looking at a few different versions of it.
> Maybe my eyes just skim over it now.

Same here - I can't look at that code anymore. I'll try to gain some
distance and look at it again later, and do some more extensive testing
too.

> I do note that the clearing of %rax here:
> 
> > +.Lerms_exit:
> > +       xorl %eax,%eax
> > +       RET
> 
> seems to be unnecessary, since %rax is never modified in the path
> leading to this. But maybe just as well just for consistency with the
> cases where it *is* used as a temporary.

Yeah.

> And I still suspect that "copy_to_user()" is *much* more interesting
> than "clear_user()", but I guess we can't inline it anyway due to all
> the other overhead (ie access_ok() and stac/clac).
> 
> And for a plain "call memcpy/memset", we'd need compiler help to do
> this (at a minimum, we'd have to have the compiler use the 'rep
> movs/stos' register logic, and then we could patch things in place
> afterwards, with objtool creating the alternatives section or
> something).

Yeah, I have this on my todo to research them properly. Will report when
I have something.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
  2022-05-24 16:51                                                                     ` Linus Torvalds
@ 2022-05-25 12:11                                                                     ` Mark Hemment
  2022-05-27 11:28                                                                       ` Borislav Petkov
  2022-05-27 11:10                                                                     ` Ingo Molnar
  2022-06-22 14:21                                                                     ` Borislav Petkov
  3 siblings, 1 reply; 82+ messages in thread
From: Mark Hemment @ 2022-05-25 12:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, 24 May 2022 at 13:32, Borislav Petkov <bp@alien8.de> wrote:
>
> Ok,
>
> finally a somewhat final version, lightly tested.
>
> I still need to run it on production Icelake and that is kinda being
> delayed due to server room cooling issues (don't ask ;-\).
>
> ---
> From: Borislav Petkov <bp@suse.de>
>
> Based on a patch by Mark Hemment <markhemm@googlemail.com> and
> incorporating very sane suggestions from Linus.
>
> The point here is to have the default case with FSRM - which is supposed
> to be the majority of x86 hw out there - if not now then soon - be
> directly inlined into the instruction stream so that no function call
> overhead is taking place.
>
> The benchmarks I ran would show very small improvements and a PF
> benchmark would even show weird things like slowdowns with higher core
> counts.
>
> So for a ~6m running the git test suite, the function gets called under
> 700K times, all from padzero():
>
>   <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
>   <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
>   <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
>   <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
>    ...
>
> which is around 1%-ish of the total time and which is consistent with
> the benchmark numbers.
>
> So Mel gave me the idea to simply measure how fast the function becomes.
> I.e.:
>
>   start = rdtsc_ordered();
>   ret = __clear_user(to, n);
>   end = rdtsc_ordered();
>
> Computing the mean average of all the samples collected during the test
> suite run then shows some improvement:
>
>   clear_user_original:
>   Amean: 9219.71 (Sum: 6340154910, samples: 687674)
>
>   fsrm:
>   Amean: 8030.63 (Sum: 5522277720, samples: 687652)
>
> That's on Zen3.
>
> Signed-off-by: Borislav Petkov <bp@suse.de>
> Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
> ---
>  arch/x86/include/asm/uaccess.h    |   5 +-
>  arch/x86/include/asm/uaccess_64.h |  45 ++++++++++
>  arch/x86/lib/clear_page_64.S      | 142 ++++++++++++++++++++++++++++++
>  arch/x86/lib/usercopy_64.c        |  40 ---------
>  tools/objtool/check.c             |   3 +
>  5 files changed, 192 insertions(+), 43 deletions(-)

[...snip...]
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index fe59b8ac4fcc..6dd6c7fd1eb8 100644
...
> +/*
> + * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
> + * present.
> + * Input:
> + * rdi destination
> + * rcx count
> + *
> + * Output:
> + * rcx: uncleared bytes or 0 if successful.
> + */
> +SYM_FUNC_START(clear_user_rep_good)
> +       # call the original thing for less than a cacheline
> +       cmp $64, %rcx
> +       ja  .Lprep
> +       call clear_user_original
> +       RET
> +
> +.Lprep:
> +       # copy lower 32-bits for rest bytes
> +       mov %ecx, %edx
> +       shr $3, %rcx
> +       jz .Lrep_good_rest_bytes

A slight doubt here; comment says "less than a cachline", but the code
is using 'ja' (jump if above) - so calls 'clear_user_original' for a
'len' less than or equal to 64.
Not sure of the intended behaviour for 64 bytes here, but
'copy_user_enhanced_fast_string' uses the slow-method for lengths less
than 64.  So, should this be coded as;
    cmp $64,%rcx
    jb clear_user_original
?

'clear_user_erms' has similar logic which might also need reworking.

> +
> +.Lrep_good_qwords:
> +       rep stosq
> +
> +.Lrep_good_rest_bytes:
> +       and $7, %edx
> +       jz .Lrep_good_exit
> +
> +.Lrep_good_bytes:
> +       mov %edx, %ecx
> +       rep stosb
> +
> +.Lrep_good_exit:
> +       # see .Lexit comment above
> +       xor %eax, %eax
> +       RET
> +
> +.Lrep_good_qwords_exception:
> +       # convert remaining qwords back into bytes to return to caller
> +       shl $3, %rcx
> +       and $7, %edx
> +       add %rdx, %rcx
> +       jmp .Lrep_good_exit
> +
> +       _ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
> +       _ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
> +SYM_FUNC_END(clear_user_rep_good)
> +EXPORT_SYMBOL(clear_user_rep_good)
> +
> +/*
> + * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
> + * Input:
> + * rdi destination
> + * rcx count
> + *
> + * Output:
> + * rcx: uncleared bytes or 0 if successful.
> + *
> + */
> +SYM_FUNC_START(clear_user_erms)
> +       # call the original thing for less than a cacheline
> +       cmp $64, %rcx
> +       ja  .Lerms_bytes
> +       call clear_user_original
> +       RET
> +
> +.Lerms_bytes:
> +       rep stosb
> +
> +.Lerms_exit:
> +       xorl %eax,%eax
> +       RET
> +
> +       _ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
> +SYM_FUNC_END(clear_user_erms)
> +EXPORT_SYMBOL(clear_user_erms)
> diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> index 0ae6cf804197..6c1f8ac5e721 100644
> --- a/arch/x86/lib/usercopy_64.c
> +++ b/arch/x86/lib/usercopy_64.c
> @@ -14,46 +14,6 @@
>   * Zero Userspace
>   */
>
> -unsigned long __clear_user(void __user *addr, unsigned long size)
> -{
> -       long __d0;
> -       might_fault();
> -       /* no memory constraint because it doesn't change any memory gcc knows
> -          about */
> -       stac();
> -       asm volatile(
> -               "       testq  %[size8],%[size8]\n"
> -               "       jz     4f\n"
> -               "       .align 16\n"
> -               "0:     movq $0,(%[dst])\n"
> -               "       addq   $8,%[dst]\n"
> -               "       decl %%ecx ; jnz   0b\n"
> -               "4:     movq  %[size1],%%rcx\n"
> -               "       testl %%ecx,%%ecx\n"
> -               "       jz     2f\n"
> -               "1:     movb   $0,(%[dst])\n"
> -               "       incq   %[dst]\n"
> -               "       decl %%ecx ; jnz  1b\n"
> -               "2:\n"
> -
> -               _ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
> -               _ASM_EXTABLE_UA(1b, 2b)
> -
> -               : [size8] "=&c"(size), [dst] "=&D" (__d0)
> -               : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
> -       clac();
> -       return size;
> -}
> -EXPORT_SYMBOL(__clear_user);
> -
> -unsigned long clear_user(void __user *to, unsigned long n)
> -{
> -       if (access_ok(to, n))
> -               return __clear_user(to, n);
> -       return n;
> -}
> -EXPORT_SYMBOL(clear_user);
> -
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>  /**
>   * clean_cache_range - write back a cache range with CLWB
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index ca5b74603008..e460aa004b88 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -1020,6 +1020,9 @@ static const char *uaccess_safe_builtin[] = {
>         "copy_mc_fragile_handle_tail",
>         "copy_mc_enhanced_fast_string",
>         "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
> +       "clear_user_erms",
> +       "clear_user_rep_good",
> +       "clear_user_original",
>         NULL
>  };
>
> --
> 2.35.1
>
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

Cheers,
Mark

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
  2022-05-24 16:51                                                                     ` Linus Torvalds
  2022-05-25 12:11                                                                     ` Mark Hemment
@ 2022-05-27 11:10                                                                     ` Ingo Molnar
  2022-06-22 14:21                                                                     ` Borislav Petkov
  3 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2022-05-27 11:10 UTC (permalink / raw)
  To: Borislav Petkov, Arnaldo Carvalho de Melo
  Cc: Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman


* Borislav Petkov <bp@alien8.de> wrote:

> Ok,
> 
> finally a somewhat final version, lightly tested.
> 
> I still need to run it on production Icelake and that is kinda being
> delayed due to server room cooling issues (don't ask ;-\).

> So Mel gave me the idea to simply measure how fast the function becomes.
> I.e.:
> 
>   start = rdtsc_ordered();
>   ret = __clear_user(to, n);
>   end = rdtsc_ordered();
> 
> Computing the mean average of all the samples collected during the test
> suite run then shows some improvement:
> 
>   clear_user_original:
>   Amean: 9219.71 (Sum: 6340154910, samples: 687674)
> 
>   fsrm:
>   Amean: 8030.63 (Sum: 5522277720, samples: 687652)
> 
> That's on Zen3.

As a side note, there's some rudimentary perf tooling that allows the 
user-space testing of kernel-space x86 memcpy and memset implementations:

 $ perf bench mem memcpy
 # Running 'mem/memcpy' benchmark:
 # function 'default' (Default memcpy() provided by glibc)
 # Copying 1MB bytes ...

       42.459239 GB/sec
 # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S)
 # Copying 1MB bytes ...

       23.818598 GB/sec
 # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S)
 # Copying 1MB bytes ...

       10.172526 GB/sec
 # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S)
 # Copying 1MB bytes ...

       10.614810 GB/sec

Note how the actual implementation in arch/x86/lib/memcpy_64.S was used to 
build a user-space test into 'perf bench'.

For copy_user() & clear_user() some additional wrappery would be needed I 
guess, to wrap away stac()/clac()/might_sleep(), etc. ...

[ Plus it could all be improved to measure cache hot & cache cold 
  performance, to use different sizes, etc. ]

Even with the limitation that it's not 100% equivalent to the kernel-space 
thing, especially for very short buffers, having the whole perf side 
benchmarking, profiling & statistics machinery available is a plus I think.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-25 12:11                                                                     ` Mark Hemment
@ 2022-05-27 11:28                                                                       ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-05-27 11:28 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Linus Torvalds, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, May 25, 2022 at 01:11:17PM +0100, Mark Hemment wrote:
> A slight doubt here; comment says "less than a cachline", but the code
> is using 'ja' (jump if above) - so calls 'clear_user_original' for a
> 'len' less than or equal to 64.
> Not sure of the intended behaviour for 64 bytes here, but
> 'copy_user_enhanced_fast_string' uses the slow-method for lengths less
> than 64.  So, should this be coded as;
>     cmp $64,%rcx
>     jb clear_user_original
> ?

Yeah, it probably doesn't matter whether you clear a cacheline the "old"
way or with some of the new ones. clear_user() performance matters only
in microbenchmarks, as I've come to realize.

But your suggestion simplifies the code so lemme do that.

Thx!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
                                                                                       ` (2 preceding siblings ...)
  2022-05-27 11:10                                                                     ` Ingo Molnar
@ 2022-06-22 14:21                                                                     ` Borislav Petkov
  2022-06-22 15:06                                                                       ` Linus Torvalds
  3 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-06-22 14:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Tue, May 24, 2022 at 02:32:36PM +0200, Borislav Petkov wrote:
> I still need to run it on production Icelake and that is kinda being
> delayed due to server room cooling issues (don't ask ;-\).

So I finally got a production level ICL-X:

[    0.220822] smpboot: CPU0: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz (family: 0x6, model: 0x6a, stepping: 0x6)

and frankly, this looks really weird:

clear_user_original:
Amean: 19679.4 (Sum: 13652560764, samples: 693750)
Amean: 19743.7 (Sum: 13693470604, samples: 693562)

(I ran it twice just to be sure.)

ERMS:
Amean: 20374.3 (Sum: 13910601024, samples: 682752)
Amean: 20453.7 (Sum: 14186223606, samples: 693576)

FSRM:
Amean: 20458.2 (Sum: 13918381386, sample s: 680331)

so either that particular box is weird or Icelake really is slower wrt
FSRM or I've fat-fingered it somewhere.

But my measuring is as trivial as:

static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
{
	if (access_ok(to, n)) {
		unsigned long start, end, ret;

		start = rdtsc_ordered();
		ret = __clear_user(to, n);
		end = rdtsc_ordered();

		trace_printk("to: 0x%lx, size: %ld, cycles: %lu\n", (unsigned long)to, n, end - start);

		return ret;
	}

	return n;
}

so if anything I don't see what the problem could be...

Hmm.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-06-22 14:21                                                                     ` Borislav Petkov
@ 2022-06-22 15:06                                                                       ` Linus Torvalds
  2022-06-22 20:14                                                                         ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-06-22 15:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, Jun 22, 2022 at 9:21 AM Borislav Petkov <bp@alien8.de> wrote:
>
> and frankly, this looks really weird:

I'm not sure how valid the TSC thing is, with the extra
synchronization maybe interacting with the whole microcode engine
startup/stop thing.

I'm also not sure the rdtsc is doing the same thing on your AMD tests
vs your Intel tests - I suspect you end up both using 'rdtscp' (as
opposed to the 'lsync' variant we also have), but I don't think the
ordering really is all that well defined architecturally, so AMD may
have very different serialization rules than Intel does.

.. and that serialization may well be different wrt normal load/stores
and microcode.

So those numbers look like they have a 3% difference, but I'm not 100%
convinced it might not be due to measuring artifacts. The fact that it
worked well for you on your AMD platform doesn't necessarily mean that
it has to work on icelake-x.

But it could equally easily be that "rep stosb" really just isn't any
better on that platform, and the numbers are just giving the plain
reality.

Or it could mean that it makes some cache access decision ("this is
big enough that let's not pollute L1 caches, do stores directly to
L2") that might be better for actual performance afterwards, but that
makes that clearing itself that bit slower.

IOW, I do think that microbenchmarks are kind of suspect to begin
with, and the rdtsc thing in particular may work better on some
microarchitectures than it does others.

Very hard to make a judgment call - I think the only thing that really
ends up mattering is the macro-benchmarks, but I think when you tried
that it was way too noisy to actually show any real signal.

That is, of course, a problem with memcpy and memset in general. It's
easy to do microbenchmarks for them, it's not just clear whether said
microbenchmarks give numbers that are actually meaningful, exactly
because of things like cache replacement policy etc.

And finally, I will repeat that this particular code probably just
isn't that important. The memory clearing for page allocation and
regular memcpy is where most of the real time is spent, so I don't
think that you should necessarily worry too much about this special
case.

                 Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-06-22 15:06                                                                       ` Linus Torvalds
@ 2022-06-22 20:14                                                                         ` Borislav Petkov
  2022-06-22 21:07                                                                           ` Linus Torvalds
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-06-22 20:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, Jun 22, 2022 at 10:06:42AM -0500, Linus Torvalds wrote:
> I'm not sure how valid the TSC thing is, with the extra
> synchronization maybe interacting with the whole microcode engine
> startup/stop thing.

Very possible.

So I went and did the original microbenchmark which started people
looking into this in the first place and with it, it looks very
good:

before:

$ dd if=/dev/zero of=/dev/null bs=1024k status=progress
400823418880 bytes (401 GB, 373 GiB) copied, 17 s, 23.6 GB/s

after:

$ dd if=/dev/zero of=/dev/null bs=1024k status=progress
2696274771968 bytes (2.7 TB, 2.5 TiB) copied, 50 s, 53.9 GB/s

So that's very persuasive in my book.

> I'm also not sure the rdtsc is doing the same thing on your AMD tests
> vs your Intel tests - I suspect you end up both using 'rdtscp' (as

That is correct.

> opposed to the 'lsync' variant we also have), but I don't think the
> ordering really is all that well defined architecturally, so AMD may
> have very different serialization rules than Intel does.
>
> .. and that serialization may well be different wrt normal load/stores
> and microcode.

Well, if it is that, hw people have always been telling me to use RDTSC
to measure stuff but I will object next time.

> So those numbers look like they have a 3% difference, but I'm not 100%
> convinced it might not be due to measuring artifacts. The fact that it
> worked well for you on your AMD platform doesn't necessarily mean that
> it has to work on icelake-x.

Well, it certainly is something uarch-specific because that machine
had an eval Icelake sample before and it would show the same minute
slowdown too. I attributed it to the CPU being an eval sample but I
guess uarch-wise it didn't matter.

> But it could equally easily be that "rep stosb" really just isn't any
> better on that platform, and the numbers are just giving the plain
> reality.

Right.

> Or it could mean that it makes some cache access decision ("this is
> big enough that let's not pollute L1 caches, do stores directly to
> L2") that might be better for actual performance afterwards, but that
> makes that clearing itself that bit slower.

There's that too.

> IOW, I do think that microbenchmarks are kind of suspect to begin
> with, and the rdtsc thing in particular may work better on some
> microarchitectures than it does others.
> 
> Very hard to make a judgment call - I think the only thing that really
> ends up mattering is the macro-benchmarks, but I think when you tried
> that it was way too noisy to actually show any real signal.

Yap, that was the reason why I went down to the microbenchmarks. But
even with the real benchmark, it would show a slightly bad numbers on
ICL which got me scratching my head as to why that is...

> That is, of course, a problem with memcpy and memset in general. It's
> easy to do microbenchmarks for them, it's not just clear whether said
> microbenchmarks give numbers that are actually meaningful, exactly
> because of things like cache replacement policy etc.
> 
> And finally, I will repeat that this particular code probably just
> isn't that important. The memory clearing for page allocation and
> regular memcpy is where most of the real time is spent, so I don't
> think that you should necessarily worry too much about this special
> case.

Yeah, I started poking at this because people would come with patches or
complain about stuff being slow but then no one would actually sit down
and do the measurements...

Oh well, anyway, I still think we should take that because that dd thing
above is pretty good-lookin' even on ICL. And we now have a good example
of how all this patching thing should work - have the insns patched
in and only replace with calls to other variants on the minority of
machines.

And the ICL slowdown is small enough and kinda hard to measure...

Thoughts?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-06-22 20:14                                                                         ` Borislav Petkov
@ 2022-06-22 21:07                                                                           ` Linus Torvalds
  2022-06-23  9:41                                                                             ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2022-06-22 21:07 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, Jun 22, 2022 at 3:14 PM Borislav Petkov <bp@alien8.de> wrote:
>
> before:
>
> $ dd if=/dev/zero of=/dev/null bs=1024k status=progress
> 400823418880 bytes (401 GB, 373 GiB) copied, 17 s, 23.6 GB/s
>
> after:
>
> $ dd if=/dev/zero of=/dev/null bs=1024k status=progress
> 2696274771968 bytes (2.7 TB, 2.5 TiB) copied, 50 s, 53.9 GB/s
>
> So that's very persuasive in my book.

Heh. Your numbers are very confusing, because apparently you just ^C'd
the thing randomly and they do different sizes (and the GB/s number is
what matters).

Might I suggest just using "count=XYZ" to make the sizes the same and
the numbers a but more comparable? Because when I first looked at the
numbers I was like "oh, the first one finished in 17s, the second one
was three times slower!

But yes, apparently that "rep stos" is *much* better with that /dev/zero test.

That does imply that what it does is to avoid polluting some cache
hierarchy, since your 'dd' test case doesn't actually ever *use* the
end result of the zeroing.

So yeah, memset and memcpy are just fundamentally hard to benchmark,
because what matters more than the cost of the op itself is often how
the end result interacts with the code around it.

For example, one of the things that I hope FSRM really does well is
when small copies (or memsets) are then used immediately afterwards -
does the just stored data by the microcode get nicely forwarded from
the store buffers (like it would if it was a loop of stores) or does
it mean that the store buffer is bypassed and subsequent loads will
then hit the L1 cache?

That is *not* an issue in this situation, since any clear_user() won't
be immediately loaded just a few instructions later, but it's
traditionally an issue for the "small memset/memcpy" case, where the
memset/memcpy destination is possibly accessed immediately afterwards
(either to make further modifications, or to just be read).

In a perfect world, you get all the memory forwarding logic kicking
in, which can really shortcircuit things on an OoO core and take the
memory pipeline out of the critical path, which then helps IPC.

And that's an area that legacy microcoded 'rep stosb' has not been
good at. Whether FSRM is quite there yet, I don't know.

(Somebody could test: do a 'store register to memory', then to a
'memcpy()' of that memory to another memory area, and then do a
register load from that new area - at least in _theory_ a very
aggressive microarchitecture could actually do that whole forwarding,
and make the latency from the original memory store to the final
memory load be zero cycles. I know AMD was supposedly doing that for
some of the simpler cases, and it *does* actually matter for real
world loads, because that memory indirection is often due to passing
data in structures as function arguments. So it sounds stupid to store
to memory and then immediately load it again, but it actually happens
_all_the_time_ even for smart software).

            Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH] x86/clear_user: Make it faster
  2022-06-22 21:07                                                                           ` Linus Torvalds
@ 2022-06-23  9:41                                                                             ` Borislav Petkov
  2022-07-05 17:01                                                                               ` [PATCH -final] " Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-06-23  9:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

On Wed, Jun 22, 2022 at 04:07:19PM -0500, Linus Torvalds wrote:
> Might I suggest just using "count=XYZ" to make the sizes the same and
> the numbers a but more comparable? Because when I first looked at the
> numbers I was like "oh, the first one finished in 17s, the second one
> was three times slower!

Yah, I got confused too but then I looked at the rate...

But it looks like even this microbenchmark is hm, well, showing that
there's more than meets the eye. Look at at rates:

for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied
32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s
37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s
62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s
60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s
53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s
55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s
55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s
54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s
50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s
58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s
68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s

Now the same thing with smaller buffers:

for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied 
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s

So it is all magically alignment, microcode "activity" and other
planets-alignment things of the uarch.

It is not getting even close to 50GB/s with larger block sizes - 4M in
this case:

dd if=/dev/zero of=/dev/null bs=4M status=progress count=65536
249334595584 bytes (249 GB, 232 GiB) copied, 10 s, 24.9 GB/s
65536+0 records in
65536+0 records out
274877906944 bytes (275 GB, 256 GiB) copied, 10.9976 s, 25.0 GB/s

so it is all a bit: "yes, we can go faster, but <do all those
requirements first>" :-)

> But yes, apparently that "rep stos" is *much* better with that /dev/zero test.
> 
> That does imply that what it does is to avoid polluting some cache
> hierarchy, since your 'dd' test case doesn't actually ever *use* the
> end result of the zeroing.
> 
> So yeah, memset and memcpy are just fundamentally hard to benchmark,
> because what matters more than the cost of the op itself is often how
> the end result interacts with the code around it.

Yap, and this discarding of the end result is silly but well...

> For example, one of the things that I hope FSRM really does well is
> when small copies (or memsets) are then used immediately afterwards -
> does the just stored data by the microcode get nicely forwarded from
> the store buffers (like it would if it was a loop of stores) or does
> it mean that the store buffer is bypassed and subsequent loads will
> then hit the L1 cache?
> 
> That is *not* an issue in this situation, since any clear_user() won't
> be immediately loaded just a few instructions later, but it's
> traditionally an issue for the "small memset/memcpy" case, where the
> memset/memcpy destination is possibly accessed immediately afterwards
> (either to make further modifications, or to just be read).

Right.

> In a perfect world, you get all the memory forwarding logic kicking
> in, which can really shortcircuit things on an OoO core and take the
> memory pipeline out of the critical path, which then helps IPC.
> 
> And that's an area that legacy microcoded 'rep stosb' has not been
> good at. Whether FSRM is quite there yet, I don't know.

Right.

> (Somebody could test: do a 'store register to memory', then to a
> 'memcpy()' of that memory to another memory area, and then do a
> register load from that new area - at least in _theory_ a very
> aggressive microarchitecture could actually do that whole forwarding,
> and make the latency from the original memory store to the final
> memory load be zero cycles.

Ha, that would be an interesting exercise.

Hmm, but but, how would the hardware recognize it is the same data it
has in the cache at that new virtual address?

I presume it needs some smart tracking of cachelines. But smart tracking
costs so it needs to be something that happens a lot in all the insn
traces hw guys look at when thinking of new "shortcuts" to raise IPC. :)

> I know AMD was supposedly doing that for some of the simpler cases,

Yap, the simpler cases are probably easy to track and I guess that's
what the hw does properly and does the forwarding there while for the
more complex ones, it simply does the whole run-around at least to a
lower-level cache if not to mem.

> and it *does* actually matter for real world loads, because that
> memory indirection is often due to passing data in structures as
> function arguments. So it sounds stupid to store to memory and then
> immediately load it again, but it actually happens _all_the_time_ even
> for smart software).

Right.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH -final] x86/clear_user: Make it faster
  2022-06-23  9:41                                                                             ` Borislav Petkov
@ 2022-07-05 17:01                                                                               ` Borislav Petkov
  2022-07-06  9:24                                                                                 ` Alexey Dobriyan
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-07-05 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Hemment, Andrew Morton, the arch/x86 maintainers,
	Peter Zijlstra, patrice.chotard, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Chuck Lever, Hugh Dickins,
	patches, Linux-MM, mm-commits, Mel Gorman

Ok,

here's the final version with the Intel findings added.

I think I'm going to queue it - unless someone handwaves really loudly
- because it is net code simplification, shows a good example how
to inline insns instead of function calls - something which can be
used for converting other primitives later - and the so called perf
impact on Intel is only very minute because you can only see it in
a microbenchmark. Oh, and it'll make CPU guys speed up their FSRM
microcode. :-P

Thx to all for the very helpful hints and ideas!

---
From: Borislav Petkov <bp@suse.de>
Date: Tue, 24 May 2022 11:01:18 +0200
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Based on a patch by Mark Hemment <markhemm@googlemail.com> and
incorporating very sane suggestions from Linus.

The point here is to have the default case with FSRM - which is supposed
to be the majority of x86 hw out there - if not now then soon - be
directly inlined into the instruction stream so that no function call
overhead is taking place.

The benchmarks I ran would show very small improvements and a PF
benchmark would even show weird things like slowdowns with higher core
counts.

So for a ~6m running the git test suite, the function gets called under
700K times, all from padzero():

  <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
  <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
  <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
  <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes.
I.e.:

  start = rdtsc_ordered();
  ret = __clear_user(to, n);
  end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

  clear_user_original:
  Amean: 9219.71 (Sum: 6340154910, samples: 687674)

  fsrm:
  Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

The situation looks a lot more confusing on Intel:

Icelake:

  clear_user_original:
  Amean: 19679.4 (Sum: 13652560764, samples: 693750)
  Amean: 19743.7 (Sum: 13693470604, samples: 693562)

(I ran it twice just to be sure.)

  ERMS:
  Amean: 20374.3 (Sum: 13910601024, samples: 682752)
  Amean: 20453.7 (Sum: 14186223606, samples: 693576)

  FSRM:
  Amean: 20458.2 (Sum: 13918381386, sample s: 680331)

The original microbenchmark which people were complaining about:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied
  32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s
  37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s
  62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s
  60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s
  53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s
  55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s
  55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s
  54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s
  50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s
  58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s

Now the same thing with smaller buffers:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s

is also not conclusive because it all depends on the buffer sizes,
their alignments and when the microcode detects that cachelines can be
aggregated properly and copied in bigger sizes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
---
 arch/x86/include/asm/uaccess.h    |   5 +-
 arch/x86/include/asm/uaccess_64.h |  45 ++++++++++
 arch/x86/lib/clear_page_64.S      | 138 ++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  40 ---------
 tools/objtool/check.c             |   3 +
 5 files changed, 188 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 913e593a3b45..c46207946e05 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -502,9 +502,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -526,6 +523,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..4ffefb4bb844 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,49 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
+{
+	might_fault();
+	stac();
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+	asm volatile(
+		"1:\n\t"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:\n"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0)
+		/* rep_good clobbers %rdx */
+	       : "rdx");
+
+	clac();
+
+	return size;
+}
+
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..ecbfb4dd3b01 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +51,140 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_original)
+	/*
+	 * Copy only the lower 32 bits of size as that is enough to handle the rest bytes,
+	 * i.e., no need for a 'q' suffix and thus a REX prefix.
+	 */
+	mov %ecx,%eax
+	shr $3,%rcx
+	jz .Lrest_bytes
+
+	# do the qwords first
+	.p2align 4
+.Lqwords:
+	movq $0,(%rdi)
+	lea 8(%rdi),%rdi
+	dec %rcx
+	jnz .Lqwords
+
+.Lrest_bytes:
+	and $7,  %eax
+	jz .Lexit
+
+	# now do the rest bytes
+.Lbytes:
+	movb $0,(%rdi)
+	inc %rdi
+	dec %eax
+	jnz .Lbytes
+
+.Lexit:
+	/*
+	 * %rax still needs to be cleared in the exception case because this function is called
+	 * from inline asm and the compiler expects %rax to be zero when exiting the inline asm,
+	 * in case it might reuse it somewhere.
+	 */
+        xor %eax,%eax
+        RET
+
+.Lqwords_exception:
+        # convert remaining qwords back into bytes to return to caller
+        shl $3, %rcx
+        and $7, %eax
+        add %rax,%rcx
+        jmp .Lexit
+
+.Lbytes_exception:
+        mov %eax,%ecx
+        jmp .Lexit
+
+        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
+        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_rep_good)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lprep:
+	# copy lower 32-bits for rest bytes
+	mov %ecx, %edx
+	shr $3, %rcx
+	jz .Lrep_good_rest_bytes
+
+.Lrep_good_qwords:
+	rep stosq
+
+.Lrep_good_rest_bytes:
+	and $7, %edx
+	jz .Lrep_good_exit
+
+.Lrep_good_bytes:
+	mov %edx, %ecx
+	rep stosb
+
+.Lrep_good_exit:
+	# see .Lexit comment above
+	xor %eax, %eax
+	RET
+
+.Lrep_good_qwords_exception:
+	# convert remaining qwords back into bytes to return to caller
+	shl $3, %rcx
+	and $7, %edx
+	add %rdx, %rcx
+	jmp .Lrep_good_exit
+
+	_ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
+	_ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ *
+ */
+SYM_FUNC_START(clear_user_erms)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lerms_bytes:
+	rep stosb
+
+.Lerms_exit:
+	xorl %eax,%eax
+	RET
+
+	_ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0ae6cf804197..6c1f8ac5e721 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,46 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 864bb9dd3584..98598f3c74c3 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1024,6 +1024,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 
-- 
2.35.1



-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH -final] x86/clear_user: Make it faster
  2022-07-05 17:01                                                                               ` [PATCH -final] " Borislav Petkov
@ 2022-07-06  9:24                                                                                 ` Alexey Dobriyan
  2022-07-11 10:33                                                                                   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Alexey Dobriyan @ 2022-07-06  9:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman

On Tue, Jul 05, 2022 at 07:01:06PM +0200, Borislav Petkov wrote:

> +	asm volatile(
> +		"1:\n\t"
> +		ALTERNATIVE_3("rep stosb",
> +			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
> +			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
> +			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
> +		"2:\n"
> +	       _ASM_EXTABLE_UA(1b, 2b)
> +	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
> +	       : "a" (0)
> +		/* rep_good clobbers %rdx */
> +	       : "rdx");

"+c" and "+D" should be enough for 1 instruction assembly?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH -final] x86/clear_user: Make it faster
  2022-07-06  9:24                                                                                 ` Alexey Dobriyan
@ 2022-07-11 10:33                                                                                   ` Borislav Petkov
  2022-07-12 12:32                                                                                     ` Alexey Dobriyan
  0 siblings, 1 reply; 82+ messages in thread
From: Borislav Petkov @ 2022-07-11 10:33 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman

On Wed, Jul 06, 2022 at 12:24:12PM +0300, Alexey Dobriyan wrote:
> On Tue, Jul 05, 2022 at 07:01:06PM +0200, Borislav Petkov wrote:
> 
> > +	asm volatile(
> > +		"1:\n\t"
> > +		ALTERNATIVE_3("rep stosb",
> > +			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
> > +			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
> > +			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
> > +		"2:\n"
> > +	       _ASM_EXTABLE_UA(1b, 2b)
> > +	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
> > +	       : "a" (0)
> > +		/* rep_good clobbers %rdx */
> > +	       : "rdx");
> 
> "+c" and "+D" should be enough for 1 instruction assembly?

I'm looking at

  e0a96129db57 ("x86: use early clobbers in usercopy*.c")

which introduced the early clobbers and I'm thinking we want them
because "this operand is an earlyclobber operand, which is written
before the instruction is finished using the input operands" and we have
exception handling.

But maybe you need to be more verbose as to what you mean exactly...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH -final] x86/clear_user: Make it faster
  2022-07-11 10:33                                                                                   ` Borislav Petkov
@ 2022-07-12 12:32                                                                                     ` Alexey Dobriyan
  2022-08-06 12:49                                                                                       ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Alexey Dobriyan @ 2022-07-12 12:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman

On Mon, Jul 11, 2022 at 12:33:20PM +0200, Borislav Petkov wrote:
> On Wed, Jul 06, 2022 at 12:24:12PM +0300, Alexey Dobriyan wrote:
> > On Tue, Jul 05, 2022 at 07:01:06PM +0200, Borislav Petkov wrote:
> > 
> > > +	asm volatile(
> > > +		"1:\n\t"
> > > +		ALTERNATIVE_3("rep stosb",
> > > +			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
> > > +			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
> > > +			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
> > > +		"2:\n"
> > > +	       _ASM_EXTABLE_UA(1b, 2b)
> > > +	       : "+&c" (size), "+&D" (addr), ASM_CALL_CONSTRAINT
> > > +	       : "a" (0)
> > > +		/* rep_good clobbers %rdx */
> > > +	       : "rdx");
> > 
> > "+c" and "+D" should be enough for 1 instruction assembly?
> 
> I'm looking at
> 
>   e0a96129db57 ("x86: use early clobbers in usercopy*.c")
> 
> which introduced the early clobbers and I'm thinking we want them
> because "this operand is an earlyclobber operand, which is written
> before the instruction is finished using the input operands" and we have
> exception handling.
> 
> But maybe you need to be more verbose as to what you mean exactly...

This is the original code:

-#define __do_strncpy_from_user(dst,src,count,res)                         \
-do {                                                                      \
-       long __d0, __d1, __d2;                                             \
-       might_fault();                                                     \
-       __asm__ __volatile__(                                              \
-               "       testq %1,%1\n"                                     \
-               "       jz 2f\n"                                           \
-               "0:     lodsb\n"                                           \
-               "       stosb\n"                                           \
-               "       testb %%al,%%al\n"                                 \
-               "       jz 1f\n"                                           \
-               "       decq %1\n"                                         \
-               "       jnz 0b\n"                                          \
-               "1:     subq %1,%0\n"                                      \
-               "2:\n"                                                     \
-               ".section .fixup,\"ax\"\n"                                 \
-               "3:     movq %5,%0\n"                                      \
-               "       jmp 2b\n"                                          \
-               ".previous\n"                                              \
-               _ASM_EXTABLE(0b,3b)                                        \
-               : "=&r"(res), "=&c"(count), "=&a" (__d0), "=&S" (__d1),    \
-                 "=&D" (__d2)                                             \
-               : "i"(-EFAULT), "0"(count), "1"(count), "3"(src), "4"(dst) \
-               : "memory");                                               \
-} while (0)

I meant to say that earlyclobber is necessary only because the asm body
is more than 1 instruction so there is possibility of writing to some
outputs before all inputs are consumed.

If asm body is 1 insn there is no such possibility at all.
Now "rep stosb" is 1 instruction and two alterantive functions masquarade
as single instruction.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH -final] x86/clear_user: Make it faster
  2022-07-12 12:32                                                                                     ` Alexey Dobriyan
@ 2022-08-06 12:49                                                                                       ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2022-08-06 12:49 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, Linus Torvalds, Mark Hemment, Andrew Morton,
	the arch/x86 maintainers, Peter Zijlstra, patrice.chotard,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Chuck Lever, Hugh Dickins, patches, Linux-MM,
	mm-commits, Mel Gorman

On Tue, Jul 12, 2022 at 03:32:27PM +0300, Alexey Dobriyan wrote:
> I meant to say that earlyclobber is necessary only because the asm body
> is more than 1 instruction so there is possibility of writing to some
> outputs before all inputs are consumed.

Ok, thanks for clarifying. Below is a lightly tested version with the
earlyclobbers removed.

---
From 1587f56520fb9faa55ebb0cf02d69bdea0e40170 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@suse.de>
Date: Tue, 24 May 2022 11:01:18 +0200
Subject: [PATCH] x86/clear_user: Make it faster
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Based on a patch by Mark Hemment <markhemm@googlemail.com> and
incorporating very sane suggestions from Linus.

The point here is to have the default case with FSRM - which is supposed
to be the majority of x86 hw out there - if not now then soon - be
directly inlined into the instruction stream so that no function call
overhead is taking place.

Drop the early clobbers from the @size and @addr operands as those are
not needed anymore since we have single instruction alternatives.

The benchmarks I ran would show very small improvements and a PF
benchmark would even show weird things like slowdowns with higher core
counts.

So for a ~6m running the git test suite, the function gets called under
700K times, all from padzero():

  <...>-2536    [006] .....   261.208801: padzero: to: 0x55b0663ed214, size: 3564, cycles: 21900
  <...>-2536    [006] .....   261.208819: padzero: to: 0x7f061adca078, size: 3976, cycles: 17160
  <...>-2537    [008] .....   261.211027: padzero: to: 0x5572d019e240, size: 3520, cycles: 23850
  <...>-2537    [008] .....   261.211049: padzero: to: 0x7f1288dc9078, size: 3976, cycles: 15900
   ...

which is around 1%-ish of the total time and which is consistent with
the benchmark numbers.

So Mel gave me the idea to simply measure how fast the function becomes.
I.e.:

  start = rdtsc_ordered();
  ret = __clear_user(to, n);
  end = rdtsc_ordered();

Computing the mean average of all the samples collected during the test
suite run then shows some improvement:

  clear_user_original:
  Amean: 9219.71 (Sum: 6340154910, samples: 687674)

  fsrm:
  Amean: 8030.63 (Sum: 5522277720, samples: 687652)

That's on Zen3.

The situation looks a lot more confusing on Intel:

Icelake:

  clear_user_original:
  Amean: 19679.4 (Sum: 13652560764, samples: 693750)
  Amean: 19743.7 (Sum: 13693470604, samples: 693562)

(I ran it twice just to be sure.)

  ERMS:
  Amean: 20374.3 (Sum: 13910601024, samples: 682752)
  Amean: 20453.7 (Sum: 14186223606, samples: 693576)

  FSRM:
  Amean: 20458.2 (Sum: 13918381386, sample s: 680331)

The original microbenchmark which people were complaining about:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=65536; done 2>&1 | grep copied
  32207011840 bytes (32 GB, 30 GiB) copied, 1 s, 32.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.93069 s, 35.6 GB/s
  37597741056 bytes (38 GB, 35 GiB) copied, 1 s, 37.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.78017 s, 38.6 GB/s
  62020124672 bytes (62 GB, 58 GiB) copied, 2 s, 31.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 2.13716 s, 32.2 GB/s
  60010004480 bytes (60 GB, 56 GiB) copied, 1 s, 60.0 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.14129 s, 60.2 GB/s
  53212086272 bytes (53 GB, 50 GiB) copied, 1 s, 53.2 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.28398 s, 53.5 GB/s
  55698259968 bytes (56 GB, 52 GiB) copied, 1 s, 55.7 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.22507 s, 56.1 GB/s
  55306092544 bytes (55 GB, 52 GiB) copied, 1 s, 55.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.23647 s, 55.6 GB/s
  54387539968 bytes (54 GB, 51 GiB) copied, 1 s, 54.4 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.25693 s, 54.7 GB/s
  50566529024 bytes (51 GB, 47 GiB) copied, 1 s, 50.6 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.35096 s, 50.9 GB/s
  58308165632 bytes (58 GB, 54 GiB) copied, 1 s, 58.3 GB/s
  68719476736 bytes (69 GB, 64 GiB) copied, 1.17394 s, 58.5 GB/s

Now the same thing with smaller buffers:

  for i in $(seq 1 10); do dd if=/dev/zero of=/dev/null bs=1M status=progress count=8192; done 2>&1 | grep copied
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28485 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276112 s, 31.1 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.29136 s, 29.5 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.283803 s, 30.3 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.306503 s, 28.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.349169 s, 24.6 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.276912 s, 31.0 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.265356 s, 32.4 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.28464 s, 30.2 GB/s
  8589934592 bytes (8.6 GB, 8.0 GiB) copied, 0.242998 s, 35.3 GB/s

is also not conclusive because it all depends on the buffer sizes,
their alignments and when the microcode detects that cachelines can be
aggregated properly and copied in bigger sizes.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/CAHk-=wh=Mu_EYhtOmPn6AxoQZyEh-4fo2Zx3G7rBv1g7vwoKiw@mail.gmail.com
---
 arch/x86/include/asm/uaccess.h    |   5 +-
 arch/x86/include/asm/uaccess_64.h |  45 ++++++++++
 arch/x86/lib/clear_page_64.S      | 138 ++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_64.c        |  40 ---------
 tools/objtool/check.c             |   3 +
 5 files changed, 188 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 913e593a3b45..c46207946e05 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -502,9 +502,6 @@ strncpy_from_user(char *dst, const char __user *src, long count);
 
 extern __must_check long strnlen_user(const char __user *str, long n);
 
-unsigned long __must_check clear_user(void __user *mem, unsigned long len);
-unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
-
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 unsigned long __must_check
 copy_mc_to_kernel(void *to, const void *from, unsigned len);
@@ -526,6 +523,8 @@ extern struct movsl_mask {
 #define ARCH_HAS_NOCACHE_UACCESS 1
 
 #ifdef CONFIG_X86_32
+unsigned long __must_check clear_user(void __user *mem, unsigned long len);
+unsigned long __must_check __clear_user(void __user *mem, unsigned long len);
 # include <asm/uaccess_32.h>
 #else
 # include <asm/uaccess_64.h>
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..d13d71af5cf6 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,49 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+/*
+ * Zero Userspace.
+ */
+
+__must_check unsigned long
+clear_user_original(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_rep_good(void __user *addr, unsigned long len);
+__must_check unsigned long
+clear_user_erms(void __user *addr, unsigned long len);
+
+static __always_inline __must_check unsigned long __clear_user(void __user *addr, unsigned long size)
+{
+	might_fault();
+	stac();
+
+	/*
+	 * No memory constraint because it doesn't change any memory gcc
+	 * knows about.
+	 */
+	asm volatile(
+		"1:\n\t"
+		ALTERNATIVE_3("rep stosb",
+			      "call clear_user_erms",	  ALT_NOT(X86_FEATURE_FSRM),
+			      "call clear_user_rep_good", ALT_NOT(X86_FEATURE_ERMS),
+			      "call clear_user_original", ALT_NOT(X86_FEATURE_REP_GOOD))
+		"2:\n"
+	       _ASM_EXTABLE_UA(1b, 2b)
+	       : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
+	       : "a" (0)
+		/* rep_good clobbers %rdx */
+	       : "rdx");
+
+	clac();
+
+	return size;
+}
+
+static __always_inline unsigned long clear_user(void __user *to, unsigned long n)
+{
+	if (access_ok(to, n))
+		return __clear_user(to, n);
+	return n;
+}
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index fe59b8ac4fcc..ecbfb4dd3b01 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0-only */
 #include <linux/linkage.h>
+#include <asm/asm.h>
 #include <asm/export.h>
 
 /*
@@ -50,3 +51,140 @@ SYM_FUNC_START(clear_page_erms)
 	RET
 SYM_FUNC_END(clear_page_erms)
 EXPORT_SYMBOL_GPL(clear_page_erms)
+
+/*
+ * Default clear user-space.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_original)
+	/*
+	 * Copy only the lower 32 bits of size as that is enough to handle the rest bytes,
+	 * i.e., no need for a 'q' suffix and thus a REX prefix.
+	 */
+	mov %ecx,%eax
+	shr $3,%rcx
+	jz .Lrest_bytes
+
+	# do the qwords first
+	.p2align 4
+.Lqwords:
+	movq $0,(%rdi)
+	lea 8(%rdi),%rdi
+	dec %rcx
+	jnz .Lqwords
+
+.Lrest_bytes:
+	and $7,  %eax
+	jz .Lexit
+
+	# now do the rest bytes
+.Lbytes:
+	movb $0,(%rdi)
+	inc %rdi
+	dec %eax
+	jnz .Lbytes
+
+.Lexit:
+	/*
+	 * %rax still needs to be cleared in the exception case because this function is called
+	 * from inline asm and the compiler expects %rax to be zero when exiting the inline asm,
+	 * in case it might reuse it somewhere.
+	 */
+        xor %eax,%eax
+        RET
+
+.Lqwords_exception:
+        # convert remaining qwords back into bytes to return to caller
+        shl $3, %rcx
+        and $7, %eax
+        add %rax,%rcx
+        jmp .Lexit
+
+.Lbytes_exception:
+        mov %eax,%ecx
+        jmp .Lexit
+
+        _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception)
+        _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)
+SYM_FUNC_END(clear_user_original)
+EXPORT_SYMBOL(clear_user_original)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_REP_GOOD is
+ * present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ */
+SYM_FUNC_START(clear_user_rep_good)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lprep:
+	# copy lower 32-bits for rest bytes
+	mov %ecx, %edx
+	shr $3, %rcx
+	jz .Lrep_good_rest_bytes
+
+.Lrep_good_qwords:
+	rep stosq
+
+.Lrep_good_rest_bytes:
+	and $7, %edx
+	jz .Lrep_good_exit
+
+.Lrep_good_bytes:
+	mov %edx, %ecx
+	rep stosb
+
+.Lrep_good_exit:
+	# see .Lexit comment above
+	xor %eax, %eax
+	RET
+
+.Lrep_good_qwords_exception:
+	# convert remaining qwords back into bytes to return to caller
+	shl $3, %rcx
+	and $7, %edx
+	add %rdx, %rcx
+	jmp .Lrep_good_exit
+
+	_ASM_EXTABLE_UA(.Lrep_good_qwords, .Lrep_good_qwords_exception)
+	_ASM_EXTABLE_UA(.Lrep_good_bytes, .Lrep_good_exit)
+SYM_FUNC_END(clear_user_rep_good)
+EXPORT_SYMBOL(clear_user_rep_good)
+
+/*
+ * Alternative clear user-space when CPU feature X86_FEATURE_ERMS is present.
+ * Input:
+ * rdi destination
+ * rcx count
+ *
+ * Output:
+ * rcx: uncleared bytes or 0 if successful.
+ *
+ */
+SYM_FUNC_START(clear_user_erms)
+	# call the original thing for less than a cacheline
+	cmp $64, %rcx
+	jb clear_user_original
+
+.Lerms_bytes:
+	rep stosb
+
+.Lerms_exit:
+	xorl %eax,%eax
+	RET
+
+	_ASM_EXTABLE_UA(.Lerms_bytes, .Lerms_exit)
+SYM_FUNC_END(clear_user_erms)
+EXPORT_SYMBOL(clear_user_erms)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 0ae6cf804197..6c1f8ac5e721 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -14,46 +14,6 @@
  * Zero Userspace
  */
 
-unsigned long __clear_user(void __user *addr, unsigned long size)
-{
-	long __d0;
-	might_fault();
-	/* no memory constraint because it doesn't change any memory gcc knows
-	   about */
-	stac();
-	asm volatile(
-		"	testq  %[size8],%[size8]\n"
-		"	jz     4f\n"
-		"	.align 16\n"
-		"0:	movq $0,(%[dst])\n"
-		"	addq   $8,%[dst]\n"
-		"	decl %%ecx ; jnz   0b\n"
-		"4:	movq  %[size1],%%rcx\n"
-		"	testl %%ecx,%%ecx\n"
-		"	jz     2f\n"
-		"1:	movb   $0,(%[dst])\n"
-		"	incq   %[dst]\n"
-		"	decl %%ecx ; jnz  1b\n"
-		"2:\n"
-
-		_ASM_EXTABLE_TYPE_REG(0b, 2b, EX_TYPE_UCOPY_LEN8, %[size1])
-		_ASM_EXTABLE_UA(1b, 2b)
-
-		: [size8] "=&c"(size), [dst] "=&D" (__d0)
-		: [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-	clac();
-	return size;
-}
-EXPORT_SYMBOL(__clear_user);
-
-unsigned long clear_user(void __user *to, unsigned long n)
-{
-	if (access_ok(to, n))
-		return __clear_user(to, n);
-	return n;
-}
-EXPORT_SYMBOL(clear_user);
-
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 /**
  * clean_cache_range - write back a cache range with CLWB
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0cec74da7ffe..4b2e11726f4e 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1071,6 +1071,9 @@ static const char *uaccess_safe_builtin[] = {
 	"copy_mc_fragile_handle_tail",
 	"copy_mc_enhanced_fast_string",
 	"ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
+	"clear_user_erms",
+	"clear_user_rep_good",
+	"clear_user_original",
 	NULL
 };
 
-- 
2.35.1


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply related	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2022-08-06 12:50 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-15  2:12 incoming Andrew Morton
2022-04-15  2:13 ` [patch 01/14] MAINTAINERS: Broadcom internal lists aren't maintainers Andrew Morton
2022-04-15  2:13 ` [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE Andrew Morton
2022-04-15 22:10   ` Linus Torvalds
2022-04-15 22:21     ` Matthew Wilcox
2022-04-15 22:41     ` Hugh Dickins
2022-04-16  6:36     ` Borislav Petkov
2022-04-16 14:07       ` Mark Hemment
2022-04-16 17:28         ` Borislav Petkov
2022-04-16 17:42           ` Linus Torvalds
2022-04-16 21:15             ` Borislav Petkov
2022-04-17 19:41               ` Borislav Petkov
2022-04-17 20:56                 ` Linus Torvalds
2022-04-18 10:15                   ` Borislav Petkov
2022-04-18 17:10                     ` Linus Torvalds
2022-04-19  9:17                       ` Borislav Petkov
2022-04-19 16:41                         ` Linus Torvalds
2022-04-19 17:48                           ` Borislav Petkov
2022-04-21 15:06                             ` Borislav Petkov
2022-04-21 16:50                               ` Linus Torvalds
2022-04-21 17:22                                 ` Linus Torvalds
2022-04-24 19:37                                   ` Borislav Petkov
2022-04-24 19:54                                     ` Linus Torvalds
2022-04-24 20:24                                       ` Linus Torvalds
2022-04-27  0:14                                       ` Borislav Petkov
2022-04-27  1:29                                         ` Linus Torvalds
2022-04-27 10:41                                           ` Borislav Petkov
2022-04-27 16:00                                             ` Linus Torvalds
2022-05-04 18:56                                               ` Borislav Petkov
2022-05-04 19:22                                                 ` Linus Torvalds
2022-05-04 20:18                                                   ` Borislav Petkov
2022-05-04 20:40                                                     ` Linus Torvalds
2022-05-04 21:01                                                       ` Borislav Petkov
2022-05-04 21:09                                                         ` Linus Torvalds
2022-05-10  9:31                                                           ` clear_user (was: [patch 02/14] tmpfs: fix regressions from wider use of ZERO_PAGE) Borislav Petkov
2022-05-10 17:17                                                             ` Linus Torvalds
2022-05-10 17:28                                                             ` Linus Torvalds
2022-05-10 18:10                                                               ` Borislav Petkov
2022-05-10 18:57                                                                 ` Borislav Petkov
2022-05-24 12:32                                                                   ` [PATCH] x86/clear_user: Make it faster Borislav Petkov
2022-05-24 16:51                                                                     ` Linus Torvalds
2022-05-24 17:30                                                                       ` Borislav Petkov
2022-05-25 12:11                                                                     ` Mark Hemment
2022-05-27 11:28                                                                       ` Borislav Petkov
2022-05-27 11:10                                                                     ` Ingo Molnar
2022-06-22 14:21                                                                     ` Borislav Petkov
2022-06-22 15:06                                                                       ` Linus Torvalds
2022-06-22 20:14                                                                         ` Borislav Petkov
2022-06-22 21:07                                                                           ` Linus Torvalds
2022-06-23  9:41                                                                             ` Borislav Petkov
2022-07-05 17:01                                                                               ` [PATCH -final] " Borislav Petkov
2022-07-06  9:24                                                                                 ` Alexey Dobriyan
2022-07-11 10:33                                                                                   ` Borislav Petkov
2022-07-12 12:32                                                                                     ` Alexey Dobriyan
2022-08-06 12:49                                                                                       ` Borislav Petkov
2022-04-15  2:13 ` [patch 03/14] mm/secretmem: fix panic when growing a memfd_secret Andrew Morton
2022-04-15  2:13 ` [patch 04/14] irq_work: use kasan_record_aux_stack_noalloc() record callstack Andrew Morton
2022-04-15  2:13 ` [patch 05/14] kasan: fix hw tags enablement when KUNIT tests are disabled Andrew Morton
2022-04-15  2:13 ` [patch 06/14] mm, kfence: support kmem_dump_obj() for KFENCE objects Andrew Morton
2022-04-15  2:13 ` [patch 07/14] mm, page_alloc: fix build_zonerefs_node() Andrew Morton
2022-04-15  2:13 ` [patch 08/14] mm: fix unexpected zeroed page mapping with zram swap Andrew Morton
2022-04-15  2:13 ` [patch 09/14] mm: compaction: fix compiler warning when CONFIG_COMPACTION=n Andrew Morton
2022-04-15  2:13 ` [patch 10/14] hugetlb: do not demote poisoned hugetlb pages Andrew Morton
2022-04-15  2:13 ` [patch 11/14] revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders" Andrew Morton
2022-04-15  2:13 ` [patch 12/14] revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" Andrew Morton
2022-04-15  2:14 ` [patch 13/14] mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore Andrew Morton
2022-04-15  2:14 ` [patch 14/14] mm: kmemleak: take a full lowmem check in kmemleak_*_phys() Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2022-04-27 19:41 incoming Andrew Morton
2022-04-21 23:35 incoming Andrew Morton
2022-04-08 20:08 incoming Andrew Morton
2022-04-01 18:27 incoming Andrew Morton
2022-04-01 18:20 incoming Andrew Morton
2022-04-01 18:27 ` incoming Andrew Morton
2022-03-25  1:07 incoming Andrew Morton
2022-03-23 23:04 incoming Andrew Morton
2022-03-22 21:38 incoming Andrew Morton
2022-03-16 23:14 incoming Andrew Morton
2022-03-05  4:28 incoming Andrew Morton
2022-02-26  3:10 incoming Andrew Morton
2022-02-12  0:27 incoming Andrew Morton
2022-02-12  2:02 ` incoming Linus Torvalds
2022-02-12  5:24   ` incoming Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).