From: tip-bot for Rik van Riel <tipbot@zytor.com>
To: linux-tip-commits@vger.kernel.org
Cc: hpa@zytor.com, linux-kernel@vger.kernel.org, efault@gmx.de,
peterz@infradead.org, tglx@linutronix.de,
torvalds@linux-foundation.org, riel@surriel.com,
songliubraving@fb.com, dave.hansen@intel.com, mingo@kernel.org
Subject: [tip:x86/mm] mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
Date: Tue, 17 Jul 2018 02:33:35 -0700 [thread overview]
Message-ID: <tip-c1a2f7f0c06454387c2cd7b93ff1491c715a8c69@git.kernel.org> (raw)
In-Reply-To: <20180716190337.26133-2-riel@surriel.com>
Commit-ID: c1a2f7f0c06454387c2cd7b93ff1491c715a8c69
Gitweb: https://git.kernel.org/tip/c1a2f7f0c06454387c2cd7b93ff1491c715a8c69
Author: Rik van Riel <riel@surriel.com>
AuthorDate: Mon, 16 Jul 2018 15:03:31 -0400
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 17 Jul 2018 09:35:30 +0200
mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.
This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.
The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the CPU bitmap) and the size the cpumask.
For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.
Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
getting confused by the dynamically sized array.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
drivers/firmware/efi/efi.c | 1 +
include/linux/mm_types.h | 241 +++++++++++++++++++++++----------------------
kernel/fork.c | 15 +--
mm/init-mm.c | 11 +++
4 files changed, 145 insertions(+), 123 deletions(-)
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 232f4915223b..7f0b19410a95 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -82,6 +82,7 @@ struct mm_struct efi_mm = {
.mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(efi_mm.mmlist),
+ .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
};
static bool disable_runtime;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..efdc24dd9e97 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,176 +335,183 @@ struct core_state {
struct kioctx_table;
struct mm_struct {
- struct vm_area_struct *mmap; /* list of VMAs */
- struct rb_root mm_rb;
- u32 vmacache_seqnum; /* per-thread vmacache */
+ struct {
+ struct vm_area_struct *mmap; /* list of VMAs */
+ struct rb_root mm_rb;
+ u32 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
- unsigned long (*get_unmapped_area) (struct file *filp,
+ unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
#endif
- unsigned long mmap_base; /* base of mmap area */
- unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
+ unsigned long mmap_base; /* base of mmap area */
+ unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
- /* Base adresses for compatible mmap() */
- unsigned long mmap_compat_base;
- unsigned long mmap_compat_legacy_base;
+ /* Base adresses for compatible mmap() */
+ unsigned long mmap_compat_base;
+ unsigned long mmap_compat_legacy_base;
#endif
- unsigned long task_size; /* size of task vm space */
- unsigned long highest_vm_end; /* highest vma end address */
- pgd_t * pgd;
-
- /**
- * @mm_users: The number of users including userspace.
- *
- * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
- * to 0 (i.e. when the task exits and there are no other temporary
- * reference holders), we also release a reference on @mm_count
- * (which may then free the &struct mm_struct if @mm_count also
- * drops to 0).
- */
- atomic_t mm_users;
-
- /**
- * @mm_count: The number of references to &struct mm_struct
- * (@mm_users count as 1).
- *
- * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
- * &struct mm_struct is freed.
- */
- atomic_t mm_count;
+ unsigned long task_size; /* size of task vm space */
+ unsigned long highest_vm_end; /* highest vma end address */
+ pgd_t * pgd;
+
+ /**
+ * @mm_users: The number of users including userspace.
+ *
+ * Use mmget()/mmget_not_zero()/mmput() to modify. When this
+ * drops to 0 (i.e. when the task exits and there are no other
+ * temporary reference holders), we also release a reference on
+ * @mm_count (which may then free the &struct mm_struct if
+ * @mm_count also drops to 0).
+ */
+ atomic_t mm_users;
+
+ /**
+ * @mm_count: The number of references to &struct mm_struct
+ * (@mm_users count as 1).
+ *
+ * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+ * &struct mm_struct is freed.
+ */
+ atomic_t mm_count;
#ifdef CONFIG_MMU
- atomic_long_t pgtables_bytes; /* PTE page table pages */
+ atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
- int map_count; /* number of VMAs */
+ int map_count; /* number of VMAs */
- spinlock_t page_table_lock; /* Protects page tables and some counters */
- struct rw_semaphore mmap_sem;
+ spinlock_t page_table_lock; /* Protects page tables and some
+ * counters
+ */
+ struct rw_semaphore mmap_sem;
- struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
- * together off init_mm.mmlist, and are protected
- * by mmlist_lock
- */
+ struct list_head mmlist; /* List of maybe swapped mm's. These
+ * are globally strung together off
+ * init_mm.mmlist, and are protected
+ * by mmlist_lock
+ */
- unsigned long hiwater_rss; /* High-watermark of RSS usage */
- unsigned long hiwater_vm; /* High-water virtual memory usage */
+ unsigned long hiwater_rss; /* High-watermark of RSS usage */
+ unsigned long hiwater_vm; /* High-water virtual memory usage */
- unsigned long total_vm; /* Total pages mapped */
- unsigned long locked_vm; /* Pages that have PG_mlocked set */
- unsigned long pinned_vm; /* Refcount permanently increased */
- unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
- unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
- unsigned long stack_vm; /* VM_STACK */
- unsigned long def_flags;
+ unsigned long total_vm; /* Total pages mapped */
+ unsigned long locked_vm; /* Pages that have PG_mlocked set */
+ unsigned long pinned_vm; /* Refcount permanently increased */
+ unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
+ unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
+ unsigned long stack_vm; /* VM_STACK */
+ unsigned long def_flags;
- spinlock_t arg_lock; /* protect the below fields */
- unsigned long start_code, end_code, start_data, end_data;
- unsigned long start_brk, brk, start_stack;
- unsigned long arg_start, arg_end, env_start, env_end;
+ spinlock_t arg_lock; /* protect the below fields */
+ unsigned long start_code, end_code, start_data, end_data;
+ unsigned long start_brk, brk, start_stack;
+ unsigned long arg_start, arg_end, env_start, env_end;
- unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+ unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
- /*
- * Special counters, in some configurations protected by the
- * page_table_lock, in other configurations by being atomic.
- */
- struct mm_rss_stat rss_stat;
-
- struct linux_binfmt *binfmt;
+ /*
+ * Special counters, in some configurations protected by the
+ * page_table_lock, in other configurations by being atomic.
+ */
+ struct mm_rss_stat rss_stat;
- cpumask_var_t cpu_vm_mask_var;
+ struct linux_binfmt *binfmt;
- /* Architecture-specific MM context */
- mm_context_t context;
+ /* Architecture-specific MM context */
+ mm_context_t context;
- unsigned long flags; /* Must use atomic bitops to access the bits */
+ unsigned long flags; /* Must use atomic bitops to access */
- struct core_state *core_state; /* coredumping support */
+ struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_MEMBARRIER
- atomic_t membarrier_state;
+ atomic_t membarrier_state;
#endif
#ifdef CONFIG_AIO
- spinlock_t ioctx_lock;
- struct kioctx_table __rcu *ioctx_table;
+ spinlock_t ioctx_lock;
+ struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
- /*
- * "owner" points to a task that is regarded as the canonical
- * user/owner of this mm. All of the following must be true in
- * order for it to be changed:
- *
- * current == mm->owner
- * current->mm != mm
- * new_owner->mm == mm
- * new_owner->alloc_lock is held
- */
- struct task_struct __rcu *owner;
+ /*
+ * "owner" points to a task that is regarded as the canonical
+ * user/owner of this mm. All of the following must be true in
+ * order for it to be changed:
+ *
+ * current == mm->owner
+ * current->mm != mm
+ * new_owner->mm == mm
+ * new_owner->alloc_lock is held
+ */
+ struct task_struct __rcu *owner;
#endif
- struct user_namespace *user_ns;
+ struct user_namespace *user_ns;
- /* store ref to file /proc/<pid>/exe symlink points to */
- struct file __rcu *exe_file;
+ /* store ref to file /proc/<pid>/exe symlink points to */
+ struct file __rcu *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
- struct mmu_notifier_mm *mmu_notifier_mm;
+ struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
- pgtable_t pmd_huge_pte; /* protected by page_table_lock */
-#endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
- struct cpumask cpumask_allocation;
+ pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_NUMA_BALANCING
- /*
- * numa_next_scan is the next time that the PTEs will be marked
- * pte_numa. NUMA hinting faults will gather statistics and migrate
- * pages to new nodes if necessary.
- */
- unsigned long numa_next_scan;
+ /*
+ * numa_next_scan is the next time that the PTEs will be marked
+ * pte_numa. NUMA hinting faults will gather statistics and
+ * migrate pages to new nodes if necessary.
+ */
+ unsigned long numa_next_scan;
- /* Restart point for scanning and setting pte_numa */
- unsigned long numa_scan_offset;
+ /* Restart point for scanning and setting pte_numa */
+ unsigned long numa_scan_offset;
- /* numa_scan_seq prevents two threads setting pte_numa */
- int numa_scan_seq;
+ /* numa_scan_seq prevents two threads setting pte_numa */
+ int numa_scan_seq;
#endif
- /*
- * An operation with batched TLB flushing is going on. Anything that
- * can move process memory needs to flush the TLB when moving a
- * PROT_NONE or PROT_NUMA mapped page.
- */
- atomic_t tlb_flush_pending;
+ /*
+ * An operation with batched TLB flushing is going on. Anything
+ * that can move process memory needs to flush the TLB when
+ * moving a PROT_NONE or PROT_NUMA mapped page.
+ */
+ atomic_t tlb_flush_pending;
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
- /* See flush_tlb_batched_pending() */
- bool tlb_flush_batched;
+ /* See flush_tlb_batched_pending() */
+ bool tlb_flush_batched;
#endif
- struct uprobes_state uprobes_state;
+ struct uprobes_state uprobes_state;
#ifdef CONFIG_HUGETLB_PAGE
- atomic_long_t hugetlb_usage;
+ atomic_long_t hugetlb_usage;
#endif
- struct work_struct async_put_work;
+ struct work_struct async_put_work;
#if IS_ENABLED(CONFIG_HMM)
- /* HMM needs to track a few things per mm */
- struct hmm *hmm;
+ /* HMM needs to track a few things per mm */
+ struct hmm *hmm;
#endif
-} __randomize_layout;
+ } __randomize_layout;
+
+ /*
+ * The mm_cpumask needs to be at the end of mm_struct, because it
+ * is dynamically sized based on nr_cpu_ids.
+ */
+ unsigned long cpu_bitmap[];
+};
extern struct mm_struct init_mm;
+/* Pointer magic because the dynamic array size confuses some compilers. */
static inline void mm_init_cpumask(struct mm_struct *mm)
{
-#ifdef CONFIG_CPUMASK_OFFSTACK
- mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
- cpumask_clear(mm->cpu_vm_mask_var);
+ unsigned long cpu_bitmap = (unsigned long)mm;
+
+ cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+ cpumask_clear((struct cpumask *)cpu_bitmap);
}
/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
{
- return mm->cpu_vm_mask_var;
+ return (struct cpumask *)&mm->cpu_bitmap;
}
struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..5b64c1b8461e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data)
void __init proc_caches_init(void)
{
+ unsigned int mm_size;
+
sighand_cachep = kmem_cache_create("sighand_cache",
sizeof(struct sighand_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2269,15 +2271,16 @@ void __init proc_caches_init(void)
sizeof(struct fs_struct), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
NULL);
+
/*
- * FIXME! The "sizeof(struct mm_struct)" currently includes the
- * whole struct cpumask for the OFFSTACK case. We could change
- * this to *only* allocate as much of it as required by the
- * maximum number of CPU's we can ever have. The cpumask_allocation
- * is at the end of the structure, exactly for that reason.
+ * The mm_cpumask is located at the end of mm_struct, and is
+ * dynamically sized based on the maximum CPU number this system
+ * can have, taking hotplug into account (nr_cpu_ids).
*/
+ mm_size = sizeof(struct mm_struct) + cpumask_size();
+
mm_cachep = kmem_cache_create_usercopy("mm_struct",
- sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+ mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
offsetof(struct mm_struct, saved_auxv),
sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..a787a319211e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,16 @@
#define INIT_MM_CONTEXT(name)
#endif
+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on the maximum CPU
+ * number the system can see. That way we allocate only as much memory for
+ * mm_cpumask() as needed for the hundreds, or thousands of processes that
+ * a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
@@ -25,5 +35,6 @@ struct mm_struct init_mm = {
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
+ .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
INIT_MM_CONTEXT(init_mm)
};
next prev parent reply other threads:[~2018-07-17 9:34 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-16 19:03 [PATCH v6 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
2018-07-16 19:03 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
2018-07-17 9:33 ` tip-bot for Rik van Riel [this message]
2018-08-04 22:28 ` Guenter Roeck
2018-07-16 19:03 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
2018-07-17 9:34 ` [tip:x86/mm] x86/mm/tlb: Leave " tip-bot for Rik van Riel
2018-07-17 11:46 ` Peter Zijlstra
2018-07-25 1:00 ` Anders Roxell
2018-08-16 1:54 ` [PATCH 2/7] x86,tlb: leave " Andy Lutomirski
2018-08-16 5:31 ` Rik van Riel
2018-07-16 19:03 ` [PATCH 3/7] x86,mm: restructure switch_mm_irqs_off Rik van Riel
2018-07-17 9:34 ` [tip:x86/mm] x86/mm/tlb: Restructure switch_mm_irqs_off() tip-bot for Rik van Riel
2018-10-09 14:58 ` tip-bot for Rik van Riel
2018-07-16 19:03 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel
2018-07-17 9:35 ` [tip:x86/mm] x86/mm/tlb: Make " tip-bot for Rik van Riel
2018-07-17 11:33 ` Peter Zijlstra
2018-07-18 15:33 ` Rik van Riel
2018-07-18 16:00 ` Peter Zijlstra
[not found] ` <081E558D-DB34-4A18-A35C-896BC47F6EBA@surriel.com>
2018-07-18 18:23 ` Peter Zijlstra
2018-07-18 18:51 ` Rik van Riel
2018-07-19 9:13 ` Peter Zijlstra
2018-07-17 20:04 ` [PATCH 4/7] x86,tlb: make " Andy Lutomirski
[not found] ` <FF977B78-140F-4787-AA57-0EA934017D85@surriel.com>
2018-07-17 21:29 ` Andy Lutomirski
2018-07-17 22:05 ` Rik van Riel
2018-07-17 22:27 ` Andy Lutomirski
2018-07-18 20:58 ` Rik van Riel
2018-07-18 23:13 ` Andy Lutomirski
[not found] ` <B976CC13-D014-433A-83DE-F8DF9AB4F421@surriel.com>
2018-07-19 16:45 ` Andy Lutomirski
2018-07-19 17:04 ` Andy Lutomirski
2018-07-20 4:57 ` Benjamin Herrenschmidt
2018-07-20 8:30 ` Peter Zijlstra
2018-07-23 12:26 ` Rik van Riel
2018-07-24 16:33 ` Will Deacon
[not found] ` <CF849A07-B7CE-4DE9-8246-53AC5A53A705@surriel.com>
2018-07-19 17:18 ` Andy Lutomirski
2018-07-20 8:02 ` Vitaly Kuznetsov
2018-07-20 9:49 ` Peter Zijlstra
2018-07-20 10:18 ` Vitaly Kuznetsov
2018-07-20 9:32 ` Peter Zijlstra
2018-07-20 11:04 ` Peter Zijlstra
2018-07-16 19:03 ` [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
2018-07-17 9:35 ` [tip:x86/mm] x86/mm/tlb: Only " tip-bot for Rik van Riel
2018-07-17 11:39 ` Peter Zijlstra
[not found] ` <1F8BDD25-864D-4105-B872-2109AA417454@surriel.com>
[not found] ` <24AA4367-22A1-450E-8F6A-3CBF39518384@surriel.com>
2018-07-18 16:19 ` Peter Zijlstra
2018-07-16 19:03 ` [PATCH 6/7] x86,mm: always use lazy TLB mode Rik van Riel
2018-07-17 9:36 ` [tip:x86/mm] x86/mm/tlb: Always " tip-bot for Rik van Riel
2018-10-09 14:58 ` tip-bot for Rik van Riel
2018-07-16 19:03 ` [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm Rik van Riel
2018-07-17 9:36 ` [tip:x86/mm] x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off() tip-bot for Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=tip-c1a2f7f0c06454387c2cd7b93ff1491c715a8c69@git.kernel.org \
--to=tipbot@zytor.com \
--cc=dave.hansen@intel.com \
--cc=efault@gmx.de \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-tip-commits@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=songliubraving@fb.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).