public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, "H. Peter Anvin" <hpa@linux.intel.com>
Subject: [ 01/86] x86-32, mm: Rip out x86_32 NUMA remapping code
Date: Tue, 26 Feb 2013 16:07:09 -0800	[thread overview]
Message-ID: <20130226235913.045473676@linuxfoundation.org> (raw)
In-Reply-To: <20130226235912.881663118@linuxfoundation.org>

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Dave Hansen <dave@linux.vnet.ibm.com>

commit f03574f2d5b2d6229dcdf2d322848065f72953c7 upstream.

This code was an optimization for 32-bit NUMA systems.

It has probably been the cause of a number of subtle bugs over
the years, although the conditions to excite them would have
been hard to trigger.  Essentially, we remap part of the kernel
linear mapping area, and then sometimes part of that area gets
freed back in to the bootmem allocator.  If those pages get
used by kernel data structures (say mem_map[] or a dentry),
there's no big deal.  But, if anyone ever tried to use the
linear mapping for these pages _and_ cared about their physical
address, bad things happen.

For instance, say you passed __GFP_ZERO to the page allocator
and then happened to get handed one of these pages, it zero the
remapped page, but it would make a pte to the _old_ page.
There are probably a hundred other ways that it could screw
with things.

We don't need to hang on to performance optimizations for
these old boxes any more.  All my 32-bit NUMA systems are long
dead and buried, and I probably had access to more than most
people.

This code is causing real things to break today:

	https://lkml.org/lkml/2013/1/9/376

I looked in to actually fixing this, but it requires surgery
to way too much brittle code, as well as stuff like
per_cpu_ptr_to_phys().

[ hpa: Cc: this for -stable, since it is a memory corruption issue.
  However, an alternative is to simply mark NUMA as depends BROKEN
  rather than EXPERIMENTAL in the X86_32 subclause... ]

Link: http://lkml.kernel.org/r/20130131005616.1C79F411@kernel.stglabs.ibm.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/x86/Kconfig            |    4 -
 arch/x86/mm/numa.c          |    3 
 arch/x86/mm/numa_32.c       |  161 --------------------------------------------
 arch/x86/mm/numa_internal.h |    6 -
 4 files changed, 174 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1243,10 +1243,6 @@ config HAVE_ARCH_BOOTMEM
 	def_bool y
 	depends on X86_32 && NUMA
 
-config HAVE_ARCH_ALLOC_REMAP
-	def_bool y
-	depends on X86_32 && NUMA
-
 config ARCH_HAVE_MEMORY_PRESENT
 	def_bool y
 	depends on X86_32 && DISCONTIGMEM
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -205,9 +205,6 @@ static void __init setup_node_data(int n
 	if (end && (end - start) < NODE_MIN_SIZE)
 		return;
 
-	/* initialize remap allocator before aligning to ZONE_ALIGN */
-	init_alloc_remap(nid, start, end);
-
 	start = roundup(start, ZONE_ALIGN);
 
 	printk(KERN_INFO "Initmem setup node %d %016Lx-%016Lx\n",
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -73,167 +73,6 @@ unsigned long node_memmap_size_bytes(int
 
 extern unsigned long highend_pfn, highstart_pfn;
 
-#define LARGE_PAGE_BYTES (PTRS_PER_PTE * PAGE_SIZE)
-
-static void *node_remap_start_vaddr[MAX_NUMNODES];
-void set_pmd_pfn(unsigned long vaddr, unsigned long pfn, pgprot_t flags);
-
-/*
- * Remap memory allocator
- */
-static unsigned long node_remap_start_pfn[MAX_NUMNODES];
-static void *node_remap_end_vaddr[MAX_NUMNODES];
-static void *node_remap_alloc_vaddr[MAX_NUMNODES];
-
-/**
- * alloc_remap - Allocate remapped memory
- * @nid: NUMA node to allocate memory from
- * @size: The size of allocation
- *
- * Allocate @size bytes from the remap area of NUMA node @nid.  The
- * size of the remap area is predetermined by init_alloc_remap() and
- * only the callers considered there should call this function.  For
- * more info, please read the comment on top of init_alloc_remap().
- *
- * The caller must be ready to handle allocation failure from this
- * function and fall back to regular memory allocator in such cases.
- *
- * CONTEXT:
- * Single CPU early boot context.
- *
- * RETURNS:
- * Pointer to the allocated memory on success, %NULL on failure.
- */
-void *alloc_remap(int nid, unsigned long size)
-{
-	void *allocation = node_remap_alloc_vaddr[nid];
-
-	size = ALIGN(size, L1_CACHE_BYTES);
-
-	if (!allocation || (allocation + size) > node_remap_end_vaddr[nid])
-		return NULL;
-
-	node_remap_alloc_vaddr[nid] += size;
-	memset(allocation, 0, size);
-
-	return allocation;
-}
-
-#ifdef CONFIG_HIBERNATION
-/**
- * resume_map_numa_kva - add KVA mapping to the temporary page tables created
- *                       during resume from hibernation
- * @pgd_base - temporary resume page directory
- */
-void resume_map_numa_kva(pgd_t *pgd_base)
-{
-	int node;
-
-	for_each_online_node(node) {
-		unsigned long start_va, start_pfn, nr_pages, pfn;
-
-		start_va = (unsigned long)node_remap_start_vaddr[node];
-		start_pfn = node_remap_start_pfn[node];
-		nr_pages = (node_remap_end_vaddr[node] -
-			    node_remap_start_vaddr[node]) >> PAGE_SHIFT;
-
-		printk(KERN_DEBUG "%s: node %d\n", __func__, node);
-
-		for (pfn = 0; pfn < nr_pages; pfn += PTRS_PER_PTE) {
-			unsigned long vaddr = start_va + (pfn << PAGE_SHIFT);
-			pgd_t *pgd = pgd_base + pgd_index(vaddr);
-			pud_t *pud = pud_offset(pgd, vaddr);
-			pmd_t *pmd = pmd_offset(pud, vaddr);
-
-			set_pmd(pmd, pfn_pmd(start_pfn + pfn,
-						PAGE_KERNEL_LARGE_EXEC));
-
-			printk(KERN_DEBUG "%s: %08lx -> pfn %08lx\n",
-				__func__, vaddr, start_pfn + pfn);
-		}
-	}
-}
-#endif
-
-/**
- * init_alloc_remap - Initialize remap allocator for a NUMA node
- * @nid: NUMA node to initizlie remap allocator for
- *
- * NUMA nodes may end up without any lowmem.  As allocating pgdat and
- * memmap on a different node with lowmem is inefficient, a special
- * remap allocator is implemented which can be used by alloc_remap().
- *
- * For each node, the amount of memory which will be necessary for
- * pgdat and memmap is calculated and two memory areas of the size are
- * allocated - one in the node and the other in lowmem; then, the area
- * in the node is remapped to the lowmem area.
- *
- * As pgdat and memmap must be allocated in lowmem anyway, this
- * doesn't waste lowmem address space; however, the actual lowmem
- * which gets remapped over is wasted.  The amount shouldn't be
- * problematic on machines this feature will be used.
- *
- * Initialization failure isn't fatal.  alloc_remap() is used
- * opportunistically and the callers will fall back to other memory
- * allocation mechanisms on failure.
- */
-void __init init_alloc_remap(int nid, u64 start, u64 end)
-{
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long end_pfn = end >> PAGE_SHIFT;
-	unsigned long size, pfn;
-	u64 node_pa, remap_pa;
-	void *remap_va;
-
-	/*
-	 * The acpi/srat node info can show hot-add memroy zones where
-	 * memory could be added but not currently present.
-	 */
-	printk(KERN_DEBUG "node %d pfn: [%lx - %lx]\n",
-	       nid, start_pfn, end_pfn);
-
-	/* calculate the necessary space aligned to large page size */
-	size = node_memmap_size_bytes(nid, start_pfn, end_pfn);
-	size += ALIGN(sizeof(pg_data_t), PAGE_SIZE);
-	size = ALIGN(size, LARGE_PAGE_BYTES);
-
-	/* allocate node memory and the lowmem remap area */
-	node_pa = memblock_find_in_range(start, end, size, LARGE_PAGE_BYTES);
-	if (!node_pa) {
-		pr_warning("remap_alloc: failed to allocate %lu bytes for node %d\n",
-			   size, nid);
-		return;
-	}
-	memblock_reserve(node_pa, size);
-
-	remap_pa = memblock_find_in_range(min_low_pfn << PAGE_SHIFT,
-					  max_low_pfn << PAGE_SHIFT,
-					  size, LARGE_PAGE_BYTES);
-	if (!remap_pa) {
-		pr_warning("remap_alloc: failed to allocate %lu bytes remap area for node %d\n",
-			   size, nid);
-		memblock_free(node_pa, size);
-		return;
-	}
-	memblock_reserve(remap_pa, size);
-	remap_va = phys_to_virt(remap_pa);
-
-	/* perform actual remap */
-	for (pfn = 0; pfn < size >> PAGE_SHIFT; pfn += PTRS_PER_PTE)
-		set_pmd_pfn((unsigned long)remap_va + (pfn << PAGE_SHIFT),
-			    (node_pa >> PAGE_SHIFT) + pfn,
-			    PAGE_KERNEL_LARGE);
-
-	/* initialize remap allocator parameters */
-	node_remap_start_pfn[nid] = node_pa >> PAGE_SHIFT;
-	node_remap_start_vaddr[nid] = remap_va;
-	node_remap_end_vaddr[nid] = remap_va + size;
-	node_remap_alloc_vaddr[nid] = remap_va;
-
-	printk(KERN_DEBUG "remap_alloc: node %d [%08llx-%08llx) -> [%p-%p)\n",
-	       nid, node_pa, node_pa + size, remap_va, remap_va + size);
-}
-
 void __init initmem_init(void)
 {
 	x86_numa_init();
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -21,12 +21,6 @@ void __init numa_reset_distance(void);
 
 void __init x86_numa_init(void);
 
-#ifdef CONFIG_X86_64
-static inline void init_alloc_remap(int nid, u64 start, u64 end)	{ }
-#else
-void __init init_alloc_remap(int nid, u64 start, u64 end);
-#endif
-
 #ifdef CONFIG_NUMA_EMU
 void __init numa_emulation(struct numa_meminfo *numa_meminfo,
 			   int numa_dist_cnt);



  reply	other threads:[~2013-02-27  0:31 UTC|newest]

Thread overview: 98+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-27  0:07 [ 00/86] 3.4.34-stable review Greg Kroah-Hartman
2013-02-27  0:07 ` Greg Kroah-Hartman [this message]
2013-02-27  0:07 ` [ 02/86] x86-32, mm: Remove reference to resume_map_numa_kva() Greg Kroah-Hartman
2013-02-27  0:07 ` [ 03/86] x86-32, mm: Remove reference to alloc_remap() Greg Kroah-Hartman
2013-02-27  0:07 ` [ 04/86] mm: fix pageblock bitmap allocation Greg Kroah-Hartman
2013-02-27  0:07 ` [ 05/86] timeconst.pl: Eliminate Perl warning Greg Kroah-Hartman
2013-02-27  0:07 ` [ 06/86] genirq: Avoid deadlock in spurious handling Greg Kroah-Hartman
2013-02-27  0:07 ` [ 07/86] posix-cpu-timers: Fix nanosleep task_struct leak Greg Kroah-Hartman
2013-02-27  0:07 ` [ 08/86] hrtimer: Prevent hrtimer_enqueue_reprogram race Greg Kroah-Hartman
2013-02-27  0:07 ` [ 09/86] x86: Hyper-V: register clocksource only if its advertised Greg Kroah-Hartman
2013-02-27  0:07 ` [ 10/86] ALSA: ali5451: remove irq enabling in pointer callback Greg Kroah-Hartman
2013-02-27  0:07 ` [ 11/86] ALSA: rme32.c irq enabling after spin_lock_irq Greg Kroah-Hartman
2013-02-27  0:07 ` [ 12/86] tty: Prevent deadlock in n_gsm driver Greg Kroah-Hartman
2013-02-27  0:07 ` [ 13/86] tty: set_termios/set_termiox should not return -EINTR Greg Kroah-Hartman
2013-02-27  0:07 ` [ 14/86] USB: serial: fix null-pointer dereferences on disconnect Greg Kroah-Hartman
2013-02-27  0:07 ` [ 15/86] b43: Increase number of RX DMA slots Greg Kroah-Hartman
2013-02-27  0:07 ` [ 16/86] rtlwifi: rtl8192cu: Add new USB ID Greg Kroah-Hartman
2013-02-27  0:07 ` [ 17/86] rtlwifi: usb: allocate URB control message setup_packet and data buffer separately Greg Kroah-Hartman
2013-02-27  0:07 ` [ 18/86] xen: Send spinlock IPI to all waiters Greg Kroah-Hartman
2013-02-27  0:07 ` [ 19/86] xen: close evtchn port if binding to irq fails Greg Kroah-Hartman
2013-02-27  0:07 ` [ 20/86] Driver core: treat unregistered bus_types as having no devices Greg Kroah-Hartman
2013-02-27  0:07 ` [ 21/86] mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely schedule Greg Kroah-Hartman
2013-02-27  0:07 ` [ 22/86] mm: mmu_notifier: make the mmu_notifier srcu static Greg Kroah-Hartman
2013-02-27  0:07 ` [ 23/86] mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts Greg Kroah-Hartman
2013-02-27  0:07 ` [ 24/86] KVM: s390: Handle hosts not supporting s390-virtio Greg Kroah-Hartman
2013-02-27  0:07 ` [ 25/86] s390/kvm: Fix store status for ACRS/FPRS Greg Kroah-Hartman
2013-02-27  0:07 ` [ 26/86] futex: Revert "futex: Mark get_robust_list as deprecated" Greg Kroah-Hartman
2013-02-27  0:07 ` [ 27/86] inotify: remove broken mask checks causing unmount to be EINVAL Greg Kroah-Hartman
2013-02-27  0:07 ` [ 28/86] fs/block_dev.c: page cache wrongly left invalidated after revalidate_disk() Greg Kroah-Hartman
2013-02-27  0:07 ` [ 29/86] ocfs2: unlock super lock if lockres refresh failed Greg Kroah-Hartman
2013-02-27  0:07 ` [ 30/86] drivers/video/backlight/adp88?0_bl.c: fix resume Greg Kroah-Hartman
2013-02-27  0:07 ` [ 31/86] tmpfs: fix use-after-free of mempolicy object Greg Kroah-Hartman
2013-02-27  0:07 ` [ 32/86] mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages Greg Kroah-Hartman
2013-02-27  0:07 ` [ 33/86] drivercore: Fix ordering between deferred_probe and exiting initcalls Greg Kroah-Hartman
2013-02-27  0:07 ` [ 34/86] umount oops when remove blocklayoutdriver first Greg Kroah-Hartman
2013-02-27  0:07 ` [ 35/86] NLM: Ensure that we resend all pending blocking locks after a reclaim Greg Kroah-Hartman
2013-02-27  0:07 ` [ 36/86] p54usb: corrected USB ID for T-Com Sinus 154 data II Greg Kroah-Hartman
2013-02-27  0:07 ` [ 37/86] ALSA: usb-audio: fix Roland A-PRO support Greg Kroah-Hartman
2013-02-27  0:07 ` [ 38/86] ALSA: usb: Fix Processing Unit Descriptor parsers Greg Kroah-Hartman
2013-02-27  0:07 ` [ 39/86] ALSA: hda - Release assigned pin/cvt at error path of hdmi_pcm_open() Greg Kroah-Hartman
2013-02-27  0:07 ` [ 40/86] ALSA: hda - Workaround for silent output on Sony Vaio VGC-LN51JGB with ALC889 Greg Kroah-Hartman
2013-02-27  0:07 ` [ 41/86] ALSA: hda - hdmi: ELD shouldnt be valid after unplug Greg Kroah-Hartman
2013-02-27  0:07 ` [ 42/86] sunvdc: Fix off-by-one in generic_request() Greg Kroah-Hartman
2013-02-27  0:07 ` [ 43/86] drm/radeon/dce6: fix display powergating Greg Kroah-Hartman
2013-02-27  0:07 ` [ 44/86] drm/udl: make usage as a console safer Greg Kroah-Hartman
2013-02-27  0:07 ` [ 45/86] drm/udl: disable fb_defio by default Greg Kroah-Hartman
2013-02-27  0:07 ` [ 46/86] vgacon/vt: clear buffer attributes when we load a 512 character font (v2) Greg Kroah-Hartman
2013-02-27  0:07 ` [ 47/86] drm: dont add inferred modes for monitors that dont support them Greg Kroah-Hartman
2013-02-27  0:07 ` [ 48/86] drm: Fill depth/bits_per_pixel for C8 format Greg Kroah-Hartman
2013-02-27  0:07 ` [ 49/86] drm: Use C8 instead of RGB332 when determining the format from depth/bpp Greg Kroah-Hartman
2013-02-27  0:07 ` [ 50/86] drm/usb: bind driver to correct device Greg Kroah-Hartman
2013-02-27  0:07 ` [ 51/86] target: Fix divide by zero bug in fabric_max_sectors for unconfigured devices Greg Kroah-Hartman
2013-02-27  0:08 ` [ 52/86] intel/iommu: force writebuffer-flush quirk on Gen 4 Chipsets Greg Kroah-Hartman
2013-02-27  0:08 ` [ 53/86] drm/i915: disable shared panel fitter for pipe Greg Kroah-Hartman
2013-02-27  0:08 ` [ 54/86] drm/i915: Set i9xx sdvo clock limits according to specifications Greg Kroah-Hartman
2013-02-27 10:11   ` Patrik Jakobsson
2013-02-27 10:19     ` Chris Wilson
2013-02-27 10:23       ` Patrik Jakobsson
2013-02-27  0:08 ` [ 55/86] staging: comedi: disallow COMEDI_DEVCONFIG on non-board minors Greg Kroah-Hartman
2013-02-27  0:08 ` [ 56/86] staging: vt6656: Fix URB submitted while active warning Greg Kroah-Hartman
2013-02-27  0:08 ` [ 57/86] ASoC: wm2200: correct IN2L and IN3L digital mute Greg Kroah-Hartman
2013-02-27  0:08 ` [ 58/86] ARM: PXA3xx: program the CSMSADRCFG register Greg Kroah-Hartman
2013-02-27  0:08 ` [ 59/86] ARM: samsung: fix assembly syntax for new gas Greg Kroah-Hartman
2013-02-27  0:08 ` [ 60/86] ARM: 7643/1: sched: correct update_sched_clock() Greg Kroah-Hartman
2013-02-27  0:08 ` [ 61/86] powerpc/kexec: Disable hard IRQ before kexec Greg Kroah-Hartman
2013-02-27  0:08 ` [ 62/86] [PARISC] Purge existing TLB entries in set_pte_at and ptep_set_wrprotect Greg Kroah-Hartman
2013-02-27  0:08 ` [ 63/86] pcmcia/vrc4171: Add missing spinlock init Greg Kroah-Hartman
2013-02-27  0:08 ` [ 64/86] drivers/video: fsl-diu-fb: fix pixel formats for 24 and 16 bpp Greg Kroah-Hartman
2013-02-27  0:08 ` [ 65/86] fbcon: dont lose the console font across generic->chip driver switch Greg Kroah-Hartman
2013-02-27  0:08 ` [ 66/86] fb: rework locking to fix lock ordering on takeover Greg Kroah-Hartman
2013-02-27  0:08 ` [ 67/86] fb: Yet another band-aid for fixing lockdep mess Greg Kroah-Hartman
2013-02-27  0:08 ` [ 68/86] mmc: sdhci-esdhc-imx: fix host version read Greg Kroah-Hartman
2013-02-27  0:08 ` [ 69/86] HID: wiimote: fix nunchuck button parser Greg Kroah-Hartman
2013-02-27  0:08 ` [ 70/86] bridge: set priority of STP packets Greg Kroah-Hartman
2013-02-27  0:08 ` [ 71/86] net: fix infinite loop in __skb_recv_datagram() Greg Kroah-Hartman
2013-02-27  0:08 ` [ 72/86] xen-netback: correctly return errors from netbk_count_requests() Greg Kroah-Hartman
2013-02-27  0:08 ` [ 73/86] xen-netback: cancel the credit timer when taking the vif down Greg Kroah-Hartman
2013-02-27  0:08 ` [ 74/86] net: fix a compile error when SOCK_REFCNT_DEBUG is enabled Greg Kroah-Hartman
2013-02-27  0:08 ` [ 75/86] ipv4: fix a bug in ping_err() Greg Kroah-Hartman
2013-02-27  0:08 ` [ 76/86] ipv6: use a stronger hash for tcp Greg Kroah-Hartman
2013-02-27  0:08 ` [ 77/86] sock_diag: Fix out-of-bounds access to sock_diag_handlers[] Greg Kroah-Hartman
2013-02-27  0:08 ` [ 78/86] vlan: adjust vlan_set_encap_proto() for its callers Greg Kroah-Hartman
2013-02-27  0:08 ` [ 79/86] USB: ehci-omap: Dont free gpios that we didnt request Greg Kroah-Hartman
2013-02-27  7:52   ` Roger Quadros
2013-02-27 17:20     ` Luis Henriques
2013-02-28 10:16       ` Roger Quadros
2013-02-28 10:37         ` Luis Henriques
2013-02-27 17:37     ` Greg Kroah-Hartman
2013-02-27  0:08 ` [ 80/86] dca: check against empty dca_domains list before unregister provider Greg Kroah-Hartman
2013-02-27  0:08 ` [ 81/86] USB: option: add and update Alcatel modems Greg Kroah-Hartman
2013-02-27  0:08 ` [ 82/86] USB: option: add Yota / Megafon M100-1 4g modem Greg Kroah-Hartman
2013-02-27  0:08 ` [ 83/86] USB: option: add Huawei "ACM" devices using protocol = vendor Greg Kroah-Hartman
2013-02-27  0:08 ` [ 84/86] USB: ehci-omap: Fix autoloading of module Greg Kroah-Hartman
2013-02-27  0:08 ` [ 85/86] USB: storage: properly handle the endian issues of idProduct Greg Kroah-Hartman
2013-02-27  0:08 ` [ 86/86] USB: usb-storage: unusual_devs update for Super TOP SATA bridge Greg Kroah-Hartman
2013-02-27 16:50 ` [ 00/86] 3.4.34-stable review Shuah Khan
2013-02-28 14:54 ` Satoru Takeuchi
2013-02-28 14:59   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130226235913.045473676@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=hpa@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox