[Feature] x86_64 page tracking for Stratus servers

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [Feature] x86_64 page tracking for Stratus servers
@ 2006-09-05 17:34 Kimball Murray
  2006-09-05 17:40 ` Arjan van de Ven
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Kimball Murray @ 2006-09-05 17:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: Kimball Murray, kimball.murray, akpm, ak

Attached is a git patch that implements a feature that is used by Stratus
fault-tolerant servers running on Intel x86_64 platforms.  It provides the
kernel mechanism that allows a loadable module to be able to keep track of
recently dirtied pages for the purpose of copying live, currently active
memory, to a spare memory module.

In Stratus servers, this spare memory module (and CPUs) will be brought into
lockstep with the original memory (and CPUs) in order to achieve fault
tolerance.

In order to make this feature work, it is necessary to track pages which have
been recently dirtied.  A simplified view of the algorithm used in the kernel
module is this:

1. Turn on the tracking functionality.
2. Copy all memory from active to spare memory module.
3. Until done:
	a) Identify all pages that were dirtied in the active memory since
	   the last copy operation.
	b) Copy all pages identified in 3a to the spare memory module.
	c) If number of pages still dirty is less than some threshhold,
		i.  "black out" the system (enter SMI)
		ii.  copy remaining pages in blackout context
		iii. goto step 4
	   Else
		goto 3a.
4. synchronize cpus
5. leave SMI, return to OS
6. System is now "Duplexed", and fault tolerant.

Most of this work is done in the loadable module, especially since copying
memory from one module to another is specific to Stratus boards, as is the
system blackout and final copy which takes place in an SMI context.

Given that, it is desireable to make the kernel patch as small and
unobtrusive as possible.  Also, regardless of whether or not the memory tracking
is active, the OS should handle dirty pages in the same way it always has.

To achieve this, the provided patch borrows an usused bit in the page table and
defines it to be a "SOFTDIRTY" bit (bit 9).  It's purpose is to provide an
alternate representation for dirty pages.  A few OS macros are changed such
that a page is considered dirty if either the hardware dirty bit or the
SOFTDIRTY bit is set.  This provides the freedom for a kernel module to
scan the page tables and clear the hardware dirty bits in ptes as it sets
the corresponding SOFTDIRTY bit.  To the OS, the page is still dirty.  But
for Stratus memory-tracking software (the loadable kernel module), there
now exists the ability to determine if a dirty page that was previously
copied from active to spare memory has in fact become dirty again.

For those that have made it this far, you may have noticed a hole in the
algorithm above.  Since we are traversing the page tables on a periodic
basis, what happens if a page reference is lost between passes?  We'd have
no way of knowing if the discarded pte referred to a dirty page.  To plug
that hole, the kernel patch also intercepts set_pte(...) so we can take a
look at the old, departing page and decide then if we need to track it.

And if you're really digging into this, you may be wondering about DMA, which
could lead to dirty pages without a corresponding dirty bit in the pte.  This
is handled by Stratus hardware, which can send north-bound DMA traffic to
both the active and spare memory modules at the same time.  Finally, Stratus
hardware has the ability to verify the accuracy of the memory copy after it's
all done, providing a means of testing both the kernel module and kernel
patch.

This code has been used by Stratus in several, previously released Linux
products (i386), and on an x86_64 platform that ships with Redhat's RHEL4.
The patch has been modified slightly from the RHEL4 code base, as the newer
kernels use a page table layout like this:

	pgd -> pud -> pmd -> pte

whereas RHEL4 used:

	pml4 -> pgd -> pmd -> pte

It is my hope that this feature, while essential to Stratus platforms, might
be of some use to somebody who perhaps would like to develop a tool to profile
memory usage.  One could, if desired, use this interface to monitor which
physical pages in a system become dirtied of a period of time and under
various loads.  Other vendors that would like to mirror live memory might also
be able to use this interface.  But for now, I have decided to put this feature
in the kernel hacking arena as a debug option.  I have also made the tracking
functions replaceable in case some other use is found for them.  The functions
may be replaced safely whenever the mm_tracking_struct.active flag is cleared.

Please consider this feature for inclusion in the kernel tree, as it is very
important to Stratus.

Signed-off-by:  Kimball Murray kimball.murray@stratus.com

--------------- snip -------------------------------------------
diff --git a/arch/x86_64/mm/Makefile b/arch/x86_64/mm/Makefile
index d25ac86..0ac1353 100644
--- a/arch/x86_64/mm/Makefile
+++ b/arch/x86_64/mm/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpag
 obj-$(CONFIG_NUMA) += numa.o
 obj-$(CONFIG_K8_NUMA) += k8topology.o
 obj-$(CONFIG_ACPI_NUMA) += srat.o
+obj-$(CONFIG_TRACK_DIRTY_PAGES) += track.o
 
 hugetlbpage-y = ../../i386/mm/hugetlbpage.o
diff --git a/arch/x86_64/mm/track.c b/arch/x86_64/mm/track.c
new file mode 100644
index 0000000..d797175
--- /dev/null
+++ b/arch/x86_64/mm/track.c
@@ -0,0 +1,109 @@
+#include <linux/config.h>
+#include <linux/module.h>
+#include <asm/atomic.h>
+#include <asm/mm_track.h>
+#include <asm/pgtable.h>
+
+/*
+ * For memory-tracking purposes, see mm_track.h for details.
+ */
+struct mm_tracker mm_tracking_struct = {0, ATOMIC_INIT(0), 0, 0};
+EXPORT_SYMBOL_GPL(mm_tracking_struct);
+
+void default_mm_track_pte(void * val)
+{
+	pte_t *ptep = (pte_t*)val;
+	unsigned long pfn;
+
+	if (!pte_present(*ptep))
+		return;
+
+	if (!pte_val(*ptep) & _PAGE_DIRTY)
+		return;
+
+	pfn = pte_pfn(*ptep);
+
+	if (pfn >= mm_tracking_struct.bitcnt)
+		return;
+
+	if (!test_and_set_bit(pfn, mm_tracking_struct.vector))
+		atomic_inc(&mm_tracking_struct.count);
+}
+
+#define LARGE_PMD_SIZE	(1 << PMD_SHIFT)
+
+void default_mm_track_pmd(void *val)
+{
+	int i;
+	pte_t *pte;
+	pmd_t *pmd = (pmd_t*)val;
+
+	if (!pmd_present(*pmd))
+		return;
+
+	if (unlikely(pmd_large(*pmd))) {
+		unsigned long addr, end;
+
+		if (!pte_val(*(pte_t*)val) & _PAGE_DIRTY)
+			return;
+
+		addr = pte_pfn(*(pte_t*)val) << PAGE_SHIFT;
+		end = addr + LARGE_PMD_SIZE;
+
+		while (addr < end) {
+			do_mm_track_phys((void*)addr);
+			addr +=  PAGE_SIZE;
+		}
+		return;
+	}
+
+	pte = pte_offset_kernel(pmd, 0);
+
+	for (i = 0; i < PTRS_PER_PTE; i++, pte++)
+		do_mm_track_pte(pte);
+}
+
+static inline void track_as_pte(void *val) {
+	unsigned long pfn = pte_pfn(*(pte_t*)val);
+	if (pfn >= mm_tracking_struct.bitcnt)
+		return;
+
+	if (!test_and_set_bit(pfn, mm_tracking_struct.vector))
+		atomic_inc(&mm_tracking_struct.count);
+}
+
+void default_mm_track_pud(void *val)
+{
+	track_as_pte(val);
+}
+
+void default_mm_track_pgd(void *val)
+{
+	track_as_pte(val);
+}
+
+void default_mm_track_phys(void *val)
+{
+	unsigned long pfn;
+
+	pfn = (unsigned long)val >> PAGE_SHIFT;
+
+	if (pfn >= mm_tracking_struct.bitcnt)
+		return;
+
+	if (!test_and_set_bit(pfn, mm_tracking_struct.vector))
+		atomic_inc(&mm_tracking_struct.count);
+}
+
+void (*do_mm_track_pte)(void *) = default_mm_track_pte;
+void (*do_mm_track_pmd)(void *) = default_mm_track_pmd;
+void (*do_mm_track_pud)(void *) = default_mm_track_pud;
+void (*do_mm_track_pgd)(void *) = default_mm_track_pgd;
+void (*do_mm_track_phys)(void *) = default_mm_track_phys;
+
+EXPORT_SYMBOL_GPL(do_mm_track_pte);
+EXPORT_SYMBOL_GPL(do_mm_track_pmd);
+EXPORT_SYMBOL_GPL(do_mm_track_pud);
+EXPORT_SYMBOL_GPL(do_mm_track_pgd);
+EXPORT_SYMBOL_GPL(do_mm_track_phys);
+
diff --git a/include/asm-x86_64/mm_track.h b/include/asm-x86_64/mm_track.h
new file mode 100644
index 0000000..9af1ef5
--- /dev/null
+++ b/include/asm-x86_64/mm_track.h
@@ -0,0 +1,77 @@
+#ifndef __X86_64_MMTRACK_H__
+#define __X86_64_MMTRACK_H__
+
+#ifndef CONFIG_TRACK_DIRTY_PAGES
+
+#define mm_track_pte(ptep)		do { ; } while (0)
+#define mm_track_pmd(ptep)		do { ; } while (0)
+#define mm_track_pud(ptep)		do { ; } while (0)
+#define mm_track_pgd(ptep)		do { ; } while (0)
+#define mm_track_phys(x)		do { ; } while (0)
+
+#else
+
+#include <asm/page.h>
+#include <asm/atomic.h>
+ /*
+  * For memory-tracking purposes, if active is true (non-zero), the other
+  * elements of the structure are available for use.  Each time mm_track_pte
+  * is called, it increments count and sets a bit in the bitvector table.
+  * Each bit in the bitvector represents a physical page in memory.
+  *
+  * This is declared in arch/x86_64/mm/track.c.
+  *
+  * The in_use element is used in the code which drives the memory tracking
+  * environment.  When tracking is complete, the vector may be freed, but 
+  * only after the active flag is set to zero and the in_use count goes to
+  * zero.
+  *
+  * The count element indicates how many pages have been stored in the
+  * bitvector.  This is an optimization to avoid counting the bits in the
+  * vector between harvest operations.
+  */
+struct mm_tracker {
+	int active;		// non-zero if this structure in use
+	atomic_t count;		// number of pages tracked by mm_track()
+	unsigned long * vector;	// bit vector of modified pages
+	unsigned long bitcnt;	// number of bits in vector
+};
+extern struct mm_tracker mm_tracking_struct;
+
+extern void (*do_mm_track_pte)(void *);
+extern void (*do_mm_track_pmd)(void *);
+extern void (*do_mm_track_pud)(void *);
+extern void (*do_mm_track_pgd)(void *);
+extern void (*do_mm_track_phys)(void *);
+
+/* The mm_track routine is needed by macros in the pgtable-2level.h
+ * and pgtable-3level.h.
+ */
+static __inline__ void mm_track_pte(void * val)
+{
+	if (unlikely(mm_tracking_struct.active))
+		do_mm_track_pte(val);
+}
+static __inline__ void mm_track_pmd(void * val)
+{
+	if (unlikely(mm_tracking_struct.active))
+		do_mm_track_pmd(val);
+}
+static __inline__ void mm_track_pud(void * val)
+{
+	if (unlikely(mm_tracking_struct.active))
+		do_mm_track_pud(val);
+}
+static __inline__ void mm_track_pgd(void * val)
+{
+	if (unlikely(mm_tracking_struct.active))
+		do_mm_track_pgd(val);
+}
+static __inline__ void mm_track_phys(void * val)
+{
+	if (unlikely(mm_tracking_struct.active))
+		do_mm_track_phys(val);
+}
+#endif /* CONFIG_TRACK_DIRTY_PAGES */
+
+#endif /* __X86_64_MMTRACK_H__ */
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index a31ab4e..346346c 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -10,6 +10,9 @@
 #include <asm/bitops.h>
 #include <linux/threads.h>
 #include <asm/pda.h>
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+#include <asm/mm_track.h>
+#endif
 
 extern pud_t level3_kernel_pgt[512];
 extern pud_t level3_physmem_pgt[512];
@@ -72,17 +75,26 @@ extern unsigned long empty_zero_page[PAG
 
 static inline void set_pte(pte_t *dst, pte_t val)
 {
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+	mm_track_pte(dst);
+#endif
 	pte_val(*dst) = pte_val(val);
 } 
 #define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
 
 static inline void set_pmd(pmd_t *dst, pmd_t val)
 {
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+	mm_track_pmd(dst);
+#endif
         pmd_val(*dst) = pmd_val(val); 
 } 
 
 static inline void set_pud(pud_t *dst, pud_t val)
 {
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+	mm_track_pud(dst);
+#endif
 	pud_val(*dst) = pud_val(val);
 }
 
@@ -93,6 +105,9 @@ static inline void pud_clear (pud_t *pud
 
 static inline void set_pgd(pgd_t *dst, pgd_t val)
 {
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+	mm_track_pgd(dst);
+#endif
 	pgd_val(*dst) = pgd_val(val); 
 } 
 
@@ -104,7 +119,15 @@ static inline void pgd_clear (pgd_t * pg
 #define pud_page(pud) \
 ((unsigned long) __va(pud_val(pud) & PHYSICAL_PAGE_MASK))
 
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, pte_t *xp)
+{
+	mm_track_pte(xp);
+	return __pte(xchg(&(xp)->pte, 0));
+}
+#else
 #define ptep_get_and_clear(mm,addr,xp)	__pte(xchg(&(xp)->pte, 0))
+#endif
 
 struct mm_struct;
 
@@ -113,6 +136,9 @@ static inline pte_t ptep_get_and_clear_f
 	pte_t pte;
 	if (full) {
 		pte = *ptep;
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+		mm_track_pte(ptep);
+#endif
 		*ptep = __pte(0);
 	} else {
 		pte = ptep_get_and_clear(mm, addr, ptep);
@@ -151,6 +177,9 @@ static inline pte_t ptep_get_and_clear_f
 #define _PAGE_BIT_DIRTY		6
 #define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page */
 #define _PAGE_BIT_GLOBAL	8	/* Global TLB entry PPro+ */
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+#define _PAGE_BIT_SOFTDIRTY	9	/* save dirty state when hdw dirty bit cleared */
+#endif
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
 #define _PAGE_PRESENT	0x001
@@ -163,6 +192,9 @@ static inline pte_t ptep_get_and_clear_f
 #define _PAGE_PSE	0x080	/* 2MB page */
 #define _PAGE_FILE	0x040	/* nonlinear file mapping, saved PTE; unset:swap */
 #define _PAGE_GLOBAL	0x100	/* Global TLB entry */
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+#define _PAGE_SOFTDIRTY	0x200
+#endif
 
 #define _PAGE_PROTNONE	0x080	/* If not present */
 #define _PAGE_NX        (1UL<<_PAGE_BIT_NX)
@@ -170,7 +202,11 @@ static inline pte_t ptep_get_and_clear_f
 #define _PAGE_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
 #define _KERNPG_TABLE	(_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
 
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+#define _PAGE_CHG_MASK	(PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_SOFTDIRTY)
+#else
 #define _PAGE_CHG_MASK	(PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)
+#endif
 
 #define PAGE_NONE	__pgprot(_PAGE_PROTNONE | _PAGE_ACCESSED)
 #define PAGE_SHARED	__pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_NX)
@@ -269,7 +305,11 @@ static inline pte_t pfn_pte(unsigned lon
 static inline int pte_user(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
 static inline int pte_read(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
 static inline int pte_exec(pte_t pte)		{ return pte_val(pte) & _PAGE_USER; }
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+static inline int pte_dirty(pte_t pte)		{ return pte_val(pte) & (_PAGE_DIRTY | _PAGE_SOFTDIRTY); }
+#else
 static inline int pte_dirty(pte_t pte)		{ return pte_val(pte) & _PAGE_DIRTY; }
+#endif
 static inline int pte_young(pte_t pte)		{ return pte_val(pte) & _PAGE_ACCESSED; }
 static inline int pte_write(pte_t pte)		{ return pte_val(pte) & _PAGE_RW; }
 static inline int pte_file(pte_t pte)		{ return pte_val(pte) & _PAGE_FILE; }
@@ -277,7 +317,11 @@ static inline int pte_huge(pte_t pte)		{
 
 static inline pte_t pte_rdprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
 static inline pte_t pte_exprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_USER)); return pte; }
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+static inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~(_PAGE_SOFTDIRTY|_PAGE_DIRTY))); return pte; }
+#else
 static inline pte_t pte_mkclean(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }
+#endif
 static inline pte_t pte_mkold(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; }
 static inline pte_t pte_wrprotect(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_RW)); return pte; }
 static inline pte_t pte_mkread(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_USER)); return pte; }
@@ -293,7 +337,13 @@ static inline int ptep_test_and_clear_di
 {
 	if (!pte_dirty(*ptep))
 		return 0;
+#ifdef CONFIG_TRACK_DIRTY_PAGES
+	mm_track_pte(ptep);
+	return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep) |
+	       test_and_clear_bit(_PAGE_BIT_SOFTDIRTY, ptep);
+#else
 	return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte);
+#endif
 }
 
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 554ee68..ddabdf1 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -39,6 +39,17 @@ config UNUSED_SYMBOLS
 	  you really need it, and what the merge plan to the mainline kernel for
 	  your module is.
 
+config TRACK_DIRTY_PAGES
+	bool "Enable dirty page tracking"
+	depends on X86_64
+	default n
+	help
+	  Turning this on allows third party modules to use a
+	  kernel interface that can track dirty page generation
+	  in the system.  This can be used to copy/mirror live
+	  memory to another system, or perhaps even a replacement
+	  DIMM.  Most users will say n here.
+
 config DEBUG_KERNEL
 	bool "Kernel debugging"
 	help

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-05 17:34 [Feature] x86_64 page tracking for Stratus servers Kimball Murray
@ 2006-09-05 17:40 ` Arjan van de Ven
  2006-09-05 18:38   ` Kimball Murray
  2006-09-05 17:56 ` Dave Hansen
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Arjan van de Ven @ 2006-09-05 17:40 UTC (permalink / raw)
  To: Kimball Murray; +Cc: linux-kernel, akpm, ak

On Tue, 2006-09-05 at 13:34 -0400, Kimball Murray wrote:
> Attached is a git patch that implements a feature that is used by Stratus
> fault-tolerant servers running on Intel x86_64 platforms.  It provides the
> kernel mechanism that allows a loadable module to be able to keep track of
> recently dirtied pages for the purpose of copying live, currently active
> memory, to a spare memory module.


last time this came up, your module wasn't open source.... has that
changed since?
Maybe you should just post it...




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-05 17:40 ` Arjan van de Ven
@ 2006-09-05 18:38   ` Kimball Murray
  2006-09-05 20:48     ` Andi Kleen
  0 siblings, 1 reply; 12+ messages in thread
From: Kimball Murray @ 2006-09-05 18:38 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, akpm, ak

[-- Attachment #1: Type: text/plain, Size: 1695 bytes --]

Hi Arjan,

Unfortunately, our situation has not changed much, so I feel this may
be a replay of the last time you asked.  One of our modules is GPL,
and that is the one that implements the page tracking.  It's fairly
large because it also provides PCI hotplug functionality and other
things for our platform.

As before, Stratus would not object to providing all kernel modules as
GPL, but still our problem is that there is some code that reflects
some of our chipset vendor's proprietary information, so they won't
allow us to release it under GPL.  So the best we can do under those
circumstances is to mimic something like the NVIDIA driver model to
satisfy our chipset vendors.

The file that implements the page tracking is called harvest.c, and is
part of the GPL module.  I have attached the code, in its
developmental form (so please pardon the #Ifdef's).  The entry point
for the page tracking is the routine fosil_synchronize_memory.

At 1400 lines, I should probably find a server to make this available.
 But not having one at the moment, please forgive the attachment.

-kimball

On 9/5/06, Arjan van de Ven <arjan@infradead.org> wrote:
> On Tue, 2006-09-05 at 13:34 -0400, Kimball Murray wrote:
> > Attached is a git patch that implements a feature that is used by Stratus
> > fault-tolerant servers running on Intel x86_64 platforms.  It provides the
> > kernel mechanism that allows a loadable module to be able to keep track of
> > recently dirtied pages for the purpose of copying live, currently active
> > memory, to a spare memory module.
>
>
> last time this came up, your module wasn't open source.... has that
> changed since?
> Maybe you should just post it...
>
>
>
>

[-- Attachment #2: harvest.c --]
[-- Type: text/x-c, Size: 30949 bytes --]

/*
 * Low-level routines for memory mirroring.  Tracks dirty pages and
 * utilizes provided copy routine to transfer pages which have changed
 * between harvest passes.
 *
 * Copyright (C) 2006 Stratus Technologies Bermuda Ltd.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */
#include "fosil.h"
#include "syncd.h"
#include "harvest.h"
#include "smallpmd.h"

/*
 * The following ifdef allows harvest to be built against a kernel
 * which does not implement memory tracking.
 */
#ifdef CONFIG_TRACK_DIRTY_PAGES

#include <linux/interrupt.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/highmem.h>
#include <asm/io.h>
#include <asm/atomic.h>
#include <asm/tlb.h>
#include <asm/uaccess.h>

#define HARVEST_BY_TABLE 1
#define SMP_HARVEST 1
//#define USE_SMALL_PMD	1
#define USE_TASK_HARVEST	1

#define MAX_COPYRANGE	100	/* this is actually defined in cc/host.h */
#define MB (1024*1024)
#define NEARBY		6

#ifdef CONFIG_X86_PAE
typedef unsigned long long paddr_t;	/* physical address type */
#define PADDR_FMT "%09lx"		/* printf format for paddr_t */
#else
typedef unsigned long paddr_t;		/* physical address type */
#define PADDR_FMT "%08lx"		/* printf format for paddr_t */
#endif

enum {
	DBGLVL_NONE     = 0x00,
	DBGLVL_HARVEST  = 0x01,
	DBGLVL_BROWNOUT = 0x02,
	DBGLVL_TIMER    = 0x04,
};

static int dbgmask = DBGLVL_NONE;

#define DBGPRNT(lvl, args...)			\
({						\
	if (lvl & dbgmask)			\
		printk(args);			\
})

enum {
	IN_BROWNOUT = 0,
	IN_BLACKOUT = 1,
};

struct timer_set {
	unsigned long long before;
	unsigned long long after;
};

struct syncd_data {
	unsigned long flags[NR_CPUS];
	copy_proc copy;
	sync_proc sync;
	void * context;
	is_pfn_valid_proc is_pfn_valid;
	is_pfnrange_valid_proc is_pfnrange_valid;
	int status;		// return status from command inside blackout
	int max_range;		// # of valid ranges in following CopyRange
	fosil_page_range_t CopyRange[MAX_COPYRANGE];
	unsigned long nearby;	// insert merge tunable
	unsigned int  duration_in_ms;	// blackout during
	unsigned long bitcnt;	// number of bits find in last harvest
	int blackout;		// 1 == blackout, 0 == brownout
	int pass;		// # of times through harvest/process cycle
	int done;		// 1 == done - time to leave
	unsigned int threshold;	// number of dirty pages before sync
	atomic_t r;
};

#ifdef HARVEST_BY_TABLE
void harvest_table(pgd_t * pgd_base, unsigned long start, unsigned long end)
{
	unsigned long i, j, k, l;

	pgd_t *pgd;

	unsigned long pgd_start, pgd_end;
	unsigned long pud_start, pud_end;
	unsigned long pmd_start, pmd_end;
	unsigned long pte_start, pte_end;

	unsigned long pud_last;
	unsigned long pmd_last;
	unsigned long pte_last;

	/*
	 * end marks the first byte after the upper limit of the address
	 * range
	 */
	end--;

	if (!pgd_base) {
		printk("%s: null pgd?\n", __func__);
		return;
	}

	pgd_start = pgd_index(start);
	pgd_end = pgd_index(end);

	pud_start = pud_index(start);
	pud_end = pud_index(end);

	pmd_start = pmd_index(start);
	pmd_end = pmd_index(end);

	pte_start = pte_index(start);
	pte_end = pte_index(end);

	pgd = pgd_base + pgd_start;

	for (i = pgd_start; i <= pgd_end; i++, pgd++) {

		pud_t *pud;

		if (pgd_none(*pgd) || !pgd_present(*pgd))
			continue;

		if (i == pgd_start)
			j = pud_start;
		else
			j = 0;

		pud = pud_offset(pgd, 0) + j;

		if (i == pgd_end)
			pud_last = pud_end;
		else
			pud_last = PTRS_PER_PUD - 1;

		for (; j <= pud_last; j++, pud++) {
			pmd_t *pmd;
			unsigned long addr;
			pud_t tmp_pud = *pud;

			addr = i << PGDIR_SHIFT;
			addr += j << PUD_SHIFT;

			if (pud_none(tmp_pud) || !pud_present(tmp_pud))
				continue;

			if ((i == pgd_start) && (j == pud_start))
				k = pmd_start;
			else
				k = 0;

			pmd = pmd_offset(pud, 0) + k;

			if ((i == pgd_end) && (j == pud_end))
				pmd_last = pmd_end;
			else
				pmd_last = PTRS_PER_PMD - 1;

			for (; k <= pmd_last; k++, pmd++) {
				pte_t *pte;
				pmd_t tmp_pmd = *pmd;

				addr += k << PMD_SHIFT;

				if (pmd_none(tmp_pmd) || !pmd_present(tmp_pmd))
					continue;
				
				if (pmd_large(tmp_pmd)) {
					if (pmd_val(tmp_pmd) & _PAGE_DIRTY) {
						pmd_val(tmp_pmd) &= ~_PAGE_DIRTY;
						pmd_val(tmp_pmd) |= _PAGE_SOFTDIRTY;
						mm_track_pmd(pmd);
					}
				}

				if (pmd_val(tmp_pmd) != pmd_val(*pmd)) {
					*pmd = tmp_pmd;
					if (pmd_val(tmp_pmd) & _PAGE_GLOBAL)
						__flush_tlb_global();
					else
						__flush_tlb_one(addr);
					mm_track_phys((void*)virt_to_phys(pmd));
				}

				if (pmd_large(tmp_pmd))
					continue;

				if ((i == pgd_start) && (j == pud_start) && (k == pmd_start))
					l = pte_start;
				else
					l = 0;

				pte = pte_offset_kernel(pmd, 0) + l;

				if ((i == pgd_end) && (j == pud_end) && (k == pmd_end))
					pte_last = pte_end;
				else
					pte_last = PTRS_PER_PTE - 1;

				for (; l <= pte_last; l++, pte++) {

					unsigned long addr_pte = addr + (k << PAGE_SHIFT);
					pte_t tmp_pte = *pte;

					if (pte_val(tmp_pte) & _PAGE_DIRTY) {
						pte_val(tmp_pte) &= ~_PAGE_DIRTY;
						pte_val(tmp_pte) |= _PAGE_SOFTDIRTY;
						mm_track_pte(pte);
					}

					if (pte_val(tmp_pte) != pte_val(*pte)) {
						*pte = tmp_pte;
						if (pte_val(tmp_pte) & _PAGE_GLOBAL)
							__flush_tlb_global();
						else
							__flush_tlb_one(addr_pte);
						mm_track_phys((void*)virt_to_phys(pte));
					}
				}
			}
		}
	}
}
#else
static void harvest_page(pgd_t * pgd_base, unsigned long addr)
{
	// for all pages in this vm_area_struct
	pgd_t *pgd;
	pmd_t *pmd;
	pte_t *pte;

	DBGPRNT(DBGLVL_HARVEST, "pgd_base %p + offset %ld\n",
		pgd_base, pgd_index(addr));

	DBGPRNT(DBGLVL_HARVEST, "    pgd[0] %09lx\n", pgd_val(pgd_base[0]));
	DBGPRNT(DBGLVL_HARVEST, "    pgd[1] %09lx\n", pgd_val(pgd_base[1]));
	DBGPRNT(DBGLVL_HARVEST, "    pgd[2] %09lx\n", pgd_val(pgd_base[2]));
	DBGPRNT(DBGLVL_HARVEST, "    pgd[3] %09lx\n", pgd_val(pgd_base[3]));

	pgd = pgd_base + pgd_index(addr);
	if (pgd_none(*pgd))
		return;

	if (!pgd_present(*pgd))
		return;

	pmd = pmd_offset(pgd, addr);

	DBGPRNT(DBGLVL_HARVEST, "        pmd[%ld] %p is %lx\n",
		pmd_index(addr), pmd, pmd_val(*pmd));

	if (pmd_none(*pmd))
		return;

	if (!pmd_present(*pmd))
		return;

	// large pages
	if (pmd_large(*pmd)) {

		if ((pmd_val(*pmd) & _PAGE_DIRTY) == 0)
			return;

		mm_track_pmd(pmd);

		pmd_val(*pmd) &= ~_PAGE_DIRTY;
		pmd_val(*pmd) |= _PAGE_SOFTDIRTY;

		return;
	}

	if (addr < PAGE_OFFSET)
		pte = pte_offset_map(pmd, addr);
	else
		pte = pte_offset_kernel(pmd, addr);

	DBGPRNT(DBGLVL_HARVEST, "            page table %lx\n",
		pmd_val(*pmd) & PAGE_MASK);
	DBGPRNT(DBGLVL_HARVEST, "            pte[%ld] %p is %lx\n",
		pte_index(addr), pte, pte_val(*pte));

	if (pte_none(*pte) || !pte_present(*pte))
		goto out_unmap_pte;

	if (!(pte_val(*pte) & _PAGE_DIRTY))
		goto out_unmap_pte;

	mm_track_pte(pte);

	pte_val(*pte) &= ~_PAGE_DIRTY;
	pte_val(*pte) |= _PAGE_SOFTDIRTY;

	DBGPRNT(DBGLVL_HARVEST, "a           pte[%ld] %p is %lx\n",
		pte_index(addr), pte, pte_val(*pte));

out_unmap_pte:
	if (addr < PAGE_OFFSET)
		pte_unmap(pte);

	return;
}
#endif

static pgd_t *get_pgd(unsigned long address) {
#if 0
        pgd_t *pgd;

        asm volatile("movq %%cr3,%0" : "=r" (pgd));

        pgd = __va((unsigned long)pgd & PHYSICAL_PAGE_MASK);
        pgd += pgd_index(address);
        if (!pgd_present(*pgd)) {
		printk("%s-KAM: pgd not present (%p)\n", __func__,
			(void*)(pgd_val(*pgd)));
		return NULL;
	}

	if (pgd != init_level4_pgt)
		printk("%s: got %p, but init_level4_pgt is %p\n", __func__,
			pgd, init_level4_pgt);

	return pgd;
#else
	return init_level4_pgt;
#endif
}

/*
 * Traverse the kernel page tables, but don't use the
 * loop-over-all-kernel-address technique, as that
 * takes one fourth of forever on a 64-bit machine.
 * Instead, iterate over the page tables themselves,
 * starting with the pgd table.  This will be a quadruple-
 * nested loop,  relying on p[ml4/gd/md/te]_none checks
 * for early-outs in the loops.
 * Be nice to find a useful way to SMP-ify this traversal,
 * but so far haven't found a way.
 */
static void harvest_kernel(void)
{
	unsigned long n, i, j, k;

	pgd_t *pgd_0 = get_pgd(0);
	pgd_t *pgd = get_pgd(PAGE_OFFSET);

	if (!pgd_0) {
		printk("%s: null pgd_0\n", __func__);
		return;
	}

	if (!pgd) {
		printk("%s: null pgd\n", __func__);
		return;
	}

	// Skip addrs < PAGE_OFFSET
	for (n = pgd - pgd_0; n < PTRS_PER_PGD; n++, pgd++) {
		pud_t *pud;

		if (pgd_none(*pgd) || !pgd_present(*pgd))
			continue;

		pud = pud_offset(pgd, 0);

		if (!pud) {
			printk("%s: null pud?\n", __func__);
			return;
		}

		mm_track_phys((void*)virt_to_phys(pud));
		
		for (i = 0; i < PTRS_PER_PUD; i++, pud++) {

			pmd_t *pmd;

			if (pud_none(*pud) || !pud_present(*pud))
				continue;

			pmd = pmd_offset(pud, 0);

			for (j = 0; j < PTRS_PER_PMD; j++, pmd++) {
				pte_t *pte;
				pmd_t tmp_pmd = *pmd;
				unsigned long addr;

				addr = n << PGDIR_SHIFT;
				addr += i << PUD_SHIFT;
				addr += j << PMD_SHIFT;

				if (pmd_none(tmp_pmd) || !pmd_present(tmp_pmd))
					continue;

				if (pmd_large(tmp_pmd)) {
					if (pmd_val(tmp_pmd) & _PAGE_DIRTY) {
						pmd_val(tmp_pmd) &= ~_PAGE_DIRTY;
						pmd_val(tmp_pmd) |= _PAGE_SOFTDIRTY;
						mm_track_pmd(pmd);
					}
				}

				if (pmd_val(tmp_pmd) != pmd_val(*pmd)) {
					*pmd = tmp_pmd;
					if (pmd_val(tmp_pmd) & _PAGE_GLOBAL)
						__flush_tlb_global();
					else
						__flush_tlb_one(addr);
					mm_track_phys((void*)virt_to_phys(pmd));
				}

				if (pmd_large(tmp_pmd))
					continue;

				pte = pte_offset_kernel(pmd, 0);

				for (k = 0; k < PTRS_PER_PTE; k++, pte++) {
					unsigned long addr_pte = addr + (k << PAGE_SHIFT);
					pte_t tmp_pte = *pte;

					if (pte_none(tmp_pte) || !pte_present(tmp_pte))
						continue;

					if (pte_val(tmp_pte) & _PAGE_DIRTY) {
						pte_val(tmp_pte) &= ~_PAGE_DIRTY;
						pte_val(tmp_pte) |= _PAGE_SOFTDIRTY;
						mm_track_pte(pte);
					}

					if (pte_val(tmp_pte) != pte_val(*pte)) {
						*pte = tmp_pte;
						if (pte_val(tmp_pte) & _PAGE_GLOBAL)
							__flush_tlb_global();
						else
							__flush_tlb_one(addr_pte);
						mm_track_phys((void*)virt_to_phys(pte));
					}
				}
			}
		}
	}
}

static unsigned long map_kernel_small_pmd(void * __attribute__ ((unused)) unused_arg)
{
#ifdef USE_SMALL_PMD
	small_pmd_enter();
#endif
	return 0;
}

static unsigned long map_kernel_orig_pmd(void * __attribute__ ((unused)) unused_arg)
{
#ifdef USE_SMALL_PMD
	small_pmd_leave();
#endif
	return 0;
}

#ifdef USE_TASK_HARVEST
static void harvest_mm(struct mm_struct *mm) {
	struct vm_area_struct * mmap, * vma;

	vma = mmap = mm->mmap;
	if (!mmap)
		return;

	do {
		unsigned long start, end;

		if (!vma)
			break;

		start = vma->vm_start;
		end   = vma->vm_end;

#ifdef HARVEST_BY_TABLE
		harvest_table(mm->pgd, start, end);
#else
		for (addr = start; addr <= end; addr += PAGE_SIZE)
			harvest_page(mm->pgd, addr);
#endif

	} while ((vma = vma->vm_next) != mmap);
}

static void harvest_user(void)
{
	struct task_struct *p;
	int task_count = 0;
	int this_cpu = smp_processor_id();
	int num_cpus = num_online_cpus();

	for_each_process(p) {
		struct mm_struct *mm = p->mm;

#ifdef SMP_HARVEST
		if (this_cpu != (task_count++ % num_cpus))
			continue;
#endif

		if (!mm)
			continue;

		harvest_mm(mm);
	}
}
#else
static void harvest_user(void)
{
	struct list_head * m = NULL;
	struct mm_struct * mmentry;
	struct vm_area_struct * mmap, * vma;
#ifndef HARVEST_BY_TABLE
	unsigned long addr;
#endif
	int num_cpus = num_online_cpus();
	int this_cpu = smp_processor_id();
	int pgd_count = 0;

	m = &init_mm.mmlist;

	do {
		mmentry = list_entry(m, struct mm_struct, mmlist);
		m = m->next;

		vma = mmap = mmentry->mmap;
		if (!mmap)
			continue;

#ifdef SMP_HARVEST
		if (this_cpu != (pgd_count++ % num_cpus))
			continue;
#endif

		do {
			unsigned long start, end;

			if (!vma)
				break;

			start = vma->vm_start;
			end   = vma->vm_end;

#ifdef HARVEST_BY_TABLE
			harvest_table(mmentry->pgd, start, end);
#else
			for (addr = start; addr <= end; addr += PAGE_SIZE)
				harvest_page(mmentry->pgd, addr);
#endif

		} while ((vma = vma->vm_next) != mmap);

	} while (m != &init_mm.mmlist);
}
#endif

#define VIRT_TO_PFN(pfn) (virt_to_phys(pfn) >> PAGE_SHIFT)

#define TEST_BIT(x, pfn, vector)				\
({								\
	int retval;						\
								\
	if (x->blackout)					\
		retval = test_bit(pfn, vector);			\
	else							\
		retval = test_and_clear_bit(pfn, vector);	\
								\
	retval;							\
})

static void process_vector(void * arg)
{
	struct syncd_data * x = (struct syncd_data *) arg;
	unsigned long pfn;
	unsigned long start_pfn, runlength;

	// traverse the the bitmap looking for dirty pages
	runlength = 0;
	start_pfn = 0;

	for (pfn = 0; pfn < mm_tracking_struct.bitcnt; pfn++) {
		if (TEST_BIT(x, pfn, mm_tracking_struct.vector)) {
			if (!(x->blackout))
				atomic_dec(&mm_tracking_struct.count);
		    	if (x->is_pfn_valid(pfn)) {
				if (runlength == 0) {
					start_pfn = pfn;
				}
				runlength++;
			} else {
				if (runlength) {
					x->status = x->copy(start_pfn, runlength,
							    x->blackout, x->context);
					if (x->status != FOSIL_ERR_OK) {
						printk("%s: %d error (%d), start_pfn %lx, runlength %lx\n",
							__FUNCTION__, __LINE__,
							x->status,
							start_pfn,
							runlength);
						return;
					}
				}
				start_pfn = runlength = 0;
			}

		} else if (runlength) {
			x->status = x->copy(start_pfn, runlength,
					    x->blackout, x->context);
			if (x->status != FOSIL_ERR_OK) {
				printk("%s: %d error (%d), start_pfn %lx, runlength %lx\n",
					__FUNCTION__, __LINE__,
					x->status,
					start_pfn,
					runlength);
				return;
			}
			start_pfn = runlength = 0;
		}
	}

	// Catch the last ones.
	if (runlength) {
		x->status = x->copy(start_pfn, runlength,
				    x->blackout, x->context);
		if (x->status != FOSIL_ERR_OK) {
			printk("%s: %d\n", __FUNCTION__, __LINE__);
			return;
		}
	}
}

static void enable_tracking(void)
{
	// initialize tracking structure
	atomic_set(&mm_tracking_struct.count, 0);
	memset(mm_tracking_struct.vector, 0, mm_tracking_struct.bitcnt/8);

        // turn on tracking
        mm_tracking_struct.active = 1;
}

static void disable_tracking(void)
{
	// turn off tracking
	mm_tracking_struct.active = 0;
}

/*
 * Difference or zero
 *
 *  d <- (x - y), x >= y
 *	    0	, x < y
 */
unsigned long
doz(unsigned long x, unsigned long y)
{
	if (x < y)
		return 0;

	return x - y;
}

/*
 * Merge closest two regions to free up one slot in table.
 */
int
merge(struct syncd_data * x)
{
	int i, ii;
	unsigned long delta, ndelta;
	unsigned long end;
	fosil_page_range_t * cr = x->CopyRange;

	i = 0;
	ii = ~0;
	delta = ~0UL;
	do {
		end = cr[i].start + cr[i].num - 1;
		ndelta = cr[i+1].start - end;

		if (!x->is_pfnrange_valid(end, cr[i+1].start))
			continue;

		if (ndelta <= delta) {
			delta = ndelta;
			ii = i;
		}
	} while (i++ < (x->max_range - 1));

	if (ii == ~0)
		return 1;

	cr[ii].num = (cr[ii+1].start + cr[ii+1].num) - cr[ii].start;

	// shift everyone down
	x->max_range--;
	for (i = ii + 1; i < x->max_range; i++) {
		cr[i].start = cr[i+1].start;
		cr[i].num   = cr[i+1].num;
	}

	return 0;
}

int
insert_pfn(struct syncd_data * x, unsigned long pfn)
{
	int i;
	int ii;
	fosil_page_range_t * cr = x->CopyRange;
	int nearby = x->nearby;

	if (!x->is_pfn_valid(pfn))
		return 0;
 restart:
	for (i = 0; i < x->max_range; i++) {
		unsigned long start = cr[i].start;
		unsigned long len   = cr[i].num;
		unsigned long end   = start + len - 1;

		// inside current region	start <= pfn <= end
		if ((pfn - start) <= (len - 1))
			return 0;

		// adjacent to start		start-nearby <= pfn <= start
		if ((pfn - doz(start, nearby)) <= start - doz(start, nearby)) {
			cr[i].num  += (cr[i].start - pfn);
			cr[i].start = pfn;
			return 0;
		}

		// adjacent to end		end <= pfn <= end+nearby
		if ((pfn - end) <= nearby) {
			cr[i].num += pfn - end;
			end += pfn - end;
 again:
			if (i+1 == x->max_range)
				return 0;

			// this range may overlap or abut next range
			// after insert		(next.start - nearby) <= end
			if (doz(cr[i+1].start, nearby) <= end) {
				cr[i].num = (cr[i+1].start + cr[i+1].num)
					- cr[i].start;
				end = cr[i].start + cr[i].num - 1;

				// shift everyone down
				x->max_range--;
				for (ii = i + 1; ii < x->max_range; ii++) {
					cr[ii].start = cr[ii+1].start;
					cr[ii].num   = cr[ii+1].num;
				}
				goto again;
			}

			return 0;
		}

		// before region	pfn < start-nearby	- OR -
		//			!at end of array	- AND -
		// between regions	end+nearby < pfn < next.start-nearby
		if ((pfn < doz(start, nearby)) ||
			((i+1 == x->max_range) &&
			((end+nearby < pfn) &&
			 (pfn < doz(cr[i+1].start, nearby))))) {
			if (x->max_range == MAX_COPYRANGE) {
				if (merge(x))
					return 1;
				goto restart;
			}
			// shift everyone up to make room
			for (ii = x->max_range; ii > i; ii--) {
				cr[ii].start = cr[ii-1].start;
				cr[ii].num   = cr[ii-1].num;
			}
			if (pfn < doz(start, nearby))
				ii = i;		// before region
			else
				ii = i + 1;	// between regions
			cr[ii].start = pfn;
			cr[ii].num = 1;
			x->max_range++;
			return 0;
		}
	}

	if (x->max_range == MAX_COPYRANGE)
		if (merge(x))
			return 1;

	// append to array
	cr[x->max_range].start = pfn;
	cr[x->max_range].num = 1;
	x->max_range++;

	return 0;
}

/*
 * x86_64 stacks are different from i386 in that you don't get
 * there by adding 2 pages to "current".  Instead, we'll put
 * a dummy variable on the stack and look at its address.
 */
static int add_stack_pages(struct syncd_data * x) {
	unsigned long dummy;
	unsigned long *stack = &dummy;
	unsigned long pfn, num_pfns;
	int rc = 0;

	unsigned long i;
	pfn = (unsigned long)(virt_to_phys(stack) & ~(THREAD_SIZE - 1));
	pfn >>= PAGE_SHIFT;
	num_pfns = (THREAD_SIZE + (PAGE_SIZE - 1)) >> PAGE_SHIFT;

	for (i = 0; i < num_pfns; i++, pfn++) {
		rc = insert_pfn(x, pfn);
		if (rc)
			return rc;
	}

	return 0;
}

static unsigned long virt_to_pfn(void *addr) {
	unsigned long phys = virt_to_phys(addr);
	return phys >> PAGE_SHIFT;
}

/*
 * ftmod dirties a page for OS_DEBUG prints (ringbuffer).
 * Provide that page here, so we can track it, as it will be
 * dirtied on the way to the sync point.
 */
void *fosil_scratch_page;
EXPORT_SYMBOL_GPL(fosil_scratch_page);
static spinlock_t *fosil_scratch_lock;
EXPORT_SYMBOL_GPL(fosil_scratch_lock);


/*
 * With a fundamental knowledge of what absolutely MUST be copied
 * by the HOST during blackout, add pfn's to the copy range list.
 *
 * All page tables are added to avoid having to deal with optimizing
 * which ptes might be set dirty or accessed by the last few operations
 * before blackout.
 *
 * After page tables, the remainder of the pages are our known "working
 * set".  This includes the stack frames for this task and processor and
 * the syncd structure (which we are making dirty by compiling this list).
 *
 * Finally, all pages marked as dirty in their pte's are added to the
 * list.  These are pages which escaped the harvest/process cycles and
 * were presumably dirtied between harvest and here.
 */
static int setup_blackout_copylist(struct syncd_data * x)
{
	pgd_t *pgd;

	int i, j, k;
	int n;

	x->max_range = 0;

	pgd = get_pgd(0);

	if (!pgd) {
		printk("%s: no pgd?\n", __func__);
		return FOSIL_ERR_MERGE_FAIL;
	}

	if (insert_pfn(x, virt_to_pfn(pgd)))
		return FOSIL_ERR_MERGE_FAIL;

	for (n = 0; n < PTRS_PER_PGD; n++, pgd++) {

		pud_t *pud;

		if (pgd_none(*pgd) || !pgd_present(*pgd))
			continue;

		pud = pud_offset(pgd, 0);

		if (insert_pfn(x, virt_to_pfn(pud)))
			return FOSIL_ERR_MERGE_FAIL;

		for (i = 0; i < PTRS_PER_PUD; i++, pud++) {

			pmd_t *pmd;

			if (pud_none(*pud) || !pud_present(*pud))
				continue;

			pmd = pmd_offset(pud, 0);

			if (insert_pfn(x, virt_to_pfn(pmd)))
				return FOSIL_ERR_MERGE_FAIL;

			for (j = 0; j < PTRS_PER_PMD; j++, pmd++) {

				pte_t *pte;

				if (pmd_none(*pmd) || !pmd_present(*pmd))
					continue;

				if (pmd_large(*pmd)) {
					/* Skip it if it covers an invalid area */
					if (!x->is_pfn_valid(pmd_pfn(*pmd)))
						continue;

					if ((pmd_val(*pmd) & _PAGE_DIRTY)) {
						unsigned long p, pfn = pmd_pfn(*pmd);
						for (p = 0; p < 0x200; p++) {
							if (insert_pfn(x, pfn + p))
								return FOSIL_ERR_MERGE_FAIL;
						}	
					}
					continue;
				}

				pte = pte_offset_kernel(pmd, 0);

				if (insert_pfn(x, virt_to_pfn(pte)))
					return FOSIL_ERR_MERGE_FAIL;

				for (k = 0; k < PTRS_PER_PTE; k++, pte++) {

					if (pte_none(*pte) || !pte_present(*pte))
						continue;

					/* Skip it if it covers an invalid area */
					if (!x->is_pfn_valid(pte_pfn(*pte)))
						continue;

					if (pte_val(*pte) & _PAGE_DIRTY) {
						if (insert_pfn(x, pte_pfn(*pte)))
							return FOSIL_ERR_MERGE_FAIL;
					}
				}
			}
		}
	}

	// copy the stack we are working from 
	if (add_stack_pages(x))
		return FOSIL_ERR_MERGE_FAIL;

	// add syncd structure
	if (insert_pfn(x, virt_to_pfn(x)))
		return FOSIL_ERR_MERGE_FAIL;

	// add mm_tracking_struct (disable_tracking() dirtied it).
	if (insert_pfn(x, virt_to_pfn(&mm_tracking_struct)))
		return FOSIL_ERR_MERGE_FAIL;

	// add the scratch page, in case ftmod wants to dirty it on the way to sync.
	if (insert_pfn(x, virt_to_pfn(fosil_scratch_page)))
		return FOSIL_ERR_MERGE_FAIL;

	if (insert_pfn(x, virt_to_pfn(fosil_scratch_lock)))
		return FOSIL_ERR_MERGE_FAIL;

	return FOSIL_ERR_OK;
}

static void __attribute__ ((__unused__)) do_blackout(void * arg)
{
	struct syncd_data * x = (struct syncd_data *) arg;
	fosil_page_range_t ranges[1] = {{0, ~0}};

	x->status = x->sync(1, ranges, &x->duration_in_ms, x->context);
}

static unsigned long harvest(void * arg)
{
#ifdef SMP_HARVEST
	struct syncd_data * x = (struct syncd_data *) arg;
#else
	if (smp_processor_id() != 0)
		return 0;
#endif

	harvest_user();

#ifdef SMP_HARVEST
	rendezvous(&x->r);

	if (smp_processor_id() == 0)
		harvest_kernel();
#else
	harvest_kernel();
#endif

	return 0;
}

/*
 * This is where all processors are sent for blackout processing.  
 * The driving processor is diverted to work on harvest and if necessary
 * entry to Common Code for synchronization.  All other processors are
 * spinning in a rendezvous at the bottom of this routine waiting for
 * the driving processor to finish.
 */
static unsigned long brownout_cycle(void * arg)
{
	struct syncd_data * x = (struct syncd_data *) arg;
	unsigned long long before, after;
	unsigned long delta;

#ifndef SMP_HARVEST
	if (smp_processor_id() != 0)
		return 0;
#endif

	rdtscll(before);

	harvest_user();

#ifdef SMP_HARVEST
	rendezvous(&x->r);

	if (smp_processor_id() != 0)
		return 0;
#endif

	harvest_kernel();

	x->bitcnt = atomic_read(&mm_tracking_struct.count);

	if ((x->bitcnt < x->threshold) || (x->pass == 0)) {
		x->blackout = IN_BLACKOUT;

		disable_tracking();

		x->status = setup_blackout_copylist(x);
		if (x->status)
			return 0;

		process_vector(x);
		if (x->status)
			return FOSIL_ERR_COPY_FAIL;

		rdtscll(after);

		x->status = x->sync(x->max_range,
				    x->CopyRange,
				    &x->duration_in_ms, x->context);
		x->done = 1;

		if (x->status) {
			printk("%s: %d (%s)\n", __FUNCTION__, __LINE__,
			       FOSIL_ERR_STRING(x->status));
		}

		delta = after - before;
		return x->duration_in_ms + (delta / cpu_khz);
	}

	return 0;
}

static void roll_time_forward(unsigned long adjust_in_ms)
{
#if 0	// FIXME
	struct timeval tv;
	struct timespec ts;

	do_gettimeofday(&tv);

	ts.tv_sec = tv.tv_sec;
	ts.tv_nsec = tv.tv_usec + adjust_in_ms * USEC_PER_SEC;
	while(unlikely(ts.tv_nsec >= NSEC_PER_SEC)) {
		ts.tv_nsec -= NSEC_PER_SEC;
		++ts.tv_sec;
	}
	do_settimeofday(&ts);
#endif
}

struct blackout_data {
	atomic_t r[5];
	unsigned int oncpu;
	unsigned long (*func)(void *);
	void * arg;
};

static void fosil_blackout(void * arg)
{
	struct blackout_data * x = (struct blackout_data *) arg;
	unsigned long flags;
	unsigned long long before, after;
	unsigned long delta, adjust;
	int this_cpu = smp_processor_id();

	rendezvous(&x->r[0]);
	local_bh_disable();
	rendezvous(&x->r[1]);
	local_irq_save(flags);
	rendezvous(&x->r[2]);

	rdtscll(before);
	adjust = x->func(x->arg);
	rdtscll(after);

	if (adjust) {
		delta = adjust;
	} else {
		delta = after - before;
		delta /= cpu_khz;
	}

	rendezvous(&x->r[3]);

	if (this_cpu == x->oncpu)
		roll_time_forward(delta);

	rendezvous(&x->r[4]);

	local_irq_restore(flags);
	local_bh_enable();
}


/*
 * Call the specified function with the provide argument in
 * during a blackout.
 */
void fosil_exec_onall(unsigned long (*func)(void *), void * arg)
{
	struct blackout_data x;

	init_rendezvous(&x.r[0]);
	init_rendezvous(&x.r[1]);
	init_rendezvous(&x.r[2]);
	init_rendezvous(&x.r[3]);
	init_rendezvous(&x.r[4]);
	x.oncpu = smp_processor_id();
	x.func  = func;
	x.arg   = arg;

	fosil_syncd_start(SYNCD_ON_ALL_PROCESSORS, fosil_blackout, &x, 0);
	fosil_syncd_finish();
}

/*
 * This routine is responsible for driving all of the memory
 * synchronization processing.  When complete the sync function should
 * be called to enter blackout and finalize synchronization.
 */
int fosil_synchronize_memory(copy_proc copy,
			     sync_proc sync,
			     is_pfn_valid_proc is_pfn_valid,
			     is_pfnrange_valid_proc is_pfnrange_valid,
			     unsigned int threshold,
			     unsigned int passes,
			     void * context)
{
	int status;
	struct syncd_data * x;

	x = kmalloc(sizeof(struct syncd_data), GFP_KERNEL);
	if (x == NULL) {
		printk("unable to allocate syncd control structure\n");
		return 1;
	}

	memset(x, 0, sizeof(struct syncd_data));

	x->copy = copy;
	x->sync = sync;
	x->is_pfn_valid = is_pfn_valid;
	x->is_pfnrange_valid = is_pfnrange_valid;
	x->bitcnt = 0;
	x->blackout = IN_BROWNOUT;
	x->done = 0;
	x->threshold = threshold;
	x->context = context;
	x->nearby = NEARBY;

	fosil_exec_onall(map_kernel_small_pmd, NULL);

	// clean up hardware dirty bits
	init_rendezvous(&x->r);
	fosil_exec_onall(harvest, x);

	enable_tracking();

	/*
	 * Copy all memory.  After this we only copy modified
	 * pages, so we need to make sure the unchanged parts
	 * get copied at least once.
	 */
	status = x->copy(0, ~0, 0, x->context);
	if (status != FOSIL_ERR_OK) {
		printk("%s: %d\n", __FUNCTION__, __LINE__);
		goto fosil_sync_out;
	}

	for (x->pass = passes; (x->pass >= 0); x->pass--) {
		DBGPRNT(DBGLVL_BROWNOUT,
			"brownout pass %d - %ld pages found (%d)\n",
			x->pass, x->bitcnt,
			atomic_read(&mm_tracking_struct.count));

		init_rendezvous(&x->r);
		fosil_exec_onall(brownout_cycle, x);
		if (x->status) {
			printk("%s(%d): brownout_cycle failed \n",
			       __FUNCTION__, __LINE__);
			// flush any incomplete transactions
			x->copy(~0, ~0, 0, x->context);
			break;
		}
		if (x->done)
			break;

		/* Give back some time to everybody else */
		schedule();

		process_vector(x);
		if (x->status) {
			printk("%s(%d): process_vector failed \n",
			       __FUNCTION__, __LINE__);
			// flush any incomplete transactions
			x->copy(~0, ~0, 0, x->context);
			break;
		}
	}

	disable_tracking();
#if 0
	{
		int i;
		int total = 0;
		for (i = 0; i < x->max_range; i++) {
			printk("start %08lx num %lx(%ld) gap [%ld]\n",
				x->CopyRange[i].start,
				x->CopyRange[i].num,
				x->CopyRange[i].num,
				(i==0)?0:(x->CopyRange[i].start -
					  (x->CopyRange[i-1].start +
					   x->CopyRange[i-1].num)));
			total += x->CopyRange[i].num;
		}
		printk("max range %d total pages %d\n",
			x->max_range, total);
	}
#endif
	DBGPRNT(DBGLVL_BROWNOUT, "last pass found %ld pages\n", x->bitcnt);
	DBGPRNT(DBGLVL_BROWNOUT, "copy range max %d\n", x->max_range);

	status = x->status;

fosil_sync_out:

	kfree(x);

	fosil_exec_onall(map_kernel_orig_pmd, NULL);

	return status;
}
EXPORT_SYMBOL(fosil_synchronize_memory);

#ifdef REPLACE_KERN_FUNCS
static void (*orig_track_pmd)(void *);

static int check_old_ptes(pmd_t *pmd, int *u_count, int *k_count) {
	int i, lost_pte = 0;
	unsigned long pfn;
	pte_t *pte = pte_offset_kernel(pmd, 0);

	for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
		if (pte_none(*pte) || !pte_present(*pte))
			continue;

		pfn = pte_pfn(*pte);
		
		if (pfn >= mm_tracking_struct.bitcnt)
			continue;

		if (!test_and_set_bit(pfn, mm_tracking_struct.vector)) {
			if (pte_val(*pte) & _PAGE_USER)
				(*u_count)++;
			else
				(*k_count)++;
			atomic_inc(&mm_tracking_struct.count);
			lost_pte++;
		}
	}

	return lost_pte;
}

static void fosil_track_pmd(void * val) {
	pmd_t *pmd = (pmd_t*)val;
	int u_count = 0, k_count = 0;

	if (pmd_large(*pmd)) {
		orig_track_pmd(val);
		return;
	}

	check_old_ptes(pmd, &u_count, &k_count);

	orig_track_pmd(val);
}
#endif

int __init fosil_harvest_init(void)
{
	extern unsigned long num_physpages;

#ifdef USE_SMALL_PMD
	if (small_pmd_init())
		goto out_small_pmd;
#endif

#ifdef REPLACE_KERN_FUNCS
	orig_track_pmd = do_mm_track_pmd;
	do_mm_track_pmd = fosil_track_pmd;
#endif

	mm_tracking_struct.vector = vmalloc(num_physpages/8);
	mm_tracking_struct.bitcnt = num_physpages;

	if (mm_tracking_struct.vector == NULL) {
		printk("%s: unable to allocate memory tracking bitmap\n",
		       __FUNCTION__);
		goto out_no_mm_track;
	}

	fosil_scratch_page = (void*)__get_free_pages(GFP_KERNEL, 0);
	if (!fosil_scratch_page)
		goto out_no_fsp;

	fosil_scratch_lock = (spinlock_t*)kmalloc(sizeof(spinlock_t), GFP_KERNEL);
	if (!fosil_scratch_lock)
		goto out_no_fsl;

	spin_lock_init(fosil_scratch_lock);

	return 0;

out_no_fsl:
	free_pages((unsigned long)fosil_scratch_page, 0);
out_no_fsp:
	vfree(mm_tracking_struct.vector);
out_no_mm_track:
#ifdef USE_SMALL_PMD
	small_pmd_exit();
out_small_pmd:
#endif
	return -ENOMEM;
}

void __exit fosil_harvest_cleanup(void)
{
#ifdef USE_SMALL_PMD
	small_pmd_exit();
#endif

#ifdef REPLACE_KERN_FUNCS
	do_mm_track_pmd = orig_track_pmd;
#endif

	// make sure tracking is definitely turned off
	disable_tracking();

	// wake me after everyone is sure to be done with mm_track()
	synchronize_sched();

	if (mm_tracking_struct.vector)
		vfree(mm_tracking_struct.vector);

	if (fosil_scratch_page)
		free_pages((unsigned long)fosil_scratch_page, 0);

	if (fosil_scratch_lock)
		kfree(fosil_scratch_lock);

	mm_tracking_struct.bitcnt = 0;

	return;
}

#else /* CONFIG_TRACK_DIRTY_PAGES */


int __init fosil_harvest_init(void)
{
	return 0;
}

void __exit fosil_harvest_cleanup(void)
{
	return;
}

#endif /* CONFIG_TRACK_DIRTY_PAGES */


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-05 18:38   ` Kimball Murray
@ 2006-09-05 20:48     ` Andi Kleen
  0 siblings, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2006-09-05 20:48 UTC (permalink / raw)
  To: Kimball Murray; +Cc: Arjan van de Ven, linux-kernel, akpm

On Tuesday 05 September 2006 20:38, Kimball Murray wrote:

> As before, Stratus would not object to providing all kernel modules as
> GPL, but still our problem is that there is some code that reflects
> some of our chipset vendor's proprietary information, so they won't

Is that Intel? They normally don't have problems with free drivers.

> At 1400 lines, I should probably find a server to make this available.
>  But not having one at the moment, please forgive the attachment.

The code would certainly need a lot of cleanup and split up etc.
(Andrew has a nice document on that somewhere, it's called "the perfect 
patch")

What I could see is that if you turn that into generic "memory mirror" 
VM code and submit the patches to do that together with some simple
that uses it. I suppose other people could then use it too and it might
end up as a nice generic facility in Linux code.

Then when that generic subsystem is in your hardware specific drivers could 
possibly use it, although it would be definitely strongly prefered if those
were GPL too.

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-05 17:34 [Feature] x86_64 page tracking for Stratus servers Kimball Murray
  2006-09-05 17:40 ` Arjan van de Ven
@ 2006-09-05 17:56 ` Dave Hansen
  2006-09-05 18:50   ` Kimball Murray
  2006-09-05 18:16 ` Andi Kleen
  2006-09-06  6:38 ` Nick Piggin
  3 siblings, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2006-09-05 17:56 UTC (permalink / raw)
  To: Kimball Murray; +Cc: linux-kernel, akpm, ak

On Tue, 2006-09-05 at 13:34 -0400, Kimball Murray wrote:
> +static __inline__ void mm_track_pte(void * val)
> +{
> +       if (unlikely(mm_tracking_struct.active))
> +               do_mm_track_pte(val);
> +} 

This patch just appears to be a big collection of hooks.  Could you post
an example user of these hooks?  It is obviously GPL from all of the
EXPORT_SYMBOL_GPL()s anyway, right?

-- Dave


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-05 17:56 ` Dave Hansen
@ 2006-09-05 18:50   ` Kimball Murray
  0 siblings, 0 replies; 12+ messages in thread
From: Kimball Murray @ 2006-09-05 18:50 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, akpm, ak

Hi Dave,

The "hooks" have default funtions in the patch (see track.c), all of
which do exactly what Stratus needs them to do.  Knowing that this
functionality is only needed by Stratus, the hook was my attempt to
allow other users of this interface to make it behave differently.  I
wouldn't object to removing the hooks and instead calling the default
functions directly.

-kimball

On 9/5/06, Dave Hansen <haveblue@us.ibm.com> wrote:
> On Tue, 2006-09-05 at 13:34 -0400, Kimball Murray wrote:
> > +static __inline__ void mm_track_pte(void * val)
> > +{
> > +       if (unlikely(mm_tracking_struct.active))
> > +               do_mm_track_pte(val);
> > +}
>
> This patch just appears to be a big collection of hooks.  Could you post
> an example user of these hooks?  It is obviously GPL from all of the
> EXPORT_SYMBOL_GPL()s anyway, right?
>
> -- Dave
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-05 17:34 [Feature] x86_64 page tracking for Stratus servers Kimball Murray
  2006-09-05 17:40 ` Arjan van de Ven
  2006-09-05 17:56 ` Dave Hansen
@ 2006-09-05 18:16 ` Andi Kleen
  2006-09-06  6:38 ` Nick Piggin
  3 siblings, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2006-09-05 18:16 UTC (permalink / raw)
  To: Kimball Murray; +Cc: linux-kernel, akpm

On Tuesday 05 September 2006 19:34, Kimball Murray wrote:
> Attached is a git patch that implements a feature that is used by Stratus
> fault-tolerant servers running on Intel x86_64 platforms.  It provides the
> kernel mechanism that allows a loadable module to be able to keep track of
> recently dirtied pages for the purpose of copying live, currently active
> memory, to a spare memory module.

As other people commented we normally don't merge hook only patches.

The reason for this is that with Linux being so popular
and so many people want hooks for their various addons that if there
wasn't a fairly strict policy about it then at some point the core kernel
code might contain more hooks than actual code doing something.

But I see no reason why mainline shouldn't support Stratus
systems, so if you merge the whole thing in a clean way it 
might be possible. It must be already GPL since the exports are _GPL.

Bonus points if the dirty tracking is done in a generic enough way that it 
might be useful for other kernel things too (is that possible? I know Xen etc
use it at the Hypervisor level for various stuff) 

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-05 17:34 [Feature] x86_64 page tracking for Stratus servers Kimball Murray
                   ` (2 preceding siblings ...)
  2006-09-05 18:16 ` Andi Kleen
@ 2006-09-06  6:38 ` Nick Piggin
  2006-09-06  7:36   ` Andi Kleen
  2006-09-06 14:10   ` Kimball Murray
  3 siblings, 2 replies; 12+ messages in thread
From: Nick Piggin @ 2006-09-06  6:38 UTC (permalink / raw)
  To: Kimball Murray; +Cc: linux-kernel, akpm, ak

Kimball Murray wrote:

>Attached is a git patch that implements a feature that is used by Stratus
>fault-tolerant servers running on Intel x86_64 platforms.  It provides the
>kernel mechanism that allows a loadable module to be able to keep track of
>recently dirtied pages for the purpose of copying live, currently active
>memory, to a spare memory module.
>
>In Stratus servers, this spare memory module (and CPUs) will be brought into
>lockstep with the original memory (and CPUs) in order to achieve fault
>tolerance.
>
>In order to make this feature work, it is necessary to track pages which have
>been recently dirtied.  A simplified view of the algorithm used in the kernel
>module is this:
>
>1. Turn on the tracking functionality.
>2. Copy all memory from active to spare memory module.
>3. Until done:
>	a) Identify all pages that were dirtied in the active memory since
>	   the last copy operation.
>	b) Copy all pages identified in 3a to the spare memory module.
>	c) If number of pages still dirty is less than some threshhold,
>		i.  "black out" the system (enter SMI)
>		ii.  copy remaining pages in blackout context
>		iii. goto step 4
>	   Else
>		goto 3a.
>4. synchronize cpus
>5. leave SMI, return to OS
>6. System is now "Duplexed", and fault tolerant.
>

Silly question, why can't you do all this from stop_machine_run context (or
your SMI) that doesn't have to worry about other CPUs dirtying memory?

>Please consider this feature for inclusion in the kernel tree, as it is very
>important to Stratus.
>

Given that it doesn't touch core mm/ code, I don't really care about it[*]
except that it doesn't make sense to have the tracking hooks in generic code
because it is pretty specific to your module.

Also, as far as a "notifier" goes, it is of course completely broken unless
your module is the only one that is ever going to use it. Which I suspect
may be the case ;) But you do really need to at least WARN_ON or fail if
someone is going to break it like this.

[*] Though if it gets included, it would not stop me lamenting the
proliferation of complexities to support *tiny* obscure userbases. Can
we wait until your hardware is smart enough to snoop the cc? :)

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-06  6:38 ` Nick Piggin
@ 2006-09-06  7:36   ` Andi Kleen
  2006-09-06  9:36     ` Nick Piggin
  2006-09-06 15:10     ` Kimball Murray
  2006-09-06 14:10   ` Kimball Murray
  1 sibling, 2 replies; 12+ messages in thread
From: Andi Kleen @ 2006-09-06  7:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Kimball Murray, linux-kernel, akpm

> Silly question, why can't you do all this from stop_machine_run context (or
> your SMI) that doesn't have to worry about other CPUs dirtying memory?

Because that would be too slow for continuous mirroring.

You can't go through 10+GB of virtual memory (or more with shared 
memory because the scan has to be virtual) in an interrupt.

The only sane way is to do it continuously.

> [*] Though if it gets included, it would not stop me lamenting the
> proliferation of complexities to support *tiny* obscure userbases. Can
> we wait until your hardware is smart enough to snoop the cc? :)

My guess is that if we had a generic memory mirror subsystem other people would
find uses for it too. e.g. a lot of systems support spare DIMMs these days and mirroring
some memory to it seems like a smart idea. That means normally the hardware
does it, but perhaps some stuff can be done better by doing it in software.

Or it is also a bit similar to the algorithms Xen uses for live migration.
If that was implemented on the kernel level something like this might 
be useful too. I think OpenVZ has some kind of migration support, but it's
currently not live.

-andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-06  7:36   ` Andi Kleen
@ 2006-09-06  9:36     ` Nick Piggin
  2006-09-06 15:10     ` Kimball Murray
  1 sibling, 0 replies; 12+ messages in thread
From: Nick Piggin @ 2006-09-06  9:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Kimball Murray, linux-kernel, akpm

Andi Kleen wrote:

>>Silly question, why can't you do all this from stop_machine_run context (or
>>your SMI) that doesn't have to worry about other CPUs dirtying memory?
>>
>
>Because that would be too slow for continuous mirroring.
>
>You can't go through 10+GB of virtual memory (or more with shared 
>memory because the scan has to be virtual) in an interrupt.
>
>The only sane way is to do it continuously.
>

Presumably it is not continuous but there are checkpoints, and in the
worst case, enforcement of a checkpoint will require an SMI.

But OK, heuristically I'm sure it is much quicker to already have most
memory saved. I guess this is a requirement otherwise they would have
done it the obvious way... but my question is just about what exact
requirement does this satisfy that a stop_machine would not.

>
>>[*] Though if it gets included, it would not stop me lamenting the
>>proliferation of complexities to support *tiny* obscure userbases. Can
>>we wait until your hardware is smart enough to snoop the cc? :)
>>
>
>
>My guess is that if we had a generic memory mirror subsystem other people would
>find uses for it too. e.g. a lot of systems support spare DIMMs these days and mirroring
>some memory to it seems like a smart idea. That means normally the hardware
>does it, but perhaps some stuff can be done better by doing it in software.
>
>Or it is also a bit similar to the algorithms Xen uses for live migration.
>If that was implemented on the kernel level something like this might 
>be useful too. I think OpenVZ has some kind of migration support, but it's
>currently not live.
>

Yep, I'm not saying it couldn't be useful.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-06  7:36   ` Andi Kleen
  2006-09-06  9:36     ` Nick Piggin
@ 2006-09-06 15:10     ` Kimball Murray
  1 sibling, 0 replies; 12+ messages in thread
From: Kimball Murray @ 2006-09-06 15:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, linux-kernel, akpm

On 9/6/06, Andi Kleen <ak@suse.de> wrote:
>
> > Silly question, why can't you do all this from stop_machine_run context (or
> > your SMI) that doesn't have to worry about other CPUs dirtying memory?
>
> Because that would be too slow for continuous mirroring.
>
> You can't go through 10+GB of virtual memory (or more with shared
> memory because the scan has to be virtual) in an interrupt.
>
> The only sane way is to do it continuously.

That's exactly right on all counts.  We can "harvest" dirty pages in a
continuous fashion, as a background task.  But when we finally decide
to dive in to machine blackout for the final copy, we need to ensure
that the blackout is as short as possible.  Toward that goal, we have
built a sort of DMA engine that can move pages over the backplane to
the spare cpu/memory module without interfering (much) with the active
module's activities.  We continue doing this until the number of pages
for final copy is small enough that we can be sure the final blackout
copy will take a sufficiently short time.

And you're right about the virtual scan as well.  Trouble is, it's a
potentially unbounded scan.  With the right machine load, we can find
ourselves in a state where it takes so long to scan all the page
tables that the pages are all dirty again before we start the next
pass. For this situation we have two choices.  We can either give up
and dive into the blackout passing a dirty list that basically
includes all of memory (long blackout time!), or we can continue
scanning for a long time, waiting and hoping the machine load will
decrease to a point where are target number of dirty pages is reached.
 This sort of thing will always be a problem with a software solution.

-kimball

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Feature] x86_64 page tracking for Stratus servers
  2006-09-06  6:38 ` Nick Piggin
  2006-09-06  7:36   ` Andi Kleen
@ 2006-09-06 14:10   ` Kimball Murray
  1 sibling, 0 replies; 12+ messages in thread
From: Kimball Murray @ 2006-09-06 14:10 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, akpm, ak

Hi Nick,

On 9/6/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Silly question, why can't you do all this from stop_machine_run context (or
> your SMI) that doesn't have to worry about other CPUs dirtying memory?

It's not a silly question at all.  The reason we use an SMI handler is
for two reasons.  One is to perform the final mem copy in a state
where we know that the OS and interrupt handlers (even the NMI) is
quiesced.  Admittedly, there are other ways to achieve this state.
But the second reason is that after the memory is mirrored, it is
necessary to cause the CPUs on both modules to execute in lockstep.
Both CPU modules are driven by the same front-side bus clock, but
there's a lot more to getting the CPUs running in lockstep than simply
clocking them the same way.

The processors must be put in exactly the same state before leaving
the SMI.  And here I'm referring to internal processor state as well,
things than are not in the Intel developer manuals are at play here.
While in the SMI handler, we also tweak some of the processor's
internal settings so that they behave deterministically afterward.
Otherwise, they would fall out of lock in short order.

I suppose I should also have mentioned that we support 3 different
OSes on this hardware: Linux, Windows, and one of our own called VOS.
By burying some of this mirror-and-lockstep magic in the SMI handler,
we are able to run the same SMI payload for all 3 OSes.  It just makes
life easier for us from a product standpoint.  All the OS and its
modules need to do is generate a "final-copy" list of dirty pages,
store that list in a known location, and trigger an SMI.  The SMI
handler looks at the list and does the final copy.  Then in resets the
processors, applies some Intel, deterministic magic, and exits the
SMI.

I'm oversimplifying some of this, but it should provide an idea of
what has to happen.

>
> >Please consider this feature for inclusion in the kernel tree, as it is very
> >important to Stratus.
> >
>
> Given that it doesn't touch core mm/ code, I don't really care about it[*]
> except that it doesn't make sense to have the tracking hooks in generic code
> because it is pretty specific to your module.

It's generic only to x86_64.  I could try and restrict it further to
Intel versions of that platform, but as I stated before, I had hoped
the tracking interface might be useful to someone.

-kimball

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2006-09-06 15:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-05 17:34 [Feature] x86_64 page tracking for Stratus servers Kimball Murray
2006-09-05 17:40 ` Arjan van de Ven
2006-09-05 18:38   ` Kimball Murray
2006-09-05 20:48     ` Andi Kleen
2006-09-05 17:56 ` Dave Hansen
2006-09-05 18:50   ` Kimball Murray
2006-09-05 18:16 ` Andi Kleen
2006-09-06  6:38 ` Nick Piggin
2006-09-06  7:36   ` Andi Kleen
2006-09-06  9:36     ` Nick Piggin
2006-09-06 15:10     ` Kimball Murray
2006-09-06 14:10   ` Kimball Murray

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox