[RFC PATCH 00/10] Improve numa scheduling by consolidating tasks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
@ 2013-07-30  7:48 Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 01/10] sched: Introduce per node numa weights Srikar Dronamraju
                   ` (11 more replies)
  0 siblings, 12 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

Here is an approach that looks to consolidate workloads across nodes.
This results in much improved performance. Again I would assume this work
is complementary to Mel's work with numa faulting.

Here are the advantages of this approach.
1. Provides excellent consolidation of tasks.
 From my experiments, I have found that the better the task
 consolidation, we achieve better the memory layout, which results in
 better the performance.

2. Provides good improvement in most cases, but there are some regressions.

3. Looks to extend the load balancer esp when the cpus are idling.

Here is the outline of the approach.

- Every process has a per node array where we store the weight of all
  its tasks running on that node. This arrays gets updated on task
  enqueue/dequeue.

- Added a 2 pass mechanism (somewhat taken from numacore but not
  exactly) while choosing tasks to move across nodes.

  In the first pass, choose only tasks that are ideal to be moved.
  While choosing a task, look at the per node process arrays to see if
  moving task helps.
  If the first pass fails to move a task, any task can be chosen on the
  second pass.

- If the regular load balancer (rebalance_domain()) fails to balance the
  load (or finds no imbalance) and there is a cpu, use the cpu to
  consolidate tasks to the nodes by using the information in the per
  node process arrays.

  Every idle cpu if its doesnt have tasks queued after load balance,
  - will walk thro the cpus in its node and checks if there are buddy
    tasks that are not part of the node but should have been ideally
    part of this node.
  - To make sure that we dont pull all buddy tasks and create an
    imbalance, we look at load on the load, pinned tasks and the
    processes contribution to the load for this node.
  - Each cpu looks at the node which has the least number of buddy tasks
    running and tries to pull the tasks from such nodes.

  - Once it finds the cpu from which to pull the tasks, it triggers
    active_balancing. This type of active balancing triggers just one
    pass. i.e it only fetches tasks that increase numa locality.

Here are results of specjbb run on a 2 node machine.
Specjbb was run on 3 vms.
In the fit case, one vm was big to fit one node size.
In the no-fit case, one vm was bigger than the node size.

-------------------------------------------------------------------------------------
|kernel        |                          nofit|                            fit|   vm|
|kernel        |          noksm|            ksm|          noksm|            ksm|   vm|
|kernel        |  nothp|    thp|  nothp|    thp|  nothp|    thp|  nothp|    thp|   vm|
--------------------------------------------------------------------------------------
|v3.9          | 136056| 189423| 135359| 186722| 136983| 191669| 136728| 184253| vm_1|
|v3.9          |  66041|  84779|  64564|  86645|  67426|  84427|  63657|  85043| vm_2|
|v3.9          |  67322|  83301|  63731|  85394|  65015|  85156|  63838|  84199| vm_3|
--------------------------------------------------------------------------------------
|v3.9 + Mel(v5)| 133170| 177883| 136385| 176716| 140650| 174535| 132811| 190120| vm_1|
|v3.9 + Mel(v5)|  65021|  81707|  62876|  81826|  63635|  84943|  58313|  78997| vm_2|
|v3.9 + Mel(v5)|  61915|  82198|  60106|  81723|  64222|  81123|  59559|  78299| vm_3|
| % change     |  -2.12|  -6.09|   0.76|  -5.36|   2.68|  -8.94|  -2.86|   3.18| vm_1|
| % change     |  -1.54|  -3.62|  -2.61|  -5.56|  -5.62|   0.61|  -8.39|  -7.11| vm_2|
| % change     |  -8.03|  -1.32|  -5.69|  -4.30|  -1.22|  -4.74|  -6.70|  -7.01| vm_3|
--------------------------------------------------------------------------------------
|v3.9 + this   | 136766| 189704| 148642| 180723| 147474| 184711| 139270| 186768| vm_1|
|v3.9 + this   |  72742|  86980|  67561|  91659|  69781|  87741|  65989|  83508| vm_2|
|v3.9 + this   |  66075|  90591|  66135|  90059|  67942|  87229|  66100|  85908| vm_3|
| % change     |   0.52|   0.15|   9.81|  -3.21|   7.66|  -3.63|   1.86|   1.36| vm_1|
| % change     |  10.15|   2.60|   4.64|   5.79|   3.49|   3.93|   3.66|  -1.80| vm_2|
| % change     |  -1.85|   8.75|   3.77|   5.46|   4.50|   2.43|   3.54|   2.03| vm_3|
--------------------------------------------------------------------------------------


Autonuma benchmark results on a 2 node machine:
KernelVersion: 3.9.0
		Testcase:      Min      Max      Avg   StdDev
		  numa01:   118.98   122.37   120.96     1.17
     numa01_THREAD_ALLOC:   279.84   284.49   282.53     1.65
		  numa02:    36.84    37.68    37.09     0.31
	      numa02_SMT:    44.67    48.39    47.32     1.38

KernelVersion: 3.9.0 + Mel's v5
		Testcase:      Min      Max      Avg   StdDev  %Change
		  numa01:   115.02   123.08   120.83     3.04    0.11%
     numa01_THREAD_ALLOC:   268.59   298.47   281.15    11.16    0.46%
		  numa02:    36.31    37.34    36.68     0.43    1.10%
	      numa02_SMT:    43.18    43.43    43.29     0.08    9.28%

KernelVersion: 3.9.0 + this patchset
		Testcase:      Min      Max      Avg   StdDev  %Change
		  numa01:   103.46   112.31   106.44     3.10   12.93%
     numa01_THREAD_ALLOC:   277.51   289.81   283.88     4.98   -0.47%
		  numa02:    36.72    40.81    38.42     1.85   -3.26%
	      numa02_SMT:    56.50    60.00    58.08     1.23  -17.93%

KernelVersion: 3.9.0(HT)
		Testcase:      Min      Max      Avg   StdDev
		  numa01:   241.23   244.46   242.94     1.31
     numa01_THREAD_ALLOC:   301.95   307.39   305.04     2.20
		  numa02:    41.31    43.92    42.98     1.02
	      numa02_SMT:    37.02    37.58    37.44     0.21

KernelVersion: 3.9.0 + Mel's v5 (HT)
		Testcase:      Min      Max      Avg   StdDev  %Change
		  numa01:   238.42   242.62   241.60     1.60    0.55%
     numa01_THREAD_ALLOC:   285.01   298.23   291.54     5.37    4.53%
		  numa02:    38.08    38.16    38.11     0.03   12.76%
	      numa02_SMT:    36.20    36.64    36.36     0.17    2.95%

KernelVersion: 3.9.0 + this patchset(HT)
		Testcase:      Min      Max      Avg   StdDev  %Change
		  numa01:   175.17   189.61   181.90     5.26   32.19%
     numa01_THREAD_ALLOC:   285.79   365.26   305.27    30.35   -0.06%
		  numa02:    38.26    38.97    38.50     0.25   11.50%
	      numa02_SMT:    44.66    49.22    46.22     1.60  -17.84%


Autonuma benchmark results on a 4 node machine:
# dmidecode | grep 'Product Name:'
	Product Name: System x3750 M4 -[8722C1A]-
# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 65468 MB
node 0 free: 63890 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 65536 MB
node 1 free: 64033 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 65536 MB
node 2 free: 64236 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 65536 MB
node 3 free: 64162 MB
node distances:
node   0   1   2   3 
  0:  10  11  11  12 
  1:  11  10  12  11 
  2:  11  12  10  11 
  3:  12  11  11  10 

KernelVersion: 3.9.0
		Testcase:      Min      Max      Avg   StdDev
		  numa01:   581.35   761.95   681.23    80.97
     numa01_THREAD_ALLOC:   140.39   164.45   150.34     7.98
		  numa02:    18.47    20.12    19.25     0.65
	      numa02_SMT:    16.40    25.30    21.06     2.86

KernelVersion: 3.9.0 + Mel's v5 patchset
		Testcase:      Min      Max      Avg   StdDev  %Change
		  numa01:   733.15   767.99   748.88    14.51   -8.81%
     numa01_THREAD_ALLOC:   154.18   169.13   160.48     5.76   -6.00%
		  numa02:    19.09    22.15    21.02     1.03   -7.99%
	      numa02_SMT:    23.01    25.53    23.98     0.83  -11.44%

KernelVersion: 3.9.0 + this patchset
		Testcase:      Min      Max      Avg   StdDev  %Change
		  numa01:   409.64   457.91   444.55    17.66   51.69%
     numa01_THREAD_ALLOC:   158.10   174.89   169.32     5.84  -10.85%
		  numa02:    18.89    22.36    19.98     1.29   -3.26%
	      numa02_SMT:    23.33    27.87    25.02     1.68  -14.21%


KernelVersion: 3.9.0 (HT)
		Testcase:      Min      Max      Avg   StdDev
		  numa01:   567.62   752.06   620.26    66.72
     numa01_THREAD_ALLOC:   145.84   172.44   160.73    10.34
		  numa02:    18.11    20.06    19.10     0.67
	      numa02_SMT:    17.59    22.83    19.94     2.17

KernelVersion: 3.9.0 + Mel's v5 patchset (HT)
		Testcase:      Min      Max      Avg   StdDev  %Change
		  numa01:   741.13   753.91   748.10     4.51  -16.96%
     numa01_THREAD_ALLOC:   153.57   162.45   158.22     3.18    1.55%
		  numa02:    19.15    20.96    20.04     0.64   -4.48%
	      numa02_SMT:    22.57    25.92    23.87     1.15  -15.16%

KernelVersion: 3.9.0 + this patchset (HT)
		Testcase:      Min      Max      Avg   StdDev  %Change
		  numa01:   418.46   457.77   436.00    12.81   40.25%
     numa01_THREAD_ALLOC:   156.21   169.79   163.75     4.37   -1.78%
		  numa02:    18.41    20.18    19.06     0.60    0.20%
	      numa02_SMT:    22.72    27.24    25.29     1.76  -19.64%


Autonuma results on a 8 node machine:

# dmidecode | grep 'Product Name:'
	Product Name: IBM x3950-[88722RZ]-

# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32510 MB
node 0 free: 31475 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32512 MB
node 1 free: 31709 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32512 MB
node 2 free: 31737 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32512 MB
node 3 free: 31736 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32512 MB
node 4 free: 31739 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32512 MB
node 5 free: 31639 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 65280 MB
node 6 free: 63836 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 65280 MB
node 7 free: 64043 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  20  20  20  20  20  20  20 
  1:  20  10  20  20  20  20  20  20 
  2:  20  20  10  20  20  20  20  20 
  3:  20  20  20  10  20  20  20  20 
  4:  20  20  20  20  10  20  20  20 
  5:  20  20  20  20  20  10  20  20 
  6:  20  20  20  20  20  20  10  20 
  7:  20  20  20  20  20  20  20  10 

KernelVersion: 3.9.0
	Testcase:      Min      Max      Avg   StdDev
	  numa01:  1796.11  1848.89  1812.39    19.35
	  numa02:    55.05    62.32    58.30     2.37

KernelVersion: 3.9.0-mel_numa_balancing+()
	Testcase:      Min      Max      Avg   StdDev  %Change
	  numa01:  1758.01  1929.12  1853.15    77.15   -2.11%
	  numa02:    50.96    53.63    52.12     0.90   11.52%

KernelVersion: 3.9.0-numa_balancing_v39+()
	Testcase:      Min      Max      Avg   StdDev  %Change
	  numa01:  1081.66  1939.94  1500.01   350.20   16.10%
	  numa02:    35.32    43.92    38.64     3.35   44.76%


TODOs:
1. Use task loads for numa weights
2. Use numa faults as secondary key while moving threads


Andrea Arcangeli (1):
  x86, mm: Prevent gcc to re-read the pagetables

Srikar Dronamraju (9):
  sched: Introduce per node numa weights
  sched: Use numa weights while migrating tasks
  sched: Select a better task to pull across node using iterations
  sched: Move active_load_balance_cpu_stop to a new helper function
  sched: Extend idle balancing to look for consolidation of tasks
  sched: Limit migrations from a node
  sched: Pass hint to active balancer about the task to be chosen
  sched: Prevent a task from migrating immediately after an active balance
  sched: Choose a runqueue that has lesser local affinity tasks

 arch/x86/mm/gup.c        |   23 ++-
 fs/exec.c                |    6 +
 include/linux/mm_types.h |    2 +
 include/linux/sched.h    |    4 +
 kernel/fork.c            |   11 +-
 kernel/sched/core.c      |    2 +
 kernel/sched/fair.c      |  443 ++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h     |    4 +
 mm/memory.c              |    2 +-
 9 files changed, 475 insertions(+), 22 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 01/10] sched: Introduce per node numa weights
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 02/10] sched: Use numa weights while migrating tasks Srikar Dronamraju
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

Load balancer spreads the load evenly for fairness and for maintaining
balance across different domains. However where possible related tasks
could be scheduled in the same domain (esp at node domains) to allow tasks
to have more local accesses. This consolidation can be done without
affecting fairness and leaving the domains balanced

To better consolidate the loads, account weights per-mm per-node. These
stats are used in later patches to select more appropriate tasks during
load balance.

TODO: Modify to capture and use the actual task weights rather than task
counts

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 fs/exec.c                |    5 +++++
 include/linux/mm_types.h |    1 +
 kernel/fork.c            |   10 +++++++---
 kernel/sched/fair.c      |   34 ++++++++++++++++++++++++++++++++++
 4 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a96a488..b086e9e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -833,6 +833,11 @@ static int exec_mmap(struct mm_struct *mm)
 	activate_mm(active_mm, mm);
 	task_unlock(tsk);
 	arch_pick_mmap_layout(mm);
+#ifdef CONFIG_NUMA_BALANCING
+	mm->numa_weights = kzalloc(sizeof(atomic_t) * (nr_node_ids + 1), GFP_KERNEL);
+	atomic_inc(&mm->numa_weights[cpu_to_node(task_cpu(tsk))]);
+	atomic_inc(&mm->numa_weights[nr_node_ids]);
+#endif
 	if (old_mm) {
 		up_read(&old_mm->mmap_sem);
 		BUG_ON(active_mm != old_mm);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..45d02df 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -435,6 +435,7 @@ struct mm_struct {
 	 * a different node than Make PTE Scan Go Now.
 	 */
 	int first_nid;
+	atomic_t *numa_weights;
 #endif
 	struct uprobes_state uprobes_state;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 1766d32..21421bd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,9 @@ void mmput(struct mm_struct *mm)
 		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
+#ifdef CONFIG_NUMA_BALANCING
+		kfree(mm->numa_weights);
+#endif
 		if (!list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			list_del(&mm->mmlist);
@@ -823,9 +826,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
-#ifdef CONFIG_NUMA_BALANCING
-	mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 
@@ -844,6 +844,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 	if (mm->binfmt && !try_module_get(mm->binfmt->module))
 		goto free_pt;
 
+#ifdef CONFIG_NUMA_BALANCING
+	mm->first_nid = NUMA_PTE_SCAN_INIT;
+	mm->numa_weights = kzalloc(sizeof(atomic_t) * (nr_node_ids + 1), GFP_KERNEL);
+#endif
 	return mm;
 
 free_pt:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..8a2b5aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -995,10 +995,40 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 		}
 	}
 }
+
+static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+	struct mm_struct *mm = p->mm;
+	struct rq *rq = rq_of(cfs_rq);
+	int curnode = cpu_to_node(cpu_of(rq));
+
+	if (mm && mm->numa_weights) {
+		atomic_read(&mm->numa_weights[curnode]);
+		atomic_read(&mm->numa_weights[nr_node_ids]);
+	}
+}
+
+static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+	struct mm_struct *mm = p->mm;
+	struct rq *rq = rq_of(cfs_rq);
+	int curnode = cpu_to_node(cpu_of(rq));
+
+	if (mm && mm->numa_weights) {
+		atomic_dec(&mm->numa_weights[curnode]);
+		atomic_dec(&mm->numa_weights[nr_node_ids]);
+	}
+}
 #else
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 }
+static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+}
+static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
+{
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static void
@@ -1713,6 +1743,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
+	if (entity_is_task(se))
+		account_numa_enqueue(cfs_rq, task_of(se));
 
 	if (cfs_rq->nr_running == 1) {
 		list_add_leaf_cfs_rq(cfs_rq);
@@ -1810,6 +1842,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq);
+	if (entity_is_task(se))
+		account_numa_dequeue(cfs_rq, task_of(se));
 }
 
 /*
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 02/10] sched: Use numa weights while migrating tasks
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 01/10] sched: Introduce per node numa weights Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 03/10] sched: Select a better task to pull across node using iterations Srikar Dronamraju
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

While migrating a task, check if moving it improves consolidation.
However make sure that such a movement doesnt offset fairness or create
imbalance in the nodes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    1 +
 kernel/sched/fair.c   |   80 ++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..a77c3cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
+#define SD_NUMA			0x4000	/* cross-node balancing */
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..e792312 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6136,6 +6136,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 1*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
+					| 1*SD_NUMA
 					| sd_local_flags(level)
 					,
 		.last_balance		= jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a2b5aa..3df7f76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3326,6 +3326,58 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+/* Move a task if
+ * Return -1 if the task has no numa affinity
+ * Return 1 if the task has numa affinity and moving the destination
+ * runqueue is not already loaded.
+ * else return 0
+ */
+static int
+can_numa_migrate_task(struct task_struct *p, struct rq *dst_rq, struct rq *src_rq)
+{
+	struct mm_struct *mm;
+	int dst_node, src_node;
+	int dst_running, src_running;
+
+	mm = p->mm;
+	if (!mm || !mm->numa_weights)
+		return -1;
+
+	if (atomic_read(&mm->numa_weights[nr_node_ids]) < 2)
+		return -1;
+
+	dst_node = cpu_to_node(cpu_of(dst_rq));
+	src_node = cpu_to_node(cpu_of(src_rq));
+	dst_running = atomic_read(&mm->numa_weights[dst_node]);
+	src_running = atomic_read(&mm->numa_weights[src_node]);
+
+	if (dst_rq->nr_running <= src_rq->nr_running) {
+		if (dst_running * src_rq->nr_running >= src_running * dst_rq->nr_running)
+			return 1;
+	}
+
+	return 0;
+}
+#else
+static int
+can_numa_migrate_task(struct task_struct *p, struct rq *dst_rq, struct rq *src_rq)
+{
+	return -1;
+}
+#endif
+/*
+ * Dont move a task if the source runq has more numa affinity.
+ */
+static bool
+check_numa_affinity(struct task_struct *p, int cpu, int prev_cpu)
+{
+	struct rq *src_rq = cpu_rq(prev_cpu);
+	struct rq *dst_rq = cpu_rq(cpu);
+
+	return (can_numa_migrate_task(p, dst_rq, src_rq) != 0);
+}
+
 /*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -3351,7 +3403,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+				check_numa_affinity(p, cpu, prev_cpu))
 			want_affine = 1;
 		new_cpu = prev_cpu;
 	}
@@ -3899,6 +3952,17 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+static bool force_migrate(struct lb_env *env, struct task_struct *p)
+{
+	if (env->sd->nr_balance_failed > env->sd->cache_nice_tries)
+		return true;
+
+	if (!(env->sd->flags & SD_NUMA))
+		return false;
+
+	return (can_numa_migrate_task(p, env->dst_rq, env->src_rq) == 1);
+}
+
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -3949,21 +4013,17 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or
 	 * 2) too many balance attempts have failed.
+	 * 3) has numa affinity
 	 */
-
 	tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
-	if (!tsk_cache_hot ||
-		env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
+	if (tsk_cache_hot) {
+		if (force_migrate(env, p)) {
 #ifdef CONFIG_SCHEDSTATS
-		if (tsk_cache_hot) {
 			schedstat_inc(env->sd, lb_hot_gained[env->idle]);
 			schedstat_inc(p, se.statistics.nr_forced_migrations);
-		}
 #endif
-		return 1;
-	}
-
-	if (tsk_cache_hot) {
+			return 1;
+		}
 		schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
 		return 0;
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 03/10] sched: Select a better task to pull across node using iterations
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 01/10] sched: Introduce per node numa weights Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 02/10] sched: Use numa weights while migrating tasks Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 04/10] sched: Move active_load_balance_cpu_stop to a new helper function Srikar Dronamraju
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

While selecting a task to pull across a node, try to choose a task that
improves locatity. i.e choose a task that has more affinity to the
destination node.

To achieve this, parse the list of tasks in multiple iterations. For now
choose just two iterations.  In the first iteration, a task is chosen to
move if and only if moving such a task helps improve node locality.  In
the last iteration, choose the default behaviour, i.e, a task is chosen
irrespective of whether it improves node locality or not.(behaviour
before this change). This iteration logic is only for cross node
migration and with CONFIG_NUMA_BALANCING enabled.

So if there are two tasks in a runq, both eligible to be migrated to
another runq belonging to a different node, then this change tries to
chose a task among the two that improves locality.

Similar logic was first used in Peter Zijlstra's numa core.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3df7f76..8fcbf96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3906,6 +3906,7 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+	unsigned int		iterations;
 };
 
 /*
@@ -4030,6 +4031,21 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	return 1;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static bool preferred_node(struct task_struct *p, struct lb_env *env)
+{
+	if (!(env->sd->flags & SD_NUMA))
+		return false;
+
+	return (can_numa_migrate_task(p, env->dst_rq, env->src_rq) == 1);
+}
+#else
+static bool preferred_node(struct task_struct *p, struct lb_env *env)
+{
+	return false;
+}
+#endif
+
 /*
  * move_one_task tries to move exactly one task from busiest to this_rq, as
  * part of active balancing operations within "domain".
@@ -4041,7 +4057,11 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+again:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+		if (!preferred_node(p, env))
+			continue;
+
 		if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
 			continue;
 
@@ -4049,6 +4069,7 @@ static int move_one_task(struct lb_env *env)
 			continue;
 
 		move_task(p, env);
+
 		/*
 		 * Right now, this is only the second place move_task()
 		 * is called, so we can safely collect move_task()
@@ -4057,6 +4078,9 @@ static int move_one_task(struct lb_env *env)
 		schedstat_inc(env->sd, lb_gained[env->idle]);
 		return 1;
 	}
+	if (!env->iterations++)
+		goto again;
+
 	return 0;
 }
 
@@ -4096,6 +4120,9 @@ static int move_tasks(struct lb_env *env)
 			break;
 		}
 
+		if (!preferred_node(p, env))
+			goto next;
+
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 			goto next;
 
@@ -5099,6 +5126,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.iterations	= 1,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -5130,6 +5158,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	ld_moved = 0;
 	lb_iterations = 1;
 	if (busiest->nr_running > 1) {
+#ifdef CONFIG_NUMA_BALANCING
+		if (sd->flags & SD_NUMA) {
+			if (cpu_to_node(env.dst_cpu) != cpu_to_node(env.src_cpu))
+				env.iterations = 0;
+		}
+#endif
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
@@ -5160,6 +5194,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			goto more_balance;
 		}
 
+		if (!ld_moved && !env.iterations++) {
+			env.loop	 = 0;
+			env.loop_break	 = sched_nr_migrate_break;
+			goto more_balance;
+		}
+
 		/*
 		 * some other cpu did the load balance for us.
 		 */
@@ -5407,8 +5447,16 @@ static int active_load_balance_cpu_stop(void *data)
 			.src_cpu	= busiest_rq->cpu,
 			.src_rq		= busiest_rq,
 			.idle		= CPU_IDLE,
+			.iterations	= 1,
 		};
 
+#ifdef CONFIG_NUMA_BALANCING
+		if ((sd->flags & SD_NUMA)) {
+			if (cpu_to_node(env.dst_cpu) != cpu_to_node(env.src_cpu))
+				env.iterations = 0;
+		}
+#endif
+
 		schedstat_inc(sd, alb_count);
 
 		if (move_one_task(&env))
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 04/10] sched: Move active_load_balance_cpu_stop to a new helper function
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2013-07-30  7:48 ` [RFC PATCH 03/10] sched: Select a better task to pull across node using iterations Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 05/10] sched: Extend idle balancing to look for consolidation of tasks Srikar Dronamraju
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

Due to the way active_load_balance_cpu gets called and the parameters
passed to it, the active_load_balance_cpu_stop call gets split into
multiple lines. Instead move it into a separate helper function.

this is a cleanup change. No functional changes.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8fcbf96..debb75a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5103,6 +5103,12 @@ static int need_active_balance(struct lb_env *env)
 
 static int active_load_balance_cpu_stop(void *data);
 
+static void active_load_balance(struct rq *rq)
+{
+	stop_one_cpu_nowait(cpu_of(rq), active_load_balance_cpu_stop, rq,
+			&rq->active_balance_work);
+}
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -5290,11 +5296,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			}
 			raw_spin_unlock_irqrestore(&busiest->lock, flags);
 
-			if (active_balance) {
-				stop_one_cpu_nowait(cpu_of(busiest),
-					active_load_balance_cpu_stop, busiest,
-					&busiest->active_balance_work);
-			}
+			if (active_balance)
+				active_load_balance(busiest);
 
 			/*
 			 * We've kicked active balancing, reset the failure
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 05/10] sched: Extend idle balancing to look for consolidation of tasks
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2013-07-30  7:48 ` [RFC PATCH 04/10] sched: Move active_load_balance_cpu_stop to a new helper function Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 06/10] sched: Limit migrations from a node Srikar Dronamraju
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

If the cpu is idle even after a regular load balance, then
try to move a task from another node to this node, such that
node locality improves.

While choosing a task to pull, choose a task/address-space from the
currently running set of tasks on this node. Make sure that the chosen
address-space has a numa affinity to the current node. Choose another
node that has the least number of tasks that belong to this address
space.

This change might induce a slight imbalance but there are enough checks
to make sure that the imbalance is within limits. This slight imbalance
that is created can act as a catalyst/opportunity for the other node to
pull its node affine tasks.

TODO: current checks that look at nr_running should be modified to
look at task loads instead.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c |  156 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 156 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index debb75a..43af8d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5609,6 +5609,77 @@ void update_max_interval(void)
 	max_load_balance_interval = HZ*num_online_cpus()/10;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static struct task_struct *
+select_task_to_pull(struct mm_struct *this_mm, int this_cpu, int nid)
+{
+	struct mm_struct *mm;
+	struct rq *rq;
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(nid)) {
+		rq = cpu_rq(cpu);
+		mm = rq->curr->mm;
+
+		if (mm == this_mm) {
+			if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(rq->curr)))
+				return rq->curr;
+		}
+	}
+	return NULL;
+}
+
+static int
+select_node_to_pull(struct mm_struct *mm, unsigned int nr_running, int nid)
+{
+	atomic_t *weights = mm->numa_weights;
+	int least_running, running, other_nr_running, other_running;
+	int least_node = -1;
+	int other_node, cpu;
+
+	least_running = atomic_read(&weights[nid]);
+	running = least_running;
+	for_each_online_node(other_node) {
+		/* our own node? skip */
+		if (other_node == nid)
+			continue;
+
+		other_running = atomic_read(&weights[other_node]);
+		/* no interesting thread in this node */
+		if (other_running == 0)
+			continue;
+
+		/* other_node has more numa affinity? Dont move. */
+		if (other_running > least_running)
+			continue;
+
+		other_nr_running = 0;
+		for_each_cpu(cpu, cpumask_of_node(other_node))
+			other_nr_running += cpu_rq(cpu)->nr_running;
+
+		/* other_node is already lightly loaded? */
+		if (nr_running > other_nr_running)
+			continue;
+
+		/*
+		 * If the other node has significant proportion of load of
+		 * the process in question. Or relatively has more affinity
+		 * to this address space than the current node, then dont
+		 * move
+		 */
+		if (other_nr_running < 2 * other_running)
+			continue;
+
+		if (nr_running * other_running >= other_nr_running * running)
+			continue;
+
+		least_running = other_running;
+		least_node = other_node;
+	}
+	return least_node;
+}
+#endif
+
 /*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
@@ -5674,6 +5745,91 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 		if (!balance)
 			break;
 	}
+#ifdef CONFIG_NUMA_BALANCING
+	if (!rq->nr_running) {
+		struct mm_struct *prev_mm, *mm;
+		struct task_struct *p = NULL;
+		unsigned int nr_running = 0;
+		int curr_running, total_running;
+		int other_node, nid, dcpu;
+
+		nid = cpu_to_node(cpu);
+		prev_mm = NULL;
+
+		for_each_cpu(dcpu, cpumask_of_node(nid))
+			nr_running += cpu_rq(dcpu)->nr_running;
+
+		for_each_cpu(dcpu, cpumask_of_node(nid)) {
+			struct rq *this_rq;
+
+			this_rq = cpu_rq(dcpu);
+			mm = this_rq->curr->mm;
+			if (!mm || !mm->numa_weights)
+				continue;
+
+			/*
+			 * Dont retry if the previous and the current
+			 * requests share the same address space
+			 */
+			if (mm == prev_mm)
+				continue;
+
+			curr_running = atomic_read(&mm->numa_weights[nid]);
+			total_running = atomic_read(&mm->numa_weights[nr_node_ids]);
+
+			if (curr_running < 2 || total_running < 2)
+				continue;
+
+			prev_mm = mm;
+
+			/* all threads have consolidated */
+			if (curr_running == total_running)
+				continue;
+
+			/*
+			 * in-significant proportion of load running on
+			 * this node?
+			 */
+			if (total_running > curr_running * (nr_node_ids + 1)) {
+				if (nr_running > 2 * curr_running)
+					continue;
+			}
+
+			other_node = select_node_to_pull(mm, nr_running, nid);
+			if (other_node == -1)
+				continue;
+			p = select_task_to_pull(mm, cpu, other_node);
+			if (p)
+				break;
+		}
+		if (p) {
+			struct rq *this_rq;
+			unsigned long flags;
+			int active_balance;
+
+			this_rq = task_rq(p);
+			active_balance = 0;
+
+			/*
+			 * ->active_balance synchronizes accesses to
+			 * ->active_balance_work.  Once set, it's cleared
+			 * only after active load balance is finished.
+			 */
+			raw_spin_lock_irqsave(&this_rq->lock, flags);
+			if (task_rq(p) == this_rq) {
+				if (!this_rq->active_balance) {
+					this_rq->active_balance = 1;
+					this_rq->push_cpu = cpu;
+					active_balance = 1;
+				}
+			}
+			raw_spin_unlock_irqrestore(&this_rq->lock, flags);
+
+			if (active_balance)
+				active_load_balance(this_rq);
+		}
+	}
+#endif
 	rcu_read_unlock();
 
 	/*
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 06/10] sched: Limit migrations from a node
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2013-07-30  7:48 ` [RFC PATCH 05/10] sched: Extend idle balancing to look for consolidation of tasks Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 07/10] sched: Pass hint to active balancer about the task to be chosen Srikar Dronamraju
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

While tasks are being moved from one node to another, run-queue's look
at nodes that have least numa affinity. However this can lead to more
requests to pull tasks from a single node than the available non-local
tasks on that node.

Add a counter that limits the number of simultaneous
migrations. With this counter, if a source node (that acts as a node with
least numa affinity for a address-space) has enough requests to
relinquish  tasks, then we choose a node with the next least number of
affinity threads for a address-space.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 fs/exec.c                |    1 +
 include/linux/mm_types.h |    1 +
 kernel/fork.c            |    1 +
 kernel/sched/fair.c      |    9 +++++++++
 4 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index b086e9e..9ce5cab 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -835,6 +835,7 @@ static int exec_mmap(struct mm_struct *mm)
 	arch_pick_mmap_layout(mm);
 #ifdef CONFIG_NUMA_BALANCING
 	mm->numa_weights = kzalloc(sizeof(atomic_t) * (nr_node_ids + 1), GFP_KERNEL);
+	mm->limit_migrations = kzalloc(sizeof(atomic_t) * nr_node_ids, GFP_KERNEL);
 	atomic_inc(&mm->numa_weights[cpu_to_node(task_cpu(tsk))]);
 	atomic_inc(&mm->numa_weights[nr_node_ids]);
 #endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 45d02df..4b0ba71 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -436,6 +436,7 @@ struct mm_struct {
 	 */
 	int first_nid;
 	atomic_t *numa_weights;
+	atomic_t *limit_migrations;
 #endif
 	struct uprobes_state uprobes_state;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 21421bd..2b55676 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -847,6 +847,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_NUMA_BALANCING
 	mm->first_nid = NUMA_PTE_SCAN_INIT;
 	mm->numa_weights = kzalloc(sizeof(atomic_t) * (nr_node_ids + 1), GFP_KERNEL);
+	mm->limit_migrations = kzalloc(sizeof(atomic_t) * nr_node_ids, GFP_KERNEL);
 #endif
 	return mm;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43af8d9..17027e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5673,6 +5673,12 @@ select_node_to_pull(struct mm_struct *mm, unsigned int nr_running, int nid)
 		if (nr_running * other_running >= other_nr_running * running)
 			continue;
 
+		if (!atomic_add_unless(&mm->limit_migrations[other_node], 1, other_running))
+			continue;
+
+		if (least_node != -1)
+			atomic_dec(&mm->limit_migrations[least_node]);
+
 		least_running = other_running;
 		least_node = other_node;
 	}
@@ -5801,6 +5807,7 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 			p = select_task_to_pull(mm, cpu, other_node);
 			if (p)
 				break;
+			atomic_dec(&mm->limit_migrations[other_node]);
 		}
 		if (p) {
 			struct rq *this_rq;
@@ -5827,6 +5834,8 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 
 			if (active_balance)
 				active_load_balance(this_rq);
+
+			atomic_dec(&mm->limit_migrations[other_node]);
 		}
 	}
 #endif
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 07/10] sched: Pass hint to active balancer about the task to be chosen
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2013-07-30  7:48 ` [RFC PATCH 06/10] sched: Limit migrations from a node Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 08/10] sched: Prevent a task from migrating immediately after an active balance Srikar Dronamraju
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

If a task to be active balanced, which improves the numa affinity is
already chosen, then pass the task to the actual migration.

This helps in 2 ways.
- Dont have to iterate through the list of tasks and again chose a
  task.
- If the chosen task has already moved out of runqueue, avoid moving
  some other task that may or may not provide consolidation.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c  |   20 +++++++++++++++++++-
 kernel/sched/sched.h |    3 +++
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 17027e0..e04703e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4057,6 +4057,18 @@ static int move_one_task(struct lb_env *env)
 {
 	struct task_struct *p, *n;
 
+#ifdef CONFIG_NUMA_BALANCING
+	p = env->src_rq->push_task;
+	if (p) {
+		if (p->on_rq && task_cpu(p) == env->src_rq->cpu) {
+			move_task(p, env);
+			schedstat_inc(env->sd, lb_gained[env->idle]);
+			return 1;
+		}
+		return 0;
+	}
+#endif
+
 again:
 	list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
 		if (!preferred_node(p, env))
@@ -5471,6 +5483,9 @@ static int active_load_balance_cpu_stop(void *data)
 	double_unlock_balance(busiest_rq, target_rq);
 out_unlock:
 	busiest_rq->active_balance = 0;
+#ifdef CONFIG_NUMA_BALANCING
+	busiest_rq->push_task = NULL;
+#endif
 	raw_spin_unlock_irq(&busiest_rq->lock);
 	return 0;
 }
@@ -5621,6 +5636,8 @@ select_task_to_pull(struct mm_struct *this_mm, int this_cpu, int nid)
 		rq = cpu_rq(cpu);
 		mm = rq->curr->mm;
 
+		if (rq->push_task)
+			continue;
 		if (mm == this_mm) {
 			if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(rq->curr)))
 				return rq->curr;
@@ -5823,10 +5840,11 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
 			 * only after active load balance is finished.
 			 */
 			raw_spin_lock_irqsave(&this_rq->lock, flags);
-			if (task_rq(p) == this_rq) {
+			if (task_rq(p) == this_rq && !this_rq->push_task) {
 				if (!this_rq->active_balance) {
 					this_rq->active_balance = 1;
 					this_rq->push_cpu = cpu;
+					this_rq->push_task = p;
 					active_balance = 1;
 				}
 			}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..9f60d74 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -484,6 +484,9 @@ struct rq {
 #endif
 
 	struct sched_avg avg;
+#ifdef CONFIG_NUMA_BALANCING
+	struct task_struct *push_task;
+#endif
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 08/10] sched: Prevent a task from migrating immediately after an active balance
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2013-07-30  7:48 ` [RFC PATCH 07/10] sched: Pass hint to active balancer about the task to be chosen Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 09/10] sched: Choose a runqueue that has lesser local affinity tasks Srikar Dronamraju
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

Once a task has been carefully chosen to be migrated to a destination
node from a source node, try to avoid some other cpus moving this task
away from the destinagtion node.

If not, tasks might end up being in a ping-pong; one cpu pulling it
because of numa affinity, the other cpu pulling it for the slight
imbalance created because of previous migration (instead of actually
trying to pull a task that might lead to more consolidation)

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    1 +
 kernel/sched/fair.c   |    9 +++++++++
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a77c3cd..ba188f1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,6 +1506,7 @@ struct task_struct {
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
+	int migrate_seq;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e792312..453d989 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
 	p->numa_scan_period = sysctl_numa_balancing_scan_delay;
 	p->numa_work.next = &p->numa_work;
+	p->migrate_seq = 0;
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e04703e..a99aebc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1014,6 +1014,7 @@ static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
 	struct rq *rq = rq_of(cfs_rq);
 	int curnode = cpu_to_node(cpu_of(rq));
 
+	p->migrate_seq = 0;
 	if (mm && mm->numa_weights) {
 		atomic_dec(&mm->numa_weights[curnode]);
 		atomic_dec(&mm->numa_weights[nr_node_ids]);
@@ -4037,6 +4038,13 @@ static bool preferred_node(struct task_struct *p, struct lb_env *env)
 	if (!(env->sd->flags & SD_NUMA))
 		return false;
 
+	if (env->iterations) {
+		if (!p->migrate_seq)
+			return true;
+
+		p->migrate_seq--;
+		return false;
+	}
 	return (can_numa_migrate_task(p, env->dst_rq, env->src_rq) == 1);
 }
 #else
@@ -4063,6 +4071,7 @@ static int move_one_task(struct lb_env *env)
 		if (p->on_rq && task_cpu(p) == env->src_rq->cpu) {
 			move_task(p, env);
 			schedstat_inc(env->sd, lb_gained[env->idle]);
+			p->migrate_seq = 3;
 			return 1;
 		}
 		return 0;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 09/10] sched: Choose a runqueue that has lesser local affinity tasks
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2013-07-30  7:48 ` [RFC PATCH 08/10] sched: Prevent a task from migrating immediately after an active balance Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  7:48 ` [RFC PATCH 10/10] x86, mm: Prevent gcc to re-read the pagetables Srikar Dronamraju
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

While migrating tasks to a different node, choosing the busiest runqueue
may not always be the right choice. The busiest runqueue might have
tasks that are already consolidated. Choosing such a runqueue might
actually lead to more performance impact.

Alternatively choose a runqueue that has less local numa affine tasks,
i.e, tasks that benefit if run on a node other than their current node.
The load balancer would then pitchin to move load from the busiest
runqueue to the runqueue from where tasks for cross node migration were
picked. So the load would end up being better consolidated.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched.h |    2 +
 kernel/sched/fair.c   |   82 +++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h  |    1 +
 3 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba188f1..c5d0a13 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,8 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	struct callback_head numa_work;
 	int migrate_seq;
+	bool pinned_task;
+	bool local_task;
 #endif /* CONFIG_NUMA_BALANCING */
 
 	struct rcu_head rcu;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a99aebc..e749650 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,6 +805,36 @@ static void task_numa_placement(struct task_struct *p)
 	/* FIXME: Scheduling placement policy hints go here */
 }
 
+static void update_local_task_count(struct task_struct *p)
+{
+	struct rq *rq = task_rq(p);
+	int curnode = cpu_to_node(cpu_of(rq));
+	int cur_numa_weight = 0;
+	int total_numa_weight = 0;
+
+	if (!p->pinned_task) {
+		if (p->mm && p->mm->numa_weights) {
+			cur_numa_weight = atomic_read(&p->mm->numa_weights[curnode]);
+			total_numa_weight = atomic_read(&p->mm->numa_weights[nr_node_ids]);
+		}
+
+		/*
+		 * Account tasks that are neither pinned nor have numa affinity as
+		 * non local tasks.
+		 */
+		if (p->local_task != (cur_numa_weight * nr_node_ids > total_numa_weight)) {
+			if (!p->local_task) {
+				rq->non_local_task_count--;
+				p->local_task = true;
+			} else {
+				rq->non_local_task_count++;
+				p->local_task = false;
+			}
+
+		}
+	}
+}
+
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
@@ -826,6 +856,9 @@ void task_numa_fault(int node, int pages, bool migrated)
 			p->numa_scan_period + jiffies_to_msecs(10));
 
 	task_numa_placement(p);
+
+	/* Should this be moved to update_curr()? */
+	update_local_task_count(p);
 }
 
 static void reset_ptenuma_scan(struct task_struct *p)
@@ -996,16 +1029,31 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	}
 }
 
+static void add_non_local_task_count(struct rq *rq, struct task_struct *p,
+		int value)
+{
+	if (p->pinned_task || p->local_task)
+		return;
+	else
+		rq->non_local_task_count += value;
+}
+
 static void account_numa_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p)
 {
 	struct mm_struct *mm = p->mm;
 	struct rq *rq = rq_of(cfs_rq);
 	int curnode = cpu_to_node(cpu_of(rq));
+	int cur_numa_weight = 0;
+	int total_numa_weight = 0;
 
 	if (mm && mm->numa_weights) {
-		atomic_read(&mm->numa_weights[curnode]);
-		atomic_read(&mm->numa_weights[nr_node_ids]);
+		cur_numa_weight = atomic_inc_return(&mm->numa_weights[curnode]);
+		total_numa_weight = atomic_inc_return(&mm->numa_weights[nr_node_ids]);
 	}
+
+	p->pinned_task = (p->nr_cpus_allowed == 1);
+	p->local_task = (cur_numa_weight * nr_node_ids > total_numa_weight);
+	add_non_local_task_count(rq, p, 1);
 }
 
 static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
@@ -1019,6 +1067,10 @@ static void account_numa_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p)
 		atomic_dec(&mm->numa_weights[curnode]);
 		atomic_dec(&mm->numa_weights[nr_node_ids]);
 	}
+
+	add_non_local_task_count(rq, p, -1);
+	p->pinned_task = false;
+	p->local_task = false;
 }
 #else
 static void task_tick_numa(struct rq *rq, struct task_struct *curr)
@@ -5046,6 +5098,27 @@ find_busiest_group(struct lb_env *env, int *balance)
 	return NULL;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static struct rq *find_numa_queue(struct lb_env *env,
+				struct sched_group *group, struct rq *busy_rq)
+{
+	struct rq *rq;
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(group)) {
+		if (!cpumask_test_cpu(i, env->cpus))
+			continue;
+
+		rq = cpu_rq(i);
+		if (rq->nr_running > 1) {
+			if (rq->non_local_task_count > busy_rq->non_local_task_count)
+				busy_rq = rq;
+		}
+	}
+	return busy_rq;
+}
+#endif
+
 /*
  * find_busiest_queue - find the busiest runqueue among the cpus in group.
  */
@@ -5187,8 +5260,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	if (busiest->nr_running > 1) {
 #ifdef CONFIG_NUMA_BALANCING
 		if (sd->flags & SD_NUMA) {
-			if (cpu_to_node(env.dst_cpu) != cpu_to_node(env.src_cpu))
+			if (cpu_to_node(env.dst_cpu) != cpu_to_node(env.src_cpu)) {
 				env.iterations = 0;
+				busiest = find_numa_queue(&env, group, busiest);
+			}
+
 		}
 #endif
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f60d74..5e620b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -486,6 +486,7 @@ struct rq {
 	struct sched_avg avg;
 #ifdef CONFIG_NUMA_BALANCING
 	struct task_struct *push_task;
+	unsigned int non_local_task_count;
 #endif
 };
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 10/10] x86, mm: Prevent gcc to re-read the pagetables
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (8 preceding siblings ...)
  2013-07-30  7:48 ` [RFC PATCH 09/10] sched: Choose a runqueue that has lesser local affinity tasks Srikar Dronamraju
@ 2013-07-30  7:48 ` Srikar Dronamraju
  2013-07-30  8:17 ` [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Peter Zijlstra
  2013-07-31 13:33 ` Andrew Theurer
  11 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  7:48 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, Ingo Molnar
  Cc: Andrea Arcangeli, Johannes Weiner, Linux-MM, LKML,
	Preeti U Murthy, Linus Torvalds, Srikar Dronamraju

From: Andrea Arcangeli <aarcange@redhat.com>

GCC is very likely to read the pagetables just once and cache them in
the local stack or in a register, but it is can also decide to re-read
the pagetables. The problem is that the pagetable in those places can
change from under gcc.

In the page fault we only hold the ->mmap_sem for reading and both the
page fault and MADV_DONTNEED only take the ->mmap_sem for reading and we
don't hold any PT lock yet.

In get_user_pages_fast() the TLB shootdown code can clear the pagetables
before firing any TLB flush (the page can't be freed until the TLB
flushing IPI has been delivered but the pagetables will be cleared well
before sending any TLB flushing IPI).

With THP/hugetlbfs the pmd (and pud for hugetlbfs giga pages) can
change as well under gup_fast, it won't just be cleared for the same
reasons described above for the pte in the page fault case.

[ This patch was picked up from the AutoNUMA tree. Also stayed in Ingo
Molnar's Numa tree for a while.]

Originally-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

---
Ingo, Andrea, Please let me know if I can add your signed-off-by.
---
 arch/x86/mm/gup.c |   23 ++++++++++++++++++++---
 mm/memory.c       |    2 +-
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..6dc9921 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -150,7 +150,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 
 	pmdp = pmd_offset(&pud, addr);
 	do {
-		pmd_t pmd = *pmdp;
+		/*
+		 * With THP and hugetlbfs the pmd can change from
+		 * under us and it can be cleared as well by the TLB
+		 * shootdown, so read it with ACCESS_ONCE to do all
+		 * computations on the same sampling.
+		 */
+		pmd_t pmd = ACCESS_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
 		/*
@@ -220,7 +226,13 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 
 	pudp = pud_offset(&pgd, addr);
 	do {
-		pud_t pud = *pudp;
+		/*
+		 * With hugetlbfs giga pages the pud can change from
+		 * under us and it can be cleared as well by the TLB
+		 * shootdown, so read it with ACCESS_ONCE to do all
+		 * computations on the same sampling.
+		 */
+		pud_t pud = ACCESS_ONCE(*pudp);
 
 		next = pud_addr_end(addr, end);
 		if (pud_none(pud))
@@ -280,7 +292,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	local_irq_save(flags);
 	pgdp = pgd_offset(mm, addr);
 	do {
-		pgd_t pgd = *pgdp;
+		/*
+		 * The pgd could be cleared by the TLB shootdown from
+		 * under us so read it with ACCESS_ONCE to do all
+		 * computations on the same sampling.
+		 */
+		pgd_t pgd = ACCESS_ONCE(*pgdp);
 
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..6254dd2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3703,7 +3703,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	pte_t entry;
 	spinlock_t *ptl;
 
-	entry = *pte;
+	entry = ACCESS_ONCE(*pte);
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (9 preceding siblings ...)
  2013-07-30  7:48 ` [RFC PATCH 10/10] x86, mm: Prevent gcc to re-read the pagetables Srikar Dronamraju
@ 2013-07-30  8:17 ` Peter Zijlstra
  2013-07-30  8:20   ` Peter Zijlstra
  2013-07-31 13:33 ` Andrew Theurer
  11 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2013-07-30  8:17 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> Here is an approach that looks to consolidate workloads across nodes.
> This results in much improved performance. Again I would assume this work
> is complementary to Mel's work with numa faulting.

I highly dislike the use of task weights here. It seems completely
unrelated to the problem at hand.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  8:17 ` [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Peter Zijlstra
@ 2013-07-30  8:20   ` Peter Zijlstra
  2013-07-30  9:03     ` Srikar Dronamraju
  2013-07-30  9:15     ` Srikar Dronamraju
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2013-07-30  8:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > Here is an approach that looks to consolidate workloads across nodes.
> > This results in much improved performance. Again I would assume this work
> > is complementary to Mel's work with numa faulting.
> 
> I highly dislike the use of task weights here. It seems completely
> unrelated to the problem at hand.

I also don't particularly like the fact that it's purely process based.
The faults information we have gives much richer task relations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  8:20   ` Peter Zijlstra
@ 2013-07-30  9:03     ` Srikar Dronamraju
  2013-07-30  9:10       ` Peter Zijlstra
  2013-07-30  9:15     ` Srikar Dronamraju
  1 sibling, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  9:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

* Peter Zijlstra <peterz@infradead.org> [2013-07-30 10:20:01]:

> On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > Here is an approach that looks to consolidate workloads across nodes.
> > > This results in much improved performance. Again I would assume this work
> > > is complementary to Mel's work with numa faulting.
> > 
> > I highly dislike the use of task weights here. It seems completely
> > unrelated to the problem at hand.
> 
> I also don't particularly like the fact that it's purely process based.
> The faults information we have gives much richer task relations.
> 

With just pure fault information based approach, I am not seeing any
major improvement in tasks/memory consolidation. I still see memory
spread across different nodes and tasks getting ping-ponged to different
nodes. And if there are multiple unrelated processes, then we see a mix
of tasks of different processes in each of the node.

This spreading of load as per my observation, isn't helping the
performance. This is esp true with bigger boxes and would take this as a
hint that we need to consolidate tasks for better performance.

Now I can just use the number of tasks rather than task weights as I do
with the current patchset. But I don't think that would be ideal either.
Esp this wouldn't work with Fair share scheduling.

For example: lets say there are 2 vm's running similar loads on a 2 node
machine. We would get the best performance if we could easily segregate
the load. I know all problems cannot be generalized into just this set.
My thinking is to get atleast these set of problems solved.

Do you see any alternatives other than numa faults/task weights that we
could use to better consolidate tasks?

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  9:03     ` Srikar Dronamraju
@ 2013-07-30  9:10       ` Peter Zijlstra
  2013-07-30  9:26         ` Peter Zijlstra
  2013-07-30  9:46         ` Srikar Dronamraju
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2013-07-30  9:10 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2013-07-30 10:20:01]:
> 
> > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > > Here is an approach that looks to consolidate workloads across nodes.
> > > > This results in much improved performance. Again I would assume this work
> > > > is complementary to Mel's work with numa faulting.
> > > 
> > > I highly dislike the use of task weights here. It seems completely
> > > unrelated to the problem at hand.
> > 
> > I also don't particularly like the fact that it's purely process based.
> > The faults information we have gives much richer task relations.
> > 
> 
> With just pure fault information based approach, I am not seeing any
> major improvement in tasks/memory consolidation. I still see memory
> spread across different nodes and tasks getting ping-ponged to different
> nodes. And if there are multiple unrelated processes, then we see a mix
> of tasks of different processes in each of the node.

The fault thing isn't finished. Mel explicitly said it doesn't yet have
inter-task relations. And you run everything in a VM which is like a big
nasty mangler for anything sane.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  8:20   ` Peter Zijlstra
  2013-07-30  9:03     ` Srikar Dronamraju
@ 2013-07-30  9:15     ` Srikar Dronamraju
  2013-07-30  9:33       ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  9:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

* Peter Zijlstra <peterz@infradead.org> [2013-07-30 10:20:01]:

> On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > Here is an approach that looks to consolidate workloads across nodes.
> > > This results in much improved performance. Again I would assume this work
> > > is complementary to Mel's work with numa faulting.
> > 
> > I highly dislike the use of task weights here. It seems completely
> > unrelated to the problem at hand.
> 
> I also don't particularly like the fact that it's purely process based.
> The faults information we have gives much richer task relations.
> 

Peter, 

Can you please suggest workloads that I could try which might showcase
why you hate pure process based approach?

I know numa02_SMT does regress with my patches but I think its most
my implementation fault and not a approach issue.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  9:10       ` Peter Zijlstra
@ 2013-07-30  9:26         ` Peter Zijlstra
  2013-07-30  9:46         ` Srikar Dronamraju
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2013-07-30  9:26 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

On Tue, Jul 30, 2013 at 11:10:21AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2013-07-30 10:20:01]:
> > 
> > > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > > > Here is an approach that looks to consolidate workloads across nodes.
> > > > > This results in much improved performance. Again I would assume this work
> > > > > is complementary to Mel's work with numa faulting.
> > > > 
> > > > I highly dislike the use of task weights here. It seems completely
> > > > unrelated to the problem at hand.
> > > 
> > > I also don't particularly like the fact that it's purely process based.
> > > The faults information we have gives much richer task relations.
> > > 
> > 
> > With just pure fault information based approach, I am not seeing any
> > major improvement in tasks/memory consolidation. I still see memory
> > spread across different nodes and tasks getting ping-ponged to different
> > nodes. And if there are multiple unrelated processes, then we see a mix
> > of tasks of different processes in each of the node.
> 
> The fault thing isn't finished. Mel explicitly said it doesn't yet have
> inter-task relations. And you run everything in a VM which is like a big
> nasty mangler for anything sane.

Also, the last time you posted this, I already said that if you'd use
the faults data to do grouping you'd get similar reseults. Task weight
is a completely unrelated and random measure. I think you even conceded
this.

So I really don't get why you're still using task weight for this.

Also, Ingo already showed that you can get task grouping from the fault
information itself, no need to use mm information to do this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  9:15     ` Srikar Dronamraju
@ 2013-07-30  9:33       ` Peter Zijlstra
  2013-07-31 17:35         ` Srikar Dronamraju
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2013-07-30  9:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:

> Can you please suggest workloads that I could try which might showcase
> why you hate pure process based approach?

2 processes, 1 sysvshm segment. I know there's multi-process MPI
libraries out there.

Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  9:10       ` Peter Zijlstra
  2013-07-30  9:26         ` Peter Zijlstra
@ 2013-07-30  9:46         ` Srikar Dronamraju
  2013-07-31 15:09           ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-30  9:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

* Peter Zijlstra <peterz@infradead.org> [2013-07-30 11:10:21]:

> On Tue, Jul 30, 2013 at 02:33:45PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2013-07-30 10:20:01]:
> > 
> > > On Tue, Jul 30, 2013 at 10:17:55AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jul 30, 2013 at 01:18:15PM +0530, Srikar Dronamraju wrote:
> > > > > Here is an approach that looks to consolidate workloads across nodes.
> > > > > This results in much improved performance. Again I would assume this work
> > > > > is complementary to Mel's work with numa faulting.
> > > > 
> > > > I highly dislike the use of task weights here. It seems completely
> > > > unrelated to the problem at hand.
> > > 
> > > I also don't particularly like the fact that it's purely process based.
> > > The faults information we have gives much richer task relations.
> > > 
> > 
> > With just pure fault information based approach, I am not seeing any
> > major improvement in tasks/memory consolidation. I still see memory
> > spread across different nodes and tasks getting ping-ponged to different
> > nodes. And if there are multiple unrelated processes, then we see a mix
> > of tasks of different processes in each of the node.
> 
> The fault thing isn't finished. Mel explicitly said it doesn't yet have
> inter-task relations. And you run everything in a VM which is like a big
> nasty mangler for anything sane.
> 

I am not against fault and fault based handling is very much needed. 
I have listed that this approach is complementary to numa faults that
Mel is proposing. 

Right now I think if we can first get the tasks to consolidate on nodes
and then use the numa faults to place the tasks, then we would be able
to have a very good solution. 

Plain fault information is actually causing confusion in enough number
of cases esp if the initial set of pages is all consolidated into fewer
set of nodes. With plain fault information, memory follows cpu, cpu
follows memory are conflicting with each other. memory wants to move to
nodes where the tasks are currently running and the tasks are planning
to move nodes where the current memory is around.

Also most of the consolidation that I have proposed is pretty
conservative or either done at idle balance time. This would not affect
the numa faulting in any way. When I run with my patches (along with
some debug code), the consolidation happens pretty pretty quickly.
Once consolidation has happened, numa faults would be of immense value.

Here is how I am looking at the solution.

1. Till the initial scan delay, allow tasks to consolidate

2. After the first scan delay to the next scan delay, account numa
   faults, allow memory to move. But dont use numa faults as yet to
   drive scheduling decisions. Here also task continue to consolidate.

	This will lead to tasks and memory moving to specific nodes and
	leading to consolidation.

3. After the second scan delay, continue to account numa faults and
allow numa faults to drive scheduling decisions.

Should we use also use task weights at stage 3 or just numa faults or
which one should get more preference is something that I am not clear at
this time. At this time, I would think we would need to factor in both
of them.

I think this approach would mean tasks get consolidated but the inter
process, inter task relations that you are looking for also remain
strong.

Is this a acceptable solution?

-- 
Thanks and Regards
Srikar

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
                   ` (10 preceding siblings ...)
  2013-07-30  8:17 ` [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Peter Zijlstra
@ 2013-07-31 13:33 ` Andrew Theurer
  2013-07-31 15:43   ` Srikar Dronamraju
  11 siblings, 1 reply; 24+ messages in thread
From: Andrew Theurer @ 2013-07-31 13:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

On Tue, 2013-07-30 at 13:18 +0530, Srikar Dronamraju wrote:
> Here is an approach that looks to consolidate workloads across nodes.
> This results in much improved performance. Again I would assume this work
> is complementary to Mel's work with numa faulting.
> 
> Here are the advantages of this approach.
> 1. Provides excellent consolidation of tasks.
>  From my experiments, I have found that the better the task
>  consolidation, we achieve better the memory layout, which results in
>  better the performance.
> 
> 2. Provides good improvement in most cases, but there are some regressions.
> 
> 3. Looks to extend the load balancer esp when the cpus are idling.
> 
> Here is the outline of the approach.
> 
> - Every process has a per node array where we store the weight of all
>   its tasks running on that node. This arrays gets updated on task
>   enqueue/dequeue.
> 
> - Added a 2 pass mechanism (somewhat taken from numacore but not
>   exactly) while choosing tasks to move across nodes.
> 
>   In the first pass, choose only tasks that are ideal to be moved.
>   While choosing a task, look at the per node process arrays to see if
>   moving task helps.
>   If the first pass fails to move a task, any task can be chosen on the
>   second pass.
> 
> - If the regular load balancer (rebalance_domain()) fails to balance the
>   load (or finds no imbalance) and there is a cpu, use the cpu to
>   consolidate tasks to the nodes by using the information in the per
>   node process arrays.
> 
>   Every idle cpu if its doesnt have tasks queued after load balance,
>   - will walk thro the cpus in its node and checks if there are buddy
>     tasks that are not part of the node but should have been ideally
>     part of this node.
>   - To make sure that we dont pull all buddy tasks and create an
>     imbalance, we look at load on the load, pinned tasks and the
>     processes contribution to the load for this node.
>   - Each cpu looks at the node which has the least number of buddy tasks
>     running and tries to pull the tasks from such nodes.
> 
>   - Once it finds the cpu from which to pull the tasks, it triggers
>     active_balancing. This type of active balancing triggers just one
>     pass. i.e it only fetches tasks that increase numa locality.
> 
> Here are results of specjbb run on a 2 node machine.

Here's a comparison with 4 KVM VMs running dbench on a 4 socket, 40
core, 80 thread host.

kernel				total dbench throughpout
	
3-9.numabal-on			21242
3.9-numabal-off			20455
3.9-numabal-on-consolidate      22541
3.9-numabal-off-consolidate     21632
3.9-numabal-off-node-pinning    26450
3.9-numabal-on-node-pinning     25265

Based on the node pinning results, we have a long way to go, with either
numa-balancing and/or consolidation.  One thing the consolidation helps
is actually getting the sibling tasks running in the same node:

% CPU usage by node for 1st VM
node00 node01 node02  node03
094%    002%    001%    001%

However, the node which was chosen to consolidate tasks is
not the same node where most of the memory for the tasks is located:

% memory per node for 1st VM
              host-node00  host-node01    host-node02    host-node03
             -----------    -----------    -----------   ----------
 VM-node00   295937(034%)   550400(064%)   6144(000%)       0(000%) 


By comparison, same stats for numa-balancing on and no consolidation:

% CPU usage by node for 1st VM
node00 node01 node02 node03
  028%   027%  020%    023%   <-CPU usage spread across whole system

% memory per node for 1st VM
             host-node00    host-node01    host-node02    host-node03
             -----------    -----------    -----------    -----------  
 VM-node00|   49153(006%)   673792(083%)    51712(006%)   36352(004%) 

I think the consolidation is a nice concept, but it needs a much tighter
integration with numa balancing.  The action to clump tasks on same node's
runqueues should be triggered by detecting that they also access
the same memory.

> Specjbb was run on 3 vms.
> In the fit case, one vm was big to fit one node size.
> In the no-fit case, one vm was bigger than the node size.
> 
> -------------------------------------------------------------------------------------
> |kernel        |                          nofit|                            fit|   vm|
> |kernel        |          noksm|            ksm|          noksm|            ksm|   vm|
> |kernel        |  nothp|    thp|  nothp|    thp|  nothp|    thp|  nothp|    thp|   vm|
> --------------------------------------------------------------------------------------
> |v3.9          | 136056| 189423| 135359| 186722| 136983| 191669| 136728| 184253| vm_1|
> |v3.9          |  66041|  84779|  64564|  86645|  67426|  84427|  63657|  85043| vm_2|
> |v3.9          |  67322|  83301|  63731|  85394|  65015|  85156|  63838|  84199| vm_3|
> --------------------------------------------------------------------------------------
> |v3.9 + Mel(v5)| 133170| 177883| 136385| 176716| 140650| 174535| 132811| 190120| vm_1|
> |v3.9 + Mel(v5)|  65021|  81707|  62876|  81826|  63635|  84943|  58313|  78997| vm_2|
> |v3.9 + Mel(v5)|  61915|  82198|  60106|  81723|  64222|  81123|  59559|  78299| vm_3|
> | % change     |  -2.12|  -6.09|   0.76|  -5.36|   2.68|  -8.94|  -2.86|   3.18| vm_1|
> | % change     |  -1.54|  -3.62|  -2.61|  -5.56|  -5.62|   0.61|  -8.39|  -7.11| vm_2|
> | % change     |  -8.03|  -1.32|  -5.69|  -4.30|  -1.22|  -4.74|  -6.70|  -7.01| vm_3|
> --------------------------------------------------------------------------------------
> |v3.9 + this   | 136766| 189704| 148642| 180723| 147474| 184711| 139270| 186768| vm_1|
> |v3.9 + this   |  72742|  86980|  67561|  91659|  69781|  87741|  65989|  83508| vm_2|
> |v3.9 + this   |  66075|  90591|  66135|  90059|  67942|  87229|  66100|  85908| vm_3|
> | % change     |   0.52|   0.15|   9.81|  -3.21|   7.66|  -3.63|   1.86|   1.36| vm_1|
> | % change     |  10.15|   2.60|   4.64|   5.79|   3.49|   3.93|   3.66|  -1.80| vm_2|
> | % change     |  -1.85|   8.75|   3.77|   5.46|   4.50|   2.43|   3.54|   2.03| vm_3|
> --------------------------------------------------------------------------------------
> 
> 
> Autonuma benchmark results on a 2 node machine:
> KernelVersion: 3.9.0
> 		Testcase:      Min      Max      Avg   StdDev
> 		  numa01:   118.98   122.37   120.96     1.17
>      numa01_THREAD_ALLOC:   279.84   284.49   282.53     1.65
> 		  numa02:    36.84    37.68    37.09     0.31
> 	      numa02_SMT:    44.67    48.39    47.32     1.38
> 
> KernelVersion: 3.9.0 + Mel's v5
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   115.02   123.08   120.83     3.04    0.11%
>      numa01_THREAD_ALLOC:   268.59   298.47   281.15    11.16    0.46%
> 		  numa02:    36.31    37.34    36.68     0.43    1.10%
> 	      numa02_SMT:    43.18    43.43    43.29     0.08    9.28%
> 
> KernelVersion: 3.9.0 + this patchset
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   103.46   112.31   106.44     3.10   12.93%
>      numa01_THREAD_ALLOC:   277.51   289.81   283.88     4.98   -0.47%
> 		  numa02:    36.72    40.81    38.42     1.85   -3.26%
> 	      numa02_SMT:    56.50    60.00    58.08     1.23  -17.93%
> 
> KernelVersion: 3.9.0(HT)
> 		Testcase:      Min      Max      Avg   StdDev
> 		  numa01:   241.23   244.46   242.94     1.31
>      numa01_THREAD_ALLOC:   301.95   307.39   305.04     2.20
> 		  numa02:    41.31    43.92    42.98     1.02
> 	      numa02_SMT:    37.02    37.58    37.44     0.21
> 
> KernelVersion: 3.9.0 + Mel's v5 (HT)
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   238.42   242.62   241.60     1.60    0.55%
>      numa01_THREAD_ALLOC:   285.01   298.23   291.54     5.37    4.53%
> 		  numa02:    38.08    38.16    38.11     0.03   12.76%
> 	      numa02_SMT:    36.20    36.64    36.36     0.17    2.95%
> 
> KernelVersion: 3.9.0 + this patchset(HT)
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   175.17   189.61   181.90     5.26   32.19%
>      numa01_THREAD_ALLOC:   285.79   365.26   305.27    30.35   -0.06%
> 		  numa02:    38.26    38.97    38.50     0.25   11.50%
> 	      numa02_SMT:    44.66    49.22    46.22     1.60  -17.84%
> 
> 
> Autonuma benchmark results on a 4 node machine:
> # dmidecode | grep 'Product Name:'
> 	Product Name: System x3750 M4 -[8722C1A]-
> # numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
> node 0 size: 65468 MB
> node 0 free: 63890 MB
> node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
> node 1 size: 65536 MB
> node 1 free: 64033 MB
> node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
> node 2 size: 65536 MB
> node 2 free: 64236 MB
> node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
> node 3 size: 65536 MB
> node 3 free: 64162 MB
> node distances:
> node   0   1   2   3 
>   0:  10  11  11  12 
>   1:  11  10  12  11 
>   2:  11  12  10  11 
>   3:  12  11  11  10 
> 
> KernelVersion: 3.9.0
> 		Testcase:      Min      Max      Avg   StdDev
> 		  numa01:   581.35   761.95   681.23    80.97
>      numa01_THREAD_ALLOC:   140.39   164.45   150.34     7.98
> 		  numa02:    18.47    20.12    19.25     0.65
> 	      numa02_SMT:    16.40    25.30    21.06     2.86
> 
> KernelVersion: 3.9.0 + Mel's v5 patchset
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   733.15   767.99   748.88    14.51   -8.81%
>      numa01_THREAD_ALLOC:   154.18   169.13   160.48     5.76   -6.00%
> 		  numa02:    19.09    22.15    21.02     1.03   -7.99%
> 	      numa02_SMT:    23.01    25.53    23.98     0.83  -11.44%
> 
> KernelVersion: 3.9.0 + this patchset
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   409.64   457.91   444.55    17.66   51.69%
>      numa01_THREAD_ALLOC:   158.10   174.89   169.32     5.84  -10.85%
> 		  numa02:    18.89    22.36    19.98     1.29   -3.26%
> 	      numa02_SMT:    23.33    27.87    25.02     1.68  -14.21%
> 
> 
> KernelVersion: 3.9.0 (HT)
> 		Testcase:      Min      Max      Avg   StdDev
> 		  numa01:   567.62   752.06   620.26    66.72
>      numa01_THREAD_ALLOC:   145.84   172.44   160.73    10.34
> 		  numa02:    18.11    20.06    19.10     0.67
> 	      numa02_SMT:    17.59    22.83    19.94     2.17
> 
> KernelVersion: 3.9.0 + Mel's v5 patchset (HT)
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   741.13   753.91   748.10     4.51  -16.96%
>      numa01_THREAD_ALLOC:   153.57   162.45   158.22     3.18    1.55%
> 		  numa02:    19.15    20.96    20.04     0.64   -4.48%
> 	      numa02_SMT:    22.57    25.92    23.87     1.15  -15.16%
> 
> KernelVersion: 3.9.0 + this patchset (HT)
> 		Testcase:      Min      Max      Avg   StdDev  %Change
> 		  numa01:   418.46   457.77   436.00    12.81   40.25%
>      numa01_THREAD_ALLOC:   156.21   169.79   163.75     4.37   -1.78%
> 		  numa02:    18.41    20.18    19.06     0.60    0.20%
> 	      numa02_SMT:    22.72    27.24    25.29     1.76  -19.64%
> 
> 
> Autonuma results on a 8 node machine:
> 
> # dmidecode | grep 'Product Name:'
> 	Product Name: IBM x3950-[88722RZ]-
> 
> # numactl -H
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 32510 MB
> node 0 free: 31475 MB
> node 1 cpus: 8 9 10 11 12 13 14 15
> node 1 size: 32512 MB
> node 1 free: 31709 MB
> node 2 cpus: 16 17 18 19 20 21 22 23
> node 2 size: 32512 MB
> node 2 free: 31737 MB
> node 3 cpus: 24 25 26 27 28 29 30 31
> node 3 size: 32512 MB
> node 3 free: 31736 MB
> node 4 cpus: 32 33 34 35 36 37 38 39
> node 4 size: 32512 MB
> node 4 free: 31739 MB
> node 5 cpus: 40 41 42 43 44 45 46 47
> node 5 size: 32512 MB
> node 5 free: 31639 MB
> node 6 cpus: 48 49 50 51 52 53 54 55
> node 6 size: 65280 MB
> node 6 free: 63836 MB
> node 7 cpus: 56 57 58 59 60 61 62 63
> node 7 size: 65280 MB
> node 7 free: 64043 MB
> node distances:
> node   0   1   2   3   4   5   6   7 
>   0:  10  20  20  20  20  20  20  20 
>   1:  20  10  20  20  20  20  20  20 
>   2:  20  20  10  20  20  20  20  20 
>   3:  20  20  20  10  20  20  20  20 
>   4:  20  20  20  20  10  20  20  20 
>   5:  20  20  20  20  20  10  20  20 
>   6:  20  20  20  20  20  20  10  20 
>   7:  20  20  20  20  20  20  20  10 
> 
> KernelVersion: 3.9.0
> 	Testcase:      Min      Max      Avg   StdDev
> 	  numa01:  1796.11  1848.89  1812.39    19.35
> 	  numa02:    55.05    62.32    58.30     2.37
> 
> KernelVersion: 3.9.0-mel_numa_balancing+()
> 	Testcase:      Min      Max      Avg   StdDev  %Change
> 	  numa01:  1758.01  1929.12  1853.15    77.15   -2.11%
> 	  numa02:    50.96    53.63    52.12     0.90   11.52%
> 
> KernelVersion: 3.9.0-numa_balancing_v39+()
> 	Testcase:      Min      Max      Avg   StdDev  %Change
> 	  numa01:  1081.66  1939.94  1500.01   350.20   16.10%
> 	  numa02:    35.32    43.92    38.64     3.35   44.76%
> 
> 
> TODOs:
> 1. Use task loads for numa weights
> 2. Use numa faults as secondary key while moving threads
> 
> 
> Andrea Arcangeli (1):
>   x86, mm: Prevent gcc to re-read the pagetables
> 
> Srikar Dronamraju (9):
>   sched: Introduce per node numa weights
>   sched: Use numa weights while migrating tasks
>   sched: Select a better task to pull across node using iterations
>   sched: Move active_load_balance_cpu_stop to a new helper function
>   sched: Extend idle balancing to look for consolidation of tasks
>   sched: Limit migrations from a node
>   sched: Pass hint to active balancer about the task to be chosen
>   sched: Prevent a task from migrating immediately after an active balance
>   sched: Choose a runqueue that has lesser local affinity tasks
> 
>  arch/x86/mm/gup.c        |   23 ++-
>  fs/exec.c                |    6 +
>  include/linux/mm_types.h |    2 +
>  include/linux/sched.h    |    4 +
>  kernel/fork.c            |   11 +-
>  kernel/sched/core.c      |    2 +
>  kernel/sched/fair.c      |  443 ++++++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h     |    4 +
>  mm/memory.c              |    2 +-
>  9 files changed, 475 insertions(+), 22 deletions(-)
> 

-Andrew

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  9:46         ` Srikar Dronamraju
@ 2013-07-31 15:09           ` Peter Zijlstra
  2013-07-31 18:06             ` Srikar Dronamraju
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2013-07-31 15:09 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote:
> I am not against fault and fault based handling is very much needed. 
> I have listed that this approach is complementary to numa faults that
> Mel is proposing. 
> 
> Right now I think if we can first get the tasks to consolidate on nodes
> and then use the numa faults to place the tasks, then we would be able
> to have a very good solution. 
> 
> Plain fault information is actually causing confusion in enough number
> of cases esp if the initial set of pages is all consolidated into fewer
> set of nodes. With plain fault information, memory follows cpu, cpu
> follows memory are conflicting with each other. memory wants to move to
> nodes where the tasks are currently running and the tasks are planning
> to move nodes where the current memory is around.

Since task weights are a completely random measure the above story
completely fails to make any sense. If you can collate on an arbitrary
number, why can't you collate on faults?

The fact that the placement policies so far have not had inter-task
relations doesn't mean its not possible.

> Also most of the consolidation that I have proposed is pretty
> conservative or either done at idle balance time. This would not affect
> the numa faulting in any way. When I run with my patches (along with
> some debug code), the consolidation happens pretty pretty quickly.
> Once consolidation has happened, numa faults would be of immense value.

And also completely broken in various 'fun' ways. You're far too fond of
nr_running for one.

Also, afaict it never does anything if the machine is overloaded and we
never hit the !nr_running case in rebalance_domains.

> Here is how I am looking at the solution.
> 
> 1. Till the initial scan delay, allow tasks to consolidate

I would really want to not change regular balance behaviour for now;
you're also adding far too many atomic operations to the scheduler fast
path, that's going to make people terribly unhappy.

> 2. After the first scan delay to the next scan delay, account numa
>    faults, allow memory to move. But dont use numa faults as yet to
>    drive scheduling decisions. Here also task continue to consolidate.
> 
> 	This will lead to tasks and memory moving to specific nodes and
> 	leading to consolidation.

This is just plain silly, once you have fault information you'd better
use it to move tasks towards where the memory is, doing anything else
is, like said, silly.

> 3. After the second scan delay, continue to account numa faults and
> allow numa faults to drive scheduling decisions.
> 
> Should we use also use task weights at stage 3 or just numa faults or
> which one should get more preference is something that I am not clear at
> this time. At this time, I would think we would need to factor in both
> of them.
> 
> I think this approach would mean tasks get consolidated but the inter
> process, inter task relations that you are looking for also remain
> strong.
> 
> Is this a acceptable solution?

No, again, task weight is a completely random number unrelated to
anything we want to do. Furthermore we simply cannot add mm wide atomics
to the scheduler hot paths.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-31 13:33 ` Andrew Theurer
@ 2013-07-31 15:43   ` Srikar Dronamraju
  0 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-31 15:43 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Andrea Arcangeli,
	Johannes Weiner, Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

* Andrew Theurer <habanero@linux.vnet.ibm.com> [2013-07-31 08:33:44]:
>              -----------    -----------    -----------    -----------  
>  VM-node00|   49153(006%)   673792(083%)    51712(006%)   36352(004%) 
> 
> I think the consolidation is a nice concept, but it needs a much tighter
> integration with numa balancing.  The action to clump tasks on same node's
> runqueues should be triggered by detecting that they also access
> the same memory.
> 

Thanks Andrew for testing and reporting your results and analysis.
Will try to focus on getting consolidation + tighter integration with
numa balancing.

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-30  9:33       ` Peter Zijlstra
@ 2013-07-31 17:35         ` Srikar Dronamraju
  0 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-31 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

* Peter Zijlstra <peterz@infradead.org> [2013-07-30 11:33:21]:

> On Tue, Jul 30, 2013 at 02:45:43PM +0530, Srikar Dronamraju wrote:
> 
> > Can you please suggest workloads that I could try which might showcase
> > why you hate pure process based approach?
> 
> 2 processes, 1 sysvshm segment. I know there's multi-process MPI
> libraries out there.
> 
> Something like: perf bench numa mem -p 2 -G 4096 -0 -z --no-data_rand_walk -Z
> 

The above dumped core; Looks like -T is a must with -G.

I tried "perf bench numa mem -p 2 -T 32 -G 4096 -0 -z --no-data_rand_walk -Z"
It still didn't seem to do anything on my 4 node box (almost 2 hours
and nothing happened).

Finally I ran "perf bench numa mem -a"
(both with ht disabled and enabled)

Convergence wise my patchset did really well.

bw looks like a mixed bag. Though there are improvements, we see
degradations. I am not sure how to quantify which was the best among the
three. nx1 tests were the ones where this patchset had a -ve; but +ve
for all others.

Is this what you were looking for? Or was it something else?

(Lower is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
1x3-convergence		0.320		100.060		100.204		secs
1x4-convergence		100.139		100.162		100.155		secs
1x6-convergence		100.455		100.179		1.078		secs
2x3-convergence		100.261		100.339		9.743		secs
3x3-convergence		100.213		100.168		10.073		secs
4x4-convergence		100.307		100.201		19.686		secs
4x4-convergence-NOTHP	100.229		100.221		3.189		secs
4x6-convergence		101.441		100.632		6.204		secs
4x8-convergence		100.680		100.588		5.275		secs
8x4-convergence		100.335		100.365		34.069		secs
8x4-convergence-NOTHP	100.331		100.412		100.478		secs
3x1-convergence		1.227		1.536		0.576		secs
4x1-convergence		1.224		1.063		1.390		secs
8x1-convergence		1.713		2.437		1.704		secs
16x1-convergence	2.750		2.677		1.856		secs
32x1-convergence	1.985		1.795		1.391		secs


(Higher is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
RAM-bw-local		3.341		3.340		3.325		GB/sec
RAM-bw-local-NOTHP	3.308		3.307		3.290		GB/sec
RAM-bw-remote		1.815		1.815		1.815		GB/sec
RAM-bw-local-2x		6.410		6.413		6.412		GB/sec
RAM-bw-remote-2x	3.020		3.041		3.027		GB/sec
RAM-bw-cross		4.397		3.425		4.374		GB/sec
2x1-bw-process		3.481		3.442		3.492		GB/sec
3x1-bw-process		5.423		7.547		5.445		GB/sec
4x1-bw-process		5.108		11.009		5.118		GB/sec
8x1-bw-process		8.929		10.935		8.825		GB/sec
8x1-bw-process-NOTHP	12.754		11.442		22.889		GB/sec
16x1-bw-process		12.886		12.685		13.546		GB/sec
4x1-bw-thread		19.147		17.964		9.622		GB/sec
8x1-bw-thread		26.342		30.171		14.679		GB/sec
16x1-bw-thread		41.527		36.363		40.070		GB/sec
32x1-bw-thread		45.005		40.950		49.846		GB/sec
2x3-bw-thread		9.493		14.444		8.145		GB/sec
4x4-bw-thread		18.309		16.382		45.384		GB/sec
4x6-bw-thread		14.524		18.502		17.058		GB/sec
4x8-bw-thread		13.315		16.852		33.693		GB/sec
4x8-bw-thread-NOTHP	12.273		12.226		24.887		GB/sec
3x3-bw-thread		17.614		11.960		16.119		GB/sec
5x5-bw-thread		13.415		17.585		24.251		GB/sec
2x16-bw-thread		11.718		11.174		17.971		GB/sec
1x32-bw-thread		11.360		10.902		14.330		GB/sec
numa02-bw		48.999		44.173		54.795		GB/sec
numa02-bw-NOTHP		47.655		42.600		53.445		GB/sec
numa01-bw-thread	36.983		39.692		45.254		GB/sec
numa01-bw-thread-NOTHP	38.486		35.208		44.118		GB/sec



With HT ON

(Lower is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
1x3-convergence		100.114		100.138		100.084		secs
1x4-convergence		0.468		100.227		100.153		secs
1x6-convergence		100.278		100.400		100.197		secs
2x3-convergence		100.186		1.833		13.132		secs
3x3-convergence		100.302		100.457		2.087		secs
4x4-convergence		100.237		100.178		2.466		secs
4x4-convergence-NOTHP	100.148		100.251		2.985		secs
4x6-convergence		100.931		3.632		9.184		secs
4x8-convergence		100.398		100.456		4.801		secs
8x4-convergence		100.649		100.458		4.179		secs
8x4-convergence-NOTHP	100.391		100.428		9.758		secs
3x1-convergence		1.472		1.501		0.727		secs
4x1-convergence		1.478		1.489		1.408		secs
8x1-convergence		2.380		2.385		2.432		secs
16x1-convergence	3.260		3.399		2.219		secs
32x1-convergence	2.622		2.067		1.951		secs



(Higher is better)
testcase		3.9.0		Mels v5		this_patchset 	Units
------------------------------------------------------------------------------
RAM-bw-local		3.333		3.342		3.345		GB/sec
RAM-bw-local-NOTHP	3.305		3.306		3.307		GB/sec
RAM-bw-remote		1.814		1.814		1.816		GB/sec
RAM-bw-local-2x		7.896		6.400		6.538		GB/sec
RAM-bw-remote-2x	2.982		3.038		3.034		GB/sec
RAM-bw-cross		4.313		3.427		4.372		GB/sec
2x1-bw-process		3.473		4.708		3.784		GB/sec
3x1-bw-process		5.397		4.983		5.399		GB/sec
4x1-bw-process		5.040		8.775		5.098		GB/sec
8x1-bw-process		8.989		6.862		13.745		GB/sec
8x1-bw-process-NOTHP	8.457		19.094		8.118		GB/sec
16x1-bw-process		13.482		23.067		15.138		GB/sec
4x1-bw-thread		14.904		18.258		9.713		GB/sec
8x1-bw-thread		24.160		29.153		12.495		GB/sec
16x1-bw-thread		41.283		36.642		32.140		GB/sec
32x1-bw-thread		46.983		43.068		48.153		GB/sec
2x3-bw-thread		9.718		15.344		10.846		GB/sec
4x4-bw-thread		12.602		15.758		13.148		GB/sec
4x6-bw-thread		13.807		11.278		18.540		GB/sec
4x8-bw-thread		13.316		11.677		22.795		GB/sec
4x8-bw-thread-NOTHP	12.548		21.797		30.807		GB/sec
3x3-bw-thread		13.500		18.758		18.569		GB/sec
5x5-bw-thread		14.575		14.199		36.521		GB/sec
2x16-bw-thread		11.345		11.434		19.569		GB/sec
1x32-bw-thread		14.123		10.586		14.587		GB/sec
numa02-bw		50.963		44.092		53.419		GB/sec
numa02-bw-NOTHP		50.553		42.724		51.106		GB/sec
numa01-bw-thread	33.724		33.050		37.801		GB/sec
numa01-bw-thread-NOTHP	39.064		35.139		43.314		GB/sec


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks
  2013-07-31 15:09           ` Peter Zijlstra
@ 2013-07-31 18:06             ` Srikar Dronamraju
  0 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2013-07-31 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Andrea Arcangeli, Johannes Weiner,
	Linux-MM, LKML, Preeti U Murthy, Linus Torvalds

* Peter Zijlstra <peterz@infradead.org> [2013-07-31 17:09:23]:

> On Tue, Jul 30, 2013 at 03:16:50PM +0530, Srikar Dronamraju wrote:
> > I am not against fault and fault based handling is very much needed. 
> > I have listed that this approach is complementary to numa faults that
> > Mel is proposing. 
> > 
> > Right now I think if we can first get the tasks to consolidate on nodes
> > and then use the numa faults to place the tasks, then we would be able
> > to have a very good solution. 
> > 
> > Plain fault information is actually causing confusion in enough number
> > of cases esp if the initial set of pages is all consolidated into fewer
> > set of nodes. With plain fault information, memory follows cpu, cpu
> > follows memory are conflicting with each other. memory wants to move to
> > nodes where the tasks are currently running and the tasks are planning
> > to move nodes where the current memory is around.
> 
> Since task weights are a completely random measure the above story
> completely fails to make any sense. If you can collate on an arbitrary
> number, why can't you collate on faults?

Since task weights contribute to cpu load and we would want to keep the
loads balanced, and make sure that we dont do excessive consolidation
where we end up being imbalanced across cpus/nodes. So for example,
numa02 case, (single process running across all nodes), we dont want
tasks to consolidate or make the system imbalanced. So I thought task
weights would give me hints to say we should consolidating or we should
back off from consolidation. How do I derive hints to stop consolidation
based on numa faults.

> 
> The fact that the placement policies so far have not had inter-task
> relations doesn't mean its not possible.
> 

Do you have ideas that I can try out that could help build these
inter-task relations?

> > Also most of the consolidation that I have proposed is pretty
> > conservative or either done at idle balance time. This would not affect
> > the numa faulting in any way. When I run with my patches (along with
> > some debug code), the consolidation happens pretty pretty quickly.
> > Once consolidation has happened, numa faults would be of immense value.
> 
> And also completely broken in various 'fun' ways. You're far too fond of
> nr_running for one.

Yeah I too feel, I am too attached to nr_running.
> 
> Also, afaict it never does anything if the machine is overloaded and we
> never hit the !nr_running case in rebalance_domains.

Actually not, in most of my testing, cpu utilization is close to 100%.
And I have find_numa_queue, preferred_node logic that should kick in. 
My idea is we could achieve consolidation much easier in a overloaded
case since we dont actually have to do active migration. Futher there
are hints at task wake up time.

If we can further make the load balancer super intelligent that it
schedules the right task on the right cpu/node, will we need to
do migrate cpus on faults? Arent we making the code complicated by
introducing too many more points where we do pseudo load balancing?


> 
> > Here is how I am looking at the solution.
> > 
> > 1. Till the initial scan delay, allow tasks to consolidate
> 
> I would really want to not change regular balance behaviour for now;
> you're also adding far too many atomic operations to the scheduler fast
> path, that's going to make people terribly unhappy.
> 
> > 2. After the first scan delay to the next scan delay, account numa
> >    faults, allow memory to move. But dont use numa faults as yet to
> >    drive scheduling decisions. Here also task continue to consolidate.
> > 
> > 	This will lead to tasks and memory moving to specific nodes and
> > 	leading to consolidation.
> 
> This is just plain silly, once you have fault information you'd better
> use it to move tasks towards where the memory is, doing anything else
> is, like said, silly.
> 
> > 3. After the second scan delay, continue to account numa faults and
> > allow numa faults to drive scheduling decisions.
> > 
> > Should we use also use task weights at stage 3 or just numa faults or
> > which one should get more preference is something that I am not clear at
> > this time. At this time, I would think we would need to factor in both
> > of them.
> > 
> > I think this approach would mean tasks get consolidated but the inter
> > process, inter task relations that you are looking for also remain
> > strong.
> > 
> > Is this a acceptable solution?
> 
> No, again, task weight is a completely random number unrelated to
> anything we want to do. Furthermore we simply cannot add mm wide atomics
> to the scheduler hot paths.
> 

How do I maintain a per-mm per node data?

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-07-31 18:06 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-30  7:48 [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 01/10] sched: Introduce per node numa weights Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 02/10] sched: Use numa weights while migrating tasks Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 03/10] sched: Select a better task to pull across node using iterations Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 04/10] sched: Move active_load_balance_cpu_stop to a new helper function Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 05/10] sched: Extend idle balancing to look for consolidation of tasks Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 06/10] sched: Limit migrations from a node Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 07/10] sched: Pass hint to active balancer about the task to be chosen Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 08/10] sched: Prevent a task from migrating immediately after an active balance Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 09/10] sched: Choose a runqueue that has lesser local affinity tasks Srikar Dronamraju
2013-07-30  7:48 ` [RFC PATCH 10/10] x86, mm: Prevent gcc to re-read the pagetables Srikar Dronamraju
2013-07-30  8:17 ` [RFC PATCH 00/10] Improve numa scheduling by consolidating tasks Peter Zijlstra
2013-07-30  8:20   ` Peter Zijlstra
2013-07-30  9:03     ` Srikar Dronamraju
2013-07-30  9:10       ` Peter Zijlstra
2013-07-30  9:26         ` Peter Zijlstra
2013-07-30  9:46         ` Srikar Dronamraju
2013-07-31 15:09           ` Peter Zijlstra
2013-07-31 18:06             ` Srikar Dronamraju
2013-07-30  9:15     ` Srikar Dronamraju
2013-07-30  9:33       ` Peter Zijlstra
2013-07-31 17:35         ` Srikar Dronamraju
2013-07-31 13:33 ` Andrew Theurer
2013-07-31 15:43   ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).