From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: linux-mm@kvack.org
Cc: Rik van Riel <riel@redhat.com>,
Marcelo Tosatti <marcelo.tosatti@cyclades.com>,
Rahul Iyer <rni@andrew.cmu.edu>
Subject: Zoned CART
Date: Fri, 12 Aug 2005 16:37:09 +0200 [thread overview]
Message-ID: <1123857429.14899.59.camel@twins> (raw)
[-- Attachment #1: Type: text/plain, Size: 2183 bytes --]
Hi All,
I've been thinking on how to implement a zoned CART; and I think I have
found a nice concept.
My ideas are based on the initial cart patch by Rahul and the
non-resident code of Rik.
For a zoned page replacement algorithm we have per zone resident list(s)
and global non-resident list(s). CART specific we would have a T1_i and
T2_i, where 0 <= i <= nr_zones, and global B1 and B2 lists.
Because B1 and B2 are variable size and the B1_i target size q_i is zone
specific we need some tricks. However since |B1| + |B2| = c we could get
away with a single hash_table of c entries if we can manage to balance
the entries within.
I propose to do this by using a 2 hand bucket and using the 2 MSB of the
cookie (per bucket uniqueness; 30 bits of uniqueness should be enough on
a ~64 count bucket). The cookies MSB is used to distinguish B1/B2 and
the MSB-1 is used for the filter bit.
Let us denote the buckets with the subscript j: |B1_j| + |B2_j| = c_j.
Each hand keeps a FIFO for its corresponding type: B1/B2; eg. rotating
H1_j will select the next oldest B1_j page for removal.
We need to balance the per zone values:
T1_i, T2_i, |T1_i|, |T2_i|
p_i, Ns_i, Nl_i
|B1_i|, |B2_i|, q_i
agains the per bucket values:
B1_j, B2_j.
This can be done with two simple modifications to the algorithm:
- explicitly keep |B1_i| and |B2_i| - needed for the p,q targets
- merge the history replacement (lines 6-10) in the replace (lines
36-40) code so that: adding the new MRU page and removing the old LRU
page becomes one action.
This will keep:
|B1_j| |B1| Sum^i(|B1_i|)
-------- ~ ------ = -------------
|B2_j| |B2| Sum^i(|B2_i|)
however it will violate strict FIFO order within the buckets; although I
guess it won't be too bad.
This approach does away with explicitly keeping the FIFO lists for the
non-resident pages and merges them.
Attached is a modification of rik his non-resident code that implements
the buckets described herein.
I shall attempt to merge this code into the Rahuls new cart-patch-2 if
you guys don't see any big problems with the approach, or beat me to it.
Kind regards,
--
Peter Zijlstra <a.p.zijlstra@chello.nl>
[-- Attachment #2: nonresident-pages.patch --]
[-- Type: text/x-patch, Size: 9331 bytes --]
diff -NaurX linux-2.6.13-rc6/Documentation/dontdiff linux-2.6.13-rc6/include/linux/nonresident.h linux-2.6.13-rc6-cart/include/linux/nonresident.h
--- linux-2.6.13-rc6/include/linux/nonresident.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.13-rc6-cart/include/linux/nonresident.h 2005-08-12 13:55:54.000000000 +0200
@@ -0,0 +1,11 @@
+#ifndef __LINUX_NONRESIDENT_H
+#define __LINUX_NONRESIDENT_H
+
+#define NR_filter 0x01 /* short/long */
+#define NR_list 0x02 /* b1/b2; correlates to PG_active */
+
+#define EVICT_MASK 0x80000000
+#define EVICT_B1 0x00000000
+#define EVICT_B2 0x80000000
+
+#endif /* __LINUX_NONRESIDENT_H */
diff -NaurX linux-2.6.13-rc6/Documentation/dontdiff linux-2.6.13-rc6/include/linux/swap.h linux-2.6.13-rc6-cart/include/linux/swap.h
--- linux-2.6.13-rc6/include/linux/swap.h 2005-08-08 20:57:50.000000000 +0200
+++ linux-2.6.13-rc6-cart/include/linux/swap.h 2005-08-12 14:00:26.000000000 +0200
@@ -154,6 +154,11 @@
/* linux/mm/memory.c */
extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
+/* linux/mm/nonresident.c */
+extern u32 remember_page(struct address_space *, unsigned long, unsigned int);
+extern unsigned int recently_evicted(struct address_space *, unsigned long);
+extern void init_nonresident(void);
+
/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
extern unsigned long totalhigh_pages;
@@ -292,6 +297,11 @@
#define grab_swap_token() do { } while(0)
#define has_swap_token(x) 0
+/* linux/mm/nonresident.c */
+#define init_nonresident() do { } while (0)
+#define remember_page(x,y,z) 0
+#define recently_evicted(x,y) 0
+
#endif /* CONFIG_SWAP */
#endif /* __KERNEL__*/
#endif /* _LINUX_SWAP_H */
diff -NaurX linux-2.6.13-rc6/Documentation/dontdiff linux-2.6.13-rc6/init/main.c linux-2.6.13-rc6-cart/init/main.c
--- linux-2.6.13-rc6/init/main.c 2005-08-08 20:57:51.000000000 +0200
+++ linux-2.6.13-rc6-cart/init/main.c 2005-08-10 08:33:38.000000000 +0200
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/swap.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -494,6 +495,7 @@
}
#endif
vfs_caches_init_early();
+ init_nonresident();
mem_init();
kmem_cache_init();
setup_per_cpu_pageset();
diff -NaurX linux-2.6.13-rc6/Documentation/dontdiff linux-2.6.13-rc6/mm/Makefile linux-2.6.13-rc6-cart/mm/Makefile
--- linux-2.6.13-rc6/mm/Makefile 2005-08-08 20:57:52.000000000 +0200
+++ linux-2.6.13-rc6-cart/mm/Makefile 2005-08-10 08:33:39.000000000 +0200
@@ -12,7 +12,8 @@
readahead.o slab.o swap.o truncate.o vmscan.o \
prio_tree.o $(mmu-y)
-obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
+obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o \
+ nonresident.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
diff -NaurX linux-2.6.13-rc6/Documentation/dontdiff linux-2.6.13-rc6/mm/nonresident.c linux-2.6.13-rc6-cart/mm/nonresident.c
--- linux-2.6.13-rc6/mm/nonresident.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.13-rc6-cart/mm/nonresident.c 2005-08-12 14:00:26.000000000 +0200
@@ -0,0 +1,211 @@
+/*
+ * mm/nonresident.c
+ * (C) 2004,2005 Red Hat, Inc
+ * Written by Rik van Riel <riel@redhat.com>
+ * Released under the GPL, see the file COPYING for details.
+ * Adapted by Peter Zijlstra <a.p.zijlstra@chello.nl> for use by ARC
+ * like algorithms.
+ *
+ * Keeps track of whether a non-resident page was recently evicted
+ * and should be immediately promoted to the active list. This also
+ * helps automatically tune the inactive target.
+ *
+ * The pageout code stores a recently evicted page in this cache
+ * by calling remember_page(mapping/mm, index/vaddr)
+ * and can look it up in the cache by calling recently_evicted()
+ * with the same arguments.
+ *
+ * Note that there is no way to invalidate pages after eg. truncate
+ * or exit, we let the pages fall out of the non-resident set through
+ * normal replacement.
+ *
+ *
+ * Modified to work with ARC like algorithms who:
+ * - need to balance two lists; |b1| + |b2| = c,
+ * - keep a flag per non-resident page.
+ *
+ * This is accomplished by extending the buckets to two hands; one
+ * for each list. And modifying the cookie to put two state flags
+ * in its MSBs.
+ *
+ * On insertion time it is specified from which list an entry is to
+ * be reused; then the corresponding hand is rotated until a cookie
+ * of the proper type is encountered (MSB; NR_list).
+ *
+ * Because two hands and clock search are too much for
+ * preempt_disable() the bucket is guarded by a spinlock.
+ */
+#include <linux/mm.h>
+#include <linux/cache.h>
+#include <linux/spinlock.h>
+#include <linux/bootmem.h>
+#include <linux/hash.h>
+#include <linux/prefetch.h>
+#include <linux/kernel.h>
+#include <linux/nonresident.h>
+
+#define TARGET_SLOTS 128
+#define NR_CACHELINES (TARGET_SLOTS*sizeof(u32) / L1_CACHE_BYTES);
+#define NR_SLOTS (((NR_CACHELINES * L1_CACHE_BYTES) - sizeof(spinlock_t) - 2*sizeof(u16)) / sizeof(u32))
+#if NR_SLOTS < TARGET_SLOTS / 2
+#warning very small slot size
+#if NR_SLOTS == 0
+#error no room for slots left
+#endif
+#endif
+
+#define FLAGS_BITS 2
+#define FLAGS_SHIFT (sizeof(u32)*8 - FLAGS_BITS)
+#define FLAGS_MASK (~(((1 << FLAGS_BITS) - 1) << FLAGS_SHIFT))
+
+struct nr_bucket
+{
+ spinlock_t lock;
+ u16 hand[2];
+ u32 slot[NR_SLOTS];
+} ____cacheline_aligned;
+
+/* The non-resident page hash table. */
+static struct nr_bucket * nonres_table;
+static unsigned int nonres_shift;
+static unsigned int nonres_mask;
+
+/* hash the address into a bucket */
+static struct nr_bucket * nr_hash(void * mapping, unsigned long index)
+{
+ unsigned long bucket;
+ unsigned long hash;
+
+ hash = hash_ptr(mapping, BITS_PER_LONG);
+ hash = 37 * hash + hash_long(index, BITS_PER_LONG);
+ bucket = hash & nonres_mask;
+
+ return nonres_table + bucket;
+}
+
+/* hash the address, inode and flags into a cookie */
+/* the two msb are flags; where msb-1 is a type flag and msb a period flag */
+static u32 nr_cookie(struct address_space * mapping, unsigned long index, unsigned int flags)
+{
+ u32 c;
+ unsigned long cookie;
+
+ cookie = hash_ptr(mapping, BITS_PER_LONG);
+ cookie = 37 * cookie + hash_long(index, BITS_PER_LONG);
+
+ if (mapping->host) {
+ cookie = 37 * cookie + hash_long(mapping->host->i_ino, BITS_PER_LONG);
+ }
+
+ c = (u32)(cookie >> (BITS_PER_LONG - 32));
+ c = (c & FLAGS_MASK) | flags << FLAGS_SHIFT;
+ return c;
+}
+
+unsigned int recently_evicted(struct address_space * mapping, unsigned long index)
+{
+ struct nr_bucket * nr_bucket;
+ u32 wanted;
+ unsigned int r_flags = 0;
+ int i;
+
+ prefetch(mapping->host);
+ nr_bucket = nr_hash(mapping, index);
+
+ spin_lock_prefetch(nr_bucket); // prefetch_range(nr_bucket, NR_CACHELINES);
+ wanted = nr_cookie(mapping, index, 0);
+
+ spin_lock(&nr_bucket->lock);
+ for (i = 0; i < NR_SLOTS; ++i) {
+ if (nr_bucket->slot[i] & FLAGS_MASK == wanted) {
+ r_flags = nr_bucket->slot[i] >> FLAGS_SHIFT;
+ r_flags |= EVICT_MASK;
+ nr_bucket->slot[i] = 0;
+ break;
+ }
+ }
+ spin_unlock(&nr_bucket->lock);
+
+ return r_flags;
+}
+
+/* flags:
+ * logical and of the page flags (NR_filter, NR_list) and
+ * an EVICT_ target
+ */
+u32 remember_page(struct address_space * mapping, unsigned long index, unsigned int flags)
+{
+ struct nr_bucket * nr_bucket;
+ u32 cookie;
+ u32 * slot;
+ int i, slots;
+
+ prefetch(mapping->host);
+ nr_bucket = nr_hash(mapping, index);
+
+ spin_lock_prefetch(nr_bucket); // prefetchw_range(nr_bucket, NR_CACHELINES);
+ cookie = nr_cookie(mapping, index, flags);
+
+ flags &= EVICT_MASK;
+ spin_lock(&nr_bucket->lock);
+again:
+ slots = NR_SLOTS;
+ do {
+ i = ++nr_bucket->hand[!!flags];
+ if (unlikely(i >= NR_SLOTS))
+ i = nr_bucket->hand[!!flags] = 0;
+ slot = &nr_bucket->slot[i];
+ } while (*slot && *slot & EVICT_MASK != flags && --slots);
+ if (unlikely(!slots)) {
+ flags ^= EVICT_MASK;
+ goto again;
+ }
+ xchg(slot, cookie);
+ spin_unlock(&nr_bucket->lock);
+
+ return cookie;
+}
+
+/*
+ * For interactive workloads, we remember about as many non-resident pages
+ * as we have actual memory pages. For server workloads with large inter-
+ * reference distances we could benefit from remembering more.
+ */
+static __initdata unsigned long nonresident_factor = 1;
+void __init init_nonresident(void)
+{
+ int target;
+ int i;
+
+ /*
+ * Calculate the non-resident hash bucket target. Use a power of
+ * two for the division because alloc_large_system_hash rounds up.
+ */
+ target = nr_all_pages * nonresident_factor;
+ target /= (sizeof(struct nr_bucket) / sizeof(u32));
+
+ nonres_table = alloc_large_system_hash("Non-resident page tracking",
+ sizeof(struct nr_bucket),
+ target,
+ 0,
+ HASH_EARLY | HASH_HIGHMEM,
+ &nonres_shift,
+ &nonres_mask,
+ 0);
+
+ for (i = 0; i < (1 << nonres_shift); i++) {
+ spin_lock_init(&nonres_table[i].lock);
+ nonres_table[i].hand[0] = nonres_table[i].hand[1] = 0;
+ for (j = 0; j < NR_SLOTS; ++j)
+ nonres_table[i].slot[j] = 0;
+ }
+}
+
+static int __init set_nonresident_factor(char * str)
+{
+ if (!str)
+ return 0;
+ nonresident_factor = simple_strtoul(str, &str, 0);
+ return 1;
+}
+__setup("nonresident_factor=", set_nonresident_factor);
next reply other threads:[~2005-08-12 14:37 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-08-12 14:37 Peter Zijlstra [this message]
2005-08-12 15:42 ` Zoned CART Rahul Iyer
2005-08-12 15:52 ` Peter Zijlstra
2005-08-12 23:08 ` Marcelo Tosatti
2005-08-13 19:00 ` Rahul Iyer
2005-08-13 19:08 ` Marcelo Tosatti
2005-08-13 21:30 ` Rik van Riel
2005-08-14 18:31 ` Peter Zijlstra
2005-08-12 20:21 ` Marcelo Tosatti
2005-08-12 22:28 ` Marcelo Tosatti
2005-08-13 19:03 ` Rahul Iyer
2005-08-14 12:58 ` Peter Zijlstra
2005-08-15 21:31 ` Peter Zijlstra
2005-08-16 19:53 ` Rahul Iyer
2005-08-16 20:49 ` Christoph Lameter
2005-08-25 22:39 ` Peter Zijlstra
2005-08-26 0:01 ` Christoph Lameter
2005-08-26 3:59 ` Rahul Iyer
2005-08-26 7:09 ` Peter Zijlstra
2005-08-26 12:24 ` Rik van Riel
2005-08-26 21:03 ` Peter Zijlstra
2005-08-27 19:46 ` [RFC][PATCH] " Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1123857429.14899.59.camel@twins \
--to=a.p.zijlstra@chello.nl \
--cc=linux-mm@kvack.org \
--cc=marcelo.tosatti@cyclades.com \
--cc=riel@redhat.com \
--cc=rni@andrew.cmu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.