public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [rmap] operator-sparse Fibonacci hashing of waitqueues
@ 2002-02-17  9:01 William Lee Irwin III
  2002-02-18 22:31 ` Daniel Phillips
  0 siblings, 1 reply; 8+ messages in thread
From: William Lee Irwin III @ 2002-02-17  9:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: riel, davem, phillips, rwhron

After distilling with hpa's help the results of some weeks-old
numerological experiments^W^Wnumber crunching, I've devised a patch
here for -rmap to make the waitqueue hashing somewhat more palatable
for SPARC and several others.

This patch uses some operator-sparse Fibonacci hashing primes in order
to allow shift/add implementations of the hash function used for hashed
waitqueues.


Dan, Dave, could you take a look here and please comment?

Randy, any chance you could benchmark -rmap with this on top for
comparison against standard -rmap to ensure there is no regression?


Thanks,
Bill

P.S.: Dave: Sorry I took so long to get this thing out, but here it is.


# This is a BitKeeper generated patch for the following project:
# Project Name: Long-term Linux VM development
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.199   ->        
#	include/linux/mm_inline.h	1.14    -> 1.15   
#	        mm/filemap.c	1.52    -> 1.53   
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/02/17	wli@holomorphy.com	1.200
# Use operator-sparse Fibonacci hashing primes for the hashed waitqueues so as to reduce the cost of
# the hashing computation on several major RISC architectures and most 64-bit machines.
# 
# Specifically the search for operator-sparse primes was documented around the definitions of the
# GOLDEN_RATIO_PRIME values, where the number of terms in the sum of powers of 2 was chosen in
# advance and the numbers generated this way were sorted by their continued fraction expansion
# and then filtered for primality.
# 
# In turn the definition of page_waitqueue() also needed to be altered so that for 64-bit machines,
# whose compilers cannot perform the transformation of multiplication by a constant to shifts and
# adds, the transformation is explcitly coded, and documentation is included for it describing how
# the shifts and adds correspond to the signed sum of powers of two representation of the golden
# ratio prime.
# --------------------------------------------
#
diff -Nru a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h	Sun Feb 17 00:49:26 2002
+++ b/include/linux/mm_inline.h	Sun Feb 17 00:49:26 2002
@@ -9,12 +9,43 @@
  * Chuck Lever verified the effectiveness of this technique for several
  * hash tables in his paper documenting the benchmark results:
  * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These prime numbers were determined by searching all numbers
+ * composed of sums of seven terms of the form 2^n with n decreasing
+ * with the index of the summand and sorting by the continued fraction
+ * expansion of p/2^BITS_PER_LONG, then filtering for primality.
+ * Verifying that the confidence tests on the chi^2 statistics passed
+ * is necessary because this process can produce "bad" factors, but on
+ * the other hand it often produces excellent factors, so it's a very
+ * useful process for generating hash functions.
  */
+
 #if BITS_PER_LONG == 32
-#define GOLDEN_RATIO_PRIME 2654435761UL
+/*
+ * 2654404609 / 2^32 has the continued fraction expansion
+ * 0,1,1,1,1,1,1,1,1,1,1,1,1,18,7,1,3,7,18,1,1,1,1,1,1,1,1,1,1,2,0
+ * which is very close to phi.
+ * It is of the form 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1
+ * which makes it suitable for compiler optimization to shift/add
+ * implementations on machines where multiplication is a slow operation.
+ * gcc is successfully able to optimize this on 32-bit SPARC and MIPS.
+ */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
 
 #elif BITS_PER_LONG == 64
-#define GOLDEN_RATIO_PRIME 11400714819323198549UL
+/*
+ * 11400862456688148481 / 2^64 has the continued fraction expansion
+ * 0,1,1,1,1,1,1,1,1,1,1,1,2,1,14,1,1048579,15,1,2,1,1,3,9,1,1,8,3,1,7,
+ *    1,1,9,1,2,1,1,1,1,2,0
+ * which is very close to phi = (sqrt(5)-1)/2 = 0,1,1,1,....
+ *
+ * It is of the form 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1
+ * which makes it suitable for shift/add implementations of the hash
+ * function. 64-bit architectures typically have slow multiplies (or
+ * no hardware multiply) and also gcc is unable to optimize 64-bit
+ * multiplies for these bit patterns so explicit expansion is used.
+ */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
 
 #else
 #error Define GOLDEN_RATIO_PRIME in mm_inline.h for your wordsize.
diff -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c	Sun Feb 17 00:49:26 2002
+++ b/mm/filemap.c	Sun Feb 17 00:49:26 2002
@@ -787,6 +787,9 @@
  * collisions. This cost is great enough that effective hashing
  * is necessary to maintain performance.
  */
+
+#if BITS_PER_LONG == 32
+
 static inline wait_queue_head_t *page_waitqueue(struct page *page)
 {
 	const zone_t *zone = page_zone(page);
@@ -798,6 +801,50 @@
 
 	return &wait[hash];
 }
+
+#elif BITS_PER_LONG == 64
+
+static inline wait_queue_head_t *page_waitqueue(struct page *page)
+{
+	const zone_t *zone = page_zone(page);
+	wait_queue_head_t *wait = zone->wait_table;
+	unsigned long hash = (unsigned long)page;
+	unsigned long n;
+
+	/*
+	 * The bit-sparse GOLDEN_RATIO_PRIME is a sum of powers of 2
+	 * with varying signs: 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1
+	 * The shifts in the code below correspond to the differences
+	 * in the exponents above.
+	 * The primary reason why the work of expanding the multiplication
+	 * this way is not given to the compiler is because 64-bit code
+	 * generation does not seem to be capable of doing it. So the code
+	 * here manually expands it.
+	 * The differences are used in order to both reduce the number of
+	 * variables used and also to reduce the size of the immediates
+	 * needed for the shift instructions, whose precision is limited
+	 * on some architectures.
+	 */
+
+	n = hash;
+	n <<= 18;
+	hash -= n;
+	n <<= 33;
+	hash -= n;
+	n <<= 3;
+	hash += n;
+	n <<= 3;
+	hash -= n;
+	n <<= 4;
+	hash += n;
+	n <<= 2;
+	hash += n;
+
+	hash >>= zone->wait_table_shift;
+	return &wait[hash];
+}
+
+#endif /* BITS_PER_LONG == 64 */
 
 
 /* 

^ permalink raw reply	[flat|nested] 8+ messages in thread
* [PATCH] [rmap] operator-sparse Fibonacci hashing of waitqueues
@ 2002-02-17 23:01 rwhron
  2002-02-18  0:02 ` William Lee Irwin III
  0 siblings, 1 reply; 8+ messages in thread
From: rwhron @ 2002-02-17 23:01 UTC (permalink / raw)
  To: wli; +Cc: linux-kernel

> Randy, any chance you could benchmark -rmap with this on top for
> comparison against standard -rmap to ensure there is no regression?

System is k6-2 475 mhz and 384 MB ram with IDE disks.
Here it is with a few notes on things I looked at more closely.
I also have the output of readprofile between each test.

dbench 128 processes
2.4.17-rmap12f       **********   5.0  MB/sec
2.4.17-rmap12f-wli   **********   5.1  MB/sec

unixbench-4.1.0         2.4.17-rmap12f-wli  2.4.17-rmap12f
Execl Throughput                  293.5           303.7  lps (3  samples)
Pipe Throughput                395378.4        391759.4  lps (10 samples)
Pipe-based Context Switching    78095.8        117697.0  lps (10 samples)
Process Creation                 1181.5          1143.0  lps (3  samples)
System Call Overhead           335203.8        335211.1  lps (10 samples)
Shell Scripts (1 concurrent)      586.9           591.0  lpm (3  samples)

pipe context switching in unixbench results initially looked concerning,
but looking at how the test varies, and context switch and pipe tests
in lmbench suggest that the lower number may not be significant.


LMbench-2.0p2 average of 9 samples.
-------------
Processor, Processes - times in microseconds - smaller is better
OS                  null null      open  slct sig  sig   fork  exec   sh
                    call I/O  stat clos  TCP  inst hndl  proc  proc  proc
2.4.17-rmap12f      0.43 0.71 3.43 5.47  49.3 1.46 3.25   904  3606 14035
2.4.17-rmap12f-wli  0.43 0.71 3.20 5.32  44.9 1.44 3.35   951  3671 13960


Context switching - times in microseconds - smaller is better
OS                  2p/0K  2p/16K  2p/64K  8p/16K 8p/64K  16p/16K 16p/64K
                    ctxsw  ctxsw   ctxsw   ctxsw  ctxsw   ctxsw   ctxsw
2.4.17-rmap12f       0.91   24.52  195.18  58.21  211.57  60.72   227.85
2.4.17-rmap12f-wli   1.07   22.21  189.00  56.71  206.15  60.90   226.34


*Local* Communication latencies in microseconds - smaller is better
OS                  Pipe    AF    UDP    RPC/   TCP    RPC/   TCP
                           UNIX           UDP           TCP   conn
2.4.17-rmap12f      10.52  20.22  40.52  147.4  68.74  198.0  279.52
2.4.17-rmap12f-wli  11.06  19.85  40.39  142.1  66.52  191.3  274.65


File & VM system latencies in microseconds - smaller is better
OS                    0K    File     10K    File     Mmap    Prot   Page
                    Create  Delete  Create  Delete  Latency  Fault  Fault
2.4.17-rmap12f      121.02  164.96  566.05  248.88   3300.3  1.18    3.8
2.4.17-rmap12f-wli  134.03  171.19  690.52  254.61   3314.6  1.17    4.0


*Local* Communication bandwidths in MB/s - bigger is better
OS                  Pipe   AF   TCP    File   Mmap   Bcopy  Bcopy  Mem  
                          UNIX        reread reread (libc) (hand)  read 
2.4.17-rmap12f      64.04 48.73 60.15  58.62 237.62  58.48  59.62 237.43
2.4.17-rmap12f-wli  64.93 41.48 42.89  62.43 237.59  58.43  59.57 237.42

The TCP and AF bandwdith numbers vary a lot between tests.  Oddly, the TCP
test appears to vary by almost 100% between iterations.  Here are the 
individual test results;

OS                           AF     TCP 
                            UNIX        
--------------------------  -----  -----
Linux 2.4.17-rmap12f        46.47  39.43
Linux 2.4.17-rmap12f        51.26  78.44
Linux 2.4.17-rmap12f        51.31  77.34
Linux 2.4.17-rmap12f        50.52  74.49
Linux 2.4.17-rmap12f        47.49  40.32
Linux 2.4.17-rmap12f        48.20  79.06
Linux 2.4.17-rmap12f        44.00  38.66
Linux 2.4.17-rmap12f        48.65  38.14
Linux 2.4.17-rmap12f        50.63  75.50
Linux 2.4.17-rmap12f-wli    43.24  39.72
Linux 2.4.17-rmap12f-wli    42.92  82.14
Linux 2.4.17-rmap12f-wli    40.96  38.19
Linux 2.4.17-rmap12f-wli    42.46  37.77
Linux 2.4.17-rmap12f-wli    44.61  38.10
Linux 2.4.17-rmap12f-wli    40.86  37.15
Linux 2.4.17-rmap12f-wli    39.26  40.63
Linux 2.4.17-rmap12f-wli    38.35  36.27
Linux 2.4.17-rmap12f-wli    40.63  36.04

The tbench test below doesn't show a large variation between the
two kernels.


Memory latencies in nanoseconds - smaller is better
OS                  Mhz   L1 $   L2 $    Main mem
2.4.17-rmap12f       476   4.20  190.41  261.95
2.4.17-rmap12f-wli   476   4.20  186.52  262.02


tiobench Average of 3 samples.
--------
2048 MB worth of files on ext2 fs. (filesize = 2048MB / num_threads)
Read, write, and seek rates in MB/sec. 
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.


Sequential Reads   Num                  Avg     Maximum   Lat%    Lat%  CPU
Kernel             Thr  Rate  (CPU%)  Latency   Latency    >2s    >0s   Eff
------------------ --- ----------------------------------------------------
2.4.17-rmap12f       8   9.37 13.97%    9.788    980.34 0.00000 0.00000  67
2.4.17-rmap12f-wli   8   9.38 13.74%    9.759    886.07 0.00000 0.00000  68

2.4.17-rmap12f      16   9.90 14.99%   18.234   1301.80 0.00000 0.00000  66
2.4.17-rmap12f-wli  16   9.89 14.72%   18.311   1330.12 0.00000 0.00000  67

2.4.17-rmap12f      32  10.05 16.46%   35.398   2455.18 0.00000 0.00000  61
2.4.17-rmap12f-wli  32  10.14 16.61%   35.303   2483.88 0.00000 0.00000  61

2.4.17-rmap12f      64   9.65 18.54%   71.995   4745.45 0.00000 0.00000  52
2.4.17-rmap12f-wli  64   9.65 18.29%   71.986   4792.92 0.00000 0.00000  53

2.4.17-rmap12f     128   9.25 26.66%  148.095 136165.27 4.71210 0.02498  35
2.4.17-rmap12f-wli 128   9.31 27.08%  146.307 111566.57 4.54960 0.01964  34


Random Reads       Num                  Avg     Maximum   Lat%    Lat%  CPU
Kernel             Thr  Rate  (CPU%)  Latency   Latency    >2s    >10s  Eff
------------------ --- ----------------------------------------------------
2.4.17-rmap12f       8   0.42 1.058%  217.323    852.61 0.00000 0.00000  40
2.4.17-rmap12f-wli   8   0.42 1.073%  217.232   1016.86 0.00000 0.00000  39

2.4.17-rmap12f      16   0.48 1.406%  375.897   1485.76 0.00000 0.00000  34
2.4.17-rmap12f-wli  16   0.47 1.296%  375.872   1537.88 0.00000 0.00000  37

2.4.17-rmap12f      32   0.52 1.760%  667.107   3468.37 0.00000 0.00000  30
2.4.17-rmap12f-wli  32   0.53 1.640%  655.669   3331.63 0.00000 0.00000  32

2.4.17-rmap12f      64   0.53 2.176% 1283.121   8672.69 0.75605 0.00000  24
2.4.17-rmap12f-wli  64   0.54 1.784% 1288.162   7699.61 0.90726 0.00000  30

2.4.17-rmap12f     128   0.54 3.198% 2508.958  16549.40 5.24697 0.00000  17
2.4.17-rmap12f-wli 128   0.54 3.071% 2492.784  17542.43 5.80141 0.00000  18


Sequential Writes  Num                  Avg     Maximum   Lat%    Lat%  CPU
Kernel             Thr  Rate  (CPU%)  Latency   Latency    >2s    >10s  Eff
------------------ --- ----------------------------------------------------
2.4.17-rmap12f       8  18.11 82.47%    4.542   4140.63 0.00000 0.00000  22
2.4.17-rmap12f-wli   8  18.06 82.49%    4.553   5403.78 0.00306 0.00000  22

2.4.17-rmap12f      16  18.04 81.69%    8.921   7486.25 0.00744 0.00000  22
2.4.17-rmap12f-wli  16  17.99 82.15%    8.893   5086.09 0.00076 0.00000  22

2.4.17-rmap12f      32  17.31 76.08%   17.610  10609.26 0.02595 0.00000  23
2.4.17-rmap12f-wli  32  17.31 77.04%   17.808  11199.52 0.03033 0.00000  22

2.4.17-rmap12f      64  15.07 60.04%   40.287  15381.66 0.39444 0.00000  25
2.4.17-rmap12f-wli  64  15.18 61.79%   40.052  20057.21 0.37003 0.00000  25

2.4.17-rmap12f     128  12.52 47.48%   95.497  31261.15 1.64452 0.00095  26
2.4.17-rmap12f-wli 128  12.62 46.73%   94.113  29901.54 1.50032 0.00114  27


Random Writes      Num                  Avg     Maximum   Lat%    Lat%  CPU
Kernel             Thr  Rate  (CPU%)  Latency   Latency    >2s    >10s  Eff
------------------ --- ----------------------------------------------------
2.4.17-rmap12f       8   0.67 4.278%    0.490    666.50 0.00000 0.00000  16
2.4.17-rmap12f-wli   8   0.66 4.322%    0.659    688.43 0.00000 0.00000  15

2.4.17-rmap12f      16   0.68 4.492%    1.941    578.57 0.00000 0.00000  15
2.4.17-rmap12f-wli  16   0.68 4.419%    2.249    632.19 0.00000 0.00000  15

2.4.17-rmap12f      32   0.67 4.467%    0.818    835.38 0.00000 0.00000  15
2.4.17-rmap12f-wli  32   0.68 4.550%    0.663    562.35 0.00000 0.00000  15

2.4.17-rmap12f      64   0.69 4.668%    1.076    621.97 0.00000 0.00000  15
2.4.17-rmap12f-wli  64   0.69 4.585%    0.980    861.52 0.00000 0.00000  15

2.4.17-rmap12f     128   0.71 4.973%    0.975    917.19 0.00000 0.00000  14
2.4.17-rmap12f-wli 128   0.70 4.990%    0.803    940.80 0.00000 0.00000  14


bonnie++ average of 3 samples
1024MB file
Version 1.02a      --------------------Sequential Output-------------------
                   -----Per Char----- ------Block------- -----Rewrite------
Kernel             MB/sec  %CPU   Eff MB/sec  %CPU   Eff MB/sec  %CPU   Eff
2.4.17-rmap12f       3.38  99.0  3.41  15.12  97.0 15.58   5.14  40.7 12.64
2.4.17-rmap12f-wli   3.38  99.0  3.42  15.16  97.0 15.63   5.66  45.3 12.48

                    -----------Sequential Input----------- -----Random-----
                    -----Per Char-----  ------Block------- -----Seeks------
Kernel              MB/sec  %CPU   Eff  MB/sec  %CPU   Eff /sec  %CPU   Eff
2.4.17-rmap12f        3.55  85.0  4.17   19.07  23.7 80.57  122   2.0  6096
2.4.17-rmap12f-wli    3.56  85.0  4.19   19.14  23.7 80.87  123   2.0  6140

16384 small files
                   -------------Sequential Create-------
                   ------Create-----    -----Delete-----
                    /sec  %CPU   Eff    /sec  %CPU   Eff
2.4.17-rmap12f      4395  95.7  4594    4548  99.3  4578
2.4.17-rmap12f-wli  4214  98.3  4286    4268  91.7  4655

                   -------------Random Create------------
                   ------Create-----     -----Delete-----
                    /sec  %CPU   Eff     /sec  %CPU   Eff
2.4.17-rmap12f      4460  99.0  4504     3927  93.3  4207
2.4.17-rmap12f-wli  4154  97.7  4253     4109  96.3  4265


Time in seconds to build/run things.
kernel             autoconf    perl    kernel  updatedb  updatedb 5x
2.4.17-rmap12f     1217        1589    1295    29        44       
2.4.17-rmap12f-wli 1218        1609    1270    28        45       


tbench 32 processes (average of 3 runs)
2.4.17-rmap12f       ********************************  16.3  MB/sec
2.4.17-rmap12f-wli   ********************************  16.2  MB/sec


More details on testing method and other kernels tested at
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html

-- 
Randy Hron


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2002-02-19  1:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-17  9:01 [PATCH] [rmap] operator-sparse Fibonacci hashing of waitqueues William Lee Irwin III
2002-02-18 22:31 ` Daniel Phillips
2002-02-19  0:34   ` William Lee Irwin III
2002-02-19  0:46     ` William Lee Irwin III
2002-02-19  1:01     ` Daniel Phillips
2002-02-19  1:27     ` William Lee Irwin III
  -- strict thread matches above, loose matches on Subject: below --
2002-02-17 23:01 rwhron
2002-02-18  0:02 ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox