* [PATCH RESEND v1 00/16] vfs: hot data tracking
@ 2012-12-20 14:43 zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 01/16] vfs: introduce some data structures zwu.kernel
` (19 more replies)
0 siblings, 20 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
HI, guys,
This patchset has been done scalability or performance tests
by fs_mark, ffsb and compilebench.
I have done the perf testing on Linux 3.7.0-rc8+ with Intel(R) Core(TM)
i7-3770 CPU @ 3.40GHz with 8 CPUs, 16G ram and 260G disk.
Any comments or ideas are appreciated, thanks.
NOTE:
The patchset can be obtained via my kernel dev git on github:
git://github.com/wuzhy/kernel.git hot_tracking
If you're interested, you can also review them via
https://github.com/wuzhy/kernel/commits/hot_tracking
For more info, please check hot_tracking.txt in Documentation
Below is the perf testing report:
1. fs_mark test
w/o: without hot tracking
w/ : with hot tracking
Count Size FSUse% Files/sec App Overhead
w/o w/ w/o w/ w/o w/
800000 1 2 3 13756.4 32144.9 5350627 5436291
1600000 1 4 5 1163.4 1799.3 20848119 21708216
2400000 1 6 6 1360.8 1252.5 6798705 8715322
3200000 1 8 8 1600.1 1196.3 5751129 6013792
4000000 1 9 9 1071.4 1191.2 17204725 26786369
4800000 1 10 10 1483.5 1447.9 19555541 8383046
5600000 1 11 11 1457.9 1699.5 5783588 10074681
6400000 1 12 13 1658.8 1628.5 6992697 6185551
7200000 1 14 14 1662.4 1857.1 5796793 13772592
8000000 1 15 15 2930.0 2653.8 12431682 6152573
8800000 1 16 17 1630.8 1665.0 7666719 13682765
9600000 1 18 18 1530.3 1583.9 5823644 10171644
10400000 1 19 19 1437.9 1798.6 20935224 6048083
11200000 1 20 20 1529.0 1550.6 6647450 6003151
12000000 1 21 22 1558.6 1501.8 12539509 18144939
12800000 1 23 23 1644.2 1432.1 7074419 28101975
13600000 1 24 24 1753.6 1650.2 7164297 20888972
14400000 1 25 25 2750.0 1483.9 12756692 7441225
15200000 1 27 27 1551.1 1514.3 5741066 8250443
16000000 1 28 28 1610.8 1635.9 72193860 8545285
16800000 1 29 29 1646.7 1907.7 8945856 11703513
17600000 1 30 31 1496.6 2722.3 5858961 8989393
18400000 1 32 32 1457.7 1565.7 10914475 26504660
19200000 1 33 33 1437.6 1518.7 6708975 213303618
20000000 1 34 34 1825.4 1521.1 5722086 12490907
20800000 1 36 35 1718.4 1611.5 5873290 17942534
21600000 1 37 37 2152.6 1536.9 113050627 8717940
22400000 1 38 38 2443.7 1788.2 7398122 19834765
23200000 1 39 39 1518.5 1587.6 5770959 10134882
24000000 1 41 41 1536.8 2164.0 5751248 7214626
24800000 1 42 42 1576.6 2939.4 7390314 6070271
25600000 1 43 43 1707.4 1535.9 11075939 6052896
26400000 1 44 44 1522.5 1563.1 10142987 22549898
27200000 1 46 46 1827.4 1608.5 11613016 24828125
28000000 1 47 47 3420.5 1741.9 8059985 16599156
28800000 1 48 48 1815.5 1944.4 7847931 9043277
29600000 1 50 49 1650.0 1596.6 5636323 7929164
30400000 1 51 51 1683.7 1573.3 5766323 19369146
31200000 1 52 52 1610.1 1669.8 9256111 9899107
32000000 1 53 53 1645.2 3081.0 7855010 6057257
32800000 1 54 55 1835.3 3122.0 6899141 6143875
33600000 1 56 56 1916.8 1734.8 10271967 6049509
34400000 1 57 57 3119.2 1630.8 11503274 13975417
35200000 1 58 58 1629.2 1695.7 6827225 6214248
36000000 1 60 60 1636.5 1695.4 38077664 16211067
36800000 1 61 61 1665.2 2069.1 19948817 9358494
37600000 1 62 62 1734.5 1931.5 26487196 8954836
38400000 1 63 63 1625.8 1654.0 6649289 9131844
39200000 1 65 65 1778.4 1663.3 11653376 7144960
40000000 1 66 66 1851.0 1935.6 8164470 11288753
40800000 1 67 67 3171.0 3431.6 12358380 6072820
41600000 1 69 69 1714.3 1954.3 13765035 9364495
42400000 1 70 70 1591.0 1681.8 18733304 7407689
43200000 1 71 71 1537.2 1642.8 19534908 6163018
44000000 1 72 72 1630.3 1641.2 23479883 10967509
44800000 1 74 74 1877.5 1651.9 8174965 9484587
45600000 1 75 75 3322.4 1653.6 14740938 7497831
46400000 1 76 76 1706.9 1840.6 10348550 23296562
47200000 1 77 78 1837.7 2515.3 13917543 14683192
48000000 1 79 79 1642.6 2368.6 14365759 6080942
48800000 1 80 80 1827.1 1655.2 9234312 7412406
49600000 1 81 81 1631.0 1858.7 7543970 18610881
50400000 1 82 82 1560.5 1865.0 21374219 6598771
>From the above table, when the same count files with same size are created, how FS is full is
basically same.
2. FFSB test
w/o hot tracking w/ hot tracking ratio
v1 v2 (v2-v1)/v1
large_file_create
1 thread
- Trans/sec 28918.75 29014.48 +0.33%
- Throughput 113MB/sec 113MB/sec +0.0%
- %CPU 4.8% 5.1% +6.3%
- Trans/%CPU 602473.96 568911.37 -5.6%
8 threads
- Trans/sec 28480.37 28541.25 +0.2%
- Throughput 111MB/sec 111MB/sec +0.0%
- %CPU 5.6% 5.9% +5.4%
- Trans/%CPU 508578.04 483750 -4.9%
32 threads
- Trans/sec 25011.86 26992.32 +7.9%
- Throughput 97.7MB/sec 105MB/sec +7.5%
- %CPU 6.2% 7.1% +14.8%
- Trans/%CPU 403417.10 380173.52 -5.8%
large_file_seq_read
1 thread
- Trans/sec 35303.23 34838.02 -1.3%
- Throughput 138MB/sec 136MB/sec -1.4%
- %CPU 5.4% 5.4% +0.0%
- Trans/%CPU 653763.52 645148.52 -1.3%
8 threads
- Trans/sec 11902.82 11205.22 -5.9%
- Throughput 46.5MB/sec 43.8MB/sec -5.8%
- %CPU 2.1% 2.0% -4.8%
- Trans/%CPU 566800.95 560261 -1.2%
32 threads
- Trans/sec 5068.48 5316.36 +4.9%
- Throughput 19.8MB/sec 20.8MB/sec +5.1%
- %CPU 0.9% 1.0% +11.1%
- Trans/%CPU 563164.45 531636 -5.6%
random_write
1 thread
- Trans/sec 729.01 738.89 +1.4%
- Throughput 99.7MB/sec 101MB/sec +1.3%
- %CPU 0.1% 0.1% +0.0%
- Trans/%CPU 72901 73889 +1.4%
8 threads
- Trans/sec 714.56 714.57 +0.0%
- Throughput 97.7MB/sec 97.7MB/sec +0.0%
- %CPU 0.2% 0.2% +0.0%
- Trans/%CPU 35728 35728.5 +0.0%
32 threads
- Trans/sec 698.62 692.59 -0.9%
- Throughput 95.5MB/sec 94.7MB/sec -0.8%
- %CPU 0.2% 0.2% +0.0%
- Trans/%CPU 34931 34629.5 -0.9%
random_read
1 thread
- Trans/sec 225.49 227.03 +0.7%
- Throughput 902KB/sec 908KB/sec +0.7%
- %CPU 1.1% 1.1% +0.0%
- Trans/%CPU 20499.10 20639.10 +0.7%
8 threads
- Trans/sec 106.72 105.76 -0.9%
- Throughput 427KB/sec 423KB/sec -0.9%
- %CPU 0.5% 0.5% +0.0%
- Trans/%CPU 2134.4 2115.2 -0.9%
32 threads
- Trans/sec 107.44 108.26 +0.8%
- Throughput 430KB/sec 433KB/sec +0.7%
- %CPU 0.5% 0.5% +0.0%
- Trans/%CPU 2148.8 2165.2 +0.8%
mail_server
1 thread
- Trans/sec 681.67 732.66 +7.5%
- Throughput [read] 1.77MB/sec 1.99MB/sec +12.4%
- Throughput [write] 858KB/sec 887KB/sec +3.4%
- %CPU 0.6% 0.6% +0.0%
- Trans/%CPU 11361.17 12211 +7.5%
8 threads
- Trans/sec 630.48 597.08 -5.3%
- Throughput [read] 1.64MB/sec 1.54MB/sec -6.1%
- Throughput [write] 814KB/sec 784KB/sec -3.7%
- %CPU 0.6% 0.5% -16.7%
- Trans/%CPU 10508 11941.6 +13.6%
32 threads
- Trans/sec 598.68 566.05 -5.5%
- Throughput [read] 1.53MB/sec 1.5MB/sec -2.0%
- Throughput [write] 804KB/sec 705KB/sec -12.3%
- %CPU 0.7% 0.6% -14.2%
- Trans/%CPU 8552.57 9434.17 +10.3%
3. Compilebench test
w/o hot tracking w/ hot tracking ratio
v1 v2 (v2-v1)/v1
intial create 114.81 MB/s 118.32 MB/s +3.1%
create 11.98 MB/s 12.26 MB/s +2.3%
patch 3.61 MB/s 3.66 MB/s +1.4%
compile 46.40 MB/s 48.07 MB/s +3.6%
clean 126.33 MB/s 128.75 MB/s +1.9%
read tree 9.93 MB/s 9.71 MB/s -2.2%
read compiled tree 17.19 MB/s 17.52 MB/s +1.9%
delete tree 12.23 seconds 11.13 seconds -9.0%
delete compiled tree 12.98 seconds 16.05 seconds +26.7%
stat tree 7.03 seconds 5.51 seconds -21.6%
stat compiled tree 12.19 seconds 9.06 seconds -25.7%
Changelog:
- Solved 64 bits inode number issue. [David Sterba]
- Embed struct hot_type in struct file_system_type [Darrick J. Wong]
- Cleanup Some issues [David Sterba]
- Use a static hot debugfs root [Greg KH]
- Rewritten debugfs support based on seq_file operation. [Dave Chinner]
- Refactored workqueue support. [Dave Chinner]
- Turn some Micro into be tunable [Zhiyong, Zheng Liu]
TIME_TO_KICK, and HEAT_UPDATE_DELAY
- Introduce hot func registering framework [Zhiyong]
- Remove global variable for hot tracking [Zhiyong]
- Add xfs hot tracking support [Dave Chinner]
- Add ext4 hot tracking support [Zheng Liu]
- Cleanedup a lot of other issues [Dave Chinner]
- Added memory shrinker [Dave Chinner]
- Converted to one workqueue to update map info periodically [Dave Chinner]
- Cleanedup a lot of other issues [Dave Chinner]
- Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
- Add btrfs hot tracking support [Zhiyong]
- The first three patches can probably just be flattened into one.
[Marco Stornelli , Dave Chinner]
Zhi Yong Wu (16):
vfs: introduce some data structures
vfs: add init and cleanup functions
vfs: add I/O frequency update function
vfs: add two map arrays
vfs: add hooks to enable hot tracking
vfs: add temp calculation function
vfs: add map info update function
vfs: add aging function
vfs: add one work queue
vfs: add FS hot type support
vfs: register one shrinker
vfs: add one ioctl interface
vfs: add debugfs support
proc: add two hot_track proc files
btrfs: add hot tracking support
vfs: add documentation
Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/hot_tracking.txt | 255 ++++++
fs/Makefile | 2 +-
fs/btrfs/ctree.h | 1 +
fs/btrfs/super.c | 22 +-
fs/compat_ioctl.c | 5 +
fs/dcache.c | 2 +
fs/direct-io.c | 6 +
fs/hot_tracking.c | 1345 ++++++++++++++++++++++++++++
fs/hot_tracking.h | 52 ++
fs/ioctl.c | 74 ++
include/linux/fs.h | 5 +
include/linux/hot_tracking.h | 152 ++++
kernel/sysctl.c | 14 +
mm/filemap.c | 6 +
mm/page-writeback.c | 12 +
mm/readahead.c | 7 +
17 files changed, 1960 insertions(+), 2 deletions(-)
create mode 100644 Documentation/filesystems/hot_tracking.txt
create mode 100644 fs/hot_tracking.c
create mode 100644 fs/hot_tracking.h
create mode 100644 include/linux/hot_tracking.h
--
1.7.6.5
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 01/16] vfs: introduce some data structures
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2013-01-10 0:48 ` David Sterba
2012-12-20 14:43 ` [PATCH RESEND v1 02/16] vfs: add init and cleanup functions zwu.kernel
` (18 subsequent siblings)
19 siblings, 1 reply; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
One root structure hot_info is defined, is hooked
up in super_block, and will be used to hold radix tree
root, hash list root and some other information, etc.
Adds hot_inode_tree struct to keep track of
frequently accessed files, and be keyed by {inode, offset}.
Trees contain hot_inode_items representing those files
and ranges.
Having these trees means that vfs can quickly determine the
temperature of some data by doing some calculations on the
hot_freq_data struct that hangs off of the tree item.
Define two items hot_inode_item and hot_range_item,
one of them represents one tracked file
to keep track of its access frequency and the tree of
ranges in this file, while the latter represents
a file range of one inode.
Each of the two structures contains a hot_freq_data
struct with its frequency of access metrics (number of
{reads, writes}, last {read,write} time, frequency of
{reads,writes}).
Also, each hot_inode_item contains one hot_range_tree
struct which is keyed by {inode, offset, length}
and used to keep track of all the ranges in this file.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/Makefile | 2 +-
fs/dcache.c | 2 +
fs/hot_tracking.c | 109 ++++++++++++++++++++++++++++++++++++++++++
fs/hot_tracking.h | 22 ++++++++
include/linux/hot_tracking.h | 78 ++++++++++++++++++++++++++++++
5 files changed, 212 insertions(+), 1 deletions(-)
create mode 100644 fs/hot_tracking.c
create mode 100644 fs/hot_tracking.h
create mode 100644 include/linux/hot_tracking.h
diff --git a/fs/Makefile b/fs/Makefile
index 1d7af79..f966dea 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
- stack.o fs_struct.o statfs.o
+ stack.o fs_struct.o statfs.o hot_tracking.o
ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/fs/dcache.c b/fs/dcache.c
index 3a463d0..7d5be16 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
#include <linux/rculist_bl.h>
#include <linux/prefetch.h>
#include <linux/ratelimit.h>
+#include <linux/hot_tracking.h>
#include "internal.h"
#include "mount.h"
@@ -3172,4 +3173,5 @@ void __init vfs_caches_init(unsigned long mempages)
mnt_init();
bdev_cache_init();
chrdev_init();
+ hot_cache_init();
}
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
new file mode 100644
index 0000000..ef7ff09
--- /dev/null
+++ b/fs/hot_tracking.c
@@ -0,0 +1,109 @@
+/*
+ * fs/hot_tracking.c
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#include <linux/list.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/hardirq.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/types.h>
+#include <linux/limits.h>
+#include "hot_tracking.h"
+
+/* kmem_cache pointers for slab caches */
+static struct kmem_cache *hot_inode_item_cachep __read_mostly;
+static struct kmem_cache *hot_range_item_cachep __read_mostly;
+
+/*
+ * Initialize the inode tree. Should be called for each new inode
+ * access or other user of the hot_inode interface.
+ */
+static void hot_inode_tree_init(struct hot_info *root)
+{
+ root->hot_inode_tree.map = RB_ROOT;
+ spin_lock_init(&root->lock);
+}
+
+/*
+ * Initialize the hot range tree. Should be called for each new inode
+ * access or other user of the hot_range interface.
+ */
+void hot_range_tree_init(struct hot_inode_item *he)
+{
+ he->hot_range_tree.map = RB_ROOT;
+ spin_lock_init(&he->lock);
+}
+
+/*
+ * Initialize a new hot_range_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_range_item()
+ */
+static void hot_range_item_init(struct hot_range_item *hr, loff_t start,
+ struct hot_inode_item *he)
+{
+ hr->start = start;
+ hr->len = RANGE_SIZE;
+ hr->hot_inode = he;
+ kref_init(&hr->hot_range.refs);
+ spin_lock_init(&hr->hot_range.lock);
+ hr->hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
+ hr->hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
+ hr->hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
+}
+
+/*
+ * Initialize a new hot_inode_item structure. The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using hot_free_inode_item()
+ */
+static void hot_inode_item_init(struct hot_inode_item *he,
+ u64 ino,
+ struct hot_rb_tree *hot_inode_tree)
+{
+ he->i_ino = ino;
+ he->hot_inode_tree = hot_inode_tree;
+ kref_init(&he->hot_inode.refs);
+ spin_lock_init(&he->hot_inode.lock);
+ he->hot_inode.hot_freq_data.avg_delta_reads = (u64) -1;
+ he->hot_inode.hot_freq_data.avg_delta_writes = (u64) -1;
+ he->hot_inode.hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
+ hot_range_tree_init(he);
+}
+
+/*
+ * Initialize kmem cache for hot_inode_item and hot_range_item.
+ */
+void __init hot_cache_init(void)
+{
+ hot_inode_item_cachep = kmem_cache_create("hot_inode_item",
+ sizeof(struct hot_inode_item), 0,
+ SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+ NULL);
+ if (!hot_inode_item_cachep)
+ return;
+
+ hot_range_item_cachep = kmem_cache_create("hot_range_item",
+ sizeof(struct hot_range_item), 0,
+ SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
+ NULL);
+ if (!hot_range_item_cachep)
+ goto err;
+
+ return;
+
+err:
+ kmem_cache_destroy(hot_inode_item_cachep);
+}
+EXPORT_SYMBOL_GPL(hot_cache_init);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
new file mode 100644
index 0000000..d58a461
--- /dev/null
+++ b/fs/hot_tracking.h
@@ -0,0 +1,22 @@
+/*
+ * fs/hot_tracking.h
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef __HOT_TRACKING__
+#define __HOT_TRACKING__
+
+#include <linux/workqueue.h>
+#include <linux/hot_tracking.h>
+
+/* values for hot_freq_data flags */
+#define FREQ_DATA_TYPE_INODE (1 << 0)
+#define FREQ_DATA_TYPE_RANGE (1 << 1)
+
+#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
new file mode 100644
index 0000000..294c4c9
--- /dev/null
+++ b/include/linux/hot_tracking.h
@@ -0,0 +1,78 @@
+/*
+ * include/linux/hot_tracking.h
+ *
+ * This file has definitions for VFS hot data tracking
+ * structures etc.
+ *
+ * Copyright (C) 2012 IBM Corp. All rights reserved.
+ * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_HOTTRACK_H
+#define _LINUX_HOTTRACK_H
+
+#include <linux/types.h>
+#include <linux/rbtree.h>
+#include <linux/kref.h>
+#include <linux/fs.h>
+
+struct hot_rb_tree {
+ struct rb_root map;
+};
+
+/*
+ * A frequency data struct holds values that are used to
+ * determine temperature of files and file ranges. These structs
+ * are members of hot_inode_item and hot_range_item
+ */
+struct hot_freq_data {
+ struct timespec last_read_time;
+ struct timespec last_write_time;
+ u32 nr_reads;
+ u32 nr_writes;
+ u64 avg_delta_reads;
+ u64 avg_delta_writes;
+ u32 flags;
+ u32 last_temp;
+};
+
+/* The common info for both following structures */
+struct hot_comm_item {
+ struct rb_node rb_node; /* rbtree index */
+ struct hot_freq_data hot_freq_data; /* frequency data */
+ spinlock_t lock; /* protects object data */
+ struct kref refs; /* prevents kfree */
+};
+
+/* An item representing an inode and its access frequency */
+struct hot_inode_item {
+ struct hot_comm_item hot_inode; /* node in hot_inode_tree */
+ struct hot_rb_tree hot_range_tree; /* tree of ranges */
+ spinlock_t lock; /* protect range tree */
+ struct hot_rb_tree *hot_inode_tree;
+ u64 i_ino; /* inode number from inode */
+};
+
+/*
+ * An item representing a range inside of
+ * an inode whose frequency is being tracked
+ */
+struct hot_range_item {
+ struct hot_comm_item hot_range;
+ struct hot_inode_item *hot_inode; /* associated hot_inode_item */
+ loff_t start; /* item offset in bytes in hot_range_tree */
+ size_t len; /* length in bytes */
+};
+
+struct hot_info {
+ struct hot_rb_tree hot_inode_tree;
+ spinlock_t lock; /*protect inode tree */
+};
+
+extern void __init hot_cache_init(void);
+
+#endif /* _LINUX_HOTTRACK_H */
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 02/16] vfs: add init and cleanup functions
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 01/16] vfs: introduce some data structures zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2013-01-10 0:48 ` David Sterba
2012-12-20 14:43 ` [PATCH RESEND v1 03/16] vfs: add I/O frequency update function zwu.kernel
` (17 subsequent siblings)
19 siblings, 1 reply; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Add initialization function to create some
key data structures when hot tracking is enabled;
Clean up them when hot tracking is disabled
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 117 ++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 4 ++
include/linux/hot_tracking.h | 3 +
3 files changed, 124 insertions(+), 0 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index ef7ff09..a73477c 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -76,12 +76,94 @@ static void hot_inode_item_init(struct hot_inode_item *he,
he->hot_inode_tree = hot_inode_tree;
kref_init(&he->hot_inode.refs);
spin_lock_init(&he->hot_inode.lock);
+ INIT_LIST_HEAD(&he->hot_inode.n_list);
he->hot_inode.hot_freq_data.avg_delta_reads = (u64) -1;
he->hot_inode.hot_freq_data.avg_delta_writes = (u64) -1;
he->hot_inode.hot_freq_data.flags = FREQ_DATA_TYPE_INODE;
hot_range_tree_init(he);
}
+static void hot_range_item_free(struct kref *kref)
+{
+ struct hot_comm_item *comm_item = container_of(kref,
+ struct hot_comm_item, refs);
+ struct hot_range_item *hr = container_of(comm_item,
+ struct hot_range_item, hot_range);
+
+ rb_erase(&hr->hot_range.rb_node,
+ &hr->hot_inode->hot_range_tree.map);
+ kmem_cache_free(hot_range_item_cachep, hr);
+}
+
+/*
+ * Drops the reference out on hot_range_item by one
+ * and free the structure if the reference count hits zero
+ */
+static void hot_range_item_put(struct hot_range_item *hr)
+{
+ kref_put(&hr->hot_range.refs, hot_range_item_free);
+}
+
+/* Frees the entire hot_range_tree. */
+static void hot_range_tree_free(struct hot_inode_item *he)
+{
+ struct rb_node *node;
+ struct hot_comm_item *ci;
+ struct hot_range_item *hr;
+
+ /* Free hot inode and range trees on fs root */
+ spin_lock(&he->lock);
+ while ((node = rb_first(&he->hot_range_tree.map))) {
+ ci = rb_entry(node, struct hot_comm_item, rb_node);
+ hr = container_of(ci,
+ struct hot_range_item, hot_range);
+ hot_range_item_put(hr);
+ }
+ spin_unlock(&he->lock);
+}
+
+static void hot_inode_item_free(struct kref *kref)
+{
+ struct hot_comm_item *comm_item = container_of(kref,
+ struct hot_comm_item, refs);
+ struct hot_inode_item *he = container_of(comm_item,
+ struct hot_inode_item, hot_inode);
+
+ hot_range_tree_free(he);
+ spin_lock(&he->hot_inode.lock);
+ rb_erase(&he->hot_inode.rb_node, &he->hot_inode_tree->map);
+ spin_unlock(&he->hot_inode.lock);
+ kmem_cache_free(hot_inode_item_cachep, he);
+}
+
+/*
+ * Drops the reference out on hot_inode_item by one
+ * and free the structure if the reference count hits zero
+ */
+void hot_inode_item_put(struct hot_inode_item *he)
+{
+ kref_put(&he->hot_inode.refs, hot_inode_item_free);
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_put);
+
+/* Frees the entire hot_inode_tree. */
+static void hot_inode_tree_exit(struct hot_info *root)
+{
+ struct rb_node *node;
+ struct hot_comm_item *ci;
+ struct hot_inode_item *he;
+
+ /* Free hot inode and range trees on fs root */
+ spin_lock(&root->lock);
+ while ((node = rb_first(&root->hot_inode_tree.map))) {
+ ci = rb_entry(node, struct hot_comm_item, rb_node);
+ he = container_of(ci,
+ struct hot_inode_item, hot_inode);
+ hot_inode_item_put(he);
+ }
+ spin_unlock(&root->lock);
+}
+
/*
* Initialize kmem cache for hot_inode_item and hot_range_item.
*/
@@ -107,3 +189,38 @@ err:
kmem_cache_destroy(hot_inode_item_cachep);
}
EXPORT_SYMBOL_GPL(hot_cache_init);
+
+/*
+ * Initialize the data structures for hot data tracking.
+ */
+int hot_track_init(struct super_block *sb)
+{
+ struct hot_info *root;
+ int ret = -ENOMEM;
+
+ root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
+ if (!root) {
+ printk(KERN_ERR "%s: Failed to malloc memory for "
+ "hot_info\n", __func__);
+ return ret;
+ }
+
+ hot_inode_tree_init(root);
+
+ sb->s_hot_root = root;
+
+ printk(KERN_INFO "VFS: Turning on hot data tracking\n");
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(hot_track_init);
+
+void hot_track_exit(struct super_block *sb)
+{
+ struct hot_info *root = sb->s_hot_root;
+
+ hot_inode_tree_exit(root);
+ sb->s_hot_root = NULL;
+ kfree(root);
+}
+EXPORT_SYMBOL_GPL(hot_track_exit);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a823d4b..c42dc37 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@
#include <linux/lockdep.h>
#include <linux/percpu-rwsem.h>
#include <linux/blk_types.h>
+#include <linux/hot_tracking.h>
#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -1320,6 +1321,9 @@ struct super_block {
/* Being remounted read-only */
int s_readonly_remount;
+
+ /* Hot data tracking*/
+ struct hot_info *s_hot_root;
};
/* superblock cache pruning functions */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 294c4c9..3b0dfcf 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -74,5 +74,8 @@ struct hot_info {
};
extern void __init hot_cache_init(void);
+extern int hot_track_init(struct super_block *sb);
+extern void hot_track_exit(struct super_block *sb);
+extern void hot_inode_item_put(struct hot_inode_item *he);
#endif /* _LINUX_HOTTRACK_H */
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 03/16] vfs: add I/O frequency update function
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 01/16] vfs: introduce some data structures zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 02/16] vfs: add init and cleanup functions zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2013-01-10 0:51 ` David Sterba
2012-12-20 14:43 ` [PATCH RESEND v1 04/16] vfs: add two map arrays zwu.kernel
` (16 subsequent siblings)
19 siblings, 1 reply; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Add some util helpers to update access frequencies
for one file or its range.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 178 ++++++++++++++++++++++++++++++++++++++++++
fs/hot_tracking.h | 5 +
include/linux/hot_tracking.h | 4 +
3 files changed, 187 insertions(+), 0 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index a73477c..6f587fa 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -164,6 +164,135 @@ static void hot_inode_tree_exit(struct hot_info *root)
spin_unlock(&root->lock);
}
+struct hot_inode_item
+*hot_inode_item_lookup(struct hot_info *root, u64 ino)
+{
+ struct rb_node **p = &root->hot_inode_tree.map.rb_node;
+ struct rb_node *parent = NULL;
+ struct hot_comm_item *ci;
+ struct hot_inode_item *entry;
+
+ /* walk tree to find insertion point */
+ spin_lock(&root->lock);
+ while (*p) {
+ parent = *p;
+ ci = rb_entry(parent, struct hot_comm_item, rb_node);
+ entry = container_of(ci, struct hot_inode_item, hot_inode);
+ if (ino < entry->i_ino)
+ p = &(*p)->rb_left;
+ else if (ino > entry->i_ino)
+ p = &(*p)->rb_right;
+ else {
+ spin_unlock(&root->lock);
+ kref_get(&entry->hot_inode.refs);
+ return entry;
+ }
+ }
+ spin_unlock(&root->lock);
+
+ entry = kmem_cache_zalloc(hot_inode_item_cachep, GFP_NOFS);
+ if (!entry)
+ return ERR_PTR(-ENOMEM);
+
+ spin_lock(&root->lock);
+ hot_inode_item_init(entry, ino, &root->hot_inode_tree);
+ rb_link_node(&entry->hot_inode.rb_node, parent, p);
+ rb_insert_color(&entry->hot_inode.rb_node,
+ &root->hot_inode_tree.map);
+ spin_unlock(&root->lock);
+
+ kref_get(&entry->hot_inode.refs);
+ return entry;
+}
+EXPORT_SYMBOL_GPL(hot_inode_item_lookup);
+
+static loff_t hot_range_end(struct hot_range_item *hr)
+{
+ if (hr->start + hr->len < hr->start)
+ return (loff_t)-1;
+
+ return hr->start + hr->len - 1;
+}
+
+static struct hot_range_item
+*hot_range_item_lookup(struct hot_inode_item *he,
+ loff_t start)
+{
+ struct rb_node **p = &he->hot_range_tree.map.rb_node;
+ struct rb_node *parent = NULL;
+ struct hot_comm_item *ci;
+ struct hot_range_item *entry;
+
+ /* walk tree to find insertion point */
+ spin_lock(&he->lock);
+ while (*p) {
+ parent = *p;
+ ci = rb_entry(parent, struct hot_comm_item, rb_node);
+ entry = container_of(ci, struct hot_range_item, hot_range);
+ if (start < entry->start)
+ p = &(*p)->rb_left;
+ else if (start > hot_range_end(entry))
+ p = &(*p)->rb_right;
+ else {
+ spin_unlock(&he->lock);
+ kref_get(&entry->hot_range.refs);
+ return entry;
+ }
+ }
+ spin_unlock(&he->lock);
+
+ entry = kmem_cache_zalloc(hot_range_item_cachep, GFP_NOFS);
+ if (!entry)
+ return ERR_PTR(-ENOMEM);
+
+ spin_lock(&he->lock);
+ hot_range_item_init(entry, start, he);
+ rb_link_node(&entry->hot_range.rb_node, parent, p);
+ rb_insert_color(&entry->hot_range.rb_node,
+ &he->hot_range_tree.map);
+ spin_unlock(&he->lock);
+
+ kref_get(&entry->hot_range.refs);
+ return entry;
+}
+
+/*
+ * This function does the actual work of updating
+ * the frequency numbers, whatever they turn out to be.
+ */
+static void hot_rw_freq_calc(struct timespec old_atime,
+ struct timespec cur_time, u64 *avg)
+{
+ struct timespec delta_ts;
+ u64 new_delta;
+
+ delta_ts = timespec_sub(cur_time, old_atime);
+ new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER;
+
+ *avg = (*avg << FREQ_POWER) - *avg + new_delta;
+ *avg = *avg >> FREQ_POWER;
+}
+
+static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
+{
+ struct timespec cur_time = current_kernel_time();
+
+ if (write) {
+ freq_data->nr_writes += 1;
+ hot_rw_freq_calc(freq_data->last_write_time,
+ cur_time,
+ &freq_data->avg_delta_writes);
+ freq_data->last_write_time = cur_time;
+ } else {
+ freq_data->nr_reads += 1;
+ hot_rw_freq_calc(freq_data->last_read_time,
+ freq_data->last_read_time,
+ cur_time,
+ &freq_data->avg_delta_reads);
+ freq_data->last_read_time = cur_time;
+ }
+}
+
/*
* Initialize kmem cache for hot_inode_item and hot_range_item.
*/
@@ -191,6 +320,55 @@ err:
EXPORT_SYMBOL_GPL(hot_cache_init);
/*
+ * Main function to update access frequency from read/writepage(s) hooks
+ */
+void hot_update_freqs(struct inode *inode, loff_t start,
+ size_t len, int rw)
+{
+ struct hot_info *root = inode->i_sb->s_hot_root;
+ struct hot_inode_item *he;
+ struct hot_range_item *hr;
+ loff_t cur, end;
+
+ if (!root || (len == 0))
+ return;
+
+ he = hot_inode_item_lookup(root, inode->i_ino);
+ if (IS_ERR(he)) {
+ WARN_ON(1);
+ return;
+ }
+
+ spin_lock(&he->hot_inode.lock);
+ hot_freq_data_update(&he->hot_inode.hot_freq_data, rw);
+ spin_unlock(&he->hot_inode.lock);
+
+ /*
+ * Align ranges on RANGE_SIZE boundary
+ * to prevent proliferation of range structs
+ */
+ end = (start + len + RANGE_SIZE - 1) >> RANGE_BITS;
+ for (cur = (start >> RANGE_BITS); cur < end; cur++) {
+ hr = hot_range_item_lookup(he, cur);
+ if (IS_ERR(hr)) {
+ WARN(1, "hot_range_item_lookup returns %ld\n",
+ PTR_ERR(hr));
+ hot_inode_item_put(he);
+ return;
+ }
+
+ spin_lock(&hr->hot_range.lock);
+ hot_freq_data_update(&hr->hot_range.hot_freq_data, rw);
+ spin_unlock(&hr->hot_range.lock);
+
+ hot_range_item_put(hr);
+ }
+
+ hot_inode_item_put(he);
+}
+EXPORT_SYMBOL_GPL(hot_update_freqs);
+
+/*
* Initialize the data structures for hot data tracking.
*/
int hot_track_init(struct super_block *sb)
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index d58a461..8571186 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -19,4 +19,9 @@
#define FREQ_DATA_TYPE_INODE (1 << 0)
#define FREQ_DATA_TYPE_RANGE (1 << 1)
+/* size of sub-file ranges */
+#define RANGE_BITS 20
+#define RANGE_SIZE (1 << RANGE_BITS)
+#define FREQ_POWER 4
+
#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 3b0dfcf..d555046 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -77,5 +77,9 @@ extern void __init hot_cache_init(void);
extern int hot_track_init(struct super_block *sb);
extern void hot_track_exit(struct super_block *sb);
extern void hot_inode_item_put(struct hot_inode_item *he);
+extern void hot_update_freqs(struct inode *inode, loff_t start,
+ size_t len, int rw);
+extern struct hot_inode_item *hot_inode_item_lookup(struct hot_info *root,
+ u64 ino);
#endif /* _LINUX_HOTTRACK_H */
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 04/16] vfs: add two map arrays
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (2 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 03/16] vfs: add I/O frequency update function zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2013-01-10 0:51 ` David Sterba
2012-12-20 14:43 ` [PATCH RESEND v1 05/16] vfs: add hooks to enable hot tracking zwu.kernel
` (15 subsequent siblings)
19 siblings, 1 reply; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Adds two map arrays which contains
a lot of list and is used to efficiently
look up the data temperature of a file or its
ranges.
In each list of map arrays, the array node
will keep track of temperature info.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 67 ++++++++++++++++++++++++++++++++++++++++++
include/linux/hot_tracking.h | 17 ++++++++++
2 files changed, 84 insertions(+), 0 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 6f587fa..5f164c8 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -58,6 +58,7 @@ static void hot_range_item_init(struct hot_range_item *hr, loff_t start,
hr->hot_inode = he;
kref_init(&hr->hot_range.refs);
spin_lock_init(&hr->hot_range.lock);
+ INIT_LIST_HEAD(&hr->hot_range.n_list);
hr->hot_range.hot_freq_data.avg_delta_reads = (u64) -1;
hr->hot_range.hot_freq_data.avg_delta_writes = (u64) -1;
hr->hot_range.hot_freq_data.flags = FREQ_DATA_TYPE_RANGE;
@@ -89,9 +90,20 @@ static void hot_range_item_free(struct kref *kref)
struct hot_comm_item, refs);
struct hot_range_item *hr = container_of(comm_item,
struct hot_range_item, hot_range);
+ struct hot_info *root = container_of(
+ hr->hot_inode->hot_inode_tree,
+ struct hot_info, hot_inode_tree);
+
+ spin_lock(&hr->hot_range.lock);
+ if (!list_empty(&hr->hot_range.n_list)) {
+ list_del_init(&hr->hot_range.n_list);
+ root->hot_map_nr--;
+ }
rb_erase(&hr->hot_range.rb_node,
&hr->hot_inode->hot_range_tree.map);
+ spin_unlock(&hr->hot_range.lock);
+
kmem_cache_free(hot_range_item_cachep, hr);
}
@@ -128,6 +140,15 @@ static void hot_inode_item_free(struct kref *kref)
struct hot_comm_item, refs);
struct hot_inode_item *he = container_of(comm_item,
struct hot_inode_item, hot_inode);
+ struct hot_info *root = container_of(he->hot_inode_tree,
+ struct hot_info, hot_inode_tree);
+
+ spin_lock(&he->hot_inode.lock);
+ if (!list_empty(&he->hot_inode.n_list)) {
+ list_del_init(&he->hot_inode.n_list);
+ root->hot_map_nr--;
+ }
+ spin_unlock(&he->hot_inode.lock);
hot_range_tree_free(he);
spin_lock(&he->hot_inode.lock);
@@ -294,6 +315,50 @@ static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
}
/*
+ * Initialize inode and range map info.
+ */
+static void hot_map_init(struct hot_info *root)
+{
+ int i;
+ for (i = 0; i < HEAT_MAP_SIZE; i++) {
+ INIT_LIST_HEAD(&root->heat_inode_map[i].node_list);
+ INIT_LIST_HEAD(&root->heat_range_map[i].node_list);
+ root->heat_inode_map[i].temp = i;
+ root->heat_range_map[i].temp = i;
+ spin_lock_init(&root->heat_inode_map[i].lock);
+ spin_lock_init(&root->heat_range_map[i].lock);
+ }
+}
+
+static void hot_map_list_free(struct list_head *node_list,
+ struct hot_info *root)
+{
+ struct list_head *pos, *next;
+ struct hot_comm_item *node;
+
+ list_for_each_safe(pos, next, node_list) {
+ node = list_entry(pos, struct hot_comm_item, n_list);
+ list_del_init(&node->n_list);
+ root->hot_map_nr--;
+ }
+
+}
+
+/* Free inode and range map info */
+static void hot_map_exit(struct hot_info *root)
+{
+ int i;
+ for (i = 0; i < HEAT_MAP_SIZE; i++) {
+ spin_lock(&root->heat_inode_map[i].lock);
+ hot_map_list_free(&root->heat_inode_map[i].node_list, root);
+ spin_unlock(&root->heat_inode_map[i].lock);
+ spin_lock(&root->heat_range_map[i].lock);
+ hot_map_list_free(&root->heat_range_map[i].node_list, root);
+ spin_unlock(&root->heat_range_map[i].lock);
+ }
+}
+
+/*
* Initialize kmem cache for hot_inode_item and hot_range_item.
*/
void __init hot_cache_init(void)
@@ -384,6 +449,7 @@ int hot_track_init(struct super_block *sb)
}
hot_inode_tree_init(root);
+ hot_map_init(root);
sb->s_hot_root = root;
@@ -397,6 +463,7 @@ void hot_track_exit(struct super_block *sb)
{
struct hot_info *root = sb->s_hot_root;
+ hot_map_exit(root);
hot_inode_tree_exit(root);
sb->s_hot_root = NULL;
kfree(root);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index d555046..23edad92 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -20,6 +20,9 @@
#include <linux/kref.h>
#include <linux/fs.h>
+#define HEAT_MAP_BITS 8
+#define HEAT_MAP_SIZE (1 << HEAT_MAP_BITS)
+
struct hot_rb_tree {
struct rb_root map;
};
@@ -40,12 +43,20 @@ struct hot_freq_data {
u32 last_temp;
};
+/* List heads in hot map array */
+struct hot_map_head {
+ struct list_head node_list;
+ u8 temp;
+ spinlock_t lock;
+};
+
/* The common info for both following structures */
struct hot_comm_item {
struct rb_node rb_node; /* rbtree index */
struct hot_freq_data hot_freq_data; /* frequency data */
spinlock_t lock; /* protects object data */
struct kref refs; /* prevents kfree */
+ struct list_head n_list; /* list node index */
};
/* An item representing an inode and its access frequency */
@@ -71,6 +82,12 @@ struct hot_range_item {
struct hot_info {
struct hot_rb_tree hot_inode_tree;
spinlock_t lock; /*protect inode tree */
+
+ /* map of inode temperature */
+ struct hot_map_head heat_inode_map[HEAT_MAP_SIZE];
+ /* map of range temperature */
+ struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
+ unsigned int hot_map_nr;
};
extern void __init hot_cache_init(void);
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 05/16] vfs: add hooks to enable hot tracking
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (3 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 04/16] vfs: add two map arrays zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2013-01-10 0:52 ` David Sterba
2012-12-20 14:43 ` [PATCH RESEND v1 06/16] vfs: add temp calculation function zwu.kernel
` (14 subsequent siblings)
19 siblings, 1 reply; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Miscellaneous features that implement hot data tracking
and generally make the hot data functions a bit more friendly.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/direct-io.c | 6 ++++++
mm/filemap.c | 6 ++++++
mm/page-writeback.c | 12 ++++++++++++
mm/readahead.c | 7 +++++++
4 files changed, 31 insertions(+), 0 deletions(-)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index cf5b44b..834d142 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -37,6 +37,7 @@
#include <linux/uio.h>
#include <linux/atomic.h>
#include <linux/prefetch.h>
+#include "hot_tracking.h"
/*
* How many user pages to map in one call to get_user_pages(). This determines
@@ -1299,6 +1300,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
prefetch(bdev->bd_queue);
prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);
+ /* Hot data tracking */
+ hot_update_freqs(inode, offset,
+ iov_length(iov, nr_segs),
+ rw & WRITE);
+
return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
nr_segs, get_block, end_io,
submit_io, flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..6141374 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/cleancache.h>
+#include <linux/hot_tracking.h>
#include "internal.h"
/*
@@ -1224,6 +1225,11 @@ readpage:
* PG_error will be set again if readpage fails.
*/
ClearPageError(page);
+
+ /* Hot data tracking */
+ hot_update_freqs(inode, (loff_t)page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+
/* Start the actual read. The read will unlock the page. */
error = mapping->a_ops->readpage(filp, page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6f42712..7c5739f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h> /* __set_page_dirty_buffers */
#include <linux/pagevec.h>
#include <linux/timer.h>
+#include <linux/hot_tracking.h>
#include <trace/events/writeback.h>
/*
@@ -1902,13 +1903,24 @@ EXPORT_SYMBOL(generic_writepages);
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
int ret;
+ loff_t start = 0;
+ size_t count = 0;
if (wbc->nr_to_write <= 0)
return 0;
+
+ start = mapping->writeback_index << PAGE_CACHE_SHIFT;
+ count = wbc->nr_to_write;
+
if (mapping->a_ops->writepages)
ret = mapping->a_ops->writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+ /* Hot data tracking */
+ hot_update_freqs(mapping->host, start,
+ (count - wbc->nr_to_write) * PAGE_CACHE_SIZE, 1);
+
return ret;
}
diff --git a/mm/readahead.c b/mm/readahead.c
index 7963f23..d1ab688 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -19,6 +19,7 @@
#include <linux/pagemap.h>
#include <linux/syscalls.h>
#include <linux/file.h>
+#include <linux/hot_tracking.h>
/*
* Initialise a struct file's readahead state. Assumes that the caller has
@@ -138,6 +139,12 @@ static int read_pages(struct address_space *mapping, struct file *filp,
out:
blk_finish_plug(&plug);
+ /* Hot data tracking */
+ hot_update_freqs(mapping->host,
+ (loff_t)(list_entry(pages->prev, struct page, lru)->index)
+ << PAGE_CACHE_SHIFT,
+ (size_t)nr_pages * PAGE_CACHE_SIZE, 0);
+
return ret;
}
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 06/16] vfs: add temp calculation function
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (4 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 05/16] vfs: add hooks to enable hot tracking zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2013-01-10 0:53 ` David Sterba
2012-12-20 14:43 ` [PATCH RESEND v1 07/16] vfs: add map info update function zwu.kernel
` (13 subsequent siblings)
19 siblings, 1 reply; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/hot_tracking.h | 21 +++++++++++++++
2 files changed, 95 insertions(+), 0 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 5f164c8..ba4decf 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -25,6 +25,14 @@
static struct kmem_cache *hot_inode_item_cachep __read_mostly;
static struct kmem_cache *hot_range_item_cachep __read_mostly;
+static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
+{
+ if (dir)
+ return counter << bits;
+ else
+ return counter >> bits;
+}
+
/*
* Initialize the inode tree. Should be called for each new inode
* access or other user of the hot_inode interface.
@@ -315,6 +323,72 @@ static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
}
/*
+ * hot_temp_calc() is responsible for distilling the six heat
+ * criteria down into a single temperature value for the data,
+ * which is an integer between 0 and HEAT_MAX_VALUE.
+ */
+static u32 hot_temp_calc(struct hot_freq_data *freq_data)
+{
+ u32 result = 0;
+
+ struct timespec ckt = current_kernel_time();
+ u64 cur_time = timespec_to_ns(&ckt);
+
+ u32 nrr_heat = (u32)hot_raw_shift((u64)freq_data->nr_reads,
+ NRR_MULTIPLIER_POWER, true);
+ u32 nrw_heat = (u32)hot_raw_shift((u64)freq_data->nr_writes,
+ NRW_MULTIPLIER_POWER, true);
+
+ u64 ltr_heat =
+ hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_read_time)),
+ LTR_DIVIDER_POWER, false);
+ u64 ltw_heat =
+ hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_write_time)),
+ LTW_DIVIDER_POWER, false);
+
+ u64 avr_heat =
+ hot_raw_shift((((u64) -1) - freq_data->avg_delta_reads),
+ AVR_DIVIDER_POWER, false);
+ u64 avw_heat =
+ hot_raw_shift((((u64) -1) - freq_data->avg_delta_writes),
+ AVW_DIVIDER_POWER, false);
+
+ /* ltr_heat is now guaranteed to be u32 safe */
+ if (ltr_heat >= hot_raw_shift((u64) 1, 32, true))
+ ltr_heat = 0;
+ else
+ ltr_heat = hot_raw_shift((u64) 1, 32, true) - ltr_heat;
+
+ /* ltw_heat is now guaranteed to be u32 safe */
+ if (ltw_heat >= hot_raw_shift((u64) 1, 32, true))
+ ltw_heat = 0;
+ else
+ ltw_heat = hot_raw_shift((u64) 1, 32, true) - ltw_heat;
+
+ /* avr_heat is now guaranteed to be u32 safe */
+ if (avr_heat >= hot_raw_shift((u64) 1, 32, true))
+ avr_heat = (u32) -1;
+
+ /* avw_heat is now guaranteed to be u32 safe */
+ if (avw_heat >= hot_raw_shift((u64) 1, 32, true))
+ avw_heat = (u32) -1;
+
+ nrr_heat = (u32)hot_raw_shift((u64)nrr_heat,
+ (3 - NRR_COEFF_POWER), false);
+ nrw_heat = (u32)hot_raw_shift((u64)nrw_heat,
+ (3 - NRW_COEFF_POWER), false);
+ ltr_heat = hot_raw_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
+ ltw_heat = hot_raw_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
+ avr_heat = hot_raw_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
+ avw_heat = hot_raw_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
+
+ result = nrr_heat + nrw_heat + (u32) ltr_heat +
+ (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
+
+ return result;
+}
+
+/*
* Initialize inode and range map info.
*/
static void hot_map_init(struct hot_info *root)
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 8571186..f33066f 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -24,4 +24,25 @@
#define RANGE_SIZE (1 << RANGE_BITS)
#define FREQ_POWER 4
+/* NRR/NRW heat unit = 2^X accesses */
+#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
+#define NRR_COEFF_POWER 0
+#define NRW_MULTIPLIER_POWER 20 /* NRW - number of writes since mount */
+#define NRW_COEFF_POWER 0
+
+/* LTR/LTW heat unit = 2^X ns of age */
+#define LTR_DIVIDER_POWER 30 /* LTR - time elapsed since last read(ns) */
+#define LTR_COEFF_POWER 1
+#define LTW_DIVIDER_POWER 30 /* LTW - time elapsed since last write(ns) */
+#define LTW_COEFF_POWER 1
+
+/*
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+ */
+#define AVR_DIVIDER_POWER 40 /* AVR - average delta between recent reads(ns) */
+#define AVR_COEFF_POWER 0
+#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
+#define AVW_COEFF_POWER 0
+
#endif /* __HOT_TRACKING__ */
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 07/16] vfs: add map info update function
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (5 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 06/16] vfs: add temp calculation function zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 08/16] vfs: add aging function zwu.kernel
` (12 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 67 insertions(+), 0 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index ba4decf..6a5ef53 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -389,6 +389,73 @@ static u32 hot_temp_calc(struct hot_freq_data *freq_data)
}
/*
+ * Calculate a new temperature and, if necessary,
+ * move the list_head corresponding to this inode or range
+ * to the proper list with the new temperature
+ */
+static void hot_map_update(struct hot_freq_data *freq_data,
+ struct hot_info *root)
+{
+ struct hot_map_head *buckets, *cur_bucket;
+ struct hot_comm_item *comm_item;
+ struct hot_inode_item *he;
+ struct hot_range_item *hr;
+ u32 temp = hot_temp_calc(freq_data);
+ u8 a_temp = (u8)hot_raw_shift((u64)temp, (32 - HEAT_MAP_BITS), false);
+ u8 b_temp = (u8)hot_raw_shift((u64)freq_data->last_temp,
+ (32 - HEAT_MAP_BITS), false);
+
+ comm_item = container_of(freq_data,
+ struct hot_comm_item, hot_freq_data);
+
+ if (freq_data->flags & FREQ_DATA_TYPE_INODE) {
+ he = container_of(comm_item,
+ struct hot_inode_item, hot_inode);
+ buckets = root->heat_inode_map;
+
+ if (he == NULL)
+ return;
+
+ spin_lock(&he->hot_inode.lock);
+ if (list_empty(&he->hot_inode.n_list) || (a_temp != b_temp)) {
+ if (!list_empty(&he->hot_inode.n_list)) {
+ list_del_init(&he->hot_inode.n_list);
+ root->hot_map_nr--;
+ }
+
+ cur_bucket = buckets + a_temp;
+ list_add_tail(&he->hot_inode.n_list,
+ &cur_bucket->node_list);
+ root->hot_map_nr++;
+ freq_data->last_temp = temp;
+ }
+ spin_unlock(&he->hot_inode.lock);
+ } else if (freq_data->flags & FREQ_DATA_TYPE_RANGE) {
+ hr = container_of(comm_item,
+ struct hot_range_item, hot_range);
+ buckets = root->heat_range_map;
+
+ if (hr == NULL)
+ return;
+
+ spin_lock(&hr->hot_range.lock);
+ if (list_empty(&hr->hot_range.n_list) || (a_temp != b_temp)) {
+ if (!list_empty(&hr->hot_range.n_list)) {
+ list_del_init(&hr->hot_range.n_list);
+ root->hot_map_nr--;
+ }
+
+ cur_bucket = buckets + a_temp;
+ list_add_tail(&hr->hot_range.n_list,
+ &cur_bucket->node_list);
+ root->hot_map_nr++;
+ freq_data->last_temp = temp;
+ }
+ spin_unlock(&hr->hot_range.lock);
+ }
+}
+
+/*
* Initialize inode and range map info.
*/
static void hot_map_init(struct hot_info *root)
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 08/16] vfs: add aging function
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (6 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 07/16] vfs: add map info update function zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 09/16] vfs: add one work queue zwu.kernel
` (11 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/hot_tracking.h | 6 ++++++
2 files changed, 55 insertions(+), 0 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 6a5ef53..45d0164 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -388,6 +388,24 @@ static u32 hot_temp_calc(struct hot_freq_data *freq_data)
return result;
}
+static bool hot_is_obsolete(struct hot_freq_data *freq_data)
+{
+ int ret = 0;
+ struct timespec ckt = current_kernel_time();
+
+ u64 cur_time = timespec_to_ns(&ckt);
+ u64 last_read_ns =
+ (cur_time - timespec_to_ns(&freq_data->last_read_time));
+ u64 last_write_ns =
+ (cur_time - timespec_to_ns(&freq_data->last_write_time));
+ u64 kick_ns = TIME_TO_KICK * NSEC_PER_SEC;
+
+ if ((last_read_ns > kick_ns) && (last_write_ns > kick_ns))
+ ret = 1;
+
+ return ret;
+}
+
/*
* Calculate a new temperature and, if necessary,
* move the list_head corresponding to this inode or range
@@ -455,6 +473,37 @@ static void hot_map_update(struct hot_freq_data *freq_data,
}
}
+/* Update temperatures for each range item for aging purposes */
+static void hot_range_update(struct hot_inode_item *he,
+ struct hot_info *root)
+{
+ struct rb_node *node;
+ struct hot_comm_item *ci;
+ struct hot_range_item *hr;
+ bool obsolete;
+
+ spin_lock(&he->lock);
+ node = rb_first(&he->hot_range_tree.map);
+ while (node) {
+ ci = rb_entry(node, struct hot_comm_item, rb_node);
+ hr = container_of(ci, struct hot_range_item, hot_range);
+ kref_get(&hr->hot_range.refs);
+ hot_map_update(&hr->hot_range.hot_freq_data, root);
+
+ spin_lock(&hr->hot_range.lock);
+ obsolete = hot_is_obsolete(
+ &hr->hot_range.hot_freq_data);
+ spin_unlock(&hr->hot_range.lock);
+
+ node = rb_next(node);
+
+ hot_range_item_put(hr);
+ if (obsolete)
+ hot_range_item_put(hr);
+ }
+ spin_unlock(&he->lock);
+}
+
/*
* Initialize inode and range map info.
*/
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index f33066f..46d068a 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -24,6 +24,12 @@
#define RANGE_SIZE (1 << RANGE_BITS)
#define FREQ_POWER 4
+/*
+ * time to quit keeping track of
+ * tracking data (seconds)
+ */
+#define TIME_TO_KICK 300
+
/* NRR/NRW heat unit = 2^X accesses */
#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
#define NRR_COEFF_POWER 0
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 09/16] vfs: add one work queue
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (7 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 08/16] vfs: add aging function zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 10/16] vfs: add FS hot type support zwu.kernel
` (10 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Add a per-superblock workqueue and a delayed_work
to run periodic work to update map info on each superblock.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 85 ++++++++++++++++++++++++++++++++++++++++++
fs/hot_tracking.h | 3 +
include/linux/hot_tracking.h | 3 +
3 files changed, 91 insertions(+), 0 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 45d0164..383cc54 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -15,9 +15,12 @@
#include <linux/module.h>
#include <linux/spinlock.h>
#include <linux/hardirq.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
#include <linux/fs.h>
#include <linux/blkdev.h>
#include <linux/types.h>
+#include <linux/list_sort.h>
#include <linux/limits.h>
#include "hot_tracking.h"
@@ -548,6 +551,67 @@ static void hot_map_exit(struct hot_info *root)
}
}
+/* Temperature compare function*/
+static int hot_temp_cmp(void *priv, struct list_head *a,
+ struct list_head *b)
+{
+ struct hot_comm_item *ap =
+ container_of(a, struct hot_comm_item, n_list);
+ struct hot_comm_item *bp =
+ container_of(b, struct hot_comm_item, n_list);
+
+ int diff = ap->hot_freq_data.last_temp
+ - bp->hot_freq_data.last_temp;
+ if (diff > 0)
+ return -1;
+ if (diff < 0)
+ return 1;
+ return 0;
+}
+
+/*
+ * Every sync period we update temperatures for
+ * each hot inode item and hot range item for aging
+ * purposes.
+ */
+static void hot_update_worker(struct work_struct *work)
+{
+ struct hot_info *root = container_of(to_delayed_work(work),
+ struct hot_info, update_work);
+ struct rb_node *node;
+ struct hot_comm_item *ci;
+ struct hot_inode_item *he;
+ int i;
+
+ node = rb_first(&root->hot_inode_tree.map);
+ while (node) {
+ ci = rb_entry(node, struct hot_comm_item, rb_node);
+ he = container_of(ci, struct hot_inode_item, hot_inode);
+ kref_get(&he->hot_inode.refs);
+ hot_map_update(
+ &he->hot_inode.hot_freq_data, root);
+ hot_range_update(he, root);
+ node = rb_next(node);
+ hot_inode_item_put(he);
+ }
+
+ /* Sort temperature map info */
+ for (i = 0; i < HEAT_MAP_SIZE; i++) {
+ spin_lock(&root->heat_inode_map[i].lock);
+ list_sort(NULL, &root->heat_inode_map[i].node_list,
+ hot_temp_cmp);
+ spin_unlock(&root->heat_inode_map[i].lock);
+ spin_lock(&root->heat_range_map[i].lock);
+ list_sort(NULL, &root->heat_range_map[i].node_list,
+ hot_temp_cmp);
+ spin_unlock(&root->heat_range_map[i].lock);
+ }
+
+ /* Instert next delayed work */
+ queue_delayed_work(root->update_wq, &root->update_work,
+ msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+}
+
/*
* Initialize kmem cache for hot_inode_item and hot_range_item.
*/
@@ -641,11 +705,30 @@ int hot_track_init(struct super_block *sb)
hot_inode_tree_init(root);
hot_map_init(root);
+ root->update_wq = alloc_workqueue(
+ "hot_update_wq", WQ_NON_REENTRANT, 0);
+ if (!root->update_wq) {
+ printk(KERN_ERR "%s: Failed to create "
+ "hot update workqueue\n", __func__);
+ goto failed_wq;
+ }
+
+ /* Initialize hot tracking wq and arm one delayed work */
+ INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
+ queue_delayed_work(root->update_wq, &root->update_work,
+ msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+
sb->s_hot_root = root;
printk(KERN_INFO "VFS: Turning on hot data tracking\n");
return 0;
+
+failed_wq:
+ hot_map_exit(root);
+ hot_inode_tree_exit(root);
+ kfree(root);
+ return ret;
}
EXPORT_SYMBOL_GPL(hot_track_init);
@@ -653,6 +736,8 @@ void hot_track_exit(struct super_block *sb)
{
struct hot_info *root = sb->s_hot_root;
+ cancel_delayed_work_sync(&root->update_work);
+ destroy_workqueue(root->update_wq);
hot_map_exit(root);
hot_inode_tree_exit(root);
sb->s_hot_root = NULL;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 46d068a..96379a6 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -30,6 +30,9 @@
*/
#define TIME_TO_KICK 300
+/* set how often to update temperatures (seconds) */
+#define HEAT_UPDATE_DELAY 300
+
/* NRR/NRW heat unit = 2^X accesses */
#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
#define NRR_COEFF_POWER 0
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 23edad92..1feead2 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -88,6 +88,9 @@ struct hot_info {
/* map of range temperature */
struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
unsigned int hot_map_nr;
+
+ struct workqueue_struct *update_wq;
+ struct delayed_work update_work;
};
extern void __init hot_cache_init(void);
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 10/16] vfs: add FS hot type support
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (8 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 09/16] vfs: add one work queue zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 11/16] vfs: register one shrinker zwu.kernel
` (9 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Introduce one way to enable that specific FS
can inject its own hot tracking type.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 43 +++++++++++++++++++++++++++++++----------
fs/hot_tracking.h | 1 -
include/linux/fs.h | 1 +
include/linux/hot_tracking.h | 19 ++++++++++++++++++
4 files changed, 52 insertions(+), 12 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 383cc54..07a2a81 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -64,8 +64,11 @@ void hot_range_tree_init(struct hot_inode_item *he)
static void hot_range_item_init(struct hot_range_item *hr, loff_t start,
struct hot_inode_item *he)
{
+ struct hot_info *root = container_of(he->hot_inode_tree,
+ struct hot_info, hot_inode_tree);
+
hr->start = start;
- hr->len = RANGE_SIZE;
+ hr->len = hot_raw_shift(1, root->hot_type->range_bits, true);
hr->hot_inode = he;
kref_init(&hr->hot_range.refs);
spin_lock_init(&hr->hot_range.lock);
@@ -305,19 +308,21 @@ static void hot_rw_freq_calc(struct timespec old_atime,
*avg = *avg >> FREQ_POWER;
}
-static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
+static void hot_freq_data_update(struct hot_info *root,
+ struct hot_freq_data *freq_data, bool write)
{
struct timespec cur_time = current_kernel_time();
if (write) {
freq_data->nr_writes += 1;
- hot_rw_freq_calc(freq_data->last_write_time,
+ root->hot_type->ops.hot_rw_freq_calc_fn(
+ freq_data->last_write_time,
cur_time,
&freq_data->avg_delta_writes);
freq_data->last_write_time = cur_time;
} else {
freq_data->nr_reads += 1;
- hot_rw_freq_calc(freq_data->last_read_time,
+ root->hot_type->ops.hot_rw_freq_calc_fn(
freq_data->last_read_time,
cur_time,
&freq_data->avg_delta_reads);
@@ -421,7 +426,7 @@ static void hot_map_update(struct hot_freq_data *freq_data,
struct hot_comm_item *comm_item;
struct hot_inode_item *he;
struct hot_range_item *hr;
- u32 temp = hot_temp_calc(freq_data);
+ u32 temp = root->hot_type->ops.hot_temp_calc_fn(freq_data);
u8 a_temp = (u8)hot_raw_shift((u64)temp, (32 - HEAT_MAP_BITS), false);
u8 b_temp = (u8)hot_raw_shift((u64)freq_data->last_temp,
(32 - HEAT_MAP_BITS), false);
@@ -494,7 +499,7 @@ static void hot_range_update(struct hot_inode_item *he,
hot_map_update(&hr->hot_range.hot_freq_data, root);
spin_lock(&hr->hot_range.lock);
- obsolete = hot_is_obsolete(
+ obsolete = root->hot_type->ops.hot_is_obsolete_fn(
&hr->hot_range.hot_freq_data);
spin_unlock(&hr->hot_range.lock);
@@ -647,6 +652,7 @@ void hot_update_freqs(struct inode *inode, loff_t start,
struct hot_info *root = inode->i_sb->s_hot_root;
struct hot_inode_item *he;
struct hot_range_item *hr;
+ u64 range_size;
loff_t cur, end;
if (!root || (len == 0))
@@ -659,15 +665,19 @@ void hot_update_freqs(struct inode *inode, loff_t start,
}
spin_lock(&he->hot_inode.lock);
- hot_freq_data_update(&he->hot_inode.hot_freq_data, rw);
+ hot_freq_data_update(root, &he->hot_inode.hot_freq_data, rw);
spin_unlock(&he->hot_inode.lock);
/*
- * Align ranges on RANGE_SIZE boundary
+ * Align ranges on range size boundary
* to prevent proliferation of range structs
*/
- end = (start + len + RANGE_SIZE - 1) >> RANGE_BITS;
- for (cur = (start >> RANGE_BITS); cur < end; cur++) {
+ range_size = hot_raw_shift(1,
+ root->hot_type->range_bits, true);
+ end = hot_raw_shift((start + len + range_size - 1),
+ root->hot_type->range_bits, false);
+ cur = hot_raw_shift(start, root->hot_type->range_bits, false);
+ for (; cur < end; cur++) {
hr = hot_range_item_lookup(he, cur);
if (IS_ERR(hr)) {
WARN(1, "hot_range_item_lookup returns %ld\n",
@@ -677,7 +687,7 @@ void hot_update_freqs(struct inode *inode, loff_t start,
}
spin_lock(&hr->hot_range.lock);
- hot_freq_data_update(&hr->hot_range.hot_freq_data, rw);
+ hot_freq_data_update(root, &hr->hot_range.hot_freq_data, rw);
spin_unlock(&hr->hot_range.lock);
hot_range_item_put(hr);
@@ -705,6 +715,17 @@ int hot_track_init(struct super_block *sb)
hot_inode_tree_init(root);
hot_map_init(root);
+ /* Get hot type for specific FS */
+ root->hot_type = &sb->s_type->hot_type;
+ if (!root->hot_type->ops.hot_rw_freq_calc_fn)
+ root->hot_type->ops.hot_rw_freq_calc_fn = hot_rw_freq_calc;
+ if (!root->hot_type->ops.hot_temp_calc_fn)
+ root->hot_type->ops.hot_temp_calc_fn = hot_temp_calc;
+ if (!root->hot_type->ops.hot_is_obsolete_fn)
+ root->hot_type->ops.hot_is_obsolete_fn = hot_is_obsolete;
+ if (root->hot_type->range_bits == 0)
+ root->hot_type->range_bits = RANGE_BITS;
+
root->update_wq = alloc_workqueue(
"hot_update_wq", WQ_NON_REENTRANT, 0);
if (!root->update_wq) {
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 96379a6..73d2a3e 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -21,7 +21,6 @@
/* size of sub-file ranges */
#define RANGE_BITS 20
-#define RANGE_SIZE (1 << RANGE_BITS)
#define FREQ_POWER 4
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c42dc37..a8cb14e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1821,6 +1821,7 @@ struct file_system_type {
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
void (*kill_sb) (struct super_block *);
+ struct hot_type hot_type;
struct module *owner;
struct file_system_type * next;
struct hlist_head fs_supers;
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 1feead2..003af47 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -79,6 +79,24 @@ struct hot_range_item {
size_t len; /* length in bytes */
};
+typedef void (hot_rw_freq_calc_fn) (struct timespec old_atime,
+ struct timespec cur_time, u64 *avg);
+typedef u32 (hot_temp_calc_fn) (struct hot_freq_data *freq_data);
+typedef bool (hot_is_obsolete_fn) (struct hot_freq_data *freq_data);
+
+struct hot_func_ops {
+ hot_rw_freq_calc_fn *hot_rw_freq_calc_fn;
+ hot_temp_calc_fn *hot_temp_calc_fn;
+ hot_is_obsolete_fn *hot_is_obsolete_fn;
+};
+
+/* identifies an hot type */
+struct hot_type {
+ u64 range_bits;
+ /* fields provided by specific FS */
+ struct hot_func_ops ops;
+};
+
struct hot_info {
struct hot_rb_tree hot_inode_tree;
spinlock_t lock; /*protect inode tree */
@@ -91,6 +109,7 @@ struct hot_info {
struct workqueue_struct *update_wq;
struct delayed_work update_work;
+ struct hot_type *hot_type;
};
extern void __init hot_cache_init(void);
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 11/16] vfs: register one shrinker
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (9 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 10/16] vfs: add FS hot type support zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 12/16] vfs: add one ioctl interface zwu.kernel
` (8 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Register a shrinker to control the amount of
memory that is used in tracking hot regions - if we are throwing
inodes out of memory due to memory pressure, we most definitely are
going to need to reduce the amount of memory the tracking code is
using, even if it means losing useful information (i.e. the shrinker
accelerates the aging process).
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 63 ++++++++++++++++++++++++++++++++++++++++++
include/linux/hot_tracking.h | 1 +
2 files changed, 64 insertions(+), 0 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 07a2a81..a59521c 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -643,6 +643,63 @@ err:
}
EXPORT_SYMBOL_GPL(hot_cache_init);
+static int hot_track_prune_map(struct hot_map_head *map_head,
+ bool type, int nr)
+{
+ struct hot_comm_item *node;
+ int i;
+
+ for (i = 0; i < HEAT_MAP_SIZE; i++) {
+ spin_lock(&(map_head + i)->lock);
+ while (!list_empty(&(map_head + i)->node_list)) {
+ if (nr-- <= 0)
+ break;
+
+ node = list_first_entry(&(map_head + i)->node_list,
+ struct hot_comm_item, n_list);
+ if (type) {
+ struct hot_inode_item *hot_inode =
+ container_of(node,
+ struct hot_inode_item, hot_inode);
+ hot_inode_item_put(hot_inode);
+ } else {
+ struct hot_range_item *hot_range =
+ container_of(node,
+ struct hot_range_item, hot_range);
+ hot_range_item_put(hot_range);
+ }
+ }
+ spin_unlock(&(map_head + i)->lock);
+ }
+
+ return nr;
+}
+
+/* The shrinker callback function */
+static int hot_track_prune(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct hot_info *root =
+ container_of(shrink, struct hot_info, hot_shrink);
+ int ret;
+
+ if (sc->nr_to_scan == 0)
+ return root->hot_map_nr;
+
+ if (!(sc->gfp_mask & __GFP_FS))
+ return -1;
+
+ ret = hot_track_prune_map(root->heat_range_map,
+ false, sc->nr_to_scan);
+ if (ret > 0)
+ ret = hot_track_prune_map(root->heat_inode_map,
+ true, ret);
+ if (ret > 0)
+ root->hot_map_nr -= (sc->nr_to_scan - ret);
+
+ return root->hot_map_nr;
+}
+
/*
* Main function to update access frequency from read/writepage(s) hooks
*/
@@ -739,6 +796,11 @@ int hot_track_init(struct super_block *sb)
queue_delayed_work(root->update_wq, &root->update_work,
msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+ /* Register a shrinker callback */
+ root->hot_shrink.shrink = hot_track_prune;
+ root->hot_shrink.seeks = DEFAULT_SEEKS;
+ register_shrinker(&root->hot_shrink);
+
sb->s_hot_root = root;
printk(KERN_INFO "VFS: Turning on hot data tracking\n");
@@ -757,6 +819,7 @@ void hot_track_exit(struct super_block *sb)
{
struct hot_info *root = sb->s_hot_root;
+ unregister_shrinker(&root->hot_shrink);
cancel_delayed_work_sync(&root->update_work);
destroy_workqueue(root->update_wq);
hot_map_exit(root);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 003af47..36dac41 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -110,6 +110,7 @@ struct hot_info {
struct workqueue_struct *update_wq;
struct delayed_work update_work;
struct hot_type *hot_type;
+ struct shrinker hot_shrink;
};
extern void __init hot_cache_init(void);
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 12/16] vfs: add one ioctl interface
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (10 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 11/16] vfs: register one shrinker zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 13/16] vfs: add debugfs support zwu.kernel
` (7 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
FS_IOC_GET_HEAT_INFO: return a struct containing the various
metrics collected in hot_freq_data structs, and also return a
calculated data temperature based on those metrics. Optionally, retrieve
the temperature from the hot data hash list instead of recalculating it.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/compat_ioctl.c | 5 +++
fs/ioctl.c | 74 ++++++++++++++++++++++++++++++++++++++++++
include/linux/hot_tracking.h | 19 +++++++++++
3 files changed, 98 insertions(+), 0 deletions(-)
diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index e2f57a0..a55a885 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -57,6 +57,7 @@
#include <linux/i2c-dev.h>
#include <linux/atalk.h>
#include <linux/gfp.h>
+#include <linux/hot_tracking.h>
#include <net/bluetooth/bluetooth.h>
#include <net/bluetooth/hci.h>
@@ -1403,6 +1404,9 @@ COMPATIBLE_IOCTL(TIOCSTART)
COMPATIBLE_IOCTL(TIOCSTOP)
#endif
+/*Hot data tracking*/
+COMPATIBLE_IOCTL(FS_IOC_GET_HEAT_INFO)
+
/* fat 'r' ioctls. These are handled by fat with ->compat_ioctl,
but we don't want warnings on other file systems. So declare
them as compatible here. */
@@ -1582,6 +1586,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
case FIBMAP:
case FIGETBSZ:
case FIONREAD:
+ case FS_IOC_GET_HEAT_INFO:
if (S_ISREG(f.file->f_path.dentry->d_inode->i_mode))
break;
/*FALL THROUGH*/
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 3bdad6d..79fe81f 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -15,6 +15,7 @@
#include <linux/writeback.h>
#include <linux/buffer_head.h>
#include <linux/falloc.h>
+#include <linux/hot_tracking.h>
#include <asm/ioctls.h>
@@ -537,6 +538,76 @@ static int ioctl_fsthaw(struct file *filp)
}
/*
+ * Retrieve information about access frequency for the given file. Return it in
+ * a userspace-friendly struct for btrfsctl (or another tool) to parse.
+ *
+ * The temperature that is returned can be "live" -- that is, recalculated when
+ * the ioctl is called -- or it can be returned from the hashtable, reflecting
+ * the (possibly old) value that the system will use when considering files
+ * for migration. This behavior is determined by hot_heat_info->live.
+ */
+static int ioctl_heat_info(struct file *file, void __user *argp)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct hot_heat_info heat_info;
+ struct hot_inode_item *he;
+ int ret = 0;
+
+ if (copy_from_user((void *)&heat_info,
+ argp,
+ sizeof(struct hot_heat_info)) != 0) {
+ ret = -EFAULT;
+ goto err;
+ }
+
+ he = hot_inode_item_lookup(inode->i_sb->s_hot_root, inode->i_ino);
+ if (!he) {
+ /* we don't have any info on this file yet */
+ ret = -ENODATA;
+ goto err;
+ }
+
+ spin_lock(&he->hot_inode.lock);
+ heat_info.avg_delta_reads =
+ (__u64) he->hot_inode.hot_freq_data.avg_delta_reads;
+ heat_info.avg_delta_writes =
+ (__u64) he->hot_inode.hot_freq_data.avg_delta_writes;
+ heat_info.last_read_time =
+ (__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_read_time);
+ heat_info.last_write_time =
+ (__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_write_time);
+ heat_info.num_reads =
+ (__u32) he->hot_inode.hot_freq_data.nr_reads;
+ heat_info.num_writes =
+ (__u32) he->hot_inode.hot_freq_data.nr_writes;
+
+ if (heat_info.live > 0) {
+ /*
+ * got a request for live temperature,
+ * call hot_hash_calc_temperature to recalculate
+ */
+ heat_info.temp =
+ inode->i_sb->s_hot_root->hot_type->ops.hot_temp_calc_fn(
+ &he->hot_inode.hot_freq_data);
+ } else {
+ /* not live temperature, get it from the hashlist */
+ heat_info.temp = he->hot_inode.hot_freq_data.last_temp;
+ }
+ spin_unlock(&he->hot_inode.lock);
+
+ hot_inode_item_put(he);
+
+ if (copy_to_user(argp, (void *)&heat_info,
+ sizeof(struct hot_heat_info))) {
+ ret = -EFAULT;
+ goto err;
+ }
+
+err:
+ return ret;
+}
+
+/*
* When you add any new common ioctls to the switches above and below
* please update compat_sys_ioctl() too.
*
@@ -591,6 +662,9 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
case FIGETBSZ:
return put_user(inode->i_sb->s_blocksize, argp);
+ case FS_IOC_GET_HEAT_INFO:
+ return ioctl_heat_info(filp, argp);
+
default:
if (S_ISREG(inode->i_mode))
error = file_ioctl(filp, cmd, arg);
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index 36dac41..bd55a34 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -43,6 +43,17 @@ struct hot_freq_data {
u32 last_temp;
};
+struct hot_heat_info {
+ __u64 avg_delta_reads;
+ __u64 avg_delta_writes;
+ __u64 last_read_time;
+ __u64 last_write_time;
+ __u32 num_reads;
+ __u32 num_writes;
+ __u32 temp;
+ __u8 live;
+};
+
/* List heads in hot map array */
struct hot_map_head {
struct list_head node_list;
@@ -113,6 +124,14 @@ struct hot_info {
struct shrinker hot_shrink;
};
+/*
+ * Hot data tracking ioctls:
+ *
+ * HOT_INFO - retrieve info on frequency of access
+ */
+#define FS_IOC_GET_HEAT_INFO _IOR('f', 17, \
+ struct hot_heat_info)
+
extern void __init hot_cache_init(void);
extern int hot_track_init(struct super_block *sb);
extern void hot_track_exit(struct super_block *sb);
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 13/16] vfs: add debugfs support
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (11 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 12/16] vfs: add one ioctl interface zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 14/16] proc: add two hot_track proc files zwu.kernel
` (6 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Add a /sys/kernel/debug/hot_track/<device_name>/ directory for each
volume that contains two files. The first, `inode_stats', contains the
heat information for inodes that have been brought into the hot data map
structures. The second, `range_stats', contains similar information for
subfile ranges.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 513 +++++++++++++++++++++++++++++++++++++++++-
fs/hot_tracking.h | 5 +
include/linux/hot_tracking.h | 1 +
3 files changed, 517 insertions(+), 2 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index a59521c..94fe029 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -21,9 +21,12 @@
#include <linux/blkdev.h>
#include <linux/types.h>
#include <linux/list_sort.h>
+#include <linux/debugfs.h>
#include <linux/limits.h>
#include "hot_tracking.h"
+static struct dentry *hot_debugfs_root;
+
/* kmem_cache pointers for slab caches */
static struct kmem_cache *hot_inode_item_cachep __read_mostly;
static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -218,8 +221,8 @@ struct hot_inode_item
else if (ino > entry->i_ino)
p = &(*p)->rb_right;
else {
- spin_unlock(&root->lock);
kref_get(&entry->hot_inode.refs);
+ spin_unlock(&root->lock);
return entry;
}
}
@@ -269,8 +272,8 @@ static struct hot_range_item
else if (start > hot_range_end(entry))
p = &(*p)->rb_right;
else {
- spin_unlock(&he->lock);
kref_get(&entry->hot_range.refs);
+ spin_unlock(&he->lock);
return entry;
}
}
@@ -617,6 +620,499 @@ static void hot_update_worker(struct work_struct *work)
msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
}
+static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct hot_info *root = seq->private;
+ struct rb_node *node, *node2;
+ struct hot_comm_item *ci;
+ struct hot_inode_item *he;
+ struct hot_range_item *hr;
+ loff_t l = *pos;
+
+ spin_lock(&root->lock);
+ node = rb_first(&root->hot_inode_tree.map);
+ while (node) {
+ ci = rb_entry(node, struct hot_comm_item, rb_node);
+ he = container_of(ci, struct hot_inode_item, hot_inode);
+ spin_lock(&he->lock);
+ node2 = rb_first(&he->hot_range_tree.map);
+ while (node2) {
+ if (!l--) {
+ ci = rb_entry(node2,
+ struct hot_comm_item, rb_node);
+ hr = container_of(ci,
+ struct hot_range_item, hot_range);
+ kref_get(&hr->hot_range.refs);
+ spin_unlock(&he->lock);
+ spin_unlock(&root->lock);
+ return hr;
+ }
+ node2 = rb_next(node2);
+ }
+ node = rb_next(node);
+ spin_unlock(&he->lock);
+ }
+ spin_unlock(&root->lock);
+ return NULL;
+}
+
+static void *hot_range_seq_next(struct seq_file *seq,
+ void *v, loff_t *pos)
+{
+ struct rb_node *node, *node2;
+ struct hot_comm_item *ci;
+ struct hot_inode_item *he;
+ struct hot_range_item *hr_next = NULL, *hr = v;
+
+ spin_lock(&hr->hot_range.lock);
+ (*pos)++;
+ node2 = rb_next(&hr->hot_range.rb_node);
+ if (node2)
+ goto next;
+
+ node = rb_next(&hr->hot_inode->hot_inode.rb_node);
+ if (node) {
+ ci = rb_entry(node, struct hot_comm_item, rb_node);
+ he = container_of(ci, struct hot_inode_item, hot_inode);
+ node2 = rb_first(&he->hot_range_tree.map);
+ if (node2) {
+next:
+ ci = rb_entry(node2,
+ struct hot_comm_item, rb_node);
+ hr_next = container_of(ci,
+ struct hot_range_item, hot_range);
+ kref_get(&hr_next->hot_range.refs);
+ }
+ }
+ spin_unlock(&hr->hot_range.lock);
+
+ hot_range_item_put(hr);
+ return hr_next;
+}
+
+static void hot_range_seq_stop(struct seq_file *seq, void *v)
+{
+ struct hot_range_item *hr = v;
+
+ if (hr)
+ hot_range_item_put(hr);
+}
+
+static int hot_range_seq_show(struct seq_file *seq, void *v)
+{
+ struct hot_range_item *hr = v;
+ struct hot_inode_item *he = hr->hot_inode;
+ struct hot_freq_data *freq_data = &hr->hot_range.hot_freq_data;
+ struct hot_info *root = container_of(he->hot_inode_tree,
+ struct hot_info, hot_inode_tree);
+ loff_t start = hr->start * hot_raw_shift(1,
+ root->hot_type->range_bits, true);
+
+ /* Always lock hot_inode_item first */
+ spin_lock(&he->hot_inode.lock);
+ spin_lock(&hr->hot_range.lock);
+ seq_printf(seq, "inode %llu, range " \
+ "%llu+%llu, reads %u, writes %u, temp %u\n",
+ he->i_ino, (unsigned long long)start,
+ (unsigned long long)hr->len,
+ freq_data->nr_reads,
+ freq_data->nr_writes,
+ (u8)hot_raw_shift((u64)freq_data->last_temp,
+ (32 - HEAT_MAP_BITS), false));
+ spin_unlock(&hr->hot_range.lock);
+ spin_unlock(&he->hot_inode.lock);
+
+ return 0;
+}
+
+static void *hot_inode_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct hot_info *root = seq->private;
+ struct rb_node *node;
+ struct hot_comm_item *ci;
+ struct hot_inode_item *he = NULL;
+ loff_t l = *pos;
+
+ spin_lock(&root->lock);
+ node = rb_first(&root->hot_inode_tree.map);
+ while (node) {
+ if (!l--) {
+ ci = rb_entry(node, struct hot_comm_item, rb_node);
+ he = container_of(ci,
+ struct hot_inode_item, hot_inode);
+ kref_get(&he->hot_inode.refs);
+ break;
+ }
+ node = rb_next(node);
+ }
+ spin_unlock(&root->lock);
+
+ return he;
+}
+
+static void *hot_inode_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct hot_inode_item *he_next = NULL, *he = v;
+ struct rb_node *node;
+ struct hot_comm_item *ci;
+
+ spin_lock(&he->hot_inode.lock);
+ (*pos)++;
+ node = rb_next(&he->hot_inode.rb_node);
+ if (node) {
+ ci = rb_entry(node, struct hot_comm_item, rb_node);
+ he_next = container_of(ci,
+ struct hot_inode_item, hot_inode);
+ kref_get(&he_next->hot_inode.refs);
+ }
+ spin_unlock(&he->hot_inode.lock);
+
+ hot_inode_item_put(he);
+
+ return he_next;
+}
+
+static void hot_inode_seq_stop(struct seq_file *seq, void *v)
+{
+ struct hot_inode_item *he = v;
+
+ if (he)
+ hot_inode_item_put(he);
+}
+
+static int hot_inode_seq_show(struct seq_file *seq, void *v)
+{
+ struct hot_inode_item *he = v;
+ struct hot_freq_data *freq_data = &he->hot_inode.hot_freq_data;
+
+ spin_lock(&he->hot_inode.lock);
+ seq_printf(seq, "inode %llu, reads %u, writes %u, temp %d\n",
+ he->i_ino,
+ freq_data->nr_reads,
+ freq_data->nr_writes,
+ (u8)hot_raw_shift((u64)freq_data->last_temp,
+ (32 - HEAT_MAP_BITS), false));
+ spin_unlock(&he->hot_inode.lock);
+
+ return 0;
+}
+
+static void *hot_spot_range_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct hot_info *root = seq->private;
+ struct hot_range_item *hr;
+ struct hot_comm_item *comm_item;
+ struct list_head *n_list;
+ int i;
+
+ for (i = HEAT_MAP_SIZE - 1; i >= 0; i--) {
+ spin_lock(&root->heat_range_map[i].lock);
+ n_list = seq_list_start(
+ &root->heat_range_map[i].node_list, *pos);
+ if (n_list) {
+ comm_item = container_of(n_list,
+ struct hot_comm_item, n_list);
+ hr = container_of(comm_item,
+ struct hot_range_item, hot_range);
+ kref_get(&hr->hot_range.refs);
+ spin_unlock(&root->heat_range_map[i].lock);
+ return hr;
+ }
+ spin_unlock(&root->heat_range_map[i].lock);
+ }
+
+ return NULL;
+}
+
+static void *hot_spot_range_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct hot_info *root = seq->private;
+ struct hot_range_item *hr_next, *hr = v;
+ struct hot_comm_item *comm_item;
+ struct list_head *n_list;
+ int i, j;
+ spin_lock(&hr->hot_range.lock);
+ i = (int)hot_raw_shift(hr->hot_range.hot_freq_data.last_temp,
+ (32 - HEAT_MAP_BITS), false);
+ spin_unlock(&hr->hot_range.lock);
+
+ spin_lock(&root->heat_range_map[i].lock);
+ n_list = seq_list_next(&hr->hot_range.n_list,
+ &root->heat_range_map[i].node_list, pos);
+ hot_range_item_put(hr);
+next:
+ j = i;
+ if (n_list) {
+ comm_item = container_of(n_list,
+ struct hot_comm_item, n_list);
+ hr_next = container_of(comm_item,
+ struct hot_range_item, hot_range);
+ kref_get(&hr_next->hot_range.refs);
+ spin_unlock(&root->heat_range_map[i].lock);
+ return hr_next;
+ } else if (--i >= 0) {
+ spin_unlock(&root->heat_range_map[j].lock);
+ spin_lock(&root->heat_range_map[i].lock);
+ n_list = seq_list_next(&root->heat_range_map[i].node_list,
+ &root->heat_range_map[i].node_list, pos);
+ goto next;
+ }
+
+ spin_unlock(&root->heat_range_map[j].lock);
+ return NULL;
+}
+
+static void *hot_spot_inode_seq_start(struct seq_file *seq, loff_t *pos)
+{
+ struct hot_info *root = seq->private;
+ struct hot_inode_item *he;
+ struct hot_comm_item *comm_item;
+ struct list_head *n_list;
+ int i;
+
+ for (i = HEAT_MAP_SIZE - 1; i >= 0; i--) {
+ spin_lock(&root->heat_inode_map[i].lock);
+ n_list = seq_list_start(
+ &root->heat_inode_map[i].node_list, *pos);
+ if (n_list) {
+ comm_item = container_of(n_list,
+ struct hot_comm_item, n_list);
+ he = container_of(comm_item,
+ struct hot_inode_item, hot_inode);
+ kref_get(&he->hot_inode.refs);
+ spin_unlock(&root->heat_inode_map[i].lock);
+ return he;
+ }
+ spin_unlock(&root->heat_inode_map[i].lock);
+ }
+
+ return NULL;
+}
+
+static void *hot_spot_inode_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+ struct hot_info *root = seq->private;
+ struct hot_inode_item *he_next, *he = v;
+ struct hot_comm_item *comm_item;
+ struct list_head *n_list;
+ int i, j;
+ spin_lock(&he->hot_inode.lock);
+ i = (int)hot_raw_shift(he->hot_inode.hot_freq_data.last_temp,
+ (32 - HEAT_MAP_BITS), false);
+ spin_unlock(&he->hot_inode.lock);
+
+ spin_lock(&root->heat_inode_map[i].lock);
+ n_list = seq_list_next(&he->hot_inode.n_list,
+ &root->heat_inode_map[i].node_list, pos);
+ hot_inode_item_put(he);
+next:
+ j = i;
+ if (n_list) {
+ comm_item = container_of(n_list,
+ struct hot_comm_item, n_list);
+ he_next = container_of(comm_item,
+ struct hot_inode_item, hot_inode);
+ kref_get(&he_next->hot_inode.refs);
+ spin_unlock(&root->heat_inode_map[i].lock);
+ return he_next;
+ } else if (--i >= 0) {
+ spin_unlock(&root->heat_inode_map[j].lock);
+ spin_lock(&root->heat_inode_map[i].lock);
+ n_list = seq_list_next(&root->heat_inode_map[i].node_list,
+ &root->heat_inode_map[i].node_list, pos);
+ goto next;
+ }
+
+ spin_unlock(&root->heat_inode_map[j].lock);
+ return NULL;
+}
+
+static const struct seq_operations hot_range_seq_ops = {
+ .start = hot_range_seq_start,
+ .next = hot_range_seq_next,
+ .stop = hot_range_seq_stop,
+ .show = hot_range_seq_show
+};
+
+static const struct seq_operations hot_inode_seq_ops = {
+ .start = hot_inode_seq_start,
+ .next = hot_inode_seq_next,
+ .stop = hot_inode_seq_stop,
+ .show = hot_inode_seq_show
+};
+
+static const struct seq_operations hot_spot_range_seq_ops = {
+ .start = hot_spot_range_seq_start,
+ .next = hot_spot_range_seq_next,
+ .stop = hot_range_seq_stop,
+ .show = hot_range_seq_show
+};
+
+static const struct seq_operations hot_spot_inode_seq_ops = {
+ .start = hot_spot_inode_seq_start,
+ .next = hot_spot_inode_seq_next,
+ .stop = hot_inode_seq_stop,
+ .show = hot_inode_seq_show
+};
+
+static int hot_range_seq_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open_private(file, &hot_range_seq_ops, 0);
+ if (ret == 0) {
+ struct seq_file *seq = file->private_data;
+ seq->private = inode->i_private;
+ }
+ return ret;
+}
+
+static int hot_inode_seq_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open_private(file, &hot_inode_seq_ops, 0);
+ if (ret == 0) {
+ struct seq_file *seq = file->private_data;
+ seq->private = inode->i_private;
+ }
+ return ret;
+}
+
+static int hot_spot_range_seq_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open_private(file, &hot_spot_range_seq_ops, 0);
+ if (ret == 0) {
+ struct seq_file *seq = file->private_data;
+ seq->private = inode->i_private;
+ }
+ return ret;
+}
+
+static int hot_spot_inode_seq_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open_private(file, &hot_spot_inode_seq_ops, 0);
+ if (ret == 0) {
+ struct seq_file *seq = file->private_data;
+ seq->private = inode->i_private;
+ }
+ return ret;
+}
+
+/* fops to override for printing range data */
+static const struct file_operations hot_debugfs_range_fops = {
+ .open = hot_range_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+/* fops to override for printing inode data */
+static const struct file_operations hot_debugfs_inode_fops = {
+ .open = hot_inode_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+/* fops to override for printing temperature data */
+static const struct file_operations hot_debugfs_spot_range_fops = {
+ .open = hot_spot_range_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static const struct file_operations hot_debugfs_spot_inode_fops = {
+ .open = hot_spot_inode_seq_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+static const struct hot_debugfs hot_debugfs[] = {
+ {
+ .name = "rt_stats_range",
+ .fops = &hot_debugfs_range_fops,
+ },
+ {
+ .name = "rt_stats_inode",
+ .fops = &hot_debugfs_inode_fops,
+ },
+ {
+ .name = "hot_spots_range",
+ .fops = &hot_debugfs_spot_range_fops,
+ },
+ {
+ .name = "hot_spots_inode",
+ .fops = &hot_debugfs_spot_inode_fops,
+ },
+};
+
+/* initialize debugfs */
+static int hot_debugfs_init(struct super_block *sb)
+{
+ static const char hot_name[] = "hot_track";
+ struct dentry *dentry;
+ int i, ret = 0;
+
+ /* Determine if hot debufs root has existed */
+ if (!hot_debugfs_root) {
+ hot_debugfs_root = debugfs_create_dir(hot_name, NULL);
+ if (IS_ERR(hot_debugfs_root)) {
+ ret = PTR_ERR(hot_debugfs_root);
+ return ret;
+ }
+ }
+
+ if (!S_ISDIR(hot_debugfs_root->d_inode->i_mode))
+ return -ENOTDIR;
+
+ /* create debugfs folder for this volume by mounted dev name */
+ sb->s_hot_root->vol_dentry =
+ debugfs_create_dir(sb->s_id, hot_debugfs_root);
+ if (IS_ERR(sb->s_hot_root->vol_dentry)) {
+ ret = PTR_ERR(sb->s_hot_root->vol_dentry);
+ goto err;
+ }
+
+ /* create debugfs hot data files */
+ for (i = 0; i < ARRAY_SIZE(hot_debugfs); i++) {
+ dentry = debugfs_create_file(hot_debugfs[i].name,
+ S_IFREG | S_IRUSR | S_IWUSR,
+ sb->s_hot_root->vol_dentry,
+ sb->s_hot_root,
+ hot_debugfs[i].fops);
+ if (IS_ERR(dentry)) {
+ ret = PTR_ERR(dentry);
+ goto err;
+ }
+ }
+
+ return 0;
+
+err:
+ debugfs_remove_recursive(sb->s_hot_root->vol_dentry);
+
+ if (list_empty(&hot_debugfs_root->d_subdirs)) {
+ debugfs_remove(hot_debugfs_root);
+ hot_debugfs_root = NULL;
+ }
+
+ return ret;
+}
+
+/* remove dentries for debugsfs */
+static void hot_debugfs_exit(struct super_block *sb)
+{
+ /* remove all debugfs entries recursively from the volume root */
+ if (sb->s_hot_root->vol_dentry)
+ debugfs_remove_recursive(sb->s_hot_root->vol_dentry);
+ else
+ BUG();
+
+ if (list_empty(&hot_debugfs_root->d_subdirs)) {
+ debugfs_remove(hot_debugfs_root);
+ hot_debugfs_root = NULL;
+ }
+}
+
/*
* Initialize kmem cache for hot_inode_item and hot_range_item.
*/
@@ -803,10 +1299,22 @@ int hot_track_init(struct super_block *sb)
sb->s_hot_root = root;
+ ret = hot_debugfs_init(sb);
+ if (ret) {
+ printk(KERN_ERR "%s: hot_debugfs_init error: %d\n",
+ __func__, ret);
+ goto failed_debugfs;
+ }
+
printk(KERN_INFO "VFS: Turning on hot data tracking\n");
return 0;
+failed_debugfs:
+ unregister_shrinker(&root->hot_shrink);
+ cancel_delayed_work_sync(&root->update_work);
+ destroy_workqueue(root->update_wq);
+ sb->s_hot_root = NULL;
failed_wq:
hot_map_exit(root);
hot_inode_tree_exit(root);
@@ -819,6 +1327,7 @@ void hot_track_exit(struct super_block *sb)
{
struct hot_info *root = sb->s_hot_root;
+ hot_debugfs_exit(sb);
unregister_shrinker(&root->hot_shrink);
cancel_delayed_work_sync(&root->update_work);
destroy_workqueue(root->update_wq);
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index 73d2a3e..a969940 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -53,4 +53,9 @@
#define AVW_DIVIDER_POWER 40 /* AVW - average delta between recent writes(ns) */
#define AVW_COEFF_POWER 0
+struct hot_debugfs {
+ const char *name;
+ const struct file_operations *fops;
+};
+
#endif /* __HOT_TRACKING__ */
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index bd55a34..a735f58 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -122,6 +122,7 @@ struct hot_info {
struct delayed_work update_work;
struct hot_type *hot_type;
struct shrinker hot_shrink;
+ struct dentry *vol_dentry;
};
/*
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 14/16] proc: add two hot_track proc files
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (12 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 13/16] vfs: add debugfs support zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 15/16] btrfs: add hot tracking support zwu.kernel
` (5 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Add two proc files hot-kick-time and hot-update-delay
under the dir /proc/sys/fs/ in order to turn
TIME_TO_KICK and HEAT_UPDATE_DELAY into be tunable.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/hot_tracking.c | 12 +++++++++---
fs/hot_tracking.h | 9 ---------
include/linux/hot_tracking.h | 7 +++++++
kernel/sysctl.c | 14 ++++++++++++++
4 files changed, 30 insertions(+), 12 deletions(-)
diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
index 94fe029..74d01da 100644
--- a/fs/hot_tracking.c
+++ b/fs/hot_tracking.c
@@ -27,6 +27,12 @@
static struct dentry *hot_debugfs_root;
+int sysctl_hot_kick_time __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_kick_time);
+
+int sysctl_hot_update_delay __read_mostly = 300;
+EXPORT_SYMBOL_GPL(sysctl_hot_update_delay);
+
/* kmem_cache pointers for slab caches */
static struct kmem_cache *hot_inode_item_cachep __read_mostly;
static struct kmem_cache *hot_range_item_cachep __read_mostly;
@@ -409,7 +415,7 @@ static bool hot_is_obsolete(struct hot_freq_data *freq_data)
(cur_time - timespec_to_ns(&freq_data->last_read_time));
u64 last_write_ns =
(cur_time - timespec_to_ns(&freq_data->last_write_time));
- u64 kick_ns = TIME_TO_KICK * NSEC_PER_SEC;
+ u64 kick_ns = sysctl_hot_kick_time * NSEC_PER_SEC;
if ((last_read_ns > kick_ns) && (last_write_ns > kick_ns))
ret = 1;
@@ -617,7 +623,7 @@ static void hot_update_worker(struct work_struct *work)
/* Instert next delayed work */
queue_delayed_work(root->update_wq, &root->update_work,
- msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+ msecs_to_jiffies(sysctl_hot_update_delay * MSEC_PER_SEC));
}
static void *hot_range_seq_start(struct seq_file *seq, loff_t *pos)
@@ -1290,7 +1296,7 @@ int hot_track_init(struct super_block *sb)
/* Initialize hot tracking wq and arm one delayed work */
INIT_DELAYED_WORK(&root->update_work, hot_update_worker);
queue_delayed_work(root->update_wq, &root->update_work,
- msecs_to_jiffies(HEAT_UPDATE_DELAY * MSEC_PER_SEC));
+ msecs_to_jiffies(sysctl_hot_update_delay * MSEC_PER_SEC));
/* Register a shrinker callback */
root->hot_shrink.shrink = hot_track_prune;
diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
index a969940..ab6d603 100644
--- a/fs/hot_tracking.h
+++ b/fs/hot_tracking.h
@@ -23,15 +23,6 @@
#define RANGE_BITS 20
#define FREQ_POWER 4
-/*
- * time to quit keeping track of
- * tracking data (seconds)
- */
-#define TIME_TO_KICK 300
-
-/* set how often to update temperatures (seconds) */
-#define HEAT_UPDATE_DELAY 300
-
/* NRR/NRW heat unit = 2^X accesses */
#define NRR_MULTIPLIER_POWER 20 /* NRR - number of reads since mount */
#define NRR_COEFF_POWER 0
diff --git a/include/linux/hot_tracking.h b/include/linux/hot_tracking.h
index a735f58..6130687 100644
--- a/include/linux/hot_tracking.h
+++ b/include/linux/hot_tracking.h
@@ -126,6 +126,13 @@ struct hot_info {
};
/*
+ * Two variables have meanings as below:
+ * 1. time to quit keeping track of tracking data (seconds)
+ * 2. set how often to update temperatures (seconds)
+ */
+extern int sysctl_hot_kick_time, sysctl_hot_update_delay;
+
+/*
* Hot data tracking ioctls:
*
* HOT_INFO - retrieve info on frequency of access
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c88878d..b7e233e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1586,6 +1586,20 @@ static struct ctl_table fs_table[] = {
.proc_handler = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
+ {
+ .procname = "hot-kick-time",
+ .data = &sysctl_hot_kick_time,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "hot-update-delay",
+ .data = &sysctl_hot_update_delay,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
{ }
};
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 15/16] btrfs: add hot tracking support
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (13 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 14/16] proc: add two hot_track proc files zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 16/16] vfs: add documentation zwu.kernel
` (4 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Introduce one new mount option '-o hot_track',
and add its parsing support.
Its usage looks like:
mount -o hot_track
mount -o nouser,hot_track
mount -o nouser,hot_track,loop
mount -o hot_track,nouser
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
fs/btrfs/ctree.h | 1 +
fs/btrfs/super.c | 22 +++++++++++++++++++++-
2 files changed, 22 insertions(+), 1 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 547b7b0..e9781e8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1829,6 +1829,7 @@ struct btrfs_ioctl_defrag_range_args {
#define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20)
#define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21)
#define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22)
+#define BTRFS_MOUNT_HOT_TRACK (1 << 23)
#define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt)
#define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 99545df..7dcf79e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -41,6 +41,7 @@
#include <linux/slab.h>
#include <linux/cleancache.h>
#include <linux/ratelimit.h>
+#include <linux/hot_tracking.h>
#include "compat.h"
#include "delayed-inode.h"
#include "ctree.h"
@@ -309,6 +310,10 @@ static void btrfs_put_super(struct super_block *sb)
* last process that kept it busy. Or segfault in the aforementioned
* process... Whom would you report that to?
*/
+
+ /* Hot data tracking */
+ if (btrfs_test_opt(btrfs_sb(sb)->tree_root, HOT_TRACK))
+ hot_track_exit(sb);
}
enum {
@@ -321,7 +326,7 @@ enum {
Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache,
Opt_no_space_cache, Opt_recovery, Opt_skip_balance,
Opt_check_integrity, Opt_check_integrity_including_extent_data,
- Opt_check_integrity_print_mask, Opt_fatal_errors,
+ Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_hot_track,
Opt_err,
};
@@ -362,6 +367,7 @@ static match_table_t tokens = {
{Opt_check_integrity_including_extent_data, "check_int_data"},
{Opt_check_integrity_print_mask, "check_int_print_mask=%d"},
{Opt_fatal_errors, "fatal_errors=%s"},
+ {Opt_hot_track, "hot_track"},
{Opt_err, NULL},
};
@@ -624,6 +630,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options)
goto out;
}
break;
+ case Opt_hot_track:
+ btrfs_set_opt(info->mount_opt, HOT_TRACK);
+ break;
case Opt_err:
printk(KERN_INFO "btrfs: unrecognized mount option "
"'%s'\n", p);
@@ -851,11 +860,20 @@ static int btrfs_fill_super(struct super_block *sb,
goto fail_close;
}
+ if (btrfs_test_opt(fs_info->tree_root, HOT_TRACK)) {
+ err = hot_track_init(sb);
+ if (err)
+ goto fail_hot;
+ }
+
save_mount_options(sb, data);
cleancache_init_fs(sb);
sb->s_flags |= MS_ACTIVE;
return 0;
+fail_hot:
+ dput(sb->s_root);
+ sb->s_root = NULL;
fail_close:
close_ctree(fs_info->tree_root);
return err;
@@ -951,6 +969,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
seq_puts(seq, ",skip_balance");
if (btrfs_test_opt(root, PANIC_ON_FATAL_ERROR))
seq_puts(seq, ",fatal_errors=panic");
+ if (btrfs_test_opt(root, HOT_TRACK))
+ seq_puts(seq, ",hot_track");
return 0;
}
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH RESEND v1 16/16] vfs: add documentation
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (14 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 15/16] btrfs: add hot tracking support zwu.kernel
@ 2012-12-20 14:43 ` zwu.kernel
2012-12-20 14:55 ` [PATCH RESEND v1 00/16] vfs: hot data tracking Zhi Yong Wu
` (3 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: zwu.kernel @ 2012-12-20 14:43 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Add one doc for VFS hot tracking feature
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/hot_tracking.txt | 255 ++++++++++++++++++++++++++++
2 files changed, 257 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/hot_tracking.txt
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX
index 7b52ba7..b973fd8 100644
--- a/Documentation/filesystems/00-INDEX
+++ b/Documentation/filesystems/00-INDEX
@@ -120,3 +120,5 @@ xfs.txt
- info and mount options for the XFS filesystem.
xip.txt
- info on execute-in-place for file mappings.
+hot_tracking.txt
+ - info on hot data tracking in VFS layer
diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt
new file mode 100644
index 0000000..cd6a592
--- /dev/null
+++ b/Documentation/filesystems/hot_tracking.txt
@@ -0,0 +1,255 @@
+Hot Data Tracking
+
+September, 2012 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
+
+CONTENTS
+
+1. Introduction
+2. Motivation
+3. The Design
+4. How to Calc Frequency of Reads/Writes & Temperature
+5. Git Development Tree
+6. Usage Example
+
+
+1. Introduction
+
+ The feature adds experimental support for tracking data temperature
+information in VFS layer. Essentially, this means maintaining some key
+stats(like number of reads/writes, last read/write time, frequency of
+reads/writes), then distilling those numbers down to a single
+"temperature" value that reflects what data is "hot," and using that
+temperature to move data to SSDs.
+
+ The long-term goal of the feature is to allow some FSs,
+e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume.
+Incidentally, this project has been motivated by
+the Project Ideas page on the Btrfs wiki.
+
+ Of course, users are warned not to run this code outside of development
+environments. These patches are EXPERIMENTAL, and as such they might eat
+your data and/or memory. That said, the code should be relatively safe
+when the hottrack mount option are disabled.
+
+
+2. Motivation
+
+ The overall goal of enabling hot data relocation to SSD has been
+motivated by the Project Ideas page on the Btrfs wiki at
+<https://btrfs.wiki.kernel.org/index.php/Project_ideas>.
+It will divide into two steps. VFS provide hot data tracking function
+while specific FS will provide hot data relocation function.
+So as the first step of this goal, it is hoped that the patchset
+for hot data tracking will eventually mature into VFS.
+
+ This is essentially the traditional cache argument: SSD is fast and
+expensive; HDD is cheap but slow. ZFS, for example, can already take
+advantage of SSD caching. Btrfs should also be able to take advantage of
+hybrid storage without many broad, sweeping changes to existing code.
+
+
+3. The Design
+
+These include the following parts:
+
+ * Hooks in existing vfs functions to track data access frequency
+
+ * New radix-trees for tracking access frequency of inodes and sub-file
+ranges
+ The relationship between super_block and radix-tree is as below:
+hot_info.hot_inode_tree
+ Each FS instance can find hot tracking info s_hotinfo.
+In this hot_info, it store a lot of hot tracking info such as hot_inode_tree,
+inode and range list, etc.
+
+ * A list for indexing data by its temperature
+
+ * A debugfs interface for dumping data from the radix-trees
+
+ * A background kthread for updating inode heat info
+
+ * Mount options for enabling temperature tracking(-o hot_track,
+default mean disabled)
+ * An ioctl to retrieve the frequency information collected for a certain
+file
+ * Ioctls to enable/disable frequency tracking per inode.
+
+Let us see their relationship as below:
+
+ * hot_info.hot_inode_tree indexes hot_inode_items, one per inode
+
+ * hot_inode_item contains access frequency data for that inode
+
+ * hot_inode_item holds a heat list node to index the access
+frequency data for that inode
+
+ * hot_inode_item.hot_range_tree indexes hot_range_items for that inode
+
+ * hot_range_item contains access frequency data for that range
+
+ * hot_range_item holds a heat list node to index the access
+frequency data for that range
+
+ * hot_info.heat_inode_map indexes per-inode heat list nodes
+
+ * hot_info.heat_range_map indexes per-range heat list nodes
+
+ How about some ascii art? :) Just looking at the hot inode item case
+(the range item case is the same pattern, though), we have:
+
+heat_inode_map hot_inode_tree
+ | |
+ | V
+ | +-------hot_comm_item--------+
+ | | frequency data |
++---+ | list_head |
+| V ^ | V
+| ...<--hot_comm_item-->... | | ...<--hot_comm_item-->...
+| frequency data | | frequency data
++-------->list_head----------+ +--------->list_head--->.....
+ hot_range_tree hot_range_tree
+ |
+ heat_range_map V
+ | +-------hot_comm_item--------+
+ | | frequency data |
+ +---+ | list_head |
+ | V ^ | V
+ | ...<--hot_comm_item-->... | | ...<--hot_comm_item-->...
+ | frequency data | | frequency data
+ +-------->list_head----------+ +--------->list_head--->.....
+
+
+4. How to Calc Frequency of Reads/Writes & Temperature
+
+1.) hot_rw_freq_calc()
+
+ This function does the actual work of updating the frequency numbers,
+whatever they turn out to be. FREQ_POWER determines how many atime
+deltas we keep track of (as a power of 2). So, setting it to anything above
+16ish is probably overkill. Also, the higher the power, the more bits get
+right shifted out of the timestamp, reducing precision, so take note of that
+as well.
+
+ The caller should have already locked freq_data's parent's spinlock.
+
+ FREQ_POWER, defined immediately below, determines how heavily to weight
+the current frequency numbers against the newest access. For example, a value
+of 4 means that the new access information will be weighted 1/16th (ie 2^-4)
+as heavily as the existing frequency info. In essence, this is a kludged-
+together version of a weighted average, since we can't afford to keep all of
+the information that it would take to get a _real_ weighted average.
+
+2.) Some Micro explaination
+
+ The following comments explain what exactly comprises a unit of heat.
+Each of six values of heat are calculated and combined in order to form an
+overall temperature for the data:
+
+ * NRR - number of reads since mount
+ * NRW - number of writes since mount
+ * LTR - time elapsed since last read (ns)
+ * LTW - time elapsed since last write (ns)
+ * AVR - average delta between recent reads (ns)
+ * AVW - average delta between recent writes (ns)
+
+ These values are divided (right-shifted) according to the *_DIVIDER_POWER
+values defined below to bring the numbers into a reasonable range. You can
+modify these values to fit your needs. However, each heat unit is a u32 and
+thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite
+carefully or else they could max out or be stuck at zero quite easily.
+(E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime
+delta would bring the temperature above zero, ever.)
+
+ Finally, each value is added to the overall temperature between 0 and 8
+times, depending on its *_COEFF_POWER value. Note that the coefficients are
+also actually implemented with shifts, so take care to treat these values
+as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.)
+
+ * AVR/AVW cold unit = 2^X ns of average delta
+ * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit
+
+ E.g., data with an average delta between 0 and 2^X ns will have a cold
+value of 0, which means a heat value equal to HEAT_MAX_VALUE.
+
+3.) hot_temp_calc()
+
+ This function is responsible for distilling the six heat
+criteria, which are described in detail in hot_tracking.h) down into a single
+temperature value for the data, which is an integer between 0
+and HEAT_MAX_VALUE.
+
+ To accomplish this, the raw values from the hot_freq_data structure
+are shifted various ways in order to make the temperature calculation more
+or less sensitive to each value.
+
+ Once this calibration has happened, we do some additional normalization and
+make sure that everything fits nicely in a u32. From there, we take a very
+rudimentary kind of "average" of each of the values, where the *_COEFF_POWER
+values act as weights for the average.
+
+ Finally, we use the HEAT_HASH_BITS value, which determines the size of the
+heat list array, to normalize the temperature to the proper granularity.
+
+
+5. Git Development Tree
+
+ This feature is still on development and review, so if you're interested,
+you can pull from the git repository at the following location:
+
+ https://github.com/wuzhy/kernel.git hot_tracking
+ git://github.com/wuzhy/kernel.git hot_tracking
+
+
+6. Usage Example
+
+1.) To use hot tracking, you should mount like this:
+
+$ mount -o hot_track /dev/sdb /mnt
+[ 1505.894078] device label test devid 1 transid 29 /dev/sdb
+[ 1505.952977] btrfs: disk space caching is enabled
+[ 1506.069678] vfs: turning on hot data tracking
+
+2.) Mount debugfs at first:
+
+$ mount -t debugfs none /sys/kernel/debug
+$ ls -l /sys/kernel/debug/hot_track/
+total 0
+drwxr-xr-x 2 root root 0 Aug 8 04:40 sdb
+$ ls -l /sys/kernel/debug/hot_track/sdb
+total 0
+-rw-r--r-- 1 root root 0 Aug 8 04:40 rt_stats_inode
+-rw-r--r-- 1 root root 0 Aug 8 04:40 rt_stats_range
+
+3.) View information about hot tracking from debugfs:
+
+$ echo "hot tracking test" > /mnt/file
+$ cat /sys/kernel/debug/hot_track/sdb/rt_stats_inode
+inode #279, reads 0, writes 1, temp 109
+$ cat /sys/kernel/debug/hot_track/sdb/range_data
+inode #279, range 0+1048576, reads 0, writes 1, temp 64
+
+$ echo "hot data tracking test" >> /mnt/file
+$ cat /sys/kernel/debug/hot_track/sdb/rt_stats_inode
+inode #279, reads 0, writes 2, temp 109
+$ cat /sys/kernel/debug/hot_track/sdb/range_data
+inode #279, range 0+1048576 reads 0, writes 2, temp 64
+
+4.) Check temp sorting result of some nodes:
+
+$ cat /sys/kernel/debug/hot_track/loop0/hot_spots_inode
+inode #5248773, reads 0, writes 244, temp 111
+inode #878523, reads 0, writes 1, temp 109
+inode #878524, reads 0, writes 1, temp 109
+
+5.) Tune some hot tracking parameters as below:
+
+$ cat /proc/sys/fs/hot-kick-time
+300
+$ echo 360 > /proc/sys/fs/hot-kick-time
+$ cat /proc/sys/fs/hot-kick-time
+360
+$ cat /proc/sys/fs/hot-update-delay
+300
+$ echo 360 > /proc/sys/fs/hot-update-delay
+$ cat /proc/sys/fs/hot-update-delay
+360
--
1.7.6.5
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 00/16] vfs: hot data tracking
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (15 preceding siblings ...)
2012-12-20 14:43 ` [PATCH RESEND v1 16/16] vfs: add documentation zwu.kernel
@ 2012-12-20 14:55 ` Zhi Yong Wu
2013-01-07 13:49 ` Zhi Yong Wu
` (2 subsequent siblings)
19 siblings, 0 replies; 35+ messages in thread
From: Zhi Yong Wu @ 2012-12-20 14:55 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
HI,
The raw data is very big, i don't know if it is appropriate to post
them here. If you want to get them, please let me know.
On Thu, Dec 20, 2012 at 10:43 PM, <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
> HI, guys,
>
> This patchset has been done scalability or performance tests
> by fs_mark, ffsb and compilebench.
> I have done the perf testing on Linux 3.7.0-rc8+ with Intel(R) Core(TM)
> i7-3770 CPU @ 3.40GHz with 8 CPUs, 16G ram and 260G disk.
>
> Any comments or ideas are appreciated, thanks.
>
> NOTE:
>
> The patchset can be obtained via my kernel dev git on github:
> git://github.com/wuzhy/kernel.git hot_tracking
> If you're interested, you can also review them via
> https://github.com/wuzhy/kernel/commits/hot_tracking
>
> For more info, please check hot_tracking.txt in Documentation
>
> Below is the perf testing report:
>
> 1. fs_mark test
>
> w/o: without hot tracking
> w/ : with hot tracking
>
> Count Size FSUse% Files/sec App Overhead
>
> w/o w/ w/o w/ w/o w/
>
> 800000 1 2 3 13756.4 32144.9 5350627 5436291
> 1600000 1 4 5 1163.4 1799.3 20848119 21708216
> 2400000 1 6 6 1360.8 1252.5 6798705 8715322
> 3200000 1 8 8 1600.1 1196.3 5751129 6013792
> 4000000 1 9 9 1071.4 1191.2 17204725 26786369
> 4800000 1 10 10 1483.5 1447.9 19555541 8383046
> 5600000 1 11 11 1457.9 1699.5 5783588 10074681
> 6400000 1 12 13 1658.8 1628.5 6992697 6185551
> 7200000 1 14 14 1662.4 1857.1 5796793 13772592
> 8000000 1 15 15 2930.0 2653.8 12431682 6152573
> 8800000 1 16 17 1630.8 1665.0 7666719 13682765
> 9600000 1 18 18 1530.3 1583.9 5823644 10171644
> 10400000 1 19 19 1437.9 1798.6 20935224 6048083
> 11200000 1 20 20 1529.0 1550.6 6647450 6003151
> 12000000 1 21 22 1558.6 1501.8 12539509 18144939
> 12800000 1 23 23 1644.2 1432.1 7074419 28101975
> 13600000 1 24 24 1753.6 1650.2 7164297 20888972
> 14400000 1 25 25 2750.0 1483.9 12756692 7441225
> 15200000 1 27 27 1551.1 1514.3 5741066 8250443
> 16000000 1 28 28 1610.8 1635.9 72193860 8545285
> 16800000 1 29 29 1646.7 1907.7 8945856 11703513
> 17600000 1 30 31 1496.6 2722.3 5858961 8989393
> 18400000 1 32 32 1457.7 1565.7 10914475 26504660
> 19200000 1 33 33 1437.6 1518.7 6708975 213303618
> 20000000 1 34 34 1825.4 1521.1 5722086 12490907
> 20800000 1 36 35 1718.4 1611.5 5873290 17942534
> 21600000 1 37 37 2152.6 1536.9 113050627 8717940
> 22400000 1 38 38 2443.7 1788.2 7398122 19834765
> 23200000 1 39 39 1518.5 1587.6 5770959 10134882
> 24000000 1 41 41 1536.8 2164.0 5751248 7214626
> 24800000 1 42 42 1576.6 2939.4 7390314 6070271
> 25600000 1 43 43 1707.4 1535.9 11075939 6052896
> 26400000 1 44 44 1522.5 1563.1 10142987 22549898
> 27200000 1 46 46 1827.4 1608.5 11613016 24828125
> 28000000 1 47 47 3420.5 1741.9 8059985 16599156
> 28800000 1 48 48 1815.5 1944.4 7847931 9043277
> 29600000 1 50 49 1650.0 1596.6 5636323 7929164
> 30400000 1 51 51 1683.7 1573.3 5766323 19369146
> 31200000 1 52 52 1610.1 1669.8 9256111 9899107
> 32000000 1 53 53 1645.2 3081.0 7855010 6057257
> 32800000 1 54 55 1835.3 3122.0 6899141 6143875
> 33600000 1 56 56 1916.8 1734.8 10271967 6049509
> 34400000 1 57 57 3119.2 1630.8 11503274 13975417
> 35200000 1 58 58 1629.2 1695.7 6827225 6214248
> 36000000 1 60 60 1636.5 1695.4 38077664 16211067
> 36800000 1 61 61 1665.2 2069.1 19948817 9358494
> 37600000 1 62 62 1734.5 1931.5 26487196 8954836
> 38400000 1 63 63 1625.8 1654.0 6649289 9131844
> 39200000 1 65 65 1778.4 1663.3 11653376 7144960
> 40000000 1 66 66 1851.0 1935.6 8164470 11288753
> 40800000 1 67 67 3171.0 3431.6 12358380 6072820
> 41600000 1 69 69 1714.3 1954.3 13765035 9364495
> 42400000 1 70 70 1591.0 1681.8 18733304 7407689
> 43200000 1 71 71 1537.2 1642.8 19534908 6163018
> 44000000 1 72 72 1630.3 1641.2 23479883 10967509
> 44800000 1 74 74 1877.5 1651.9 8174965 9484587
> 45600000 1 75 75 3322.4 1653.6 14740938 7497831
> 46400000 1 76 76 1706.9 1840.6 10348550 23296562
> 47200000 1 77 78 1837.7 2515.3 13917543 14683192
> 48000000 1 79 79 1642.6 2368.6 14365759 6080942
> 48800000 1 80 80 1827.1 1655.2 9234312 7412406
> 49600000 1 81 81 1631.0 1858.7 7543970 18610881
> 50400000 1 82 82 1560.5 1865.0 21374219 6598771
>
>
> From the above table, when the same count files with same size are created, how FS is full is
> basically same.
>
> 2. FFSB test
> w/o hot tracking w/ hot tracking ratio
> v1 v2 (v2-v1)/v1
> large_file_create
> 1 thread
> - Trans/sec 28918.75 29014.48 +0.33%
> - Throughput 113MB/sec 113MB/sec +0.0%
> - %CPU 4.8% 5.1% +6.3%
> - Trans/%CPU 602473.96 568911.37 -5.6%
> 8 threads
> - Trans/sec 28480.37 28541.25 +0.2%
> - Throughput 111MB/sec 111MB/sec +0.0%
> - %CPU 5.6% 5.9% +5.4%
> - Trans/%CPU 508578.04 483750 -4.9%
> 32 threads
> - Trans/sec 25011.86 26992.32 +7.9%
> - Throughput 97.7MB/sec 105MB/sec +7.5%
> - %CPU 6.2% 7.1% +14.8%
> - Trans/%CPU 403417.10 380173.52 -5.8%
>
> large_file_seq_read
> 1 thread
> - Trans/sec 35303.23 34838.02 -1.3%
> - Throughput 138MB/sec 136MB/sec -1.4%
> - %CPU 5.4% 5.4% +0.0%
> - Trans/%CPU 653763.52 645148.52 -1.3%
> 8 threads
> - Trans/sec 11902.82 11205.22 -5.9%
> - Throughput 46.5MB/sec 43.8MB/sec -5.8%
> - %CPU 2.1% 2.0% -4.8%
> - Trans/%CPU 566800.95 560261 -1.2%
> 32 threads
> - Trans/sec 5068.48 5316.36 +4.9%
> - Throughput 19.8MB/sec 20.8MB/sec +5.1%
> - %CPU 0.9% 1.0% +11.1%
> - Trans/%CPU 563164.45 531636 -5.6%
>
> random_write
> 1 thread
> - Trans/sec 729.01 738.89 +1.4%
> - Throughput 99.7MB/sec 101MB/sec +1.3%
> - %CPU 0.1% 0.1% +0.0%
> - Trans/%CPU 72901 73889 +1.4%
> 8 threads
> - Trans/sec 714.56 714.57 +0.0%
> - Throughput 97.7MB/sec 97.7MB/sec +0.0%
> - %CPU 0.2% 0.2% +0.0%
> - Trans/%CPU 35728 35728.5 +0.0%
> 32 threads
> - Trans/sec 698.62 692.59 -0.9%
> - Throughput 95.5MB/sec 94.7MB/sec -0.8%
> - %CPU 0.2% 0.2% +0.0%
> - Trans/%CPU 34931 34629.5 -0.9%
>
> random_read
> 1 thread
> - Trans/sec 225.49 227.03 +0.7%
> - Throughput 902KB/sec 908KB/sec +0.7%
> - %CPU 1.1% 1.1% +0.0%
> - Trans/%CPU 20499.10 20639.10 +0.7%
> 8 threads
> - Trans/sec 106.72 105.76 -0.9%
> - Throughput 427KB/sec 423KB/sec -0.9%
> - %CPU 0.5% 0.5% +0.0%
> - Trans/%CPU 2134.4 2115.2 -0.9%
> 32 threads
> - Trans/sec 107.44 108.26 +0.8%
> - Throughput 430KB/sec 433KB/sec +0.7%
> - %CPU 0.5% 0.5% +0.0%
> - Trans/%CPU 2148.8 2165.2 +0.8%
>
> mail_server
> 1 thread
> - Trans/sec 681.67 732.66 +7.5%
> - Throughput [read] 1.77MB/sec 1.99MB/sec +12.4%
> - Throughput [write] 858KB/sec 887KB/sec +3.4%
> - %CPU 0.6% 0.6% +0.0%
> - Trans/%CPU 11361.17 12211 +7.5%
> 8 threads
> - Trans/sec 630.48 597.08 -5.3%
> - Throughput [read] 1.64MB/sec 1.54MB/sec -6.1%
> - Throughput [write] 814KB/sec 784KB/sec -3.7%
> - %CPU 0.6% 0.5% -16.7%
> - Trans/%CPU 10508 11941.6 +13.6%
> 32 threads
> - Trans/sec 598.68 566.05 -5.5%
> - Throughput [read] 1.53MB/sec 1.5MB/sec -2.0%
> - Throughput [write] 804KB/sec 705KB/sec -12.3%
> - %CPU 0.7% 0.6% -14.2%
> - Trans/%CPU 8552.57 9434.17 +10.3%
>
> 3. Compilebench test
>
> w/o hot tracking w/ hot tracking ratio
> v1 v2 (v2-v1)/v1
> intial create 114.81 MB/s 118.32 MB/s +3.1%
> create 11.98 MB/s 12.26 MB/s +2.3%
> patch 3.61 MB/s 3.66 MB/s +1.4%
> compile 46.40 MB/s 48.07 MB/s +3.6%
> clean 126.33 MB/s 128.75 MB/s +1.9%
> read tree 9.93 MB/s 9.71 MB/s -2.2%
> read compiled tree 17.19 MB/s 17.52 MB/s +1.9%
> delete tree 12.23 seconds 11.13 seconds -9.0%
> delete compiled tree 12.98 seconds 16.05 seconds +26.7%
> stat tree 7.03 seconds 5.51 seconds -21.6%
> stat compiled tree 12.19 seconds 9.06 seconds -25.7%
>
> Changelog:
>
> - Solved 64 bits inode number issue. [David Sterba]
> - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
> - Cleanup Some issues [David Sterba]
> - Use a static hot debugfs root [Greg KH]
> - Rewritten debugfs support based on seq_file operation. [Dave Chinner]
> - Refactored workqueue support. [Dave Chinner]
> - Turn some Micro into be tunable [Zhiyong, Zheng Liu]
> TIME_TO_KICK, and HEAT_UPDATE_DELAY
> - Introduce hot func registering framework [Zhiyong]
> - Remove global variable for hot tracking [Zhiyong]
> - Add xfs hot tracking support [Dave Chinner]
> - Add ext4 hot tracking support [Zheng Liu]
> - Cleanedup a lot of other issues [Dave Chinner]
> - Added memory shrinker [Dave Chinner]
> - Converted to one workqueue to update map info periodically [Dave Chinner]
> - Cleanedup a lot of other issues [Dave Chinner]
> - Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
> - Add btrfs hot tracking support [Zhiyong]
> - The first three patches can probably just be flattened into one.
> [Marco Stornelli , Dave Chinner]
>
> Zhi Yong Wu (16):
> vfs: introduce some data structures
> vfs: add init and cleanup functions
> vfs: add I/O frequency update function
> vfs: add two map arrays
> vfs: add hooks to enable hot tracking
> vfs: add temp calculation function
> vfs: add map info update function
> vfs: add aging function
> vfs: add one work queue
> vfs: add FS hot type support
> vfs: register one shrinker
> vfs: add one ioctl interface
> vfs: add debugfs support
> proc: add two hot_track proc files
> btrfs: add hot tracking support
> vfs: add documentation
>
> Documentation/filesystems/00-INDEX | 2 +
> Documentation/filesystems/hot_tracking.txt | 255 ++++++
> fs/Makefile | 2 +-
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/super.c | 22 +-
> fs/compat_ioctl.c | 5 +
> fs/dcache.c | 2 +
> fs/direct-io.c | 6 +
> fs/hot_tracking.c | 1345 ++++++++++++++++++++++++++++
> fs/hot_tracking.h | 52 ++
> fs/ioctl.c | 74 ++
> include/linux/fs.h | 5 +
> include/linux/hot_tracking.h | 152 ++++
> kernel/sysctl.c | 14 +
> mm/filemap.c | 6 +
> mm/page-writeback.c | 12 +
> mm/readahead.c | 7 +
> 17 files changed, 1960 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/filesystems/hot_tracking.txt
> create mode 100644 fs/hot_tracking.c
> create mode 100644 fs/hot_tracking.h
> create mode 100644 include/linux/hot_tracking.h
>
> --
> 1.7.6.5
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 00/16] vfs: hot data tracking
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (16 preceding siblings ...)
2012-12-20 14:55 ` [PATCH RESEND v1 00/16] vfs: hot data tracking Zhi Yong Wu
@ 2013-01-07 13:49 ` Zhi Yong Wu
2013-01-08 7:52 ` Zhi Yong Wu
2013-02-22 0:32 ` Zhi Yong Wu
19 siblings, 0 replies; 35+ messages in thread
From: Zhi Yong Wu @ 2013-01-07 13:49 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
hi,
any comments? Some graphs are generated based on the following benchmark data.
1. fs_mark test
https://github.com/wuzhy/perf_report/commit/8dbc7a00d280d7d7e625e336aac57391fa3e422c
2. ffsb test
1.) large_file_create
https://github.com/wuzhy/perf_report/commit/1582cf3fed06d32ffd93dbe109572a2a2fe2990f
2.) large_file_read
https://github.com/wuzhy/perf_report/commit/3b8a13c02c4b86de349e59215d8dbcdf88c5821c
3.) random_write
https://github.com/wuzhy/perf_report/commit/d4fc2aaa0626d40f72a4a06f8c5b5a10d5a949c3
4.) random_read
https://github.com/wuzhy/perf_report/commit/7e010b4fe193c6cf49c8c946be301ed3d3ac9eae
5.) mail_server
https://github.com/wuzhy/perf_report/commit/225f2b51ae3baab774e7cda6009d3745ec742d18
On Thu, Dec 20, 2012 at 10:43 PM, <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
> HI, guys,
>
> This patchset has been done scalability or performance tests
> by fs_mark, ffsb and compilebench.
> I have done the perf testing on Linux 3.7.0-rc8+ with Intel(R) Core(TM)
> i7-3770 CPU @ 3.40GHz with 8 CPUs, 16G ram and 260G disk.
>
> Any comments or ideas are appreciated, thanks.
>
> NOTE:
>
> The patchset can be obtained via my kernel dev git on github:
> git://github.com/wuzhy/kernel.git hot_tracking
> If you're interested, you can also review them via
> https://github.com/wuzhy/kernel/commits/hot_tracking
>
> For more info, please check hot_tracking.txt in Documentation
>
> Below is the perf testing report:
>
> 1. fs_mark test
>
> w/o: without hot tracking
> w/ : with hot tracking
>
> Count Size FSUse% Files/sec App Overhead
>
> w/o w/ w/o w/ w/o w/
>
> 800000 1 2 3 13756.4 32144.9 5350627 5436291
> 1600000 1 4 5 1163.4 1799.3 20848119 21708216
> 2400000 1 6 6 1360.8 1252.5 6798705 8715322
> 3200000 1 8 8 1600.1 1196.3 5751129 6013792
> 4000000 1 9 9 1071.4 1191.2 17204725 26786369
> 4800000 1 10 10 1483.5 1447.9 19555541 8383046
> 5600000 1 11 11 1457.9 1699.5 5783588 10074681
> 6400000 1 12 13 1658.8 1628.5 6992697 6185551
> 7200000 1 14 14 1662.4 1857.1 5796793 13772592
> 8000000 1 15 15 2930.0 2653.8 12431682 6152573
> 8800000 1 16 17 1630.8 1665.0 7666719 13682765
> 9600000 1 18 18 1530.3 1583.9 5823644 10171644
> 10400000 1 19 19 1437.9 1798.6 20935224 6048083
> 11200000 1 20 20 1529.0 1550.6 6647450 6003151
> 12000000 1 21 22 1558.6 1501.8 12539509 18144939
> 12800000 1 23 23 1644.2 1432.1 7074419 28101975
> 13600000 1 24 24 1753.6 1650.2 7164297 20888972
> 14400000 1 25 25 2750.0 1483.9 12756692 7441225
> 15200000 1 27 27 1551.1 1514.3 5741066 8250443
> 16000000 1 28 28 1610.8 1635.9 72193860 8545285
> 16800000 1 29 29 1646.7 1907.7 8945856 11703513
> 17600000 1 30 31 1496.6 2722.3 5858961 8989393
> 18400000 1 32 32 1457.7 1565.7 10914475 26504660
> 19200000 1 33 33 1437.6 1518.7 6708975 213303618
> 20000000 1 34 34 1825.4 1521.1 5722086 12490907
> 20800000 1 36 35 1718.4 1611.5 5873290 17942534
> 21600000 1 37 37 2152.6 1536.9 113050627 8717940
> 22400000 1 38 38 2443.7 1788.2 7398122 19834765
> 23200000 1 39 39 1518.5 1587.6 5770959 10134882
> 24000000 1 41 41 1536.8 2164.0 5751248 7214626
> 24800000 1 42 42 1576.6 2939.4 7390314 6070271
> 25600000 1 43 43 1707.4 1535.9 11075939 6052896
> 26400000 1 44 44 1522.5 1563.1 10142987 22549898
> 27200000 1 46 46 1827.4 1608.5 11613016 24828125
> 28000000 1 47 47 3420.5 1741.9 8059985 16599156
> 28800000 1 48 48 1815.5 1944.4 7847931 9043277
> 29600000 1 50 49 1650.0 1596.6 5636323 7929164
> 30400000 1 51 51 1683.7 1573.3 5766323 19369146
> 31200000 1 52 52 1610.1 1669.8 9256111 9899107
> 32000000 1 53 53 1645.2 3081.0 7855010 6057257
> 32800000 1 54 55 1835.3 3122.0 6899141 6143875
> 33600000 1 56 56 1916.8 1734.8 10271967 6049509
> 34400000 1 57 57 3119.2 1630.8 11503274 13975417
> 35200000 1 58 58 1629.2 1695.7 6827225 6214248
> 36000000 1 60 60 1636.5 1695.4 38077664 16211067
> 36800000 1 61 61 1665.2 2069.1 19948817 9358494
> 37600000 1 62 62 1734.5 1931.5 26487196 8954836
> 38400000 1 63 63 1625.8 1654.0 6649289 9131844
> 39200000 1 65 65 1778.4 1663.3 11653376 7144960
> 40000000 1 66 66 1851.0 1935.6 8164470 11288753
> 40800000 1 67 67 3171.0 3431.6 12358380 6072820
> 41600000 1 69 69 1714.3 1954.3 13765035 9364495
> 42400000 1 70 70 1591.0 1681.8 18733304 7407689
> 43200000 1 71 71 1537.2 1642.8 19534908 6163018
> 44000000 1 72 72 1630.3 1641.2 23479883 10967509
> 44800000 1 74 74 1877.5 1651.9 8174965 9484587
> 45600000 1 75 75 3322.4 1653.6 14740938 7497831
> 46400000 1 76 76 1706.9 1840.6 10348550 23296562
> 47200000 1 77 78 1837.7 2515.3 13917543 14683192
> 48000000 1 79 79 1642.6 2368.6 14365759 6080942
> 48800000 1 80 80 1827.1 1655.2 9234312 7412406
> 49600000 1 81 81 1631.0 1858.7 7543970 18610881
> 50400000 1 82 82 1560.5 1865.0 21374219 6598771
>
>
> From the above table, when the same count files with same size are created, how FS is full is
> basically same.
>
> 2. FFSB test
> w/o hot tracking w/ hot tracking ratio
> v1 v2 (v2-v1)/v1
> large_file_create
> 1 thread
> - Trans/sec 28918.75 29014.48 +0.33%
> - Throughput 113MB/sec 113MB/sec +0.0%
> - %CPU 4.8% 5.1% +6.3%
> - Trans/%CPU 602473.96 568911.37 -5.6%
> 8 threads
> - Trans/sec 28480.37 28541.25 +0.2%
> - Throughput 111MB/sec 111MB/sec +0.0%
> - %CPU 5.6% 5.9% +5.4%
> - Trans/%CPU 508578.04 483750 -4.9%
> 32 threads
> - Trans/sec 25011.86 26992.32 +7.9%
> - Throughput 97.7MB/sec 105MB/sec +7.5%
> - %CPU 6.2% 7.1% +14.8%
> - Trans/%CPU 403417.10 380173.52 -5.8%
>
> large_file_seq_read
> 1 thread
> - Trans/sec 35303.23 34838.02 -1.3%
> - Throughput 138MB/sec 136MB/sec -1.4%
> - %CPU 5.4% 5.4% +0.0%
> - Trans/%CPU 653763.52 645148.52 -1.3%
> 8 threads
> - Trans/sec 11902.82 11205.22 -5.9%
> - Throughput 46.5MB/sec 43.8MB/sec -5.8%
> - %CPU 2.1% 2.0% -4.8%
> - Trans/%CPU 566800.95 560261 -1.2%
> 32 threads
> - Trans/sec 5068.48 5316.36 +4.9%
> - Throughput 19.8MB/sec 20.8MB/sec +5.1%
> - %CPU 0.9% 1.0% +11.1%
> - Trans/%CPU 563164.45 531636 -5.6%
>
> random_write
> 1 thread
> - Trans/sec 729.01 738.89 +1.4%
> - Throughput 99.7MB/sec 101MB/sec +1.3%
> - %CPU 0.1% 0.1% +0.0%
> - Trans/%CPU 72901 73889 +1.4%
> 8 threads
> - Trans/sec 714.56 714.57 +0.0%
> - Throughput 97.7MB/sec 97.7MB/sec +0.0%
> - %CPU 0.2% 0.2% +0.0%
> - Trans/%CPU 35728 35728.5 +0.0%
> 32 threads
> - Trans/sec 698.62 692.59 -0.9%
> - Throughput 95.5MB/sec 94.7MB/sec -0.8%
> - %CPU 0.2% 0.2% +0.0%
> - Trans/%CPU 34931 34629.5 -0.9%
>
> random_read
> 1 thread
> - Trans/sec 225.49 227.03 +0.7%
> - Throughput 902KB/sec 908KB/sec +0.7%
> - %CPU 1.1% 1.1% +0.0%
> - Trans/%CPU 20499.10 20639.10 +0.7%
> 8 threads
> - Trans/sec 106.72 105.76 -0.9%
> - Throughput 427KB/sec 423KB/sec -0.9%
> - %CPU 0.5% 0.5% +0.0%
> - Trans/%CPU 2134.4 2115.2 -0.9%
> 32 threads
> - Trans/sec 107.44 108.26 +0.8%
> - Throughput 430KB/sec 433KB/sec +0.7%
> - %CPU 0.5% 0.5% +0.0%
> - Trans/%CPU 2148.8 2165.2 +0.8%
>
> mail_server
> 1 thread
> - Trans/sec 681.67 732.66 +7.5%
> - Throughput [read] 1.77MB/sec 1.99MB/sec +12.4%
> - Throughput [write] 858KB/sec 887KB/sec +3.4%
> - %CPU 0.6% 0.6% +0.0%
> - Trans/%CPU 11361.17 12211 +7.5%
> 8 threads
> - Trans/sec 630.48 597.08 -5.3%
> - Throughput [read] 1.64MB/sec 1.54MB/sec -6.1%
> - Throughput [write] 814KB/sec 784KB/sec -3.7%
> - %CPU 0.6% 0.5% -16.7%
> - Trans/%CPU 10508 11941.6 +13.6%
> 32 threads
> - Trans/sec 598.68 566.05 -5.5%
> - Throughput [read] 1.53MB/sec 1.5MB/sec -2.0%
> - Throughput [write] 804KB/sec 705KB/sec -12.3%
> - %CPU 0.7% 0.6% -14.2%
> - Trans/%CPU 8552.57 9434.17 +10.3%
>
> 3. Compilebench test
>
> w/o hot tracking w/ hot tracking ratio
> v1 v2 (v2-v1)/v1
> intial create 114.81 MB/s 118.32 MB/s +3.1%
> create 11.98 MB/s 12.26 MB/s +2.3%
> patch 3.61 MB/s 3.66 MB/s +1.4%
> compile 46.40 MB/s 48.07 MB/s +3.6%
> clean 126.33 MB/s 128.75 MB/s +1.9%
> read tree 9.93 MB/s 9.71 MB/s -2.2%
> read compiled tree 17.19 MB/s 17.52 MB/s +1.9%
> delete tree 12.23 seconds 11.13 seconds -9.0%
> delete compiled tree 12.98 seconds 16.05 seconds +26.7%
> stat tree 7.03 seconds 5.51 seconds -21.6%
> stat compiled tree 12.19 seconds 9.06 seconds -25.7%
>
> Changelog:
>
> - Solved 64 bits inode number issue. [David Sterba]
> - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
> - Cleanup Some issues [David Sterba]
> - Use a static hot debugfs root [Greg KH]
> - Rewritten debugfs support based on seq_file operation. [Dave Chinner]
> - Refactored workqueue support. [Dave Chinner]
> - Turn some Micro into be tunable [Zhiyong, Zheng Liu]
> TIME_TO_KICK, and HEAT_UPDATE_DELAY
> - Introduce hot func registering framework [Zhiyong]
> - Remove global variable for hot tracking [Zhiyong]
> - Add xfs hot tracking support [Dave Chinner]
> - Add ext4 hot tracking support [Zheng Liu]
> - Cleanedup a lot of other issues [Dave Chinner]
> - Added memory shrinker [Dave Chinner]
> - Converted to one workqueue to update map info periodically [Dave Chinner]
> - Cleanedup a lot of other issues [Dave Chinner]
> - Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
> - Add btrfs hot tracking support [Zhiyong]
> - The first three patches can probably just be flattened into one.
> [Marco Stornelli , Dave Chinner]
>
> Zhi Yong Wu (16):
> vfs: introduce some data structures
> vfs: add init and cleanup functions
> vfs: add I/O frequency update function
> vfs: add two map arrays
> vfs: add hooks to enable hot tracking
> vfs: add temp calculation function
> vfs: add map info update function
> vfs: add aging function
> vfs: add one work queue
> vfs: add FS hot type support
> vfs: register one shrinker
> vfs: add one ioctl interface
> vfs: add debugfs support
> proc: add two hot_track proc files
> btrfs: add hot tracking support
> vfs: add documentation
>
> Documentation/filesystems/00-INDEX | 2 +
> Documentation/filesystems/hot_tracking.txt | 255 ++++++
> fs/Makefile | 2 +-
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/super.c | 22 +-
> fs/compat_ioctl.c | 5 +
> fs/dcache.c | 2 +
> fs/direct-io.c | 6 +
> fs/hot_tracking.c | 1345 ++++++++++++++++++++++++++++
> fs/hot_tracking.h | 52 ++
> fs/ioctl.c | 74 ++
> include/linux/fs.h | 5 +
> include/linux/hot_tracking.h | 152 ++++
> kernel/sysctl.c | 14 +
> mm/filemap.c | 6 +
> mm/page-writeback.c | 12 +
> mm/readahead.c | 7 +
> 17 files changed, 1960 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/filesystems/hot_tracking.txt
> create mode 100644 fs/hot_tracking.c
> create mode 100644 fs/hot_tracking.h
> create mode 100644 include/linux/hot_tracking.h
>
> --
> 1.7.6.5
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 00/16] vfs: hot data tracking
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (17 preceding siblings ...)
2013-01-07 13:49 ` Zhi Yong Wu
@ 2013-01-08 7:52 ` Zhi Yong Wu
2013-02-22 0:32 ` Zhi Yong Wu
19 siblings, 0 replies; 35+ messages in thread
From: Zhi Yong Wu @ 2013-01-08 7:52 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
On Thu, Dec 20, 2012 at 10:43 PM, <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
> HI, guys,
>
> This patchset has been done scalability or performance tests
> by fs_mark, ffsb and compilebench.
> I have done the perf testing on Linux 3.7.0-rc8+ with Intel(R) Core(TM)
> i7-3770 CPU @ 3.40GHz with 8 CPUs, 16G ram and 260G disk.
>
> Any comments or ideas are appreciated, thanks.
>
> NOTE:
>
> The patchset can be obtained via my kernel dev git on github:
> git://github.com/wuzhy/kernel.git hot_tracking
> If you're interested, you can also review them via
> https://github.com/wuzhy/kernel/commits/hot_tracking
>
> For more info, please check hot_tracking.txt in Documentation
>
> Below is the perf testing report:
>
> 1. fs_mark test
>
> w/o: without hot tracking
> w/ : with hot tracking
>
> Count Size FSUse% Files/sec App Overhead
>
> w/o w/ w/o w/ w/o w/
>
> 800000 1 2 3 13756.4 32144.9 5350627 5436291
> 1600000 1 4 5 1163.4 1799.3 20848119 21708216
> 2400000 1 6 6 1360.8 1252.5 6798705 8715322
> 3200000 1 8 8 1600.1 1196.3 5751129 6013792
> 4000000 1 9 9 1071.4 1191.2 17204725 26786369
> 4800000 1 10 10 1483.5 1447.9 19555541 8383046
> 5600000 1 11 11 1457.9 1699.5 5783588 10074681
> 6400000 1 12 13 1658.8 1628.5 6992697 6185551
> 7200000 1 14 14 1662.4 1857.1 5796793 13772592
> 8000000 1 15 15 2930.0 2653.8 12431682 6152573
> 8800000 1 16 17 1630.8 1665.0 7666719 13682765
> 9600000 1 18 18 1530.3 1583.9 5823644 10171644
> 10400000 1 19 19 1437.9 1798.6 20935224 6048083
> 11200000 1 20 20 1529.0 1550.6 6647450 6003151
> 12000000 1 21 22 1558.6 1501.8 12539509 18144939
> 12800000 1 23 23 1644.2 1432.1 7074419 28101975
> 13600000 1 24 24 1753.6 1650.2 7164297 20888972
> 14400000 1 25 25 2750.0 1483.9 12756692 7441225
> 15200000 1 27 27 1551.1 1514.3 5741066 8250443
> 16000000 1 28 28 1610.8 1635.9 72193860 8545285
> 16800000 1 29 29 1646.7 1907.7 8945856 11703513
> 17600000 1 30 31 1496.6 2722.3 5858961 8989393
> 18400000 1 32 32 1457.7 1565.7 10914475 26504660
> 19200000 1 33 33 1437.6 1518.7 6708975 213303618
> 20000000 1 34 34 1825.4 1521.1 5722086 12490907
> 20800000 1 36 35 1718.4 1611.5 5873290 17942534
> 21600000 1 37 37 2152.6 1536.9 113050627 8717940
> 22400000 1 38 38 2443.7 1788.2 7398122 19834765
> 23200000 1 39 39 1518.5 1587.6 5770959 10134882
> 24000000 1 41 41 1536.8 2164.0 5751248 7214626
> 24800000 1 42 42 1576.6 2939.4 7390314 6070271
> 25600000 1 43 43 1707.4 1535.9 11075939 6052896
> 26400000 1 44 44 1522.5 1563.1 10142987 22549898
> 27200000 1 46 46 1827.4 1608.5 11613016 24828125
> 28000000 1 47 47 3420.5 1741.9 8059985 16599156
> 28800000 1 48 48 1815.5 1944.4 7847931 9043277
> 29600000 1 50 49 1650.0 1596.6 5636323 7929164
> 30400000 1 51 51 1683.7 1573.3 5766323 19369146
> 31200000 1 52 52 1610.1 1669.8 9256111 9899107
> 32000000 1 53 53 1645.2 3081.0 7855010 6057257
> 32800000 1 54 55 1835.3 3122.0 6899141 6143875
> 33600000 1 56 56 1916.8 1734.8 10271967 6049509
> 34400000 1 57 57 3119.2 1630.8 11503274 13975417
> 35200000 1 58 58 1629.2 1695.7 6827225 6214248
> 36000000 1 60 60 1636.5 1695.4 38077664 16211067
> 36800000 1 61 61 1665.2 2069.1 19948817 9358494
> 37600000 1 62 62 1734.5 1931.5 26487196 8954836
> 38400000 1 63 63 1625.8 1654.0 6649289 9131844
> 39200000 1 65 65 1778.4 1663.3 11653376 7144960
> 40000000 1 66 66 1851.0 1935.6 8164470 11288753
> 40800000 1 67 67 3171.0 3431.6 12358380 6072820
> 41600000 1 69 69 1714.3 1954.3 13765035 9364495
> 42400000 1 70 70 1591.0 1681.8 18733304 7407689
> 43200000 1 71 71 1537.2 1642.8 19534908 6163018
> 44000000 1 72 72 1630.3 1641.2 23479883 10967509
> 44800000 1 74 74 1877.5 1651.9 8174965 9484587
> 45600000 1 75 75 3322.4 1653.6 14740938 7497831
> 46400000 1 76 76 1706.9 1840.6 10348550 23296562
> 47200000 1 77 78 1837.7 2515.3 13917543 14683192
> 48000000 1 79 79 1642.6 2368.6 14365759 6080942
> 48800000 1 80 80 1827.1 1655.2 9234312 7412406
> 49600000 1 81 81 1631.0 1858.7 7543970 18610881
> 50400000 1 82 82 1560.5 1865.0 21374219 6598771
>
For its memory this test eats up, 50 millions files only occupy about
240MB of 16GB total memory based on the info below.
[root@localhost ~]# cat /proc/slabinfo | grep hot
hot_range_item 5051 5190 136 30 1 : tunables 0 0
0 : slabdata 173 173 0
hot_inode_item 1677552 1688960 144 28 1 : tunables 0
0 0 : slabdata 60320 60320 0
[root@localhost fs_mark-3.3]# slabtop
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1688960 1677552 99% 0.14K 60320 28 241280K hot_inode_item
[root@localhost ~]# free -m
total used free shared buffers cached
Mem: 15989 15353 636 0 100 11508
-/+ buffers/cache: 3744 12245
Swap: 18047 0 18047
>
> From the above table, when the same count files with same size are created, how FS is full is
> basically same.
>
> 2. FFSB test
> w/o hot tracking w/ hot tracking ratio
> v1 v2 (v2-v1)/v1
> large_file_create
> 1 thread
> - Trans/sec 28918.75 29014.48 +0.33%
> - Throughput 113MB/sec 113MB/sec +0.0%
> - %CPU 4.8% 5.1% +6.3%
> - Trans/%CPU 602473.96 568911.37 -5.6%
> 8 threads
> - Trans/sec 28480.37 28541.25 +0.2%
> - Throughput 111MB/sec 111MB/sec +0.0%
> - %CPU 5.6% 5.9% +5.4%
> - Trans/%CPU 508578.04 483750 -4.9%
> 32 threads
> - Trans/sec 25011.86 26992.32 +7.9%
> - Throughput 97.7MB/sec 105MB/sec +7.5%
> - %CPU 6.2% 7.1% +14.8%
> - Trans/%CPU 403417.10 380173.52 -5.8%
>
> large_file_seq_read
> 1 thread
> - Trans/sec 35303.23 34838.02 -1.3%
> - Throughput 138MB/sec 136MB/sec -1.4%
> - %CPU 5.4% 5.4% +0.0%
> - Trans/%CPU 653763.52 645148.52 -1.3%
> 8 threads
> - Trans/sec 11902.82 11205.22 -5.9%
> - Throughput 46.5MB/sec 43.8MB/sec -5.8%
> - %CPU 2.1% 2.0% -4.8%
> - Trans/%CPU 566800.95 560261 -1.2%
> 32 threads
> - Trans/sec 5068.48 5316.36 +4.9%
> - Throughput 19.8MB/sec 20.8MB/sec +5.1%
> - %CPU 0.9% 1.0% +11.1%
> - Trans/%CPU 563164.45 531636 -5.6%
>
> random_write
> 1 thread
> - Trans/sec 729.01 738.89 +1.4%
> - Throughput 99.7MB/sec 101MB/sec +1.3%
> - %CPU 0.1% 0.1% +0.0%
> - Trans/%CPU 72901 73889 +1.4%
> 8 threads
> - Trans/sec 714.56 714.57 +0.0%
> - Throughput 97.7MB/sec 97.7MB/sec +0.0%
> - %CPU 0.2% 0.2% +0.0%
> - Trans/%CPU 35728 35728.5 +0.0%
> 32 threads
> - Trans/sec 698.62 692.59 -0.9%
> - Throughput 95.5MB/sec 94.7MB/sec -0.8%
> - %CPU 0.2% 0.2% +0.0%
> - Trans/%CPU 34931 34629.5 -0.9%
>
> random_read
> 1 thread
> - Trans/sec 225.49 227.03 +0.7%
> - Throughput 902KB/sec 908KB/sec +0.7%
> - %CPU 1.1% 1.1% +0.0%
> - Trans/%CPU 20499.10 20639.10 +0.7%
> 8 threads
> - Trans/sec 106.72 105.76 -0.9%
> - Throughput 427KB/sec 423KB/sec -0.9%
> - %CPU 0.5% 0.5% +0.0%
> - Trans/%CPU 2134.4 2115.2 -0.9%
> 32 threads
> - Trans/sec 107.44 108.26 +0.8%
> - Throughput 430KB/sec 433KB/sec +0.7%
> - %CPU 0.5% 0.5% +0.0%
> - Trans/%CPU 2148.8 2165.2 +0.8%
>
> mail_server
> 1 thread
> - Trans/sec 681.67 732.66 +7.5%
> - Throughput [read] 1.77MB/sec 1.99MB/sec +12.4%
> - Throughput [write] 858KB/sec 887KB/sec +3.4%
> - %CPU 0.6% 0.6% +0.0%
> - Trans/%CPU 11361.17 12211 +7.5%
> 8 threads
> - Trans/sec 630.48 597.08 -5.3%
> - Throughput [read] 1.64MB/sec 1.54MB/sec -6.1%
> - Throughput [write] 814KB/sec 784KB/sec -3.7%
> - %CPU 0.6% 0.5% -16.7%
> - Trans/%CPU 10508 11941.6 +13.6%
> 32 threads
> - Trans/sec 598.68 566.05 -5.5%
> - Throughput [read] 1.53MB/sec 1.5MB/sec -2.0%
> - Throughput [write] 804KB/sec 705KB/sec -12.3%
> - %CPU 0.7% 0.6% -14.2%
> - Trans/%CPU 8552.57 9434.17 +10.3%
>
> 3. Compilebench test
>
> w/o hot tracking w/ hot tracking ratio
> v1 v2 (v2-v1)/v1
> intial create 114.81 MB/s 118.32 MB/s +3.1%
> create 11.98 MB/s 12.26 MB/s +2.3%
> patch 3.61 MB/s 3.66 MB/s +1.4%
> compile 46.40 MB/s 48.07 MB/s +3.6%
> clean 126.33 MB/s 128.75 MB/s +1.9%
> read tree 9.93 MB/s 9.71 MB/s -2.2%
> read compiled tree 17.19 MB/s 17.52 MB/s +1.9%
> delete tree 12.23 seconds 11.13 seconds -9.0%
> delete compiled tree 12.98 seconds 16.05 seconds +26.7%
> stat tree 7.03 seconds 5.51 seconds -21.6%
> stat compiled tree 12.19 seconds 9.06 seconds -25.7%
>
> Changelog:
>
> - Solved 64 bits inode number issue. [David Sterba]
> - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
> - Cleanup Some issues [David Sterba]
> - Use a static hot debugfs root [Greg KH]
> - Rewritten debugfs support based on seq_file operation. [Dave Chinner]
> - Refactored workqueue support. [Dave Chinner]
> - Turn some Micro into be tunable [Zhiyong, Zheng Liu]
> TIME_TO_KICK, and HEAT_UPDATE_DELAY
> - Introduce hot func registering framework [Zhiyong]
> - Remove global variable for hot tracking [Zhiyong]
> - Add xfs hot tracking support [Dave Chinner]
> - Add ext4 hot tracking support [Zheng Liu]
> - Cleanedup a lot of other issues [Dave Chinner]
> - Added memory shrinker [Dave Chinner]
> - Converted to one workqueue to update map info periodically [Dave Chinner]
> - Cleanedup a lot of other issues [Dave Chinner]
> - Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
> - Add btrfs hot tracking support [Zhiyong]
> - The first three patches can probably just be flattened into one.
> [Marco Stornelli , Dave Chinner]
>
> Zhi Yong Wu (16):
> vfs: introduce some data structures
> vfs: add init and cleanup functions
> vfs: add I/O frequency update function
> vfs: add two map arrays
> vfs: add hooks to enable hot tracking
> vfs: add temp calculation function
> vfs: add map info update function
> vfs: add aging function
> vfs: add one work queue
> vfs: add FS hot type support
> vfs: register one shrinker
> vfs: add one ioctl interface
> vfs: add debugfs support
> proc: add two hot_track proc files
> btrfs: add hot tracking support
> vfs: add documentation
>
> Documentation/filesystems/00-INDEX | 2 +
> Documentation/filesystems/hot_tracking.txt | 255 ++++++
> fs/Makefile | 2 +-
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/super.c | 22 +-
> fs/compat_ioctl.c | 5 +
> fs/dcache.c | 2 +
> fs/direct-io.c | 6 +
> fs/hot_tracking.c | 1345 ++++++++++++++++++++++++++++
> fs/hot_tracking.h | 52 ++
> fs/ioctl.c | 74 ++
> include/linux/fs.h | 5 +
> include/linux/hot_tracking.h | 152 ++++
> kernel/sysctl.c | 14 +
> mm/filemap.c | 6 +
> mm/page-writeback.c | 12 +
> mm/readahead.c | 7 +
> 17 files changed, 1960 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/filesystems/hot_tracking.txt
> create mode 100644 fs/hot_tracking.c
> create mode 100644 fs/hot_tracking.h
> create mode 100644 include/linux/hot_tracking.h
>
> --
> 1.7.6.5
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 01/16] vfs: introduce some data structures
2012-12-20 14:43 ` [PATCH RESEND v1 01/16] vfs: introduce some data structures zwu.kernel
@ 2013-01-10 0:48 ` David Sterba
2013-01-10 6:24 ` Zhi Yong Wu
0 siblings, 1 reply; 35+ messages in thread
From: David Sterba @ 2013-01-10 0:48 UTC (permalink / raw)
To: zwu.kernel
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Dec 20, 2012 at 10:43:20PM +0800, zwu.kernel@gmail.com wrote:
> --- /dev/null
> +++ b/fs/hot_tracking.c
> @@ -0,0 +1,109 @@
> +/*
> + * fs/hot_tracking.c
>From what I've undrestood the file name written here is not wanted, so
please drop it (and from .h too)
> + *
> + * Copyright (C) 2012 IBM Corp. All rights reserved.
> + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
A short description of the hot tracking feature or pointer to the
Documentation/ file would be nice here.
> + */
> +
> +#include <linux/list.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/spinlock.h>
> +#include <linux/hardirq.h>
> +#include <linux/fs.h>
> +#include <linux/blkdev.h>
> +#include <linux/types.h>
> +#include <linux/limits.h>
> +#include "hot_tracking.h"
> +
> +/* kmem_cache pointers for slab caches */
This comment seems useless to me, I does not help understanding the code, just
says the same what reads in C. There are more such redundant comments in the
series, but I'm not going point to all of them right now.
> +static struct kmem_cache *hot_inode_item_cachep __read_mostly;
> +static struct kmem_cache *hot_range_item_cachep __read_mostly;
> +
> --- /dev/null
> +++ b/include/linux/hot_tracking.h
> +/* The common info for both following structures */
> +struct hot_comm_item {
> + struct rb_node rb_node; /* rbtree index */
> + struct hot_freq_data hot_freq_data; /* frequency data */
> + spinlock_t lock; /* protects object data */
> + struct kref refs; /* prevents kfree */
> +};
> +
> +/* An item representing an inode and its access frequency */
> +struct hot_inode_item {
> + struct hot_comm_item hot_inode; /* node in hot_inode_tree */
> + struct hot_rb_tree hot_range_tree; /* tree of ranges */
> + spinlock_t lock; /* protect range tree */
> + struct hot_rb_tree *hot_inode_tree;
> + u64 i_ino; /* inode number from inode */
> +};
Please align the comments to something like this (or drop them if they seem
redundant):
/* The common info for both following structures */
struct hot_comm_item {
struct rb_node rb_node; /* rbtree index */
struct hot_freq_data hot_freq_data; /* frequency data */
spinlock_t lock; /* protects object data */
struct kref refs; /* prevents kfree */
struct list_head n_list; /* list node index */
};
/* An item representing an inode and its access frequency */
struct hot_inode_item {
struct hot_comm_item hot_inode; /* node in hot_inode_tree */
struct hot_rb_tree hot_range_tree; /* tree of ranges */
spinlock_t lock; /* protect range tree */
struct hot_rb_tree *hot_inode_tree;
u64 i_ino; /* inode number from inode */
};
> +extern void __init hot_cache_init(void);
this belongs to the private include fs/hot_tracking.h (because this is called
only once by vfs init and not by filesystems), there's
hot_track_init(superblock) for that purpose introduced later.
david
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 02/16] vfs: add init and cleanup functions
2012-12-20 14:43 ` [PATCH RESEND v1 02/16] vfs: add init and cleanup functions zwu.kernel
@ 2013-01-10 0:48 ` David Sterba
2013-01-11 7:21 ` Zhi Yong Wu
0 siblings, 1 reply; 35+ messages in thread
From: David Sterba @ 2013-01-10 0:48 UTC (permalink / raw)
To: zwu.kernel
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Dec 20, 2012 at 10:43:21PM +0800, zwu.kernel@gmail.com wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> --- a/fs/hot_tracking.c
> +++ b/fs/hot_tracking.c
> @@ -107,3 +189,38 @@ err:
> kmem_cache_destroy(hot_inode_item_cachep);
> }
> EXPORT_SYMBOL_GPL(hot_cache_init);
> +
> +/*
> + * Initialize the data structures for hot data tracking.
> + */
> +int hot_track_init(struct super_block *sb)
> +{
> + struct hot_info *root;
> + int ret = -ENOMEM;
> +
> + root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
> + if (!root) {
> + printk(KERN_ERR "%s: Failed to malloc memory for "
> + "hot_info\n", __func__);
> + return ret;
> + }
> +
> + hot_inode_tree_init(root);
This function is supposed to be called from the filesystem init, please
add a sanity check that would catch multiple initialization attempts.
> +
> + sb->s_hot_root = root;
> +
> + printk(KERN_INFO "VFS: Turning on hot data tracking\n");
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(hot_track_init);
> +
> +void hot_track_exit(struct super_block *sb)
> +{
> + struct hot_info *root = sb->s_hot_root;
another sanity check to catch the opposite.
Why? The option is parsed and enabled from the filesystems, due to
unexpected bugs eg with remounting or incorrectly handled error paths,
vfs layer should IMHO rather warn than crash.
> +
> + hot_inode_tree_exit(root);
> + sb->s_hot_root = NULL;
> + kfree(root);
> +}
> +EXPORT_SYMBOL_GPL(hot_track_exit);
david
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 03/16] vfs: add I/O frequency update function
2012-12-20 14:43 ` [PATCH RESEND v1 03/16] vfs: add I/O frequency update function zwu.kernel
@ 2013-01-10 0:51 ` David Sterba
2013-01-11 7:38 ` Zhi Yong Wu
0 siblings, 1 reply; 35+ messages in thread
From: David Sterba @ 2013-01-10 0:51 UTC (permalink / raw)
To: zwu.kernel
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Dec 20, 2012 at 10:43:22PM +0800, zwu.kernel@gmail.com wrote:
> --- a/fs/hot_tracking.c
> +++ b/fs/hot_tracking.c
> @@ -164,6 +164,135 @@ static void hot_inode_tree_exit(struct hot_info *root)
> spin_unlock(&root->lock);
> }
>
> +struct hot_inode_item
> +*hot_inode_item_lookup(struct hot_info *root, u64 ino)
> +{
> + struct rb_node **p = &root->hot_inode_tree.map.rb_node;
> + struct rb_node *parent = NULL;
> + struct hot_comm_item *ci;
> + struct hot_inode_item *entry;
> +
> + /* walk tree to find insertion point */
> + spin_lock(&root->lock);
> + while (*p) {
> + parent = *p;
> + ci = rb_entry(parent, struct hot_comm_item, rb_node);
> + entry = container_of(ci, struct hot_inode_item, hot_inode);
> + if (ino < entry->i_ino)
> + p = &(*p)->rb_left;
> + else if (ino > entry->i_ino)
> + p = &(*p)->rb_right;
style comment: put { } around the all if/else blocks
> + else {
> + spin_unlock(&root->lock);
> + kref_get(&entry->hot_inode.refs);
jumping forwards in the series, the spin_unlock and kref_get get swapped
later, and I think that's the right order. Otherwise there's a small
window where the entry does not get the reference and could be
potentially freed by racing kref_put, no?
<lookup entry E>
spin_unlock(tree)
spin_lock(tree)
<lookup entry E>
kref_put(E) or via hot_inode_item_put(E) (1)
kref_get(E) (2)
if the reference count at (1) was 1, it's freed and (2) hits a free
memory. hot_inode_item_put can be called from filesystem or via seq
print of the respective /proc files, so I think there are chances to hit
the problem.
> + return entry;
> + }
> + }
> + spin_unlock(&root->lock);
> +
> + entry = kmem_cache_zalloc(hot_inode_item_cachep, GFP_NOFS);
> + if (!entry)
> + return ERR_PTR(-ENOMEM);
> +
> + spin_lock(&root->lock);
> + hot_inode_item_init(entry, ino, &root->hot_inode_tree);
> + rb_link_node(&entry->hot_inode.rb_node, parent, p);
> + rb_insert_color(&entry->hot_inode.rb_node,
> + &root->hot_inode_tree.map);
> + spin_unlock(&root->lock);
> +
> + kref_get(&entry->hot_inode.refs);
Similar here, the entry is inserted into the tree but there's no
refcount yet. And the order of spin_unlock/kref_get remains unchanged.
> + return entry;
> +}
> +EXPORT_SYMBOL_GPL(hot_inode_item_lookup);
> +
> +static struct hot_range_item
> +*hot_range_item_lookup(struct hot_inode_item *he,
> + loff_t start)
> +{
> + struct rb_node **p = &he->hot_range_tree.map.rb_node;
> + struct rb_node *parent = NULL;
> + struct hot_comm_item *ci;
> + struct hot_range_item *entry;
> +
> + /* walk tree to find insertion point */
> + spin_lock(&he->lock);
> + while (*p) {
> + parent = *p;
> + ci = rb_entry(parent, struct hot_comm_item, rb_node);
> + entry = container_of(ci, struct hot_range_item, hot_range);
> + if (start < entry->start)
> + p = &(*p)->rb_left;
> + else if (start > hot_range_end(entry))
> + p = &(*p)->rb_right;
if { ...}
else if { ... }
> + else {
> + spin_unlock(&he->lock);
> + kref_get(&entry->hot_range.refs);
same here
> + return entry;
> + }
> + }
> + spin_unlock(&he->lock);
> +
> + entry = kmem_cache_zalloc(hot_range_item_cachep, GFP_NOFS);
> + if (!entry)
> + return ERR_PTR(-ENOMEM);
> +
> + spin_lock(&he->lock);
> + hot_range_item_init(entry, start, he);
> + rb_link_node(&entry->hot_range.rb_node, parent, p);
> + rb_insert_color(&entry->hot_range.rb_node,
> + &he->hot_range_tree.map);
> + spin_unlock(&he->lock);
> +
> + kref_get(&entry->hot_range.refs);
and here
> + return entry;
> +}
> +
> +/*
> + * This function does the actual work of updating
> + * the frequency numbers, whatever they turn out to be.
Can this function be described a bit better? This comment did not help.
> + */
> +static void hot_rw_freq_calc(struct timespec old_atime,
> + struct timespec cur_time, u64 *avg)
> +{
> + struct timespec delta_ts;
> + u64 new_delta;
> +
> + delta_ts = timespec_sub(cur_time, old_atime);
> + new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER;
> +
> + *avg = (*avg << FREQ_POWER) - *avg + new_delta;
> + *avg = *avg >> FREQ_POWER;
> +}
> +
> +static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
> +{
> + struct timespec cur_time = current_kernel_time();
> +
> + if (write) {
> + freq_data->nr_writes += 1;
The preferred style is
freq_data->nr_writes++
> + hot_rw_freq_calc(freq_data->last_write_time,
> + cur_time,
> + &freq_data->avg_delta_writes);
> + freq_data->last_write_time = cur_time;
> + } else {
> + freq_data->nr_reads += 1;
(...)
> + hot_rw_freq_calc(freq_data->last_read_time,
> + freq_data->last_read_time,
> + cur_time,
> + &freq_data->avg_delta_reads);
> + freq_data->last_read_time = cur_time;
> + }
> +}
> +
david
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 04/16] vfs: add two map arrays
2012-12-20 14:43 ` [PATCH RESEND v1 04/16] vfs: add two map arrays zwu.kernel
@ 2013-01-10 0:51 ` David Sterba
0 siblings, 0 replies; 35+ messages in thread
From: David Sterba @ 2013-01-10 0:51 UTC (permalink / raw)
To: zwu.kernel
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Dec 20, 2012 at 10:43:23PM +0800, zwu.kernel@gmail.com wrote:
> --- a/fs/hot_tracking.c
> +++ b/fs/hot_tracking.c
> +/* Free inode and range map info */
> +static void hot_map_exit(struct hot_info *root)
> +{
> + int i;
> + for (i = 0; i < HEAT_MAP_SIZE; i++) {
> + spin_lock(&root->heat_inode_map[i].lock);
> + hot_map_list_free(&root->heat_inode_map[i].node_list, root);
> + spin_unlock(&root->heat_inode_map[i].lock);
please insert an empty line here to improve readability
> + spin_lock(&root->heat_range_map[i].lock);
> + hot_map_list_free(&root->heat_range_map[i].node_list, root);
> + spin_unlock(&root->heat_range_map[i].lock);
> + }
> +}
> +
> +/*
> * Initialize kmem cache for hot_inode_item and hot_range_item.
> */
> void __init hot_cache_init(void)
> --- a/include/linux/hot_tracking.h
> +++ b/include/linux/hot_tracking.h
> @@ -71,6 +82,12 @@ struct hot_range_item {
> struct hot_info {
> struct hot_rb_tree hot_inode_tree;
> spinlock_t lock; /*protect inode tree */
> +
> + /* map of inode temperature */
> + struct hot_map_head heat_inode_map[HEAT_MAP_SIZE];
> + /* map of range temperature */
> + struct hot_map_head heat_range_map[HEAT_MAP_SIZE];
> + unsigned int hot_map_nr;
> };
Final layout of struct hot_info is
struct hot_info {
struct hot_rb_tree hot_inode_tree; /* 0 8 */
spinlock_t lock; /* 8 72 */
/* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
struct hot_map_head heat_inode_map[256]; /* 80 24576 */
/* --- cacheline 385 boundary (24640 bytes) was 16 bytes ago --- */
struct hot_map_head heat_range_map[256]; /* 24656 24576 */
/* --- cacheline 769 boundary (49216 bytes) was 16 bytes ago --- */
unsigned int hot_map_nr; /* 49232 4 */
/* XXX 4 bytes hole, try to pack */
struct workqueue_struct * update_wq; /* 49240 8 */
struct delayed_work update_work; /* 49248 216 */
/* XXX last struct has 4 bytes of padding */
/* --- cacheline 772 boundary (49408 bytes) was 56 bytes ago --- */
struct hot_type * hot_type; /* 49464 8 */
/* --- cacheline 773 boundary (49472 bytes) --- */
struct shrinker hot_shrink; /* 49472 48 */
struct dentry * vol_dentry; /* 49520 8 */
/* size: 49528, cachelines: 774, members: 10 */
/* sum members: 49524, holes: 1, sum holes: 4 */
/* paddings: 1, sum paddings: 4 */
/* last cacheline: 56 bytes */
};
that's an order-4 allocation and the heat_*_map[] themselves need order-3.
Also the structure
struct hot_map_head {
struct list_head node_list; /* 0 16 */
u8 temp; /* 16 1 */
/* XXX 7 bytes hole, try to pack */
spinlock_t lock; /* 24 72 */
/* --- cacheline 1 boundary (64 bytes) was 32 bytes ago --- */
/* size: 96, cachelines: 2, members: 3 */
/* sum members: 89, holes: 1, sum holes: 7 */
/* last cacheline: 32 bytes */
};
is not packed efficiently and given the number of the array items, the wasted
space adds to the sum.
So, this needs to be fixed. Options I see:
1) try to allocate the structure with GFP_NOWARN and use vmalloc as a fallback
2) allocate heat_*_map arrays dynamically
An array of 256 pointers takes 2048 bytes, so when there are 2 of them plus
other struct items, overall size will go beyond a 4k page. Also, doing
kmalloc on each heat_*_map item could spread them over memory, although
hot_info is a long-term structure and it would make sense to keep the
data located at one place. For struct hot_map_head I suggest to create a
slab.
david
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 05/16] vfs: add hooks to enable hot tracking
2012-12-20 14:43 ` [PATCH RESEND v1 05/16] vfs: add hooks to enable hot tracking zwu.kernel
@ 2013-01-10 0:52 ` David Sterba
2013-01-11 7:47 ` Zhi Yong Wu
0 siblings, 1 reply; 35+ messages in thread
From: David Sterba @ 2013-01-10 0:52 UTC (permalink / raw)
To: zwu.kernel
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Dec 20, 2012 at 10:43:24PM +0800, zwu.kernel@gmail.com wrote:
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -37,6 +37,7 @@
> #include <linux/uio.h>
> #include <linux/atomic.h>
> #include <linux/prefetch.h>
> +#include "hot_tracking.h"
>
> /*
> * How many user pages to map in one call to get_user_pages(). This determines
> @@ -1299,6 +1300,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> prefetch(bdev->bd_queue);
> prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);
>
> + /* Hot data tracking */
> + hot_update_freqs(inode, offset,
> + iov_length(iov, nr_segs),
> + rw & WRITE);
hot_update_freqs takes an 'int rw' directly, so you should pass plain
'rw' here and do the 'rw & WRITE' check in hot_freq_data_update itself.
> +
> return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> nr_segs, get_block, end_io,
> submit_io, flags);
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -35,6 +35,7 @@
> #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
> #include <linux/pagevec.h>
> #include <linux/timer.h>
> +#include <linux/hot_tracking.h>
> #include <trace/events/writeback.h>
>
> /*
> @@ -1902,13 +1903,24 @@ EXPORT_SYMBOL(generic_writepages);
> int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
> {
> int ret;
> + loff_t start = 0;
> + size_t count = 0;
>
> if (wbc->nr_to_write <= 0)
> return 0;
> +
> + start = mapping->writeback_index << PAGE_CACHE_SHIFT;
> + count = wbc->nr_to_write;
> +
> if (mapping->a_ops->writepages)
> ret = mapping->a_ops->writepages(mapping, wbc);
> else
> ret = generic_writepages(mapping, wbc);
> +
> + /* Hot data tracking */
> + hot_update_freqs(mapping->host, start,
> + (count - wbc->nr_to_write) * PAGE_CACHE_SIZE, 1);
I think the frequencies should not be updated in case of error returned
from writepages.
> +
> return ret;
> }
>
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -138,6 +139,12 @@ static int read_pages(struct address_space *mapping, struct file *filp,
> out:
> blk_finish_plug(&plug);
>
> + /* Hot data tracking */
> + hot_update_freqs(mapping->host,
> + (loff_t)(list_entry(pages->prev, struct page, lru)->index)
> + << PAGE_CACHE_SHIFT,
> + (size_t)nr_pages * PAGE_CACHE_SIZE, 0);
same comment here
> +
> return ret;
> }
david
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 06/16] vfs: add temp calculation function
2012-12-20 14:43 ` [PATCH RESEND v1 06/16] vfs: add temp calculation function zwu.kernel
@ 2013-01-10 0:53 ` David Sterba
2013-01-11 8:08 ` Zhi Yong Wu
0 siblings, 1 reply; 35+ messages in thread
From: David Sterba @ 2013-01-10 0:53 UTC (permalink / raw)
To: zwu.kernel
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Dec 20, 2012 at 10:43:25PM +0800, zwu.kernel@gmail.com wrote:
> --- a/fs/hot_tracking.c
> +++ b/fs/hot_tracking.c
> @@ -25,6 +25,14 @@
> static struct kmem_cache *hot_inode_item_cachep __read_mostly;
> static struct kmem_cache *hot_range_item_cachep __read_mostly;
>
> +static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
> +{
> + if (dir)
> + return counter << bits;
> + else
> + return counter >> bits;
> +}
I don't understand the purpose of this function, it obscures a simple
bitwise shift.
> +
> /*
> * Initialize the inode tree. Should be called for each new inode
> * access or other user of the hot_inode interface.
> @@ -315,6 +323,72 @@ static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
> }
>
> /*
> + * hot_temp_calc() is responsible for distilling the six heat
> + * criteria down into a single temperature value for the data,
> + * which is an integer between 0 and HEAT_MAX_VALUE.
I didn't find HEAT_MAX_VALUE defined anywhere.
> + */
> +static u32 hot_temp_calc(struct hot_freq_data *freq_data)
> +{
> + u32 result = 0;
> +
> + struct timespec ckt = current_kernel_time();
> + u64 cur_time = timespec_to_ns(&ckt);
> +
> + u32 nrr_heat = (u32)hot_raw_shift((u64)freq_data->nr_reads,
> + NRR_MULTIPLIER_POWER, true);
> + u32 nrw_heat = (u32)hot_raw_shift((u64)freq_data->nr_writes,
> + NRW_MULTIPLIER_POWER, true);
So many typecasts, some of them unnecessary and in connection with
hot_raw_shift this is hard to read and understand.
u32 nrr_heat = (u32)((u64)freq_data->nr_reads << NRR_MULTIPLIER_POWER);
is not much better without a comment why this is doing the right thing.
> +
> + u64 ltr_heat =
> + hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_read_time)),
> + LTR_DIVIDER_POWER, false);
> + u64 ltw_heat =
> + hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_write_time)),
> + LTW_DIVIDER_POWER, false);
> +
> + u64 avr_heat =
> + hot_raw_shift((((u64) -1) - freq_data->avg_delta_reads),
> + AVR_DIVIDER_POWER, false);
> + u64 avw_heat =
> + hot_raw_shift((((u64) -1) - freq_data->avg_delta_writes),
> + AVW_DIVIDER_POWER, false);
> +
> + /* ltr_heat is now guaranteed to be u32 safe */
> + if (ltr_heat >= hot_raw_shift((u64) 1, 32, true))
> + ltr_heat = 0;
> + else
> + ltr_heat = hot_raw_shift((u64) 1, 32, true) - ltr_heat;
> +
> + /* ltw_heat is now guaranteed to be u32 safe */
> + if (ltw_heat >= hot_raw_shift((u64) 1, 32, true))
> + ltw_heat = 0;
> + else
> + ltw_heat = hot_raw_shift((u64) 1, 32, true) - ltw_heat;
> +
> + /* avr_heat is now guaranteed to be u32 safe */
> + if (avr_heat >= hot_raw_shift((u64) 1, 32, true))
> + avr_heat = (u32) -1;
> +
> + /* avw_heat is now guaranteed to be u32 safe */
> + if (avw_heat >= hot_raw_shift((u64) 1, 32, true))
> + avw_heat = (u32) -1;
> +
> + nrr_heat = (u32)hot_raw_shift((u64)nrr_heat,
> + (3 - NRR_COEFF_POWER), false);
> + nrw_heat = (u32)hot_raw_shift((u64)nrw_heat,
> + (3 - NRW_COEFF_POWER), false);
> + ltr_heat = hot_raw_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
> + ltw_heat = hot_raw_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
> + avr_heat = hot_raw_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
> + avw_heat = hot_raw_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
> +
> + result = nrr_heat + nrw_heat + (u32) ltr_heat +
> + (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
Reading through the function up to here I've got lost in the shifts that
I don't see the meaning of the resulting value and how can I interpet it
if I watch it change over time. What are the expected weights of the
number and time factors? There are more details in the documentation, but
the big picture is blurred by talking implementation details.
Let's put the impl. details here and write a better user documentation
with a few examples to the docs. Is it possible to describe some common
access patterns and how they affect the temperature?
You've been benchmarking this patchset, I'm sure you can write up a few
examples based on that.
> +
> + return result;
> +}
david
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 01/16] vfs: introduce some data structures
2013-01-10 0:48 ` David Sterba
@ 2013-01-10 6:24 ` Zhi Yong Wu
0 siblings, 0 replies; 35+ messages in thread
From: Zhi Yong Wu @ 2013-01-10 6:24 UTC (permalink / raw)
To: dsterba
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Jan 10, 2013 at 8:48 AM, David Sterba <dsterba@suse.cz> wrote:
> On Thu, Dec 20, 2012 at 10:43:20PM +0800, zwu.kernel@gmail.com wrote:
>> --- /dev/null
>> +++ b/fs/hot_tracking.c
>> @@ -0,0 +1,109 @@
>> +/*
>> + * fs/hot_tracking.c
>
> From what I've undrestood the file name written here is not wanted, so
> please drop it (and from .h too)
Done.
>
>> + *
>> + * Copyright (C) 2012 IBM Corp. All rights reserved.
>> + * Written by Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public
>> + * License v2 as published by the Free Software Foundation.
>
> A short description of the hot tracking feature or pointer to the
> Documentation/ file would be nice here.
ok, Done
>
>> + */
>> +
>> +#include <linux/list.h>
>> +#include <linux/err.h>
>> +#include <linux/slab.h>
>> +#include <linux/module.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/hardirq.h>
>> +#include <linux/fs.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/types.h>
>> +#include <linux/limits.h>
>> +#include "hot_tracking.h"
>> +
>> +/* kmem_cache pointers for slab caches */
>
> This comment seems useless to me, I does not help understanding the code, just
> says the same what reads in C. There are more such redundant comments in the
> series, but I'm not going point to all of them right now.
Removed.
>
>> +static struct kmem_cache *hot_inode_item_cachep __read_mostly;
>> +static struct kmem_cache *hot_range_item_cachep __read_mostly;
>> +
>
>> --- /dev/null
>> +++ b/include/linux/hot_tracking.h
>> +/* The common info for both following structures */
>> +struct hot_comm_item {
>> + struct rb_node rb_node; /* rbtree index */
>> + struct hot_freq_data hot_freq_data; /* frequency data */
>> + spinlock_t lock; /* protects object data */
>> + struct kref refs; /* prevents kfree */
>> +};
>> +
>> +/* An item representing an inode and its access frequency */
>> +struct hot_inode_item {
>> + struct hot_comm_item hot_inode; /* node in hot_inode_tree */
>> + struct hot_rb_tree hot_range_tree; /* tree of ranges */
>> + spinlock_t lock; /* protect range tree */
>> + struct hot_rb_tree *hot_inode_tree;
>> + u64 i_ino; /* inode number from inode */
>> +};
>
> Please align the comments to something like this (or drop them if they seem
> redundant):
Done
>
> /* The common info for both following structures */
> struct hot_comm_item {
> struct rb_node rb_node; /* rbtree index */
> struct hot_freq_data hot_freq_data; /* frequency data */
> spinlock_t lock; /* protects object data */
> struct kref refs; /* prevents kfree */
> struct list_head n_list; /* list node index */
> };
>
> /* An item representing an inode and its access frequency */
> struct hot_inode_item {
> struct hot_comm_item hot_inode; /* node in hot_inode_tree */
> struct hot_rb_tree hot_range_tree; /* tree of ranges */
> spinlock_t lock; /* protect range tree */
> struct hot_rb_tree *hot_inode_tree;
> u64 i_ino; /* inode number from inode */
> };
>
>> +extern void __init hot_cache_init(void);
>
> this belongs to the private include fs/hot_tracking.h (because this is called
> only once by vfs init and not by filesystems), there's
> hot_track_init(superblock) for that purpose introduced later.
Done, Move it to fs/hot_tracking.h
>
>
> david
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 02/16] vfs: add init and cleanup functions
2013-01-10 0:48 ` David Sterba
@ 2013-01-11 7:21 ` Zhi Yong Wu
0 siblings, 0 replies; 35+ messages in thread
From: Zhi Yong Wu @ 2013-01-11 7:21 UTC (permalink / raw)
To: dsterba
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Jan 10, 2013 at 8:48 AM, David Sterba <dsterba@suse.cz> wrote:
> On Thu, Dec 20, 2012 at 10:43:21PM +0800, zwu.kernel@gmail.com wrote:
>> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> --- a/fs/hot_tracking.c
>> +++ b/fs/hot_tracking.c
>> @@ -107,3 +189,38 @@ err:
>> kmem_cache_destroy(hot_inode_item_cachep);
>> }
>> EXPORT_SYMBOL_GPL(hot_cache_init);
>> +
>> +/*
>> + * Initialize the data structures for hot data tracking.
>> + */
>> +int hot_track_init(struct super_block *sb)
>> +{
>> + struct hot_info *root;
>> + int ret = -ENOMEM;
>> +
>> + root = kzalloc(sizeof(struct hot_info), GFP_NOFS);
>> + if (!root) {
>> + printk(KERN_ERR "%s: Failed to malloc memory for "
>> + "hot_info\n", __func__);
>> + return ret;
>> + }
>> +
>> + hot_inode_tree_init(root);
>
> This function is supposed to be called from the filesystem init, please
> add a sanity check that would catch multiple initialization attempts.
Good catch, thanks. Done.
>
>> +
>> + sb->s_hot_root = root;
>> +
>> + printk(KERN_INFO "VFS: Turning on hot data tracking\n");
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(hot_track_init);
>> +
>> +void hot_track_exit(struct super_block *sb)
>> +{
>> + struct hot_info *root = sb->s_hot_root;
>
> another sanity check to catch the opposite.
ditto.
>
> Why? The option is parsed and enabled from the filesystems, due to
> unexpected bugs eg with remounting or incorrectly handled error paths,
> vfs layer should IMHO rather warn than crash.
thanks for your expalaination.
>
>> +
>> + hot_inode_tree_exit(root);
>> + sb->s_hot_root = NULL;
>> + kfree(root);
>> +}
>> +EXPORT_SYMBOL_GPL(hot_track_exit);
>
>
> david
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 03/16] vfs: add I/O frequency update function
2013-01-10 0:51 ` David Sterba
@ 2013-01-11 7:38 ` Zhi Yong Wu
2013-01-11 14:27 ` David Sterba
0 siblings, 1 reply; 35+ messages in thread
From: Zhi Yong Wu @ 2013-01-11 7:38 UTC (permalink / raw)
To: dsterba
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Jan 10, 2013 at 8:51 AM, David Sterba <dsterba@suse.cz> wrote:
> On Thu, Dec 20, 2012 at 10:43:22PM +0800, zwu.kernel@gmail.com wrote:
>> --- a/fs/hot_tracking.c
>> +++ b/fs/hot_tracking.c
>> @@ -164,6 +164,135 @@ static void hot_inode_tree_exit(struct hot_info *root)
>> spin_unlock(&root->lock);
>> }
>>
>> +struct hot_inode_item
>> +*hot_inode_item_lookup(struct hot_info *root, u64 ino)
>> +{
>> + struct rb_node **p = &root->hot_inode_tree.map.rb_node;
>> + struct rb_node *parent = NULL;
>> + struct hot_comm_item *ci;
>> + struct hot_inode_item *entry;
>> +
>> + /* walk tree to find insertion point */
>> + spin_lock(&root->lock);
>> + while (*p) {
>> + parent = *p;
>> + ci = rb_entry(parent, struct hot_comm_item, rb_node);
>> + entry = container_of(ci, struct hot_inode_item, hot_inode);
>> + if (ino < entry->i_ino)
>> + p = &(*p)->rb_left;
>> + else if (ino > entry->i_ino)
>> + p = &(*p)->rb_right;
>
> style comment: put { } around the all if/else blocks,
no, it will violate checkpatch.pl. If the if/else block only contains
one line of code, we should not put {} around them.
>
>> + else {
>> + spin_unlock(&root->lock);
>> + kref_get(&entry->hot_inode.refs);
>
> jumping forwards in the series, the spin_unlock and kref_get get swapped
> later, and I think that's the right order. Otherwise there's a small
> window where the entry does not get the reference and could be
> potentially freed by racing kref_put, no?
yes, good catch, thanks, done
>
> <lookup entry E>
> spin_unlock(tree)
> spin_lock(tree)
> <lookup entry E>
> kref_put(E) or via hot_inode_item_put(E) (1)
> kref_get(E) (2)
>
>
> if the reference count at (1) was 1, it's freed and (2) hits a free
> memory. hot_inode_item_put can be called from filesystem or via seq
> print of the respective /proc files, so I think there are chances to hit
> the problem.
Great.
>
>> + return entry;
>> + }
>> + }
>> + spin_unlock(&root->lock);
>> +
>> + entry = kmem_cache_zalloc(hot_inode_item_cachep, GFP_NOFS);
>> + if (!entry)
>> + return ERR_PTR(-ENOMEM);
>> +
>> + spin_lock(&root->lock);
>> + hot_inode_item_init(entry, ino, &root->hot_inode_tree);
>> + rb_link_node(&entry->hot_inode.rb_node, parent, p);
>> + rb_insert_color(&entry->hot_inode.rb_node,
>> + &root->hot_inode_tree.map);
>> + spin_unlock(&root->lock);
>> +
>> + kref_get(&entry->hot_inode.refs);
>
> Similar here, the entry is inserted into the tree but there's no
> refcount yet. And the order of spin_unlock/kref_get remains unchanged.
ditto
>
>> + return entry;
>> +}
>> +EXPORT_SYMBOL_GPL(hot_inode_item_lookup);
>> +
>> +static struct hot_range_item
>> +*hot_range_item_lookup(struct hot_inode_item *he,
>> + loff_t start)
>> +{
>> + struct rb_node **p = &he->hot_range_tree.map.rb_node;
>> + struct rb_node *parent = NULL;
>> + struct hot_comm_item *ci;
>> + struct hot_range_item *entry;
>> +
>> + /* walk tree to find insertion point */
>> + spin_lock(&he->lock);
>> + while (*p) {
>> + parent = *p;
>> + ci = rb_entry(parent, struct hot_comm_item, rb_node);
>> + entry = container_of(ci, struct hot_range_item, hot_range);
>> + if (start < entry->start)
>> + p = &(*p)->rb_left;
>> + else if (start > hot_range_end(entry))
>> + p = &(*p)->rb_right;
>
> if { ...}
> else if { ... }
We should not put {} around them as what i explained above.
>
>> + else {
>> + spin_unlock(&he->lock);
>> + kref_get(&entry->hot_range.refs);
>
> same here
Done
>
>> + return entry;
>> + }
>> + }
>> + spin_unlock(&he->lock);
>> +
>> + entry = kmem_cache_zalloc(hot_range_item_cachep, GFP_NOFS);
>> + if (!entry)
>> + return ERR_PTR(-ENOMEM);
>> +
>> + spin_lock(&he->lock);
>> + hot_range_item_init(entry, start, he);
>> + rb_link_node(&entry->hot_range.rb_node, parent, p);
>> + rb_insert_color(&entry->hot_range.rb_node,
>> + &he->hot_range_tree.map);
>> + spin_unlock(&he->lock);
>> +
>> + kref_get(&entry->hot_range.refs);
>
> and here
Done
>
>> + return entry;
>> +}
>> +
>> +/*
>> + * This function does the actual work of updating
>> + * the frequency numbers, whatever they turn out to be.
>
> Can this function be described a bit better? This comment did not help.
OK, i will
>
>> + */
>> +static void hot_rw_freq_calc(struct timespec old_atime,
>> + struct timespec cur_time, u64 *avg)
>> +{
>> + struct timespec delta_ts;
>> + u64 new_delta;
>> +
>> + delta_ts = timespec_sub(cur_time, old_atime);
>> + new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER;
>> +
>> + *avg = (*avg << FREQ_POWER) - *avg + new_delta;
>> + *avg = *avg >> FREQ_POWER;
>> +}
>> +
>> +static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
>> +{
>> + struct timespec cur_time = current_kernel_time();
>> +
>> + if (write) {
>> + freq_data->nr_writes += 1;
>
> The preferred style is
>
> freq_data->nr_writes++
OK, done.
>
>> + hot_rw_freq_calc(freq_data->last_write_time,
>> + cur_time,
>> + &freq_data->avg_delta_writes);
>> + freq_data->last_write_time = cur_time;
>> + } else {
>> + freq_data->nr_reads += 1;
>
> (...)
>
>> + hot_rw_freq_calc(freq_data->last_read_time,
>> + freq_data->last_read_time,
>> + cur_time,
>> + &freq_data->avg_delta_reads);
>> + freq_data->last_read_time = cur_time;
>> + }
>> +}
>> +
>
> david
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 05/16] vfs: add hooks to enable hot tracking
2013-01-10 0:52 ` David Sterba
@ 2013-01-11 7:47 ` Zhi Yong Wu
0 siblings, 0 replies; 35+ messages in thread
From: Zhi Yong Wu @ 2013-01-11 7:47 UTC (permalink / raw)
To: dsterba
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Jan 10, 2013 at 8:52 AM, David Sterba <dsterba@suse.cz> wrote:
> On Thu, Dec 20, 2012 at 10:43:24PM +0800, zwu.kernel@gmail.com wrote:
>> --- a/fs/direct-io.c
>> +++ b/fs/direct-io.c
>> @@ -37,6 +37,7 @@
>> #include <linux/uio.h>
>> #include <linux/atomic.h>
>> #include <linux/prefetch.h>
>> +#include "hot_tracking.h"
>>
>> /*
>> * How many user pages to map in one call to get_user_pages(). This determines
>> @@ -1299,6 +1300,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>> prefetch(bdev->bd_queue);
>> prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);
>>
>> + /* Hot data tracking */
>> + hot_update_freqs(inode, offset,
>> + iov_length(iov, nr_segs),
>> + rw & WRITE);
>
> hot_update_freqs takes an 'int rw' directly, so you should pass plain
> 'rw' here and do the 'rw & WRITE' check in hot_freq_data_update itself.
OK, done.
>
>> +
>> return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
>> nr_segs, get_block, end_io,
>> submit_io, flags);
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -35,6 +35,7 @@
>> #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
>> #include <linux/pagevec.h>
>> #include <linux/timer.h>
>> +#include <linux/hot_tracking.h>
>> #include <trace/events/writeback.h>
>>
>> /*
>> @@ -1902,13 +1903,24 @@ EXPORT_SYMBOL(generic_writepages);
>> int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
>> {
>> int ret;
>> + loff_t start = 0;
>> + size_t count = 0;
>>
>> if (wbc->nr_to_write <= 0)
>> return 0;
>> +
>> + start = mapping->writeback_index << PAGE_CACHE_SHIFT;
>> + count = wbc->nr_to_write;
>> +
>> if (mapping->a_ops->writepages)
>> ret = mapping->a_ops->writepages(mapping, wbc);
>> else
>> ret = generic_writepages(mapping, wbc);
>> +
>> + /* Hot data tracking */
>> + hot_update_freqs(mapping->host, start,
>> + (count - wbc->nr_to_write) * PAGE_CACHE_SIZE, 1);
>
> I think the frequencies should not be updated in case of error returned
> from writepages.
OK, Done.
>
>> +
>> return ret;
>> }
>>
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -138,6 +139,12 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>> out:
>> blk_finish_plug(&plug);
>>
>> + /* Hot data tracking */
>> + hot_update_freqs(mapping->host,
>> + (loff_t)(list_entry(pages->prev, struct page, lru)->index)
>> + << PAGE_CACHE_SHIFT,
>> + (size_t)nr_pages * PAGE_CACHE_SIZE, 0);
>
> same comment here
Ditto. thanks.
>
>> +
>> return ret;
>> }
>
>
> david
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 06/16] vfs: add temp calculation function
2013-01-10 0:53 ` David Sterba
@ 2013-01-11 8:08 ` Zhi Yong Wu
0 siblings, 0 replies; 35+ messages in thread
From: Zhi Yong Wu @ 2013-01-11 8:08 UTC (permalink / raw)
To: dsterba
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Thu, Jan 10, 2013 at 8:53 AM, David Sterba <dsterba@suse.cz> wrote:
> On Thu, Dec 20, 2012 at 10:43:25PM +0800, zwu.kernel@gmail.com wrote:
>> --- a/fs/hot_tracking.c
>> +++ b/fs/hot_tracking.c
>> @@ -25,6 +25,14 @@
>> static struct kmem_cache *hot_inode_item_cachep __read_mostly;
>> static struct kmem_cache *hot_range_item_cachep __read_mostly;
>>
>> +static u64 hot_raw_shift(u64 counter, u32 bits, bool dir)
>> +{
>> + if (dir)
>> + return counter << bits;
>> + else
>> + return counter >> bits;
>> +}
>
> I don't understand the purpose of this function, it obscures a simple
> bitwise shift.
The following words seem to mean that you prefer removing this
shifting function?
>
>> +
>> /*
>> * Initialize the inode tree. Should be called for each new inode
>> * access or other user of the hot_inode interface.
>> @@ -315,6 +323,72 @@ static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write)
>> }
>>
>> /*
>> + * hot_temp_calc() is responsible for distilling the six heat
>> + * criteria down into a single temperature value for the data,
>> + * which is an integer between 0 and HEAT_MAX_VALUE.
>
> I didn't find HEAT_MAX_VALUE defined anywhere.
This micro is only used in some comments, OK, i will replace it with 255.
>
>> + */
>> +static u32 hot_temp_calc(struct hot_freq_data *freq_data)
>> +{
>> + u32 result = 0;
>> +
>> + struct timespec ckt = current_kernel_time();
>> + u64 cur_time = timespec_to_ns(&ckt);
>> +
>> + u32 nrr_heat = (u32)hot_raw_shift((u64)freq_data->nr_reads,
>> + NRR_MULTIPLIER_POWER, true);
>> + u32 nrw_heat = (u32)hot_raw_shift((u64)freq_data->nr_writes,
>> + NRW_MULTIPLIER_POWER, true);
>
> So many typecasts, some of them unnecessary and in connection with
> hot_raw_shift this is hard to read and understand.
>
> u32 nrr_heat = (u32)((u64)freq_data->nr_reads << NRR_MULTIPLIER_POWER);
Do you prefer this format instead of ?
>
> is not much better without a comment why this is doing the right thing.
>
>> +
>> + u64 ltr_heat =
>> + hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_read_time)),
>> + LTR_DIVIDER_POWER, false);
>> + u64 ltw_heat =
>> + hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_write_time)),
>> + LTW_DIVIDER_POWER, false);
>> +
>> + u64 avr_heat =
>> + hot_raw_shift((((u64) -1) - freq_data->avg_delta_reads),
>> + AVR_DIVIDER_POWER, false);
>> + u64 avw_heat =
>> + hot_raw_shift((((u64) -1) - freq_data->avg_delta_writes),
>> + AVW_DIVIDER_POWER, false);
>> +
>> + /* ltr_heat is now guaranteed to be u32 safe */
>> + if (ltr_heat >= hot_raw_shift((u64) 1, 32, true))
>> + ltr_heat = 0;
>> + else
>> + ltr_heat = hot_raw_shift((u64) 1, 32, true) - ltr_heat;
>> +
>> + /* ltw_heat is now guaranteed to be u32 safe */
>> + if (ltw_heat >= hot_raw_shift((u64) 1, 32, true))
>> + ltw_heat = 0;
>> + else
>> + ltw_heat = hot_raw_shift((u64) 1, 32, true) - ltw_heat;
>> +
>> + /* avr_heat is now guaranteed to be u32 safe */
>> + if (avr_heat >= hot_raw_shift((u64) 1, 32, true))
>> + avr_heat = (u32) -1;
>> +
>> + /* avw_heat is now guaranteed to be u32 safe */
>> + if (avw_heat >= hot_raw_shift((u64) 1, 32, true))
>> + avw_heat = (u32) -1;
>> +
>> + nrr_heat = (u32)hot_raw_shift((u64)nrr_heat,
>> + (3 - NRR_COEFF_POWER), false);
>> + nrw_heat = (u32)hot_raw_shift((u64)nrw_heat,
>> + (3 - NRW_COEFF_POWER), false);
>> + ltr_heat = hot_raw_shift(ltr_heat, (3 - LTR_COEFF_POWER), false);
>> + ltw_heat = hot_raw_shift(ltw_heat, (3 - LTW_COEFF_POWER), false);
>> + avr_heat = hot_raw_shift(avr_heat, (3 - AVR_COEFF_POWER), false);
>> + avw_heat = hot_raw_shift(avw_heat, (3 - AVW_COEFF_POWER), false);
>> +
>> + result = nrr_heat + nrw_heat + (u32) ltr_heat +
>> + (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat;
>
> Reading through the function up to here I've got lost in the shifts that
> I don't see the meaning of the resulting value and how can I interpet it
> if I watch it change over time. What are the expected weights of the
> number and time factors? There are more details in the documentation, but
> the big picture is blurred by talking implementation details.
>
> Let's put the impl. details here and write a better user documentation
> with a few examples to the docs. Is it possible to describe some common
> access patterns and how they affect the temperature?
>
> You've been benchmarking this patchset, I'm sure you can write up a few
> examples based on that.
>
>> +
>> + return result;
>> +}
>
> david
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 03/16] vfs: add I/O frequency update function
2013-01-11 7:38 ` Zhi Yong Wu
@ 2013-01-11 14:27 ` David Sterba
2013-01-11 14:54 ` Zhi Yong Wu
0 siblings, 1 reply; 35+ messages in thread
From: David Sterba @ 2013-01-11 14:27 UTC (permalink / raw)
To: Zhi Yong Wu
Cc: dsterba, linux-fsdevel, linux-kernel, viro, david, darrick.wong,
andi, hch, linuxram, wuzhy
On Fri, Jan 11, 2013 at 03:38:31PM +0800, Zhi Yong Wu wrote:
> On Thu, Jan 10, 2013 at 8:51 AM, David Sterba <dsterba@suse.cz> wrote:
> > On Thu, Dec 20, 2012 at 10:43:22PM +0800, zwu.kernel@gmail.com wrote:
> >> --- a/fs/hot_tracking.c
> >> +++ b/fs/hot_tracking.c
> >> @@ -164,6 +164,135 @@ static void hot_inode_tree_exit(struct hot_info *root)
> >> spin_unlock(&root->lock);
> >> }
> >>
> >> +struct hot_inode_item
> >> +*hot_inode_item_lookup(struct hot_info *root, u64 ino)
> >> +{
> >> + struct rb_node **p = &root->hot_inode_tree.map.rb_node;
> >> + struct rb_node *parent = NULL;
> >> + struct hot_comm_item *ci;
> >> + struct hot_inode_item *entry;
> >> +
> >> + /* walk tree to find insertion point */
> >> + spin_lock(&root->lock);
> >> + while (*p) {
> >> + parent = *p;
> >> + ci = rb_entry(parent, struct hot_comm_item, rb_node);
> >> + entry = container_of(ci, struct hot_inode_item, hot_inode);
> >> + if (ino < entry->i_ino)
> >> + p = &(*p)->rb_left;
> >> + else if (ino > entry->i_ino)
> >> + p = &(*p)->rb_right;
> >
> > style comment: put { } around the all if/else blocks,
> no, it will violate checkpatch.pl. If the if/else block only contains
> one line of code, we should not put {} around them.
Unless its in a if / else if / else sequence, see
Documentation/CodingStyle chapter 3.1. This is what I've learned long
time ago and using it intuitively. I don't know to what extent
checkpatch sticks to that document, the code is menat to be read by
people, so if there is one style prevailing in the subsystem (and it is
by looking into random fs/*.c files) it's wise to keep the style
consistent.
david
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 03/16] vfs: add I/O frequency update function
2013-01-11 14:27 ` David Sterba
@ 2013-01-11 14:54 ` Zhi Yong Wu
0 siblings, 0 replies; 35+ messages in thread
From: Zhi Yong Wu @ 2013-01-11 14:54 UTC (permalink / raw)
To: dsterba
Cc: linux-fsdevel, linux-kernel, viro, david, darrick.wong, andi, hch,
linuxram, wuzhy
On Fri, Jan 11, 2013 at 10:27 PM, David Sterba <dsterba@suse.cz> wrote:
> On Fri, Jan 11, 2013 at 03:38:31PM +0800, Zhi Yong Wu wrote:
>> On Thu, Jan 10, 2013 at 8:51 AM, David Sterba <dsterba@suse.cz> wrote:
>> > On Thu, Dec 20, 2012 at 10:43:22PM +0800, zwu.kernel@gmail.com wrote:
>> >> --- a/fs/hot_tracking.c
>> >> +++ b/fs/hot_tracking.c
>> >> @@ -164,6 +164,135 @@ static void hot_inode_tree_exit(struct hot_info *root)
>> >> spin_unlock(&root->lock);
>> >> }
>> >>
>> >> +struct hot_inode_item
>> >> +*hot_inode_item_lookup(struct hot_info *root, u64 ino)
>> >> +{
>> >> + struct rb_node **p = &root->hot_inode_tree.map.rb_node;
>> >> + struct rb_node *parent = NULL;
>> >> + struct hot_comm_item *ci;
>> >> + struct hot_inode_item *entry;
>> >> +
>> >> + /* walk tree to find insertion point */
>> >> + spin_lock(&root->lock);
>> >> + while (*p) {
>> >> + parent = *p;
>> >> + ci = rb_entry(parent, struct hot_comm_item, rb_node);
>> >> + entry = container_of(ci, struct hot_inode_item, hot_inode);
>> >> + if (ino < entry->i_ino)
>> >> + p = &(*p)->rb_left;
>> >> + else if (ino > entry->i_ino)
>> >> + p = &(*p)->rb_right;
>> >
>> > style comment: put { } around the all if/else blocks,
>> no, it will violate checkpatch.pl. If the if/else block only contains
>> one line of code, we should not put {} around them.
>
> Unless its in a if / else if / else sequence, see
> Documentation/CodingStyle chapter 3.1. This is what I've learned long
> time ago and using it intuitively. I don't know to what extent
> checkpatch sticks to that document, the code is menat to be read by
> people, so if there is one style prevailing in the subsystem (and it is
> by looking into random fs/*.c files) it's wise to keep the style
> consistent.
I checked *.c in fs/, its coding style about if/else if is consistent
with what i said.:)
e.g in direct-io.c
if (sdio->final_block_in_bio != sdio->cur_page_block ||
cur_offset != bio_next_offset)
dio_bio_submit(dio, sdio);
/*
* Submit now if the underlying fs is about to perform a
* metadata read
*/
else if (sdio->boundary)
dio_bio_submit(dio, sdio);
>
> david
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 00/16] vfs: hot data tracking
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
` (18 preceding siblings ...)
2013-01-08 7:52 ` Zhi Yong Wu
@ 2013-02-22 0:32 ` Zhi Yong Wu
2013-02-25 11:40 ` David Sterba
19 siblings, 1 reply; 35+ messages in thread
From: Zhi Yong Wu @ 2013-02-22 0:32 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, viro, david, dave, darrick.wong, andi, hch,
linuxram, zwu.kernel, wuzhy
ping? any comments?
On Thu, Dec 20, 2012 at 10:43 PM, <zwu.kernel@gmail.com> wrote:
> From: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>
> HI, guys,
>
> This patchset has been done scalability or performance tests
> by fs_mark, ffsb and compilebench.
> I have done the perf testing on Linux 3.7.0-rc8+ with Intel(R) Core(TM)
> i7-3770 CPU @ 3.40GHz with 8 CPUs, 16G ram and 260G disk.
>
> Any comments or ideas are appreciated, thanks.
>
> NOTE:
>
> The patchset can be obtained via my kernel dev git on github:
> git://github.com/wuzhy/kernel.git hot_tracking
> If you're interested, you can also review them via
> https://github.com/wuzhy/kernel/commits/hot_tracking
>
> For more info, please check hot_tracking.txt in Documentation
>
> Below is the perf testing report:
>
> 1. fs_mark test
>
> w/o: without hot tracking
> w/ : with hot tracking
>
> Count Size FSUse% Files/sec App Overhead
>
> w/o w/ w/o w/ w/o w/
>
> 800000 1 2 3 13756.4 32144.9 5350627 5436291
> 1600000 1 4 5 1163.4 1799.3 20848119 21708216
> 2400000 1 6 6 1360.8 1252.5 6798705 8715322
> 3200000 1 8 8 1600.1 1196.3 5751129 6013792
> 4000000 1 9 9 1071.4 1191.2 17204725 26786369
> 4800000 1 10 10 1483.5 1447.9 19555541 8383046
> 5600000 1 11 11 1457.9 1699.5 5783588 10074681
> 6400000 1 12 13 1658.8 1628.5 6992697 6185551
> 7200000 1 14 14 1662.4 1857.1 5796793 13772592
> 8000000 1 15 15 2930.0 2653.8 12431682 6152573
> 8800000 1 16 17 1630.8 1665.0 7666719 13682765
> 9600000 1 18 18 1530.3 1583.9 5823644 10171644
> 10400000 1 19 19 1437.9 1798.6 20935224 6048083
> 11200000 1 20 20 1529.0 1550.6 6647450 6003151
> 12000000 1 21 22 1558.6 1501.8 12539509 18144939
> 12800000 1 23 23 1644.2 1432.1 7074419 28101975
> 13600000 1 24 24 1753.6 1650.2 7164297 20888972
> 14400000 1 25 25 2750.0 1483.9 12756692 7441225
> 15200000 1 27 27 1551.1 1514.3 5741066 8250443
> 16000000 1 28 28 1610.8 1635.9 72193860 8545285
> 16800000 1 29 29 1646.7 1907.7 8945856 11703513
> 17600000 1 30 31 1496.6 2722.3 5858961 8989393
> 18400000 1 32 32 1457.7 1565.7 10914475 26504660
> 19200000 1 33 33 1437.6 1518.7 6708975 213303618
> 20000000 1 34 34 1825.4 1521.1 5722086 12490907
> 20800000 1 36 35 1718.4 1611.5 5873290 17942534
> 21600000 1 37 37 2152.6 1536.9 113050627 8717940
> 22400000 1 38 38 2443.7 1788.2 7398122 19834765
> 23200000 1 39 39 1518.5 1587.6 5770959 10134882
> 24000000 1 41 41 1536.8 2164.0 5751248 7214626
> 24800000 1 42 42 1576.6 2939.4 7390314 6070271
> 25600000 1 43 43 1707.4 1535.9 11075939 6052896
> 26400000 1 44 44 1522.5 1563.1 10142987 22549898
> 27200000 1 46 46 1827.4 1608.5 11613016 24828125
> 28000000 1 47 47 3420.5 1741.9 8059985 16599156
> 28800000 1 48 48 1815.5 1944.4 7847931 9043277
> 29600000 1 50 49 1650.0 1596.6 5636323 7929164
> 30400000 1 51 51 1683.7 1573.3 5766323 19369146
> 31200000 1 52 52 1610.1 1669.8 9256111 9899107
> 32000000 1 53 53 1645.2 3081.0 7855010 6057257
> 32800000 1 54 55 1835.3 3122.0 6899141 6143875
> 33600000 1 56 56 1916.8 1734.8 10271967 6049509
> 34400000 1 57 57 3119.2 1630.8 11503274 13975417
> 35200000 1 58 58 1629.2 1695.7 6827225 6214248
> 36000000 1 60 60 1636.5 1695.4 38077664 16211067
> 36800000 1 61 61 1665.2 2069.1 19948817 9358494
> 37600000 1 62 62 1734.5 1931.5 26487196 8954836
> 38400000 1 63 63 1625.8 1654.0 6649289 9131844
> 39200000 1 65 65 1778.4 1663.3 11653376 7144960
> 40000000 1 66 66 1851.0 1935.6 8164470 11288753
> 40800000 1 67 67 3171.0 3431.6 12358380 6072820
> 41600000 1 69 69 1714.3 1954.3 13765035 9364495
> 42400000 1 70 70 1591.0 1681.8 18733304 7407689
> 43200000 1 71 71 1537.2 1642.8 19534908 6163018
> 44000000 1 72 72 1630.3 1641.2 23479883 10967509
> 44800000 1 74 74 1877.5 1651.9 8174965 9484587
> 45600000 1 75 75 3322.4 1653.6 14740938 7497831
> 46400000 1 76 76 1706.9 1840.6 10348550 23296562
> 47200000 1 77 78 1837.7 2515.3 13917543 14683192
> 48000000 1 79 79 1642.6 2368.6 14365759 6080942
> 48800000 1 80 80 1827.1 1655.2 9234312 7412406
> 49600000 1 81 81 1631.0 1858.7 7543970 18610881
> 50400000 1 82 82 1560.5 1865.0 21374219 6598771
>
>
> From the above table, when the same count files with same size are created, how FS is full is
> basically same.
>
> 2. FFSB test
> w/o hot tracking w/ hot tracking ratio
> v1 v2 (v2-v1)/v1
> large_file_create
> 1 thread
> - Trans/sec 28918.75 29014.48 +0.33%
> - Throughput 113MB/sec 113MB/sec +0.0%
> - %CPU 4.8% 5.1% +6.3%
> - Trans/%CPU 602473.96 568911.37 -5.6%
> 8 threads
> - Trans/sec 28480.37 28541.25 +0.2%
> - Throughput 111MB/sec 111MB/sec +0.0%
> - %CPU 5.6% 5.9% +5.4%
> - Trans/%CPU 508578.04 483750 -4.9%
> 32 threads
> - Trans/sec 25011.86 26992.32 +7.9%
> - Throughput 97.7MB/sec 105MB/sec +7.5%
> - %CPU 6.2% 7.1% +14.8%
> - Trans/%CPU 403417.10 380173.52 -5.8%
>
> large_file_seq_read
> 1 thread
> - Trans/sec 35303.23 34838.02 -1.3%
> - Throughput 138MB/sec 136MB/sec -1.4%
> - %CPU 5.4% 5.4% +0.0%
> - Trans/%CPU 653763.52 645148.52 -1.3%
> 8 threads
> - Trans/sec 11902.82 11205.22 -5.9%
> - Throughput 46.5MB/sec 43.8MB/sec -5.8%
> - %CPU 2.1% 2.0% -4.8%
> - Trans/%CPU 566800.95 560261 -1.2%
> 32 threads
> - Trans/sec 5068.48 5316.36 +4.9%
> - Throughput 19.8MB/sec 20.8MB/sec +5.1%
> - %CPU 0.9% 1.0% +11.1%
> - Trans/%CPU 563164.45 531636 -5.6%
>
> random_write
> 1 thread
> - Trans/sec 729.01 738.89 +1.4%
> - Throughput 99.7MB/sec 101MB/sec +1.3%
> - %CPU 0.1% 0.1% +0.0%
> - Trans/%CPU 72901 73889 +1.4%
> 8 threads
> - Trans/sec 714.56 714.57 +0.0%
> - Throughput 97.7MB/sec 97.7MB/sec +0.0%
> - %CPU 0.2% 0.2% +0.0%
> - Trans/%CPU 35728 35728.5 +0.0%
> 32 threads
> - Trans/sec 698.62 692.59 -0.9%
> - Throughput 95.5MB/sec 94.7MB/sec -0.8%
> - %CPU 0.2% 0.2% +0.0%
> - Trans/%CPU 34931 34629.5 -0.9%
>
> random_read
> 1 thread
> - Trans/sec 225.49 227.03 +0.7%
> - Throughput 902KB/sec 908KB/sec +0.7%
> - %CPU 1.1% 1.1% +0.0%
> - Trans/%CPU 20499.10 20639.10 +0.7%
> 8 threads
> - Trans/sec 106.72 105.76 -0.9%
> - Throughput 427KB/sec 423KB/sec -0.9%
> - %CPU 0.5% 0.5% +0.0%
> - Trans/%CPU 2134.4 2115.2 -0.9%
> 32 threads
> - Trans/sec 107.44 108.26 +0.8%
> - Throughput 430KB/sec 433KB/sec +0.7%
> - %CPU 0.5% 0.5% +0.0%
> - Trans/%CPU 2148.8 2165.2 +0.8%
>
> mail_server
> 1 thread
> - Trans/sec 681.67 732.66 +7.5%
> - Throughput [read] 1.77MB/sec 1.99MB/sec +12.4%
> - Throughput [write] 858KB/sec 887KB/sec +3.4%
> - %CPU 0.6% 0.6% +0.0%
> - Trans/%CPU 11361.17 12211 +7.5%
> 8 threads
> - Trans/sec 630.48 597.08 -5.3%
> - Throughput [read] 1.64MB/sec 1.54MB/sec -6.1%
> - Throughput [write] 814KB/sec 784KB/sec -3.7%
> - %CPU 0.6% 0.5% -16.7%
> - Trans/%CPU 10508 11941.6 +13.6%
> 32 threads
> - Trans/sec 598.68 566.05 -5.5%
> - Throughput [read] 1.53MB/sec 1.5MB/sec -2.0%
> - Throughput [write] 804KB/sec 705KB/sec -12.3%
> - %CPU 0.7% 0.6% -14.2%
> - Trans/%CPU 8552.57 9434.17 +10.3%
>
> 3. Compilebench test
>
> w/o hot tracking w/ hot tracking ratio
> v1 v2 (v2-v1)/v1
> intial create 114.81 MB/s 118.32 MB/s +3.1%
> create 11.98 MB/s 12.26 MB/s +2.3%
> patch 3.61 MB/s 3.66 MB/s +1.4%
> compile 46.40 MB/s 48.07 MB/s +3.6%
> clean 126.33 MB/s 128.75 MB/s +1.9%
> read tree 9.93 MB/s 9.71 MB/s -2.2%
> read compiled tree 17.19 MB/s 17.52 MB/s +1.9%
> delete tree 12.23 seconds 11.13 seconds -9.0%
> delete compiled tree 12.98 seconds 16.05 seconds +26.7%
> stat tree 7.03 seconds 5.51 seconds -21.6%
> stat compiled tree 12.19 seconds 9.06 seconds -25.7%
>
> Changelog:
>
> - Solved 64 bits inode number issue. [David Sterba]
> - Embed struct hot_type in struct file_system_type [Darrick J. Wong]
> - Cleanup Some issues [David Sterba]
> - Use a static hot debugfs root [Greg KH]
> - Rewritten debugfs support based on seq_file operation. [Dave Chinner]
> - Refactored workqueue support. [Dave Chinner]
> - Turn some Micro into be tunable [Zhiyong, Zheng Liu]
> TIME_TO_KICK, and HEAT_UPDATE_DELAY
> - Introduce hot func registering framework [Zhiyong]
> - Remove global variable for hot tracking [Zhiyong]
> - Add xfs hot tracking support [Dave Chinner]
> - Add ext4 hot tracking support [Zheng Liu]
> - Cleanedup a lot of other issues [Dave Chinner]
> - Added memory shrinker [Dave Chinner]
> - Converted to one workqueue to update map info periodically [Dave Chinner]
> - Cleanedup a lot of other issues [Dave Chinner]
> - Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner]
> - Add btrfs hot tracking support [Zhiyong]
> - The first three patches can probably just be flattened into one.
> [Marco Stornelli , Dave Chinner]
>
> Zhi Yong Wu (16):
> vfs: introduce some data structures
> vfs: add init and cleanup functions
> vfs: add I/O frequency update function
> vfs: add two map arrays
> vfs: add hooks to enable hot tracking
> vfs: add temp calculation function
> vfs: add map info update function
> vfs: add aging function
> vfs: add one work queue
> vfs: add FS hot type support
> vfs: register one shrinker
> vfs: add one ioctl interface
> vfs: add debugfs support
> proc: add two hot_track proc files
> btrfs: add hot tracking support
> vfs: add documentation
>
> Documentation/filesystems/00-INDEX | 2 +
> Documentation/filesystems/hot_tracking.txt | 255 ++++++
> fs/Makefile | 2 +-
> fs/btrfs/ctree.h | 1 +
> fs/btrfs/super.c | 22 +-
> fs/compat_ioctl.c | 5 +
> fs/dcache.c | 2 +
> fs/direct-io.c | 6 +
> fs/hot_tracking.c | 1345 ++++++++++++++++++++++++++++
> fs/hot_tracking.h | 52 ++
> fs/ioctl.c | 74 ++
> include/linux/fs.h | 5 +
> include/linux/hot_tracking.h | 152 ++++
> kernel/sysctl.c | 14 +
> mm/filemap.c | 6 +
> mm/page-writeback.c | 12 +
> mm/readahead.c | 7 +
> 17 files changed, 1960 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/filesystems/hot_tracking.txt
> create mode 100644 fs/hot_tracking.c
> create mode 100644 fs/hot_tracking.h
> create mode 100644 include/linux/hot_tracking.h
>
> --
> 1.7.6.5
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH RESEND v1 00/16] vfs: hot data tracking
2013-02-22 0:32 ` Zhi Yong Wu
@ 2013-02-25 11:40 ` David Sterba
0 siblings, 0 replies; 35+ messages in thread
From: David Sterba @ 2013-02-25 11:40 UTC (permalink / raw)
To: Zhi Yong Wu; +Cc: linux-fsdevel, linux-kernel, linuxram, wuzhy
On Fri, Feb 22, 2013 at 08:32:10AM +0800, Zhi Yong Wu wrote:
> ping? any comments?
Sorry, I don't have as much time as I'd like to continue with this
patchset.
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2013-02-25 11:40 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-20 14:43 [PATCH RESEND v1 00/16] vfs: hot data tracking zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 01/16] vfs: introduce some data structures zwu.kernel
2013-01-10 0:48 ` David Sterba
2013-01-10 6:24 ` Zhi Yong Wu
2012-12-20 14:43 ` [PATCH RESEND v1 02/16] vfs: add init and cleanup functions zwu.kernel
2013-01-10 0:48 ` David Sterba
2013-01-11 7:21 ` Zhi Yong Wu
2012-12-20 14:43 ` [PATCH RESEND v1 03/16] vfs: add I/O frequency update function zwu.kernel
2013-01-10 0:51 ` David Sterba
2013-01-11 7:38 ` Zhi Yong Wu
2013-01-11 14:27 ` David Sterba
2013-01-11 14:54 ` Zhi Yong Wu
2012-12-20 14:43 ` [PATCH RESEND v1 04/16] vfs: add two map arrays zwu.kernel
2013-01-10 0:51 ` David Sterba
2012-12-20 14:43 ` [PATCH RESEND v1 05/16] vfs: add hooks to enable hot tracking zwu.kernel
2013-01-10 0:52 ` David Sterba
2013-01-11 7:47 ` Zhi Yong Wu
2012-12-20 14:43 ` [PATCH RESEND v1 06/16] vfs: add temp calculation function zwu.kernel
2013-01-10 0:53 ` David Sterba
2013-01-11 8:08 ` Zhi Yong Wu
2012-12-20 14:43 ` [PATCH RESEND v1 07/16] vfs: add map info update function zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 08/16] vfs: add aging function zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 09/16] vfs: add one work queue zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 10/16] vfs: add FS hot type support zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 11/16] vfs: register one shrinker zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 12/16] vfs: add one ioctl interface zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 13/16] vfs: add debugfs support zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 14/16] proc: add two hot_track proc files zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 15/16] btrfs: add hot tracking support zwu.kernel
2012-12-20 14:43 ` [PATCH RESEND v1 16/16] vfs: add documentation zwu.kernel
2012-12-20 14:55 ` [PATCH RESEND v1 00/16] vfs: hot data tracking Zhi Yong Wu
2013-01-07 13:49 ` Zhi Yong Wu
2013-01-08 7:52 ` Zhi Yong Wu
2013-02-22 0:32 ` Zhi Yong Wu
2013-02-25 11:40 ` David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).