* [for-3.19 PATCH 00/17] dm thin: performance improvements
@ 2014-10-17 6:06 Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
` (17 more replies)
0 siblings, 18 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Here is a patchset that Joe and I have been working on for the past 2
weeks to address various performance problems reported against DM thin
provisioning. DM thinp is now much more capable in the face of heavy
IO (be it random or sequential, multithreaded, etc).
I've just added these patches to linux-dm.git's 'for-next' branch so
that these changes get early exposure/testing for 3.19 inclusion.
Of all the changes the "dm bufio: evict buffers that are past the max
age but retain some buffers" is in the most need of review relative to
upstream code (given it is so different than the RHEL code we also
support -- due to shrinker differences).
Any review/testing is always appreciated, thanks.
Mike
BTW, Mikulas: I had a quick look at your dm-bufio
'max_kept_cache_size_bytes' patch but I need to review it closer (it
seemed a bit too elaborate but I'll revisit). I also still need to
review/stage your dm-bufio-ratio.patch
Joe Thornber (13):
dm bufio: switch from a huge hash table to an rbtree
dm bufio: evict buffers that are past the max age but retain some buffers
dm bio prison: switch to using a red black tree
dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO
dm transaction manager: add support for prefetching blocks of metadata
dm thin: prefetch missing metadata pages
dm thin: throttle incoming IO
dm thin: grab a virtual cell before looking up the mapping
dm thin: performance improvement to discard processing
dm thin: defer whole cells rather than individual bios
dm thin: remap the bios in a cell immediately
dm thin: direct dispatch when breaking sharing
dm thin: sort the deferred cells
Mike Snitzer (4):
dm thin: adjust max_sectors_kb based on thinp blocksize
dm: improve documentation and code clarity in dm_merge_bvec
dm thin: implement thin_merge
dm thin: factor out remap_and_issue_overwrite
drivers/md/dm-bio-prison.c | 186 ++++---
drivers/md/dm-bio-prison.h | 16 +-
drivers/md/dm-bufio.c | 206 +++++---
drivers/md/dm-cache-target.c | 3 +-
drivers/md/dm-thin-metadata.c | 35 +-
drivers/md/dm-thin-metadata.h | 9 +-
drivers/md/dm-thin.c | 574 +++++++++++++++++----
drivers/md/dm.c | 12 +-
.../md/persistent-data/dm-transaction-manager.c | 77 ++-
.../md/persistent-data/dm-transaction-manager.h | 7 +
10 files changed, 804 insertions(+), 321 deletions(-)
--
1.9.3
^ permalink raw reply [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 01/17] dm bufio: switch from a huge hash table to an rbtree
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 02/17] dm bufio: evict buffers that are past the max age but retain some buffers Mike Snitzer
` (16 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Converting over to using an rbtree eliminates a fixed 8MB allocation
from vmalloc space for the hash table.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bufio.c | 97 ++++++++++++++++++++++++++++-----------------------
1 file changed, 54 insertions(+), 43 deletions(-)
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 0be200b..dcaa1d9 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -14,6 +14,7 @@
#include <linux/vmalloc.h>
#include <linux/shrinker.h>
#include <linux/module.h>
+#include <linux/rbtree.h>
#define DM_MSG_PREFIX "bufio"
@@ -48,14 +49,6 @@
#define DM_BUFIO_INLINE_VECS 16
/*
- * Buffer hash
- */
-#define DM_BUFIO_HASH_BITS 20
-#define DM_BUFIO_HASH(block) \
- ((((block) >> DM_BUFIO_HASH_BITS) ^ (block)) & \
- ((1 << DM_BUFIO_HASH_BITS) - 1))
-
-/*
* Don't try to use kmem_cache_alloc for blocks larger than this.
* For explanation, see alloc_buffer_data below.
*/
@@ -106,7 +99,7 @@ struct dm_bufio_client {
unsigned minimum_buffers;
- struct hlist_head *cache_hash;
+ struct rb_root buffer_tree;
wait_queue_head_t free_buffer_wait;
int async_write_error;
@@ -135,7 +128,7 @@ enum data_mode {
};
struct dm_buffer {
- struct hlist_node hash_list;
+ struct rb_node node;
struct list_head lru_list;
sector_t block;
void *data;
@@ -253,6 +246,53 @@ static LIST_HEAD(dm_bufio_all_clients);
*/
static DEFINE_MUTEX(dm_bufio_clients_lock);
+/*----------------------------------------------------------------
+ * A red/black tree acts as an index for all the buffers.
+ *--------------------------------------------------------------*/
+static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
+{
+ struct rb_node *n = c->buffer_tree.rb_node;
+ struct dm_buffer *b;
+
+ while (n) {
+ b = container_of(n, struct dm_buffer, node);
+
+ if (b->block == block)
+ return b;
+
+ n = (b->block < block) ? n->rb_left : n->rb_right;
+ }
+
+ return NULL;
+}
+
+static void __insert(struct dm_bufio_client *c, struct dm_buffer *b)
+{
+ struct rb_node **new = &c->buffer_tree.rb_node, *parent = NULL;
+ struct dm_buffer *found;
+
+ while (*new) {
+ found = container_of(*new, struct dm_buffer, node);
+
+ if (found->block == b->block) {
+ BUG_ON(found != b);
+ return;
+ }
+
+ parent = *new;
+ new = (found->block < b->block) ?
+ &((*new)->rb_left) : &((*new)->rb_right);
+ }
+
+ rb_link_node(&b->node, parent, new);
+ rb_insert_color(&b->node, &c->buffer_tree);
+}
+
+static void __remove(struct dm_bufio_client *c, struct dm_buffer *b)
+{
+ rb_erase(&b->node, &c->buffer_tree);
+}
+
/*----------------------------------------------------------------*/
static void adjust_total_allocated(enum data_mode data_mode, long diff)
@@ -434,7 +474,7 @@ static void __link_buffer(struct dm_buffer *b, sector_t block, int dirty)
b->block = block;
b->list_mode = dirty;
list_add(&b->lru_list, &c->lru[dirty]);
- hlist_add_head(&b->hash_list, &c->cache_hash[DM_BUFIO_HASH(block)]);
+ __insert(b->c, b);
b->last_accessed = jiffies;
}
@@ -448,7 +488,7 @@ static void __unlink_buffer(struct dm_buffer *b)
BUG_ON(!c->n_buffers[b->list_mode]);
c->n_buffers[b->list_mode]--;
- hlist_del(&b->hash_list);
+ __remove(b->c, b);
list_del(&b->lru_list);
}
@@ -888,23 +928,6 @@ static void __check_watermark(struct dm_bufio_client *c,
__write_dirty_buffers_async(c, 1, write_list);
}
-/*
- * Find a buffer in the hash.
- */
-static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
-{
- struct dm_buffer *b;
-
- hlist_for_each_entry(b, &c->cache_hash[DM_BUFIO_HASH(block)],
- hash_list) {
- dm_bufio_cond_resched();
- if (b->block == block)
- return b;
- }
-
- return NULL;
-}
-
/*----------------------------------------------------------------
* Getting a buffer
*--------------------------------------------------------------*/
@@ -1534,11 +1557,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
r = -ENOMEM;
goto bad_client;
}
- c->cache_hash = vmalloc(sizeof(struct hlist_head) << DM_BUFIO_HASH_BITS);
- if (!c->cache_hash) {
- r = -ENOMEM;
- goto bad_hash;
- }
+ c->buffer_tree = RB_ROOT;
c->bdev = bdev;
c->block_size = block_size;
@@ -1557,9 +1576,6 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
c->n_buffers[i] = 0;
}
- for (i = 0; i < 1 << DM_BUFIO_HASH_BITS; i++)
- INIT_HLIST_HEAD(&c->cache_hash[i]);
-
mutex_init(&c->lock);
INIT_LIST_HEAD(&c->reserved_buffers);
c->need_reserved_buffers = reserved_buffers;
@@ -1633,8 +1649,6 @@ bad_cache:
}
dm_io_client_destroy(c->dm_io);
bad_dm_io:
- vfree(c->cache_hash);
-bad_hash:
kfree(c);
bad_client:
return ERR_PTR(r);
@@ -1661,9 +1675,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
mutex_unlock(&dm_bufio_clients_lock);
- for (i = 0; i < 1 << DM_BUFIO_HASH_BITS; i++)
- BUG_ON(!hlist_empty(&c->cache_hash[i]));
-
+ BUG_ON(!RB_EMPTY_ROOT(&c->buffer_tree));
BUG_ON(c->need_reserved_buffers);
while (!list_empty(&c->reserved_buffers)) {
@@ -1681,7 +1693,6 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
BUG_ON(c->n_buffers[i]);
dm_io_client_destroy(c->dm_io);
- vfree(c->cache_hash);
kfree(c);
}
EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 02/17] dm bufio: evict buffers that are past the max age but retain some buffers
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 03/17] dm bio prison: switch to using a red black tree Mike Snitzer
` (15 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
These changes help keep metadata backed by dm-bufio in-core longer which
fixes reports of metadata churn in the face of heavy random IO workloads.
Before, bufio evicted all buffers older than DM_BUFIO_DEFAULT_AGE_SECS.
Having a device (e.g. dm-thinp or dm-cache) lose all metadata just
because associated buffers had been idle for some time is unfriendly.
Now, the user may now configure the number of bytes that bufio retains
using the 'retain_bytes' module parameter. The default is 256K.
Also, the DM_BUFIO_WORK_TIMER_SECS and DM_BUFIO_DEFAULT_AGE_SECS
defaults were quite low so increase them (to 30 and 300 respectively).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bufio.c | 109 ++++++++++++++++++++++++++++++++++----------------
1 file changed, 75 insertions(+), 34 deletions(-)
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index dcaa1d9..99579c8 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -35,12 +35,17 @@
/*
* Check buffer ages in this interval (seconds)
*/
-#define DM_BUFIO_WORK_TIMER_SECS 10
+#define DM_BUFIO_WORK_TIMER_SECS 30
/*
* Free buffers when they are older than this (seconds)
*/
-#define DM_BUFIO_DEFAULT_AGE_SECS 60
+#define DM_BUFIO_DEFAULT_AGE_SECS 300
+
+/*
+ * The nr of bytes of cached data to keep around.
+ */
+#define DM_BUFIO_DEFAULT_RETAIN_BYTES (256 * 1024)
/*
* The number of bvec entries that are embedded directly in the buffer.
@@ -216,6 +221,7 @@ static DEFINE_SPINLOCK(param_spinlock);
* Buffers are freed after this timeout
*/
static unsigned dm_bufio_max_age = DM_BUFIO_DEFAULT_AGE_SECS;
+static unsigned dm_bufio_retain_bytes = DM_BUFIO_DEFAULT_RETAIN_BYTES;
static unsigned long dm_bufio_peak_allocated;
static unsigned long dm_bufio_allocated_kmem_cache;
@@ -1457,45 +1463,52 @@ static void drop_buffers(struct dm_bufio_client *c)
}
/*
- * Test if the buffer is unused and too old, and commit it.
+ * We may not be able to evict this buffer if IO pending or the client
+ * is still using it. Caller is expected to know buffer is too old.
+ *
* And if GFP_NOFS is used, we must not do any I/O because we hold
* dm_bufio_clients_lock and we would risk deadlock if the I/O gets
* rerouted to different bufio client.
*/
-static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
- unsigned long max_jiffies)
+static bool __try_evict_buffer(struct dm_buffer *b, gfp_t gfp)
{
- if (jiffies - b->last_accessed < max_jiffies)
- return 0;
-
if (!(gfp & __GFP_FS)) {
if (test_bit(B_READING, &b->state) ||
test_bit(B_WRITING, &b->state) ||
test_bit(B_DIRTY, &b->state))
- return 0;
+ return false;
}
if (b->hold_count)
- return 0;
+ return false;
__make_buffer_clean(b);
__unlink_buffer(b);
__free_buffer_wake(b);
- return 1;
+ return true;
}
-static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
- gfp_t gfp_mask)
+static unsigned get_retain_buffers(struct dm_bufio_client *c)
+{
+ unsigned retain_bytes = ACCESS_ONCE(dm_bufio_retain_bytes);
+ return retain_bytes / c->block_size;
+}
+
+static unsigned long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
+ gfp_t gfp_mask)
{
int l;
struct dm_buffer *b, *tmp;
- long freed = 0;
+ unsigned long freed = 0;
+ unsigned long count = nr_to_scan;
+ unsigned retain_target = get_retain_buffers(c);
for (l = 0; l < LIST_SIZE; l++) {
list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
- freed += __cleanup_old_buffer(b, gfp_mask, 0);
- if (!--nr_to_scan)
+ if (__try_evict_buffer(b, gfp_mask))
+ freed++;
+ if (!--nr_to_scan || ((count - freed) <= retain_target))
return freed;
dm_bufio_cond_resched();
}
@@ -1697,31 +1710,56 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
}
EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
-static void cleanup_old_buffers(void)
+static unsigned get_max_age_hz(void)
{
- unsigned long max_age = ACCESS_ONCE(dm_bufio_max_age);
- struct dm_bufio_client *c;
+ unsigned max_age = ACCESS_ONCE(dm_bufio_max_age);
- if (max_age > ULONG_MAX / HZ)
- max_age = ULONG_MAX / HZ;
+ if (max_age > UINT_MAX / HZ)
+ max_age = UINT_MAX / HZ;
- mutex_lock(&dm_bufio_clients_lock);
- list_for_each_entry(c, &dm_bufio_all_clients, client_list) {
- if (!dm_bufio_trylock(c))
- continue;
+ return max_age * HZ;
+}
- while (!list_empty(&c->lru[LIST_CLEAN])) {
- struct dm_buffer *b;
- b = list_entry(c->lru[LIST_CLEAN].prev,
- struct dm_buffer, lru_list);
- if (!__cleanup_old_buffer(b, 0, max_age * HZ))
- break;
- dm_bufio_cond_resched();
- }
+static bool older_than(struct dm_buffer *b, unsigned long age_hz)
+{
+ return (jiffies - b->last_accessed) >= age_hz;
+}
+
+static void __evict_old_buffers(struct dm_bufio_client *c, unsigned long age_hz)
+{
+ struct dm_buffer *b, *tmp;
+ unsigned retain_target = get_retain_buffers(c);
+ unsigned count;
+
+ dm_bufio_lock(c);
+
+ count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
+ list_for_each_entry_safe_reverse(b, tmp, &c->lru[LIST_CLEAN], lru_list) {
+ if (count <= retain_target)
+ break;
+
+ if (!older_than(b, age_hz))
+ break;
+
+ if (__try_evict_buffer(b, 0))
+ count--;
- dm_bufio_unlock(c);
dm_bufio_cond_resched();
}
+
+ dm_bufio_unlock(c);
+}
+
+static void cleanup_old_buffers(void)
+{
+ unsigned long max_age_hz = get_max_age_hz();
+ struct dm_bufio_client *c;
+
+ mutex_lock(&dm_bufio_clients_lock);
+
+ list_for_each_entry(c, &dm_bufio_all_clients, client_list)
+ __evict_old_buffers(c, max_age_hz);
+
mutex_unlock(&dm_bufio_clients_lock);
}
@@ -1846,6 +1884,9 @@ MODULE_PARM_DESC(max_cache_size_bytes, "Size of metadata cache");
module_param_named(max_age_seconds, dm_bufio_max_age, uint, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(max_age_seconds, "Max age of a buffer in seconds");
+module_param_named(retain_bytes, dm_bufio_retain_bytes, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(retain_bytes, "Try to keep at least this many bytes cached in memory");
+
module_param_named(peak_allocated_bytes, dm_bufio_peak_allocated, ulong, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(peak_allocated_bytes, "Tracks the maximum allocated memory");
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 03/17] dm bio prison: switch to using a red black tree
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 02/17] dm bufio: evict buffers that are past the max age but retain some buffers Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 04/17] dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO Mike Snitzer
` (14 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Previously it was using a fixed sized hash table. There are times
when very many concurrent cells are held (such as when processing a very
large discard). When this happens the hash table performance becomes
very poor.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bio-prison.c | 172 ++++++++++++++++++-------------------------
drivers/md/dm-bio-prison.h | 7 +-
drivers/md/dm-cache-target.c | 3 +-
drivers/md/dm-thin.c | 3 +-
4 files changed, 79 insertions(+), 106 deletions(-)
diff --git a/drivers/md/dm-bio-prison.c b/drivers/md/dm-bio-prison.c
index f752d12..90a5662 100644
--- a/drivers/md/dm-bio-prison.c
+++ b/drivers/md/dm-bio-prison.c
@@ -14,68 +14,38 @@
/*----------------------------------------------------------------*/
-struct bucket {
- spinlock_t lock;
- struct hlist_head cells;
-};
+#define MIN_CELLS 1024
struct dm_bio_prison {
+ spinlock_t lock;
mempool_t *cell_pool;
-
- unsigned nr_buckets;
- unsigned hash_mask;
- struct bucket *buckets;
+ struct rb_root cells;
};
-/*----------------------------------------------------------------*/
-
-static uint32_t calc_nr_buckets(unsigned nr_cells)
-{
- uint32_t n = 128;
-
- nr_cells /= 4;
- nr_cells = min(nr_cells, 8192u);
-
- while (n < nr_cells)
- n <<= 1;
-
- return n;
-}
-
static struct kmem_cache *_cell_cache;
-static void init_bucket(struct bucket *b)
-{
- spin_lock_init(&b->lock);
- INIT_HLIST_HEAD(&b->cells);
-}
+/*----------------------------------------------------------------*/
/*
* @nr_cells should be the number of cells you want in use _concurrently_.
* Don't confuse it with the number of distinct keys.
*/
-struct dm_bio_prison *dm_bio_prison_create(unsigned nr_cells)
+struct dm_bio_prison *dm_bio_prison_create(void)
{
- unsigned i;
- uint32_t nr_buckets = calc_nr_buckets(nr_cells);
- size_t len = sizeof(struct dm_bio_prison) +
- (sizeof(struct bucket) * nr_buckets);
- struct dm_bio_prison *prison = kmalloc(len, GFP_KERNEL);
+ struct dm_bio_prison *prison = kmalloc(sizeof(*prison), GFP_KERNEL);
if (!prison)
return NULL;
- prison->cell_pool = mempool_create_slab_pool(nr_cells, _cell_cache);
+ spin_lock_init(&prison->lock);
+
+ prison->cell_pool = mempool_create_slab_pool(MIN_CELLS, _cell_cache);
if (!prison->cell_pool) {
kfree(prison);
return NULL;
}
- prison->nr_buckets = nr_buckets;
- prison->hash_mask = nr_buckets - 1;
- prison->buckets = (struct bucket *) (prison + 1);
- for (i = 0; i < nr_buckets; i++)
- init_bucket(prison->buckets + i);
+ prison->cells = RB_ROOT;
return prison;
}
@@ -101,68 +71,73 @@ void dm_bio_prison_free_cell(struct dm_bio_prison *prison,
}
EXPORT_SYMBOL_GPL(dm_bio_prison_free_cell);
-static uint32_t hash_key(struct dm_bio_prison *prison, struct dm_cell_key *key)
+static void __setup_new_cell(struct dm_cell_key *key,
+ struct bio *holder,
+ struct dm_bio_prison_cell *cell)
{
- const unsigned long BIG_PRIME = 4294967291UL;
- uint64_t hash = key->block * BIG_PRIME;
-
- return (uint32_t) (hash & prison->hash_mask);
+ memcpy(&cell->key, key, sizeof(cell->key));
+ cell->holder = holder;
+ bio_list_init(&cell->bios);
}
-static int keys_equal(struct dm_cell_key *lhs, struct dm_cell_key *rhs)
+static int cmp_keys(struct dm_cell_key *lhs,
+ struct dm_cell_key *rhs)
{
- return (lhs->virtual == rhs->virtual) &&
- (lhs->dev == rhs->dev) &&
- (lhs->block == rhs->block);
-}
+ if (lhs->virtual < rhs->virtual)
+ return -1;
-static struct bucket *get_bucket(struct dm_bio_prison *prison,
- struct dm_cell_key *key)
-{
- return prison->buckets + hash_key(prison, key);
-}
+ if (lhs->virtual > rhs->virtual)
+ return 1;
-static struct dm_bio_prison_cell *__search_bucket(struct bucket *b,
- struct dm_cell_key *key)
-{
- struct dm_bio_prison_cell *cell;
+ if (lhs->dev < rhs->dev)
+ return -1;
- hlist_for_each_entry(cell, &b->cells, list)
- if (keys_equal(&cell->key, key))
- return cell;
+ if (lhs->dev > rhs->dev)
+ return 1;
- return NULL;
-}
+ if (lhs->block < rhs->block)
+ return -1;
-static void __setup_new_cell(struct bucket *b,
- struct dm_cell_key *key,
- struct bio *holder,
- struct dm_bio_prison_cell *cell)
-{
- memcpy(&cell->key, key, sizeof(cell->key));
- cell->holder = holder;
- bio_list_init(&cell->bios);
- hlist_add_head(&cell->list, &b->cells);
+ if (lhs->block > rhs->block)
+ return 1;
+
+ return 0;
}
-static int __bio_detain(struct bucket *b,
+static int __bio_detain(struct dm_bio_prison *prison,
struct dm_cell_key *key,
struct bio *inmate,
struct dm_bio_prison_cell *cell_prealloc,
struct dm_bio_prison_cell **cell_result)
{
- struct dm_bio_prison_cell *cell;
-
- cell = __search_bucket(b, key);
- if (cell) {
- if (inmate)
- bio_list_add(&cell->bios, inmate);
- *cell_result = cell;
- return 1;
+ int r;
+ struct rb_node **new = &prison->cells.rb_node, *parent = NULL;
+
+ while (*new) {
+ struct dm_bio_prison_cell *cell =
+ container_of(*new, struct dm_bio_prison_cell, node);
+
+ r = cmp_keys(key, &cell->key);
+
+ parent = *new;
+ if (r < 0)
+ new = &((*new)->rb_left);
+ else if (r > 0)
+ new = &((*new)->rb_right);
+ else {
+ if (inmate)
+ bio_list_add(&cell->bios, inmate);
+ *cell_result = cell;
+ return 1;
+ }
}
- __setup_new_cell(b, key, inmate, cell_prealloc);
+ __setup_new_cell(key, inmate, cell_prealloc);
*cell_result = cell_prealloc;
+
+ rb_link_node(&cell_prealloc->node, parent, new);
+ rb_insert_color(&cell_prealloc->node, &prison->cells);
+
return 0;
}
@@ -174,11 +149,10 @@ static int bio_detain(struct dm_bio_prison *prison,
{
int r;
unsigned long flags;
- struct bucket *b = get_bucket(prison, key);
- spin_lock_irqsave(&b->lock, flags);
- r = __bio_detain(b, key, inmate, cell_prealloc, cell_result);
- spin_unlock_irqrestore(&b->lock, flags);
+ spin_lock_irqsave(&prison->lock, flags);
+ r = __bio_detain(prison, key, inmate, cell_prealloc, cell_result);
+ spin_unlock_irqrestore(&prison->lock, flags);
return r;
}
@@ -205,10 +179,11 @@ EXPORT_SYMBOL_GPL(dm_get_cell);
/*
* @inmates must have been initialised prior to this call
*/
-static void __cell_release(struct dm_bio_prison_cell *cell,
+static void __cell_release(struct dm_bio_prison *prison,
+ struct dm_bio_prison_cell *cell,
struct bio_list *inmates)
{
- hlist_del(&cell->list);
+ rb_erase(&cell->node, &prison->cells);
if (inmates) {
if (cell->holder)
@@ -222,21 +197,21 @@ void dm_cell_release(struct dm_bio_prison *prison,
struct bio_list *bios)
{
unsigned long flags;
- struct bucket *b = get_bucket(prison, &cell->key);
- spin_lock_irqsave(&b->lock, flags);
- __cell_release(cell, bios);
- spin_unlock_irqrestore(&b->lock, flags);
+ spin_lock_irqsave(&prison->lock, flags);
+ __cell_release(prison, cell, bios);
+ spin_unlock_irqrestore(&prison->lock, flags);
}
EXPORT_SYMBOL_GPL(dm_cell_release);
/*
* Sometimes we don't want the holder, just the additional bios.
*/
-static void __cell_release_no_holder(struct dm_bio_prison_cell *cell,
+static void __cell_release_no_holder(struct dm_bio_prison *prison,
+ struct dm_bio_prison_cell *cell,
struct bio_list *inmates)
{
- hlist_del(&cell->list);
+ rb_erase(&cell->node, &prison->cells);
bio_list_merge(inmates, &cell->bios);
}
@@ -245,11 +220,10 @@ void dm_cell_release_no_holder(struct dm_bio_prison *prison,
struct bio_list *inmates)
{
unsigned long flags;
- struct bucket *b = get_bucket(prison, &cell->key);
- spin_lock_irqsave(&b->lock, flags);
- __cell_release_no_holder(cell, inmates);
- spin_unlock_irqrestore(&b->lock, flags);
+ spin_lock_irqsave(&prison->lock, flags);
+ __cell_release_no_holder(prison, cell, inmates);
+ spin_unlock_irqrestore(&prison->lock, flags);
}
EXPORT_SYMBOL_GPL(dm_cell_release_no_holder);
diff --git a/drivers/md/dm-bio-prison.h b/drivers/md/dm-bio-prison.h
index 6805a14..997a439 100644
--- a/drivers/md/dm-bio-prison.h
+++ b/drivers/md/dm-bio-prison.h
@@ -10,8 +10,8 @@
#include "persistent-data/dm-block-manager.h" /* FIXME: for dm_block_t */
#include "dm-thin-metadata.h" /* FIXME: for dm_thin_id */
-#include <linux/list.h>
#include <linux/bio.h>
+#include <linux/rbtree.h>
/*----------------------------------------------------------------*/
@@ -35,13 +35,14 @@ struct dm_cell_key {
* themselves.
*/
struct dm_bio_prison_cell {
- struct hlist_node list;
+ struct rb_node node;
+
struct dm_cell_key key;
struct bio *holder;
struct bio_list bios;
};
-struct dm_bio_prison *dm_bio_prison_create(unsigned nr_cells);
+struct dm_bio_prison *dm_bio_prison_create(void);
void dm_bio_prison_destroy(struct dm_bio_prison *prison);
/*
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 7130505..69de8b4 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -95,7 +95,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
/*----------------------------------------------------------------*/
-#define PRISON_CELLS 1024
#define MIGRATION_POOL_SIZE 128
#define COMMIT_PERIOD HZ
#define MIGRATION_COUNT_WINDOW 10
@@ -2327,7 +2326,7 @@ static int cache_create(struct cache_args *ca, struct cache **result)
INIT_DELAYED_WORK(&cache->waker, do_waker);
cache->last_commit_jiffies = jiffies;
- cache->prison = dm_bio_prison_create(PRISON_CELLS);
+ cache->prison = dm_bio_prison_create();
if (!cache->prison) {
*error = "could not create bio prison";
goto bad;
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 4843801..2459628 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -25,7 +25,6 @@
*/
#define ENDIO_HOOK_POOL_SIZE 1024
#define MAPPING_POOL_SIZE 1024
-#define PRISON_CELLS 1024
#define COMMIT_PERIOD HZ
#define NO_SPACE_TIMEOUT_SECS 60
@@ -2185,7 +2184,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
pool->sectors_per_block_shift = __ffs(block_size);
pool->low_water_blocks = 0;
pool_features_init(&pool->pf);
- pool->prison = dm_bio_prison_create(PRISON_CELLS);
+ pool->prison = dm_bio_prison_create();
if (!pool->prison) {
*error = "Error creating pool's bio prison";
err_p = ERR_PTR(-ENOMEM);
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 04/17] dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (2 preceding siblings ...)
2014-10-17 6:06 ` [for-3.19 PATCH 03/17] dm bio prison: switch to using a red black tree Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 05/17] dm transaction manager: add support for prefetching blocks of metadata Mike Snitzer
` (13 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
This change is a prerequisite for allowing metadata to be prefetched.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin-metadata.c | 30 +++++++++++++-----------------
drivers/md/dm-thin-metadata.h | 4 ++--
2 files changed, 15 insertions(+), 19 deletions(-)
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index e9d33ad..ee42d1c 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1384,42 +1384,38 @@ static bool __snapshotted_since(struct dm_thin_device *td, uint32_t time)
}
int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
- int can_block, struct dm_thin_lookup_result *result)
+ int can_issue_io, struct dm_thin_lookup_result *result)
{
- int r = -EINVAL;
- uint64_t block_time = 0;
+ int r;
__le64 value;
struct dm_pool_metadata *pmd = td->pmd;
dm_block_t keys[2] = { td->id, block };
struct dm_btree_info *info;
- if (can_block) {
- down_read(&pmd->root_lock);
- info = &pmd->info;
- } else if (down_read_trylock(&pmd->root_lock))
- info = &pmd->nb_info;
- else
- return -EWOULDBLOCK;
-
if (pmd->fail_io)
- goto out;
+ return -EINVAL;
- r = dm_btree_lookup(info, pmd->root, keys, &value);
- if (!r)
- block_time = le64_to_cpu(value);
+ down_read(&pmd->root_lock);
-out:
- up_read(&pmd->root_lock);
+ if (can_issue_io) {
+ info = &pmd->info;
+ } else
+ info = &pmd->nb_info;
+ r = dm_btree_lookup(info, pmd->root, keys, &value);
if (!r) {
+ uint64_t block_time = 0;
dm_block_t exception_block;
uint32_t exception_time;
+
+ block_time = le64_to_cpu(value);
unpack_block_time(block_time, &exception_block,
&exception_time);
result->block = exception_block;
result->shared = __snapshotted_since(td, exception_time);
}
+ up_read(&pmd->root_lock);
return r;
}
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index e3c857d..efedd5a 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -139,12 +139,12 @@ struct dm_thin_lookup_result {
/*
* Returns:
- * -EWOULDBLOCK iff @can_block is set and would block.
+ * -EWOULDBLOCK iff @can_issue_io is set and would issue IO
* -ENODATA iff that mapping is not present.
* 0 success
*/
int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
- int can_block, struct dm_thin_lookup_result *result);
+ int can_issue_io, struct dm_thin_lookup_result *result);
/*
* Obtain an unused block.
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 05/17] dm transaction manager: add support for prefetching blocks of metadata
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (3 preceding siblings ...)
2014-10-17 6:06 ` [for-3.19 PATCH 04/17] dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 06/17] dm thin: prefetch missing metadata pages Mike Snitzer
` (12 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Introduce the dm_tm_issue_prefetches interface. If you're using a
non-blocking clone the tm will build up a list of requested blocks that
weren't in core. dm_tm_issue_prefetches will request those blocks to be
prefetched.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
.../md/persistent-data/dm-transaction-manager.c | 77 +++++++++++++++++++++-
.../md/persistent-data/dm-transaction-manager.h | 7 ++
2 files changed, 82 insertions(+), 2 deletions(-)
diff --git a/drivers/md/persistent-data/dm-transaction-manager.c b/drivers/md/persistent-data/dm-transaction-manager.c
index 3bc30a0..9cb797d 100644
--- a/drivers/md/persistent-data/dm-transaction-manager.c
+++ b/drivers/md/persistent-data/dm-transaction-manager.c
@@ -10,6 +10,8 @@
#include "dm-persistent-data-internal.h"
#include <linux/export.h>
+#include <linux/mutex.h>
+#include <linux/hash.h>
#include <linux/slab.h>
#include <linux/device-mapper.h>
@@ -17,6 +19,61 @@
/*----------------------------------------------------------------*/
+#define PREFETCH_SIZE 128
+#define PREFETCH_BITS 7
+#define PREFETCH_SENTINEL ((dm_block_t) -1ULL)
+
+struct prefetch_set {
+ struct mutex lock;
+ dm_block_t blocks[PREFETCH_SIZE];
+};
+
+static unsigned prefetch_hash(dm_block_t b)
+{
+ return hash_64(b, PREFETCH_BITS);
+}
+
+static void prefetch_wipe(struct prefetch_set *p)
+{
+ unsigned i;
+ for (i = 0; i < PREFETCH_SIZE; i++)
+ p->blocks[i] = PREFETCH_SENTINEL;
+}
+
+static void prefetch_init(struct prefetch_set *p)
+{
+ mutex_init(&p->lock);
+ prefetch_wipe(p);
+}
+
+static void prefetch_add(struct prefetch_set *p, dm_block_t b)
+{
+ unsigned h = prefetch_hash(b);
+
+ mutex_lock(&p->lock);
+ if (p->blocks[h] == PREFETCH_SENTINEL)
+ p->blocks[h] = b;
+
+ mutex_unlock(&p->lock);
+}
+
+static void prefetch_issue(struct prefetch_set *p, struct dm_block_manager *bm)
+{
+ unsigned i;
+
+ mutex_lock(&p->lock);
+
+ for (i = 0; i < PREFETCH_SIZE; i++)
+ if (p->blocks[i] != PREFETCH_SENTINEL) {
+ dm_bm_prefetch(bm, p->blocks[i]);
+ p->blocks[i] = PREFETCH_SENTINEL;
+ }
+
+ mutex_unlock(&p->lock);
+}
+
+/*----------------------------------------------------------------*/
+
struct shadow_info {
struct hlist_node hlist;
dm_block_t where;
@@ -37,6 +94,8 @@ struct dm_transaction_manager {
spinlock_t lock;
struct hlist_head buckets[DM_HASH_SIZE];
+
+ struct prefetch_set prefetches;
};
/*----------------------------------------------------------------*/
@@ -117,6 +176,8 @@ static struct dm_transaction_manager *dm_tm_create(struct dm_block_manager *bm,
for (i = 0; i < DM_HASH_SIZE; i++)
INIT_HLIST_HEAD(tm->buckets + i);
+ prefetch_init(&tm->prefetches);
+
return tm;
}
@@ -268,8 +329,14 @@ int dm_tm_read_lock(struct dm_transaction_manager *tm, dm_block_t b,
struct dm_block_validator *v,
struct dm_block **blk)
{
- if (tm->is_clone)
- return dm_bm_read_try_lock(tm->real->bm, b, v, blk);
+ if (tm->is_clone) {
+ int r = dm_bm_read_try_lock(tm->real->bm, b, v, blk);
+
+ if (r == -EWOULDBLOCK)
+ prefetch_add(&tm->real->prefetches, b);
+
+ return r;
+ }
return dm_bm_read_lock(tm->bm, b, v, blk);
}
@@ -317,6 +384,12 @@ struct dm_block_manager *dm_tm_get_bm(struct dm_transaction_manager *tm)
return tm->bm;
}
+void dm_tm_issue_prefetches(struct dm_transaction_manager *tm)
+{
+ prefetch_issue(&tm->prefetches, tm->bm);
+}
+EXPORT_SYMBOL_GPL(dm_tm_issue_prefetches);
+
/*----------------------------------------------------------------*/
static int dm_tm_create_internal(struct dm_block_manager *bm,
diff --git a/drivers/md/persistent-data/dm-transaction-manager.h b/drivers/md/persistent-data/dm-transaction-manager.h
index 2772ed2..2e0d4d6 100644
--- a/drivers/md/persistent-data/dm-transaction-manager.h
+++ b/drivers/md/persistent-data/dm-transaction-manager.h
@@ -109,6 +109,13 @@ int dm_tm_ref(struct dm_transaction_manager *tm, dm_block_t b,
struct dm_block_manager *dm_tm_get_bm(struct dm_transaction_manager *tm);
/*
+ * If you're using a non-blocking clone the tm will build up a list of
+ * requested blocks that weren't in core. This call will request those
+ * blocks to be prefetched.
+ */
+void dm_tm_issue_prefetches(struct dm_transaction_manager *tm);
+
+/*
* A little utility that ties the knot by producing a transaction manager
* that has a space map managed by the transaction manager...
*
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 06/17] dm thin: prefetch missing metadata pages
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (4 preceding siblings ...)
2014-10-17 6:06 ` [for-3.19 PATCH 05/17] dm transaction manager: add support for prefetching blocks of metadata Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 07/17] dm thin: throttle incoming IO Mike Snitzer
` (11 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Prefetch metadata at the start of the worker thread and then again every
128th bio processed from the deferred list.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin-metadata.c | 5 +++++
drivers/md/dm-thin-metadata.h | 5 +++++
drivers/md/dm-thin.c | 10 ++++++----
3 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index ee42d1c..43adbb8 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1809,3 +1809,8 @@ bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd)
return needs_check;
}
+
+void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd)
+{
+ dm_tm_issue_prefetches(pmd->tm);
+}
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index efedd5a..921d15e 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -213,6 +213,11 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd);
bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd);
+/*
+ * Issue any prefetches that may be useful.
+ */
+void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd);
+
/*----------------------------------------------------------------*/
#endif
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 2459628..1d25b51 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1526,6 +1526,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
struct bio *bio;
struct bio_list bios;
struct blk_plug plug;
+ unsigned count = 0;
if (tc->requeue_mode) {
requeue_bio_list(tc, &tc->deferred_bio_list);
@@ -1567,6 +1568,10 @@ static void process_thin_deferred_bios(struct thin_c *tc)
pool->process_discard(tc, bio);
else
pool->process_bio(tc, bio);
+
+ if ((count++ & 127) == 0) {
+ dm_pool_issue_prefetches(pool->pmd);
+ }
}
blk_finish_plug(&plug);
}
@@ -1652,6 +1657,7 @@ static void do_worker(struct work_struct *ws)
{
struct pool *pool = container_of(ws, struct pool, worker);
+ dm_pool_issue_prefetches(pool->pmd);
process_prepared(pool, &pool->prepared_mappings, &pool->process_prepared_mapping);
process_prepared(pool, &pool->prepared_discards, &pool->process_prepared_discard);
process_deferred_bios(pool);
@@ -1990,10 +1996,6 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
/* fall through */
case -EWOULDBLOCK:
- /*
- * In future, the failed dm_thin_find_block above could
- * provide the hint to load the metadata into cache.
- */
thin_defer_bio(tc, bio);
return DM_MAPIO_SUBMITTED;
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 07/17] dm thin: throttle incoming IO
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (5 preceding siblings ...)
2014-10-17 6:06 ` [for-3.19 PATCH 06/17] dm thin: prefetch missing metadata pages Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 08/17] dm thin: adjust max_sectors_kb based on thinp blocksize Mike Snitzer
` (10 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Throttle IO based on the time it's taking the worker to do one loop.
There were reports of hung task timeouts occuring and it was observed
that the excessively long avgqu-sz (as reported by iostat) was
contributing to these hung tasks.
Throttling definitely helps dm-thinp perform better under heavy IO load
(without being detremental by being overzealous). It reduces avgqu-sz
drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata
is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2. And
avgqu-sz stays at or below 30K even with dirty_ratio=20,
dirty_background_ratio=10.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 1d25b51..d451ce5 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -126,6 +126,53 @@ static void build_virtual_key(struct dm_thin_device *td, dm_block_t b,
/*----------------------------------------------------------------*/
+#define THROTTLE_THRESHOLD (1 * HZ)
+
+struct throttle {
+ struct rw_semaphore lock;
+ unsigned long threshold;
+ bool throttle_applied;
+};
+
+static void throttle_init(struct throttle *t)
+{
+ init_rwsem(&t->lock);
+ t->throttle_applied = false;
+}
+
+static void throttle_work_start(struct throttle *t)
+{
+ t->threshold = jiffies + THROTTLE_THRESHOLD;
+}
+
+static void throttle_work_update(struct throttle *t)
+{
+ if (!t->throttle_applied && jiffies > t->threshold) {
+ down_write(&t->lock);
+ t->throttle_applied = true;
+ }
+}
+
+static void throttle_work_complete(struct throttle *t)
+{
+ if (t->throttle_applied) {
+ t->throttle_applied = false;
+ up_write(&t->lock);
+ }
+}
+
+static void throttle_lock(struct throttle *t)
+{
+ down_read(&t->lock);
+}
+
+static void throttle_unlock(struct throttle *t)
+{
+ up_read(&t->lock);
+}
+
+/*----------------------------------------------------------------*/
+
/*
* A pool device ties together a metadata device and a data device. It
* also provides the interface for creating and destroying internal
@@ -175,6 +222,7 @@ struct pool {
struct dm_kcopyd_client *copier;
struct workqueue_struct *wq;
+ struct throttle throttle;
struct work_struct worker;
struct delayed_work waker;
struct delayed_work no_space_timeout;
@@ -1570,6 +1618,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
pool->process_bio(tc, bio);
if ((count++ & 127) == 0) {
+ throttle_work_update(&pool->throttle);
dm_pool_issue_prefetches(pool->pmd);
}
}
@@ -1657,10 +1706,15 @@ static void do_worker(struct work_struct *ws)
{
struct pool *pool = container_of(ws, struct pool, worker);
+ throttle_work_start(&pool->throttle);
dm_pool_issue_prefetches(pool->pmd);
+ throttle_work_update(&pool->throttle);
process_prepared(pool, &pool->prepared_mappings, &pool->process_prepared_mapping);
+ throttle_work_update(&pool->throttle);
process_prepared(pool, &pool->prepared_discards, &pool->process_prepared_discard);
+ throttle_work_update(&pool->throttle);
process_deferred_bios(pool);
+ throttle_work_complete(&pool->throttle);
}
/*
@@ -1893,9 +1947,11 @@ static void thin_defer_bio(struct thin_c *tc, struct bio *bio)
unsigned long flags;
struct pool *pool = tc->pool;
+ throttle_lock(&pool->throttle);
spin_lock_irqsave(&tc->lock, flags);
bio_list_add(&tc->deferred_bio_list, bio);
spin_unlock_irqrestore(&tc->lock, flags);
+ throttle_unlock(&pool->throttle);
wake_worker(pool);
}
@@ -2212,6 +2268,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
goto bad_wq;
}
+ throttle_init(&pool->throttle);
INIT_WORK(&pool->worker, do_worker);
INIT_DELAYED_WORK(&pool->waker, do_waker);
INIT_DELAYED_WORK(&pool->no_space_timeout, do_no_space_timeout);
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 08/17] dm thin: adjust max_sectors_kb based on thinp blocksize
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (6 preceding siblings ...)
2014-10-17 6:06 ` [for-3.19 PATCH 07/17] dm thin: throttle incoming IO Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 09/17] dm: improve documentation and code clarity in dm_merge_bvec Mike Snitzer
` (9 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Allows for filesystems to submit bios that are a factor of the thinp
blocksize, improving dm-thinp efficiency (particularly when the data
volume is RAID).
Also set io_min to max_sectors_kb if it is a factor of the thinp
blocksize.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index d451ce5..6c216b3 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -11,6 +11,7 @@
#include <linux/device-mapper.h>
#include <linux/dm-io.h>
#include <linux/dm-kcopyd.h>
+#include <linux/log2.h>
#include <linux/list.h>
#include <linux/rculist.h>
#include <linux/init.h>
@@ -3227,15 +3228,36 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
struct pool_c *pt = ti->private;
struct pool *pool = pt->pool;
- uint64_t io_opt_sectors = limits->io_opt >> SECTOR_SHIFT;
+ sector_t io_opt_sectors = limits->io_opt >> SECTOR_SHIFT;
+
+ /*
+ * Adjust max_sectors_kb to highest possible power-of-2
+ * factor of pool->sectors_per_block.
+ */
+ if (limits->max_hw_sectors & (limits->max_hw_sectors - 1))
+ limits->max_sectors = rounddown_pow_of_two(limits->max_hw_sectors);
+ else
+ limits->max_sectors = limits->max_hw_sectors;
+
+ if (limits->max_sectors < pool->sectors_per_block) {
+ while (!is_factor(pool->sectors_per_block, limits->max_sectors))
+ limits->max_sectors = rounddown_pow_of_two(limits->max_sectors);
+ } else if (block_size_is_power_of_two(pool)) {
+ /* max_sectors_kb is >= power-of-2 thinp blocksize */
+ while (!is_factor(limits->max_sectors, pool->sectors_per_block))
+ limits->max_sectors = rounddown_pow_of_two(limits->max_sectors);
+ }
/*
* If the system-determined stacked limits are compatible with the
* pool's blocksize (io_opt is a factor) do not override them.
*/
if (io_opt_sectors < pool->sectors_per_block ||
- do_div(io_opt_sectors, pool->sectors_per_block)) {
- blk_limits_io_min(limits, pool->sectors_per_block << SECTOR_SHIFT);
+ !is_factor(io_opt_sectors, pool->sectors_per_block)) {
+ if (is_factor(pool->sectors_per_block, limits->max_sectors))
+ blk_limits_io_min(limits, limits->max_sectors << SECTOR_SHIFT);
+ else
+ blk_limits_io_min(limits, pool->sectors_per_block << SECTOR_SHIFT);
blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
}
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 09/17] dm: improve documentation and code clarity in dm_merge_bvec
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (7 preceding siblings ...)
2014-10-17 6:06 ` [for-3.19 PATCH 08/17] dm thin: adjust max_sectors_kb based on thinp blocksize Mike Snitzer
@ 2014-10-17 6:06 ` Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 10/17] dm thin: implement thin_merge Mike Snitzer
` (8 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:06 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
These code changes do not introduce a functional change.
But bio_add_page() will never attempt to build up a bio larger than
queue_max_sectors(). Similarly, bio_get_nr_vecs() is also bound by
queue_max_sectors(). Therefore, there is no point in allowing
dm_merge_bvec() to answer "how many sectors can a bio have at this
offset?" with anything larger than queue_max_sectors(). Using
queue_max_sectors() rather than BIO_MAX_SECTORS serves to more
accurately convey the limits that are being imposed.
Also, use unlikely() to clarify the fact that the defensive code in
dm_merge_bvec() relative to max_size going negative shouldn't ever
happen -- if it does happen there is a bug in the block layer for
requesting larger than dm_merge_bvec()'s initial response for a given
offset. Also, update a comment in dm_merge_bvec() relative to
max_hw_sectors_kb. And fix empty newline whitespace.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 58f3927..0fee0e5 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1607,9 +1607,9 @@ static int dm_merge_bvec(struct request_queue *q,
* Find maximum amount of I/O that won't need splitting
*/
max_sectors = min(max_io_len(bvm->bi_sector, ti),
- (sector_t) BIO_MAX_SECTORS);
+ (sector_t) queue_max_sectors(q));
max_size = (max_sectors << SECTOR_SHIFT) - bvm->bi_size;
- if (max_size < 0)
+ if (unlikely(max_size < 0)) /* this shouldn't _ever_ happen */
max_size = 0;
/*
@@ -1621,10 +1621,10 @@ static int dm_merge_bvec(struct request_queue *q,
max_size = ti->type->merge(ti, bvm, biovec, max_size);
/*
* If the target doesn't support merge method and some of the devices
- * provided their merge_bvec method (we know this by looking at
- * queue_max_hw_sectors), then we can't allow bios with multiple vector
- * entries. So always set max_size to 0, and the code below allows
- * just one page.
+ * provided their merge_bvec method (we know this by looking for the
+ * max_hw_sectors that dm_set_device_limits may set), then we can't
+ * allow bios with multiple vector entries. So always set max_size
+ * to 0, and the code below allows just one page.
*/
else if (queue_max_hw_sectors(q) <= PAGE_SIZE >> 9)
max_size = 0;
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 10/17] dm thin: implement thin_merge
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (8 preceding siblings ...)
2014-10-17 6:06 ` [for-3.19 PATCH 09/17] dm: improve documentation and code clarity in dm_merge_bvec Mike Snitzer
@ 2014-10-17 6:07 ` Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 11/17] dm thin: grab a virtual cell before looking up the mapping Mike Snitzer
` (7 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:07 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Introduce thin_merge so that any additional constraints from the data
volume may be taken into account when determing the maximum number of
sectors that can be issued relative to the specified logical offset.
This is particularly important if/when the data volume is layered ontop
of a more sophisticated device (e.g. dm-raid or some other DM target).
Reviewed-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 6c216b3..dc3efac 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3613,6 +3613,21 @@ err:
DMEMIT("Error");
}
+static int thin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+ struct bio_vec *biovec, int max_size)
+{
+ struct thin_c *tc = ti->private;
+ struct request_queue *q = bdev_get_queue(tc->pool_dev->bdev);
+
+ if (!q->merge_bvec_fn)
+ return max_size;
+
+ bvm->bi_bdev = tc->pool_dev->bdev;
+ bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
+
+ return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
static int thin_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -3637,7 +3652,7 @@ static int thin_iterate_devices(struct dm_target *ti,
static struct target_type thin_target = {
.name = "thin",
- .version = {1, 13, 0},
+ .version = {1, 14, 0},
.module = THIS_MODULE,
.ctr = thin_ctr,
.dtr = thin_dtr,
@@ -3647,6 +3662,7 @@ static struct target_type thin_target = {
.presuspend = thin_presuspend,
.postsuspend = thin_postsuspend,
.status = thin_status,
+ .merge = thin_merge,
.iterate_devices = thin_iterate_devices,
};
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 11/17] dm thin: grab a virtual cell before looking up the mapping
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (9 preceding siblings ...)
2014-10-17 6:07 ` [for-3.19 PATCH 10/17] dm thin: implement thin_merge Mike Snitzer
@ 2014-10-17 6:07 ` Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 12/17] dm thin: performance improvement to discard processing Mike Snitzer
` (6 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:07 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Avoids normal IO racing with discard.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
---
drivers/md/dm-thin.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index dc3efac..b9b6afb 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1998,6 +1998,14 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_SUBMITTED;
}
+ /*
+ * We must hold the virtual cell before doing the lookup, otherwise
+ * there's a race with discard.
+ */
+ build_virtual_key(tc->td, block, &key);
+ if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1, &cell_result))
+ return DM_MAPIO_SUBMITTED;
+
r = dm_thin_find_block(td, block, 0, &result);
/*
@@ -2021,13 +2029,10 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* shared flag will be set in their case.
*/
thin_defer_bio(tc, bio);
+ cell_defer_no_holder_no_free(tc, &cell1);
return DM_MAPIO_SUBMITTED;
}
- build_virtual_key(tc->td, block, &key);
- if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1, &cell_result))
- return DM_MAPIO_SUBMITTED;
-
build_data_key(tc->td, result.block, &key);
if (dm_bio_detain(tc->pool->prison, &key, bio, &cell2, &cell_result)) {
cell_defer_no_holder_no_free(tc, &cell1);
@@ -2048,12 +2053,14 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* of doing so.
*/
handle_unserviceable_bio(tc->pool, bio);
+ cell_defer_no_holder_no_free(tc, &cell1);
return DM_MAPIO_SUBMITTED;
}
/* fall through */
case -EWOULDBLOCK:
thin_defer_bio(tc, bio);
+ cell_defer_no_holder_no_free(tc, &cell1);
return DM_MAPIO_SUBMITTED;
default:
@@ -2063,6 +2070,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* pool is switched to fail-io mode.
*/
bio_io_error(bio);
+ cell_defer_no_holder_no_free(tc, &cell1);
return DM_MAPIO_SUBMITTED;
}
}
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 12/17] dm thin: performance improvement to discard processing
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (10 preceding siblings ...)
2014-10-17 6:07 ` [for-3.19 PATCH 11/17] dm thin: grab a virtual cell before looking up the mapping Mike Snitzer
@ 2014-10-17 6:07 ` Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 13/17] dm thin: factor out remap_and_issue_overwrite Mike Snitzer
` (5 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:07 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
When processing a discard bio, if the block is already quiesced do the
discard immediately rather than adding the mapping to a list for the
next iteration of the worker thread.
Discarding a fully provisioned 100G thin volume with 64k block size goes
from 860s to 95s with this change.
Clearly there's something wrong with the worker architecture, more
investigation needed.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index b9b6afb..67a11f3 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1194,7 +1194,6 @@ static void retry_bios_on_resume(struct pool *pool, struct dm_bio_prison_cell *c
static void process_discard(struct thin_c *tc, struct bio *bio)
{
int r;
- unsigned long flags;
struct pool *pool = tc->pool;
struct dm_bio_prison_cell *cell, *cell2;
struct dm_cell_key key, key2;
@@ -1235,12 +1234,9 @@ static void process_discard(struct thin_c *tc, struct bio *bio)
m->cell2 = cell2;
m->bio = bio;
- if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list)) {
- spin_lock_irqsave(&pool->lock, flags);
- list_add_tail(&m->list, &pool->prepared_discards);
- spin_unlock_irqrestore(&pool->lock, flags);
- wake_worker(pool);
- }
+ if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
+ pool->process_prepared_discard(m);
+
} else {
inc_all_io_entry(pool, bio);
cell_defer_no_holder(tc, cell);
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 13/17] dm thin: factor out remap_and_issue_overwrite
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (11 preceding siblings ...)
2014-10-17 6:07 ` [for-3.19 PATCH 12/17] dm thin: performance improvement to discard processing Mike Snitzer
@ 2014-10-17 6:07 ` Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 14/17] dm thin: defer whole cells rather than individual bios Mike Snitzer
` (4 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:07 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Purely cleanup of duplicated code, no functional change.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 67a11f3..176362e 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -890,6 +890,20 @@ static void ll_zero(struct thin_c *tc, struct dm_thin_new_mapping *m,
}
}
+static void remap_and_issue_overwrite(struct thin_c *tc, struct bio *bio,
+ dm_block_t data_block,
+ struct dm_thin_new_mapping *m)
+{
+ struct pool *pool = tc->pool;
+ struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
+
+ h->overwrite_mapping = m;
+ m->bio = bio;
+ save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
+ inc_all_io_entry(pool, bio);
+ remap_and_issue(tc, bio, data_block);
+}
+
/*
* A partial copy also needs to zero the uncopied region.
*/
@@ -924,15 +938,9 @@ static void schedule_copy(struct thin_c *tc, dm_block_t virt_block,
* If the whole block of data is being overwritten, we can issue the
* bio immediately. Otherwise we use kcopyd to clone the data first.
*/
- if (io_overwrites_block(pool, bio)) {
- struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
-
- h->overwrite_mapping = m;
- m->bio = bio;
- save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
- inc_all_io_entry(pool, bio);
- remap_and_issue(tc, bio, data_dest);
- } else {
+ if (io_overwrites_block(pool, bio))
+ remap_and_issue_overwrite(tc, bio, data_dest, m);
+ else {
struct dm_io_region from, to;
from.bdev = origin->bdev;
@@ -1001,16 +1009,10 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
if (!pool->pf.zero_new_blocks)
process_prepared_mapping(m);
- else if (io_overwrites_block(pool, bio)) {
- struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
-
- h->overwrite_mapping = m;
- m->bio = bio;
- save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
- inc_all_io_entry(pool, bio);
- remap_and_issue(tc, bio, data_block);
+ else if (io_overwrites_block(pool, bio))
+ remap_and_issue_overwrite(tc, bio, data_block, m);
- } else
+ else
ll_zero(tc, m,
data_block * pool->sectors_per_block,
(data_block + 1) * pool->sectors_per_block);
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 14/17] dm thin: defer whole cells rather than individual bios
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (12 preceding siblings ...)
2014-10-17 6:07 ` [for-3.19 PATCH 13/17] dm thin: factor out remap_and_issue_overwrite Mike Snitzer
@ 2014-10-17 6:07 ` Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 15/17] dm thin: remap the bios in a cell immediately Mike Snitzer
` (3 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:07 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
This avoids dropping the cell, so increases the probability that other
bios will collect within the cell, rather than being passed individually
to the worker.
Also add required process_cell and process_discard_cell error handling
wrappers and set associated pool-mode function pointers accordingly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bio-prison.h | 1 +
drivers/md/dm-thin.c | 233 +++++++++++++++++++++++++++++++++++----------
2 files changed, 183 insertions(+), 51 deletions(-)
diff --git a/drivers/md/dm-bio-prison.h b/drivers/md/dm-bio-prison.h
index 997a439..c0cddb1 100644
--- a/drivers/md/dm-bio-prison.h
+++ b/drivers/md/dm-bio-prison.h
@@ -35,6 +35,7 @@ struct dm_cell_key {
* themselves.
*/
struct dm_bio_prison_cell {
+ struct list_head user_list; /* for client use */
struct rb_node node;
struct dm_cell_key key;
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 176362e..507e674 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -202,6 +202,7 @@ struct pool_features {
struct thin_c;
typedef void (*process_bio_fn)(struct thin_c *tc, struct bio *bio);
+typedef void (*process_cell_fn)(struct thin_c *tc, struct dm_bio_prison_cell *cell);
typedef void (*process_mapping_fn)(struct dm_thin_new_mapping *m);
struct pool {
@@ -246,6 +247,9 @@ struct pool {
process_bio_fn process_bio;
process_bio_fn process_discard;
+ process_cell_fn process_cell;
+ process_cell_fn process_discard_cell;
+
process_mapping_fn process_prepared_mapping;
process_mapping_fn process_prepared_discard;
};
@@ -282,6 +286,7 @@ struct thin_c {
struct dm_thin_device *td;
bool requeue_mode:1;
spinlock_t lock;
+ struct list_head deferred_cells;
struct bio_list deferred_bio_list;
struct bio_list retry_on_resume_list;
struct rb_root sort_bio_list; /* sorted list of deferred bios */
@@ -346,19 +351,6 @@ static void cell_release_no_holder(struct pool *pool,
dm_bio_prison_free_cell(pool->prison, cell);
}
-static void cell_defer_no_holder_no_free(struct thin_c *tc,
- struct dm_bio_prison_cell *cell)
-{
- struct pool *pool = tc->pool;
- unsigned long flags;
-
- spin_lock_irqsave(&tc->lock, flags);
- dm_cell_release_no_holder(pool->prison, cell, &tc->deferred_bio_list);
- spin_unlock_irqrestore(&tc->lock, flags);
-
- wake_worker(pool);
-}
-
static void cell_error_with_code(struct pool *pool,
struct dm_bio_prison_cell *cell, int error_code)
{
@@ -371,6 +363,11 @@ static void cell_error(struct pool *pool, struct dm_bio_prison_cell *cell)
cell_error_with_code(pool, cell, -EIO);
}
+static void cell_success(struct pool *pool, struct dm_bio_prison_cell *cell)
+{
+ cell_error_with_code(pool, cell, 0);
+}
+
/*----------------------------------------------------------------*/
/*
@@ -706,6 +703,28 @@ static void cell_defer_no_holder(struct thin_c *tc, struct dm_bio_prison_cell *c
wake_worker(pool);
}
+static void thin_defer_bio(struct thin_c *tc, struct bio *bio);
+
+static void inc_remap_and_issue_cell(struct thin_c *tc,
+ struct dm_bio_prison_cell *cell,
+ dm_block_t block)
+{
+ struct bio *bio;
+ struct bio_list bios;
+
+ bio_list_init(&bios);
+ cell_release_no_holder(tc->pool, cell, &bios);
+
+ while ((bio = bio_list_pop(&bios))) {
+ if (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA))
+ thin_defer_bio(tc, bio);
+ else {
+ inc_all_io_entry(tc->pool, bio);
+ remap_and_issue(tc, bio, block);
+ }
+ }
+}
+
static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
{
if (m->bio) {
@@ -1193,20 +1212,17 @@ static void retry_bios_on_resume(struct pool *pool, struct dm_bio_prison_cell *c
retry_on_resume(bio);
}
-static void process_discard(struct thin_c *tc, struct bio *bio)
+static void process_discard_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
{
int r;
+ struct bio *bio = cell->holder;
struct pool *pool = tc->pool;
- struct dm_bio_prison_cell *cell, *cell2;
- struct dm_cell_key key, key2;
+ struct dm_bio_prison_cell *cell2;
+ struct dm_cell_key key2;
dm_block_t block = get_bio_block(tc, bio);
struct dm_thin_lookup_result lookup_result;
struct dm_thin_new_mapping *m;
- build_virtual_key(tc->td, block, &key);
- if (bio_detain(tc->pool, &key, bio, &cell))
- return;
-
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
@@ -1273,6 +1289,19 @@ static void process_discard(struct thin_c *tc, struct bio *bio)
}
}
+static void process_discard_bio(struct thin_c *tc, struct bio *bio)
+{
+ struct dm_bio_prison_cell *cell;
+ struct dm_cell_key key;
+ dm_block_t block = get_bio_block(tc, bio);
+
+ build_virtual_key(tc->td, block, &key);
+ if (bio_detain(tc->pool, &key, bio, &cell))
+ return;
+
+ process_discard_cell(tc, cell);
+}
+
static void break_sharing(struct thin_c *tc, struct bio *bio, dm_block_t block,
struct dm_cell_key *key,
struct dm_thin_lookup_result *lookup_result,
@@ -1379,34 +1408,25 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
}
}
-static void process_bio(struct thin_c *tc, struct bio *bio)
+static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
{
int r;
struct pool *pool = tc->pool;
+ struct bio *bio = cell->holder;
dm_block_t block = get_bio_block(tc, bio);
- struct dm_bio_prison_cell *cell;
- struct dm_cell_key key;
struct dm_thin_lookup_result lookup_result;
- /*
- * If cell is already occupied, then the block is already
- * being provisioned so we have nothing further to do here.
- */
- build_virtual_key(tc->td, block, &key);
- if (bio_detain(pool, &key, bio, &cell))
- return;
-
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
if (lookup_result.shared) {
process_shared_bio(tc, bio, block, &lookup_result);
+ // FIXME: we can't remap because we're waiting on a commit
cell_defer_no_holder(tc, cell); /* FIXME: pass this cell into process_shared? */
} else {
inc_all_io_entry(pool, bio);
- cell_defer_no_holder(tc, cell);
-
remap_and_issue(tc, bio, lookup_result.block);
+ inc_remap_and_issue_cell(tc, cell, lookup_result.block);
}
break;
@@ -1440,7 +1460,26 @@ static void process_bio(struct thin_c *tc, struct bio *bio)
}
}
-static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
+static void process_bio(struct thin_c *tc, struct bio *bio)
+{
+ struct pool *pool = tc->pool;
+ dm_block_t block = get_bio_block(tc, bio);
+ struct dm_bio_prison_cell *cell;
+ struct dm_cell_key key;
+
+ /*
+ * If cell is already occupied, then the block is already
+ * being provisioned so we have nothing further to do here.
+ */
+ build_virtual_key(tc->td, block, &key);
+ if (bio_detain(pool, &key, bio, &cell))
+ return;
+
+ process_cell(tc, cell);
+}
+
+static void __process_bio_read_only(struct thin_c *tc, struct bio *bio,
+ struct dm_bio_prison_cell *cell)
{
int r;
int rw = bio_data_dir(bio);
@@ -1450,15 +1489,21 @@ static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
- if (lookup_result.shared && (rw == WRITE) && bio->bi_iter.bi_size)
+ if (lookup_result.shared && (rw == WRITE) && bio->bi_iter.bi_size) {
handle_unserviceable_bio(tc->pool, bio);
- else {
+ if (cell)
+ cell_defer_no_holder(tc, cell);
+ } else {
inc_all_io_entry(tc->pool, bio);
remap_and_issue(tc, bio, lookup_result.block);
+ if (cell)
+ inc_remap_and_issue_cell(tc, cell, lookup_result.block);
}
break;
case -ENODATA:
+ if (cell)
+ cell_defer_no_holder(tc, cell);
if (rw != READ) {
handle_unserviceable_bio(tc->pool, bio);
break;
@@ -1477,11 +1522,23 @@ static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
default:
DMERR_LIMIT("%s: dm_thin_find_block() failed: error = %d",
__func__, r);
+ if (cell)
+ cell_defer_no_holder(tc, cell);
bio_io_error(bio);
break;
}
}
+static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
+{
+ __process_bio_read_only(tc, bio, NULL);
+}
+
+static void process_cell_read_only(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ __process_bio_read_only(tc, cell->holder, cell);
+}
+
static void process_bio_success(struct thin_c *tc, struct bio *bio)
{
bio_endio(bio, 0);
@@ -1492,6 +1549,16 @@ static void process_bio_fail(struct thin_c *tc, struct bio *bio)
bio_io_error(bio);
}
+static void process_cell_success(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ cell_success(tc->pool, cell);
+}
+
+static void process_cell_fail(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ cell_error(tc->pool, cell);
+}
+
/*
* FIXME: should we also commit due to size of transaction, measured in
* metadata blocks?
@@ -1624,6 +1691,51 @@ static void process_thin_deferred_bios(struct thin_c *tc)
blk_finish_plug(&plug);
}
+static void process_thin_deferred_cells(struct thin_c *tc)
+{
+ struct pool *pool = tc->pool;
+ unsigned long flags;
+ struct list_head cells;
+ struct dm_bio_prison_cell *cell, *tmp;
+
+ // FIXME: push down into per cell processing
+ if (tc->requeue_mode) {
+ requeue_bio_list(tc, &tc->deferred_bio_list);
+ return;
+ }
+
+ INIT_LIST_HEAD(&cells);
+
+ spin_lock_irqsave(&tc->lock, flags);
+ list_splice_init(&tc->deferred_cells, &cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+
+ if (list_empty(&cells))
+ return;
+
+ list_for_each_entry_safe(cell, tmp, &cells, user_list) {
+ BUG_ON(!cell->holder);
+
+ /*
+ * If we've got no free new_mapping structs, and processing
+ * this bio might require one, we pause until there are some
+ * prepared mappings to process.
+ */
+ if (ensure_next_mapping(pool)) {
+ spin_lock_irqsave(&tc->lock, flags);
+ list_add(&cell->user_list, &tc->deferred_cells);
+ list_splice(&cells, &tc->deferred_cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+ break;
+ }
+
+ if (cell->holder->bi_rw & REQ_DISCARD)
+ pool->process_discard_cell(tc, cell);
+ else
+ pool->process_cell(tc, cell);
+ }
+}
+
static void thin_get(struct thin_c *tc);
static void thin_put(struct thin_c *tc);
@@ -1672,6 +1784,7 @@ static void process_deferred_bios(struct pool *pool)
tc = get_first_thin(pool);
while (tc) {
+ process_thin_deferred_cells(tc);
process_thin_deferred_bios(tc);
tc = get_next_thin(pool, tc);
}
@@ -1850,6 +1963,8 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
dm_pool_metadata_read_only(pool->pmd);
pool->process_bio = process_bio_fail;
pool->process_discard = process_bio_fail;
+ pool->process_cell = process_cell_fail;
+ pool->process_discard_cell = process_cell_fail;
pool->process_prepared_mapping = process_prepared_mapping_fail;
pool->process_prepared_discard = process_prepared_discard_fail;
@@ -1862,6 +1977,8 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
dm_pool_metadata_read_only(pool->pmd);
pool->process_bio = process_bio_read_only;
pool->process_discard = process_bio_success;
+ pool->process_cell = process_cell_read_only;
+ pool->process_discard_cell = process_cell_success;
pool->process_prepared_mapping = process_prepared_mapping_fail;
pool->process_prepared_discard = process_prepared_discard_passdown;
@@ -1880,7 +1997,9 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
if (old_mode != new_mode)
notify_of_pool_mode_change(pool, "out-of-data-space");
pool->process_bio = process_bio_read_only;
- pool->process_discard = process_discard;
+ pool->process_discard = process_discard_bio;
+ pool->process_cell = process_cell_read_only;
+ pool->process_discard_cell = process_discard_cell;
pool->process_prepared_mapping = process_prepared_mapping;
pool->process_prepared_discard = process_prepared_discard_passdown;
@@ -1893,7 +2012,9 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
notify_of_pool_mode_change(pool, "write");
dm_pool_metadata_read_write(pool->pmd);
pool->process_bio = process_bio;
- pool->process_discard = process_discard;
+ pool->process_discard = process_discard_bio;
+ pool->process_cell = process_cell;
+ pool->process_discard_cell = process_discard_cell;
pool->process_prepared_mapping = process_prepared_mapping;
pool->process_prepared_discard = process_prepared_discard;
break;
@@ -1955,6 +2076,18 @@ static void thin_defer_bio(struct thin_c *tc, struct bio *bio)
wake_worker(pool);
}
+static void thin_defer_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ unsigned long flags;
+ struct pool *pool = tc->pool;
+
+ spin_lock_irqsave(&tc->lock, flags);
+ list_add_tail(&cell->user_list, &tc->deferred_cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+
+ wake_worker(pool);
+}
+
static void thin_hook_bio(struct thin_c *tc, struct bio *bio)
{
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
@@ -1975,8 +2108,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
dm_block_t block = get_bio_block(tc, bio);
struct dm_thin_device *td = tc->td;
struct dm_thin_lookup_result result;
- struct dm_bio_prison_cell cell1, cell2;
- struct dm_bio_prison_cell *cell_result;
+ struct dm_bio_prison_cell *virt_cell, *data_cell;
struct dm_cell_key key;
thin_hook_bio(tc, bio);
@@ -2001,7 +2133,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* there's a race with discard.
*/
build_virtual_key(tc->td, block, &key);
- if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1, &cell_result))
+ if (bio_detain(tc->pool, &key, bio, &virt_cell))
return DM_MAPIO_SUBMITTED;
r = dm_thin_find_block(td, block, 0, &result);
@@ -2026,20 +2158,19 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* More distant ancestors are irrelevant. The
* shared flag will be set in their case.
*/
- thin_defer_bio(tc, bio);
- cell_defer_no_holder_no_free(tc, &cell1);
+ thin_defer_cell(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
}
build_data_key(tc->td, result.block, &key);
- if (dm_bio_detain(tc->pool->prison, &key, bio, &cell2, &cell_result)) {
- cell_defer_no_holder_no_free(tc, &cell1);
+ if (bio_detain(tc->pool, &key, bio, &data_cell)) {
+ cell_defer_no_holder(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
}
inc_all_io_entry(tc->pool, bio);
- cell_defer_no_holder_no_free(tc, &cell2);
- cell_defer_no_holder_no_free(tc, &cell1);
+ cell_defer_no_holder(tc, data_cell);
+ cell_defer_no_holder(tc, virt_cell);
remap(tc, bio, result.block);
return DM_MAPIO_REMAPPED;
@@ -2051,14 +2182,13 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* of doing so.
*/
handle_unserviceable_bio(tc->pool, bio);
- cell_defer_no_holder_no_free(tc, &cell1);
+ cell_defer_no_holder(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
}
/* fall through */
case -EWOULDBLOCK:
- thin_defer_bio(tc, bio);
- cell_defer_no_holder_no_free(tc, &cell1);
+ thin_defer_cell(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
default:
@@ -2068,7 +2198,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* pool is switched to fail-io mode.
*/
bio_io_error(bio);
- cell_defer_no_holder_no_free(tc, &cell1);
+ cell_defer_no_holder(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
}
}
@@ -3381,6 +3511,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
goto out_unlock;
}
spin_lock_init(&tc->lock);
+ INIT_LIST_HEAD(&tc->deferred_cells);
bio_list_init(&tc->deferred_bio_list);
bio_list_init(&tc->retry_on_resume_list);
tc->sort_bio_list = RB_ROOT;
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 15/17] dm thin: remap the bios in a cell immediately
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (13 preceding siblings ...)
2014-10-17 6:07 ` [for-3.19 PATCH 14/17] dm thin: defer whole cells rather than individual bios Mike Snitzer
@ 2014-10-17 6:07 ` Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 16/17] dm thin: direct dispatch when breaking sharing Mike Snitzer
` (2 subsequent siblings)
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:07 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
This use of direct submission in process_prepared_mapping() reduces
latency for submitting bios in a cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.
But this direct submission exposes the potential for a race between
releasing a cell and incrementing deferred set. Fix this by introducing
dm_cell_visit_release() and refactoring inc_remap_and_issue_cell()
accordingly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bio-prison.c | 14 ++++++++
drivers/md/dm-bio-prison.h | 8 +++++
drivers/md/dm-thin.c | 85 ++++++++++++++++++++++++++++++----------------
3 files changed, 78 insertions(+), 29 deletions(-)
diff --git a/drivers/md/dm-bio-prison.c b/drivers/md/dm-bio-prison.c
index 90a5662..bbe22a5 100644
--- a/drivers/md/dm-bio-prison.c
+++ b/drivers/md/dm-bio-prison.c
@@ -241,6 +241,20 @@ void dm_cell_error(struct dm_bio_prison *prison,
}
EXPORT_SYMBOL_GPL(dm_cell_error);
+void dm_cell_visit_release(struct dm_bio_prison *prison,
+ void (*visit_fn)(void *, struct dm_bio_prison_cell *),
+ void *context,
+ struct dm_bio_prison_cell *cell)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&prison->lock, flags);
+ visit_fn(context, cell);
+ rb_erase(&cell->node, &prison->cells);
+ spin_unlock_irqrestore(&prison->lock, flags);
+}
+EXPORT_SYMBOL_GPL(dm_cell_visit_release);
+
/*----------------------------------------------------------------*/
#define DEFERRED_SET_SIZE 64
diff --git a/drivers/md/dm-bio-prison.h b/drivers/md/dm-bio-prison.h
index c0cddb1..b039886 100644
--- a/drivers/md/dm-bio-prison.h
+++ b/drivers/md/dm-bio-prison.h
@@ -89,6 +89,14 @@ void dm_cell_release_no_holder(struct dm_bio_prison *prison,
void dm_cell_error(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell, int error);
+/*
+ * Visits the cell and then releases. Guarantees no new inmates are
+ * inserted between the visit and release.
+ */
+void dm_cell_visit_release(struct dm_bio_prison *prison,
+ void (*visit_fn)(void *, struct dm_bio_prison_cell *),
+ void *context, struct dm_bio_prison_cell *cell);
+
/*----------------------------------------------------------------*/
/*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 507e674..3e1f40d 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -343,6 +343,15 @@ static void cell_release(struct pool *pool,
dm_bio_prison_free_cell(pool->prison, cell);
}
+static void cell_visit_release(struct pool *pool,
+ void (*fn)(void *, struct dm_bio_prison_cell *),
+ void *context,
+ struct dm_bio_prison_cell *cell)
+{
+ dm_cell_visit_release(pool->prison, fn, context, cell);
+ dm_bio_prison_free_cell(pool->prison, cell);
+}
+
static void cell_release_no_holder(struct pool *pool,
struct dm_bio_prison_cell *cell,
struct bio_list *bios)
@@ -674,55 +683,70 @@ static void overwrite_endio(struct bio *bio, int err)
*/
/*
- * This sends the bios in the cell back to the deferred_bios list.
+ * This sends the bios in the cell, except the original holder, back
+ * to the deferred_bios list.
*/
-static void cell_defer(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+static void cell_defer_no_holder(struct thin_c *tc, struct dm_bio_prison_cell *cell)
{
struct pool *pool = tc->pool;
unsigned long flags;
spin_lock_irqsave(&tc->lock, flags);
- cell_release(pool, cell, &tc->deferred_bio_list);
+ cell_release_no_holder(pool, cell, &tc->deferred_bio_list);
spin_unlock_irqrestore(&tc->lock, flags);
wake_worker(pool);
}
-/*
- * Same as cell_defer above, except it omits the original holder of the cell.
- */
-static void cell_defer_no_holder(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+static void thin_defer_bio(struct thin_c *tc, struct bio *bio);
+
+struct remap_info {
+ struct thin_c *tc;
+ struct bio_list bios;
+};
+
+static void __inc_remap_and_issue_cell(void *context,
+ struct dm_bio_prison_cell *cell)
{
- struct pool *pool = tc->pool;
- unsigned long flags;
+ struct remap_info *info = context;
+ struct bio *bio;
- spin_lock_irqsave(&tc->lock, flags);
- cell_release_no_holder(pool, cell, &tc->deferred_bio_list);
- spin_unlock_irqrestore(&tc->lock, flags);
+ while ((bio = bio_list_pop(&cell->bios))) {
+ if (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA))
+ thin_defer_bio(info->tc, bio);
+ else {
+ inc_all_io_entry(info->tc->pool, bio);
- wake_worker(pool);
+ /*
+ * We can't issue the bios with the bio prison lock
+ * held, so we add them to a list to issue on
+ * return from this function.
+ */
+ bio_list_add(&info->bios, bio);
+ }
+ }
}
-static void thin_defer_bio(struct thin_c *tc, struct bio *bio);
-
static void inc_remap_and_issue_cell(struct thin_c *tc,
struct dm_bio_prison_cell *cell,
dm_block_t block)
{
struct bio *bio;
- struct bio_list bios;
+ struct remap_info info;
- bio_list_init(&bios);
- cell_release_no_holder(tc->pool, cell, &bios);
+ info.tc = tc;
+ bio_list_init(&info.bios);
- while ((bio = bio_list_pop(&bios))) {
- if (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA))
- thin_defer_bio(tc, bio);
- else {
- inc_all_io_entry(tc->pool, bio);
- remap_and_issue(tc, bio, block);
- }
- }
+ /*
+ * We have to be careful to inc any bios we're about to issue
+ * before the cell is released, and avoid a race with new bios
+ * being added to the cell.
+ */
+ cell_visit_release(tc->pool, __inc_remap_and_issue_cell,
+ &info, cell);
+
+ while ((bio = bio_list_pop(&info.bios)))
+ remap_and_issue(info.tc, bio, block);
}
static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
@@ -773,10 +797,13 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
* the bios in the cell.
*/
if (bio) {
- cell_defer_no_holder(tc, m->cell);
+ inc_remap_and_issue_cell(tc, m->cell, m->data_block);
bio_endio(bio, 0);
- } else
- cell_defer(tc, m->cell);
+ } else {
+ inc_all_io_entry(tc->pool, m->cell->holder);
+ remap_and_issue(tc, m->cell->holder, m->data_block);
+ inc_remap_and_issue_cell(tc, m->cell, m->data_block);
+ }
out:
list_del(&m->list);
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 16/17] dm thin: direct dispatch when breaking sharing
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (14 preceding siblings ...)
2014-10-17 6:07 ` [for-3.19 PATCH 15/17] dm thin: remap the bios in a cell immediately Mike Snitzer
@ 2014-10-17 6:07 ` Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 17/17] dm thin: sort the deferred cells Mike Snitzer
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:07 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
This use of direct submission in process_shared_bio() reduces latency
for submitting bios in the shared cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 66 +++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 13 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 3e1f40d..b585c0d 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1357,11 +1357,49 @@ static void break_sharing(struct thin_c *tc, struct bio *bio, dm_block_t block,
}
}
+static void __remap_and_issue_shared_cell(void *context,
+ struct dm_bio_prison_cell *cell)
+{
+ struct remap_info *info = context;
+ struct bio *bio;
+
+ while ((bio = bio_list_pop(&cell->bios))) {
+ if ((bio_data_dir(bio) == WRITE) ||
+ (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA)))
+ thin_defer_bio(info->tc, bio);
+ else {
+ struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));;
+
+ h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds);
+ inc_all_io_entry(info->tc->pool, bio);
+ bio_list_add(&info->bios, bio);
+ }
+ }
+}
+
+static void remap_and_issue_shared_cell(struct thin_c *tc,
+ struct dm_bio_prison_cell *cell,
+ dm_block_t block)
+{
+ struct bio *bio;
+ struct remap_info info;
+
+ info.tc = tc;
+ bio_list_init(&info.bios);
+
+ cell_visit_release(tc->pool, __remap_and_issue_shared_cell,
+ &info, cell);
+
+ while ((bio = bio_list_pop(&info.bios)))
+ remap_and_issue(tc, bio, block);
+}
+
static void process_shared_bio(struct thin_c *tc, struct bio *bio,
dm_block_t block,
- struct dm_thin_lookup_result *lookup_result)
+ struct dm_thin_lookup_result *lookup_result,
+ struct dm_bio_prison_cell *virt_cell)
{
- struct dm_bio_prison_cell *cell;
+ struct dm_bio_prison_cell *data_cell;
struct pool *pool = tc->pool;
struct dm_cell_key key;
@@ -1370,19 +1408,23 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
* of being broken so we have nothing further to do here.
*/
build_data_key(tc->td, lookup_result->block, &key);
- if (bio_detain(pool, &key, bio, &cell))
+ if (bio_detain(pool, &key, bio, &data_cell)) {
+ cell_defer_no_holder(tc, virt_cell);
return;
+ }
- if (bio_data_dir(bio) == WRITE && bio->bi_iter.bi_size)
- break_sharing(tc, bio, block, &key, lookup_result, cell);
- else {
+ if (bio_data_dir(bio) == WRITE && bio->bi_iter.bi_size) {
+ break_sharing(tc, bio, block, &key, lookup_result, data_cell);
+ cell_defer_no_holder(tc, virt_cell);
+ } else {
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
h->shared_read_entry = dm_deferred_entry_inc(pool->shared_read_ds);
inc_all_io_entry(pool, bio);
- cell_defer_no_holder(tc, cell);
+ remap_and_issue(tc, bio, block);
- remap_and_issue(tc, bio, lookup_result->block);
+ remap_and_issue_shared_cell(tc, data_cell, lookup_result->block);
+ remap_and_issue_shared_cell(tc, virt_cell, lookup_result->block);
}
}
@@ -1446,11 +1488,9 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
- if (lookup_result.shared) {
- process_shared_bio(tc, bio, block, &lookup_result);
- // FIXME: we can't remap because we're waiting on a commit
- cell_defer_no_holder(tc, cell); /* FIXME: pass this cell into process_shared? */
- } else {
+ if (lookup_result.shared)
+ process_shared_bio(tc, bio, block, &lookup_result, cell);
+ else {
inc_all_io_entry(pool, bio);
remap_and_issue(tc, bio, lookup_result.block);
inc_remap_and_issue_cell(tc, cell, lookup_result.block);
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH 17/17] dm thin: sort the deferred cells
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (15 preceding siblings ...)
2014-10-17 6:07 ` [for-3.19 PATCH 16/17] dm thin: direct dispatch when breaking sharing Mike Snitzer
@ 2014-10-17 6:07 ` Mike Snitzer
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
17 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 6:07 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Sort the cells in logical block order before processing each cell in
process_thin_deferred_cells(). This significantly improves the ondisk
layout on rotational storage, whereby improving read performance.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 88 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 68 insertions(+), 20 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index b585c0d..16be238 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -17,6 +17,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
+#include <linux/sort.h>
#include <linux/rbtree.h>
#define DM_MSG_PREFIX "thin"
@@ -205,6 +206,8 @@ typedef void (*process_bio_fn)(struct thin_c *tc, struct bio *bio);
typedef void (*process_cell_fn)(struct thin_c *tc, struct dm_bio_prison_cell *cell);
typedef void (*process_mapping_fn)(struct dm_thin_new_mapping *m);
+#define CELL_SORT_ARRAY_SIZE 8192
+
struct pool {
struct list_head list;
struct dm_target *ti; /* Only set if a pool target is bound */
@@ -252,6 +255,8 @@ struct pool {
process_mapping_fn process_prepared_mapping;
process_mapping_fn process_prepared_discard;
+
+ struct dm_bio_prison_cell *cell_sort_array[CELL_SORT_ARRAY_SIZE];
};
static enum pool_mode get_pool_mode(struct pool *pool);
@@ -1758,12 +1763,48 @@ static void process_thin_deferred_bios(struct thin_c *tc)
blk_finish_plug(&plug);
}
+static int cmp_cells(const void *lhs, const void *rhs)
+{
+ struct dm_bio_prison_cell *lhs_cell = *((struct dm_bio_prison_cell **) lhs);
+ struct dm_bio_prison_cell *rhs_cell = *((struct dm_bio_prison_cell **) rhs);
+
+ BUG_ON(!lhs_cell->holder);
+ BUG_ON(!rhs_cell->holder);
+
+ if (lhs_cell->holder->bi_iter.bi_sector < rhs_cell->holder->bi_iter.bi_sector)
+ return -1;
+
+ if (lhs_cell->holder->bi_iter.bi_sector > rhs_cell->holder->bi_iter.bi_sector)
+ return 1;
+
+ return 0;
+}
+
+static unsigned sort_cells(struct pool *pool, struct list_head *cells)
+{
+ unsigned count = 0;
+ struct dm_bio_prison_cell *cell, *tmp;
+
+ list_for_each_entry_safe(cell, tmp, cells, user_list) {
+ if (count >= CELL_SORT_ARRAY_SIZE)
+ break;
+
+ pool->cell_sort_array[count++] = cell;
+ list_del(&cell->user_list);
+ }
+
+ sort(pool->cell_sort_array, count, sizeof(cell), cmp_cells, NULL);
+
+ return count;
+}
+
static void process_thin_deferred_cells(struct thin_c *tc)
{
struct pool *pool = tc->pool;
unsigned long flags;
struct list_head cells;
- struct dm_bio_prison_cell *cell, *tmp;
+ struct dm_bio_prison_cell *cell;
+ unsigned i, j, count;
// FIXME: push down into per cell processing
if (tc->requeue_mode) {
@@ -1780,27 +1821,34 @@ static void process_thin_deferred_cells(struct thin_c *tc)
if (list_empty(&cells))
return;
- list_for_each_entry_safe(cell, tmp, &cells, user_list) {
- BUG_ON(!cell->holder);
+ do {
+ count = sort_cells(tc->pool, &cells);
- /*
- * If we've got no free new_mapping structs, and processing
- * this bio might require one, we pause until there are some
- * prepared mappings to process.
- */
- if (ensure_next_mapping(pool)) {
- spin_lock_irqsave(&tc->lock, flags);
- list_add(&cell->user_list, &tc->deferred_cells);
- list_splice(&cells, &tc->deferred_cells);
- spin_unlock_irqrestore(&tc->lock, flags);
- break;
- }
+ for (i = 0; i < count; i++) {
+ cell = pool->cell_sort_array[i];
+ BUG_ON(!cell->holder);
- if (cell->holder->bi_rw & REQ_DISCARD)
- pool->process_discard_cell(tc, cell);
- else
- pool->process_cell(tc, cell);
- }
+ /*
+ * If we've got no free new_mapping structs, and processing
+ * this bio might require one, we pause until there are some
+ * prepared mappings to process.
+ */
+ if (ensure_next_mapping(pool)) {
+ for (j = i; j < count; j++)
+ list_add(&pool->cell_sort_array[j]->user_list, &cells);
+
+ spin_lock_irqsave(&tc->lock, flags);
+ list_splice(&cells, &tc->deferred_cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+ return;
+ }
+
+ if (cell->holder->bi_rw & REQ_DISCARD)
+ pool->process_discard_cell(tc, cell);
+ else
+ pool->process_cell(tc, cell);
+ }
+ } while (!list_empty(&cells));
}
static void thin_get(struct thin_c *tc);
--
1.9.3
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [for-3.19 PATCH 00/17] dm thin: performance improvements
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (16 preceding siblings ...)
2014-10-17 6:07 ` [for-3.19 PATCH 17/17] dm thin: sort the deferred cells Mike Snitzer
@ 2014-10-17 18:31 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
` (20 more replies)
17 siblings, 21 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:31 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
On Fri, Oct 17 2014 at 2:06am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:
> Here is a patchset that Joe and I have been working on for the past 2
> weeks to address various performance problems reported against DM thin
> provisioning. DM thinp is now much more capable in the face of heavy
> IO (be it random or sequential, multithreaded, etc).
>
> I've just added these patches to linux-dm.git's 'for-next' branch so
> that these changes get early exposure/testing for 3.19 inclusion.
Testing with device-mapper-test-suite on really fast storage uncovered
various races. Joe quickly fixed these and I've folded the fixes into
the patch series.
v2 will be staged staged in linux-dm.git's 'for-next' branch (after
rebase) very soon. But I'll respond with the full patchset again just
to help anyone who cares to review/respond on list.
Thanks,
Mike
^ permalink raw reply [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 01/17] dm bufio: switch from a huge hash table to an rbtree
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-21 22:48 ` Mikulas Patocka
2014-10-17 18:37 ` [for-3.19 PATCH v2 02/17] dm bufio: evict buffers that are past the max age but retain some buffers Mike Snitzer
` (19 subsequent siblings)
20 siblings, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Converting over to using an rbtree eliminates a fixed 8MB allocation
from vmalloc space for the hash table.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bufio.c | 97 ++++++++++++++++++++++++++++-----------------------
1 file changed, 54 insertions(+), 43 deletions(-)
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 0be200b..dcaa1d9 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -14,6 +14,7 @@
#include <linux/vmalloc.h>
#include <linux/shrinker.h>
#include <linux/module.h>
+#include <linux/rbtree.h>
#define DM_MSG_PREFIX "bufio"
@@ -48,14 +49,6 @@
#define DM_BUFIO_INLINE_VECS 16
/*
- * Buffer hash
- */
-#define DM_BUFIO_HASH_BITS 20
-#define DM_BUFIO_HASH(block) \
- ((((block) >> DM_BUFIO_HASH_BITS) ^ (block)) & \
- ((1 << DM_BUFIO_HASH_BITS) - 1))
-
-/*
* Don't try to use kmem_cache_alloc for blocks larger than this.
* For explanation, see alloc_buffer_data below.
*/
@@ -106,7 +99,7 @@ struct dm_bufio_client {
unsigned minimum_buffers;
- struct hlist_head *cache_hash;
+ struct rb_root buffer_tree;
wait_queue_head_t free_buffer_wait;
int async_write_error;
@@ -135,7 +128,7 @@ enum data_mode {
};
struct dm_buffer {
- struct hlist_node hash_list;
+ struct rb_node node;
struct list_head lru_list;
sector_t block;
void *data;
@@ -253,6 +246,53 @@ static LIST_HEAD(dm_bufio_all_clients);
*/
static DEFINE_MUTEX(dm_bufio_clients_lock);
+/*----------------------------------------------------------------
+ * A red/black tree acts as an index for all the buffers.
+ *--------------------------------------------------------------*/
+static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
+{
+ struct rb_node *n = c->buffer_tree.rb_node;
+ struct dm_buffer *b;
+
+ while (n) {
+ b = container_of(n, struct dm_buffer, node);
+
+ if (b->block == block)
+ return b;
+
+ n = (b->block < block) ? n->rb_left : n->rb_right;
+ }
+
+ return NULL;
+}
+
+static void __insert(struct dm_bufio_client *c, struct dm_buffer *b)
+{
+ struct rb_node **new = &c->buffer_tree.rb_node, *parent = NULL;
+ struct dm_buffer *found;
+
+ while (*new) {
+ found = container_of(*new, struct dm_buffer, node);
+
+ if (found->block == b->block) {
+ BUG_ON(found != b);
+ return;
+ }
+
+ parent = *new;
+ new = (found->block < b->block) ?
+ &((*new)->rb_left) : &((*new)->rb_right);
+ }
+
+ rb_link_node(&b->node, parent, new);
+ rb_insert_color(&b->node, &c->buffer_tree);
+}
+
+static void __remove(struct dm_bufio_client *c, struct dm_buffer *b)
+{
+ rb_erase(&b->node, &c->buffer_tree);
+}
+
/*----------------------------------------------------------------*/
static void adjust_total_allocated(enum data_mode data_mode, long diff)
@@ -434,7 +474,7 @@ static void __link_buffer(struct dm_buffer *b, sector_t block, int dirty)
b->block = block;
b->list_mode = dirty;
list_add(&b->lru_list, &c->lru[dirty]);
- hlist_add_head(&b->hash_list, &c->cache_hash[DM_BUFIO_HASH(block)]);
+ __insert(b->c, b);
b->last_accessed = jiffies;
}
@@ -448,7 +488,7 @@ static void __unlink_buffer(struct dm_buffer *b)
BUG_ON(!c->n_buffers[b->list_mode]);
c->n_buffers[b->list_mode]--;
- hlist_del(&b->hash_list);
+ __remove(b->c, b);
list_del(&b->lru_list);
}
@@ -888,23 +928,6 @@ static void __check_watermark(struct dm_bufio_client *c,
__write_dirty_buffers_async(c, 1, write_list);
}
-/*
- * Find a buffer in the hash.
- */
-static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
-{
- struct dm_buffer *b;
-
- hlist_for_each_entry(b, &c->cache_hash[DM_BUFIO_HASH(block)],
- hash_list) {
- dm_bufio_cond_resched();
- if (b->block == block)
- return b;
- }
-
- return NULL;
-}
-
/*----------------------------------------------------------------
* Getting a buffer
*--------------------------------------------------------------*/
@@ -1534,11 +1557,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
r = -ENOMEM;
goto bad_client;
}
- c->cache_hash = vmalloc(sizeof(struct hlist_head) << DM_BUFIO_HASH_BITS);
- if (!c->cache_hash) {
- r = -ENOMEM;
- goto bad_hash;
- }
+ c->buffer_tree = RB_ROOT;
c->bdev = bdev;
c->block_size = block_size;
@@ -1557,9 +1576,6 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
c->n_buffers[i] = 0;
}
- for (i = 0; i < 1 << DM_BUFIO_HASH_BITS; i++)
- INIT_HLIST_HEAD(&c->cache_hash[i]);
-
mutex_init(&c->lock);
INIT_LIST_HEAD(&c->reserved_buffers);
c->need_reserved_buffers = reserved_buffers;
@@ -1633,8 +1649,6 @@ bad_cache:
}
dm_io_client_destroy(c->dm_io);
bad_dm_io:
- vfree(c->cache_hash);
-bad_hash:
kfree(c);
bad_client:
return ERR_PTR(r);
@@ -1661,9 +1675,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
mutex_unlock(&dm_bufio_clients_lock);
- for (i = 0; i < 1 << DM_BUFIO_HASH_BITS; i++)
- BUG_ON(!hlist_empty(&c->cache_hash[i]));
-
+ BUG_ON(!RB_EMPTY_ROOT(&c->buffer_tree));
BUG_ON(c->need_reserved_buffers);
while (!list_empty(&c->reserved_buffers)) {
@@ -1681,7 +1693,6 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
BUG_ON(c->n_buffers[i]);
dm_io_client_destroy(c->dm_io);
- vfree(c->cache_hash);
kfree(c);
}
EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 02/17] dm bufio: evict buffers that are past the max age but retain some buffers
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-31 16:37 ` Mikulas Patocka
2014-10-17 18:37 ` [for-3.19 PATCH v2 03/17] dm bio prison: switch to using a red black tree Mike Snitzer
` (18 subsequent siblings)
20 siblings, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
These changes help keep metadata backed by dm-bufio in-core longer which
fixes reports of metadata churn in the face of heavy random IO workloads.
Before, bufio evicted all buffers older than DM_BUFIO_DEFAULT_AGE_SECS.
Having a device (e.g. dm-thinp or dm-cache) lose all metadata just
because associated buffers had been idle for some time is unfriendly.
Now, the user may now configure the number of bytes that bufio retains
using the 'retain_bytes' module parameter. The default is 256K.
Also, the DM_BUFIO_WORK_TIMER_SECS and DM_BUFIO_DEFAULT_AGE_SECS
defaults were quite low so increase them (to 30 and 300 respectively).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bufio.c | 109 ++++++++++++++++++++++++++++++++++----------------
1 file changed, 75 insertions(+), 34 deletions(-)
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index dcaa1d9..99579c8 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -35,12 +35,17 @@
/*
* Check buffer ages in this interval (seconds)
*/
-#define DM_BUFIO_WORK_TIMER_SECS 10
+#define DM_BUFIO_WORK_TIMER_SECS 30
/*
* Free buffers when they are older than this (seconds)
*/
-#define DM_BUFIO_DEFAULT_AGE_SECS 60
+#define DM_BUFIO_DEFAULT_AGE_SECS 300
+
+/*
+ * The nr of bytes of cached data to keep around.
+ */
+#define DM_BUFIO_DEFAULT_RETAIN_BYTES (256 * 1024)
/*
* The number of bvec entries that are embedded directly in the buffer.
@@ -216,6 +221,7 @@ static DEFINE_SPINLOCK(param_spinlock);
* Buffers are freed after this timeout
*/
static unsigned dm_bufio_max_age = DM_BUFIO_DEFAULT_AGE_SECS;
+static unsigned dm_bufio_retain_bytes = DM_BUFIO_DEFAULT_RETAIN_BYTES;
static unsigned long dm_bufio_peak_allocated;
static unsigned long dm_bufio_allocated_kmem_cache;
@@ -1457,45 +1463,52 @@ static void drop_buffers(struct dm_bufio_client *c)
}
/*
- * Test if the buffer is unused and too old, and commit it.
+ * We may not be able to evict this buffer if IO pending or the client
+ * is still using it. Caller is expected to know buffer is too old.
+ *
* And if GFP_NOFS is used, we must not do any I/O because we hold
* dm_bufio_clients_lock and we would risk deadlock if the I/O gets
* rerouted to different bufio client.
*/
-static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
- unsigned long max_jiffies)
+static bool __try_evict_buffer(struct dm_buffer *b, gfp_t gfp)
{
- if (jiffies - b->last_accessed < max_jiffies)
- return 0;
-
if (!(gfp & __GFP_FS)) {
if (test_bit(B_READING, &b->state) ||
test_bit(B_WRITING, &b->state) ||
test_bit(B_DIRTY, &b->state))
- return 0;
+ return false;
}
if (b->hold_count)
- return 0;
+ return false;
__make_buffer_clean(b);
__unlink_buffer(b);
__free_buffer_wake(b);
- return 1;
+ return true;
}
-static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
- gfp_t gfp_mask)
+static unsigned get_retain_buffers(struct dm_bufio_client *c)
+{
+ unsigned retain_bytes = ACCESS_ONCE(dm_bufio_retain_bytes);
+ return retain_bytes / c->block_size;
+}
+
+static unsigned long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
+ gfp_t gfp_mask)
{
int l;
struct dm_buffer *b, *tmp;
- long freed = 0;
+ unsigned long freed = 0;
+ unsigned long count = nr_to_scan;
+ unsigned retain_target = get_retain_buffers(c);
for (l = 0; l < LIST_SIZE; l++) {
list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
- freed += __cleanup_old_buffer(b, gfp_mask, 0);
- if (!--nr_to_scan)
+ if (__try_evict_buffer(b, gfp_mask))
+ freed++;
+ if (!--nr_to_scan || ((count - freed) <= retain_target))
return freed;
dm_bufio_cond_resched();
}
@@ -1697,31 +1710,56 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
}
EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
-static void cleanup_old_buffers(void)
+static unsigned get_max_age_hz(void)
{
- unsigned long max_age = ACCESS_ONCE(dm_bufio_max_age);
- struct dm_bufio_client *c;
+ unsigned max_age = ACCESS_ONCE(dm_bufio_max_age);
- if (max_age > ULONG_MAX / HZ)
- max_age = ULONG_MAX / HZ;
+ if (max_age > UINT_MAX / HZ)
+ max_age = UINT_MAX / HZ;
- mutex_lock(&dm_bufio_clients_lock);
- list_for_each_entry(c, &dm_bufio_all_clients, client_list) {
- if (!dm_bufio_trylock(c))
- continue;
+ return max_age * HZ;
+}
- while (!list_empty(&c->lru[LIST_CLEAN])) {
- struct dm_buffer *b;
- b = list_entry(c->lru[LIST_CLEAN].prev,
- struct dm_buffer, lru_list);
- if (!__cleanup_old_buffer(b, 0, max_age * HZ))
- break;
- dm_bufio_cond_resched();
- }
+static bool older_than(struct dm_buffer *b, unsigned long age_hz)
+{
+ return (jiffies - b->last_accessed) >= age_hz;
+}
+
+static void __evict_old_buffers(struct dm_bufio_client *c, unsigned long age_hz)
+{
+ struct dm_buffer *b, *tmp;
+ unsigned retain_target = get_retain_buffers(c);
+ unsigned count;
+
+ dm_bufio_lock(c);
+
+ count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
+ list_for_each_entry_safe_reverse(b, tmp, &c->lru[LIST_CLEAN], lru_list) {
+ if (count <= retain_target)
+ break;
+
+ if (!older_than(b, age_hz))
+ break;
+
+ if (__try_evict_buffer(b, 0))
+ count--;
- dm_bufio_unlock(c);
dm_bufio_cond_resched();
}
+
+ dm_bufio_unlock(c);
+}
+
+static void cleanup_old_buffers(void)
+{
+ unsigned long max_age_hz = get_max_age_hz();
+ struct dm_bufio_client *c;
+
+ mutex_lock(&dm_bufio_clients_lock);
+
+ list_for_each_entry(c, &dm_bufio_all_clients, client_list)
+ __evict_old_buffers(c, max_age_hz);
+
mutex_unlock(&dm_bufio_clients_lock);
}
@@ -1846,6 +1884,9 @@ MODULE_PARM_DESC(max_cache_size_bytes, "Size of metadata cache");
module_param_named(max_age_seconds, dm_bufio_max_age, uint, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(max_age_seconds, "Max age of a buffer in seconds");
+module_param_named(retain_bytes, dm_bufio_retain_bytes, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(retain_bytes, "Try to keep at least this many bytes cached in memory");
+
module_param_named(peak_allocated_bytes, dm_bufio_peak_allocated, ulong, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(peak_allocated_bytes, "Tracks the maximum allocated memory");
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 03/17] dm bio prison: switch to using a red black tree
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 02/17] dm bufio: evict buffers that are past the max age but retain some buffers Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 04/17] dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO Mike Snitzer
` (17 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Previously it was using a fixed sized hash table. There are times
when very many concurrent cells are held (such as when processing a very
large discard). When this happens the hash table performance becomes
very poor.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bio-prison.c | 172 ++++++++++++++++++-------------------------
drivers/md/dm-bio-prison.h | 7 +-
drivers/md/dm-cache-target.c | 3 +-
drivers/md/dm-thin.c | 3 +-
4 files changed, 79 insertions(+), 106 deletions(-)
diff --git a/drivers/md/dm-bio-prison.c b/drivers/md/dm-bio-prison.c
index f752d12..90a5662 100644
--- a/drivers/md/dm-bio-prison.c
+++ b/drivers/md/dm-bio-prison.c
@@ -14,68 +14,38 @@
/*----------------------------------------------------------------*/
-struct bucket {
- spinlock_t lock;
- struct hlist_head cells;
-};
+#define MIN_CELLS 1024
struct dm_bio_prison {
+ spinlock_t lock;
mempool_t *cell_pool;
-
- unsigned nr_buckets;
- unsigned hash_mask;
- struct bucket *buckets;
+ struct rb_root cells;
};
-/*----------------------------------------------------------------*/
-
-static uint32_t calc_nr_buckets(unsigned nr_cells)
-{
- uint32_t n = 128;
-
- nr_cells /= 4;
- nr_cells = min(nr_cells, 8192u);
-
- while (n < nr_cells)
- n <<= 1;
-
- return n;
-}
-
static struct kmem_cache *_cell_cache;
-static void init_bucket(struct bucket *b)
-{
- spin_lock_init(&b->lock);
- INIT_HLIST_HEAD(&b->cells);
-}
+/*----------------------------------------------------------------*/
/*
* @nr_cells should be the number of cells you want in use _concurrently_.
* Don't confuse it with the number of distinct keys.
*/
-struct dm_bio_prison *dm_bio_prison_create(unsigned nr_cells)
+struct dm_bio_prison *dm_bio_prison_create(void)
{
- unsigned i;
- uint32_t nr_buckets = calc_nr_buckets(nr_cells);
- size_t len = sizeof(struct dm_bio_prison) +
- (sizeof(struct bucket) * nr_buckets);
- struct dm_bio_prison *prison = kmalloc(len, GFP_KERNEL);
+ struct dm_bio_prison *prison = kmalloc(sizeof(*prison), GFP_KERNEL);
if (!prison)
return NULL;
- prison->cell_pool = mempool_create_slab_pool(nr_cells, _cell_cache);
+ spin_lock_init(&prison->lock);
+
+ prison->cell_pool = mempool_create_slab_pool(MIN_CELLS, _cell_cache);
if (!prison->cell_pool) {
kfree(prison);
return NULL;
}
- prison->nr_buckets = nr_buckets;
- prison->hash_mask = nr_buckets - 1;
- prison->buckets = (struct bucket *) (prison + 1);
- for (i = 0; i < nr_buckets; i++)
- init_bucket(prison->buckets + i);
+ prison->cells = RB_ROOT;
return prison;
}
@@ -101,68 +71,73 @@ void dm_bio_prison_free_cell(struct dm_bio_prison *prison,
}
EXPORT_SYMBOL_GPL(dm_bio_prison_free_cell);
-static uint32_t hash_key(struct dm_bio_prison *prison, struct dm_cell_key *key)
+static void __setup_new_cell(struct dm_cell_key *key,
+ struct bio *holder,
+ struct dm_bio_prison_cell *cell)
{
- const unsigned long BIG_PRIME = 4294967291UL;
- uint64_t hash = key->block * BIG_PRIME;
-
- return (uint32_t) (hash & prison->hash_mask);
+ memcpy(&cell->key, key, sizeof(cell->key));
+ cell->holder = holder;
+ bio_list_init(&cell->bios);
}
-static int keys_equal(struct dm_cell_key *lhs, struct dm_cell_key *rhs)
+static int cmp_keys(struct dm_cell_key *lhs,
+ struct dm_cell_key *rhs)
{
- return (lhs->virtual == rhs->virtual) &&
- (lhs->dev == rhs->dev) &&
- (lhs->block == rhs->block);
-}
+ if (lhs->virtual < rhs->virtual)
+ return -1;
-static struct bucket *get_bucket(struct dm_bio_prison *prison,
- struct dm_cell_key *key)
-{
- return prison->buckets + hash_key(prison, key);
-}
+ if (lhs->virtual > rhs->virtual)
+ return 1;
-static struct dm_bio_prison_cell *__search_bucket(struct bucket *b,
- struct dm_cell_key *key)
-{
- struct dm_bio_prison_cell *cell;
+ if (lhs->dev < rhs->dev)
+ return -1;
- hlist_for_each_entry(cell, &b->cells, list)
- if (keys_equal(&cell->key, key))
- return cell;
+ if (lhs->dev > rhs->dev)
+ return 1;
- return NULL;
-}
+ if (lhs->block < rhs->block)
+ return -1;
-static void __setup_new_cell(struct bucket *b,
- struct dm_cell_key *key,
- struct bio *holder,
- struct dm_bio_prison_cell *cell)
-{
- memcpy(&cell->key, key, sizeof(cell->key));
- cell->holder = holder;
- bio_list_init(&cell->bios);
- hlist_add_head(&cell->list, &b->cells);
+ if (lhs->block > rhs->block)
+ return 1;
+
+ return 0;
}
-static int __bio_detain(struct bucket *b,
+static int __bio_detain(struct dm_bio_prison *prison,
struct dm_cell_key *key,
struct bio *inmate,
struct dm_bio_prison_cell *cell_prealloc,
struct dm_bio_prison_cell **cell_result)
{
- struct dm_bio_prison_cell *cell;
-
- cell = __search_bucket(b, key);
- if (cell) {
- if (inmate)
- bio_list_add(&cell->bios, inmate);
- *cell_result = cell;
- return 1;
+ int r;
+ struct rb_node **new = &prison->cells.rb_node, *parent = NULL;
+
+ while (*new) {
+ struct dm_bio_prison_cell *cell =
+ container_of(*new, struct dm_bio_prison_cell, node);
+
+ r = cmp_keys(key, &cell->key);
+
+ parent = *new;
+ if (r < 0)
+ new = &((*new)->rb_left);
+ else if (r > 0)
+ new = &((*new)->rb_right);
+ else {
+ if (inmate)
+ bio_list_add(&cell->bios, inmate);
+ *cell_result = cell;
+ return 1;
+ }
}
- __setup_new_cell(b, key, inmate, cell_prealloc);
+ __setup_new_cell(key, inmate, cell_prealloc);
*cell_result = cell_prealloc;
+
+ rb_link_node(&cell_prealloc->node, parent, new);
+ rb_insert_color(&cell_prealloc->node, &prison->cells);
+
return 0;
}
@@ -174,11 +149,10 @@ static int bio_detain(struct dm_bio_prison *prison,
{
int r;
unsigned long flags;
- struct bucket *b = get_bucket(prison, key);
- spin_lock_irqsave(&b->lock, flags);
- r = __bio_detain(b, key, inmate, cell_prealloc, cell_result);
- spin_unlock_irqrestore(&b->lock, flags);
+ spin_lock_irqsave(&prison->lock, flags);
+ r = __bio_detain(prison, key, inmate, cell_prealloc, cell_result);
+ spin_unlock_irqrestore(&prison->lock, flags);
return r;
}
@@ -205,10 +179,11 @@ EXPORT_SYMBOL_GPL(dm_get_cell);
/*
* @inmates must have been initialised prior to this call
*/
-static void __cell_release(struct dm_bio_prison_cell *cell,
+static void __cell_release(struct dm_bio_prison *prison,
+ struct dm_bio_prison_cell *cell,
struct bio_list *inmates)
{
- hlist_del(&cell->list);
+ rb_erase(&cell->node, &prison->cells);
if (inmates) {
if (cell->holder)
@@ -222,21 +197,21 @@ void dm_cell_release(struct dm_bio_prison *prison,
struct bio_list *bios)
{
unsigned long flags;
- struct bucket *b = get_bucket(prison, &cell->key);
- spin_lock_irqsave(&b->lock, flags);
- __cell_release(cell, bios);
- spin_unlock_irqrestore(&b->lock, flags);
+ spin_lock_irqsave(&prison->lock, flags);
+ __cell_release(prison, cell, bios);
+ spin_unlock_irqrestore(&prison->lock, flags);
}
EXPORT_SYMBOL_GPL(dm_cell_release);
/*
* Sometimes we don't want the holder, just the additional bios.
*/
-static void __cell_release_no_holder(struct dm_bio_prison_cell *cell,
+static void __cell_release_no_holder(struct dm_bio_prison *prison,
+ struct dm_bio_prison_cell *cell,
struct bio_list *inmates)
{
- hlist_del(&cell->list);
+ rb_erase(&cell->node, &prison->cells);
bio_list_merge(inmates, &cell->bios);
}
@@ -245,11 +220,10 @@ void dm_cell_release_no_holder(struct dm_bio_prison *prison,
struct bio_list *inmates)
{
unsigned long flags;
- struct bucket *b = get_bucket(prison, &cell->key);
- spin_lock_irqsave(&b->lock, flags);
- __cell_release_no_holder(cell, inmates);
- spin_unlock_irqrestore(&b->lock, flags);
+ spin_lock_irqsave(&prison->lock, flags);
+ __cell_release_no_holder(prison, cell, inmates);
+ spin_unlock_irqrestore(&prison->lock, flags);
}
EXPORT_SYMBOL_GPL(dm_cell_release_no_holder);
diff --git a/drivers/md/dm-bio-prison.h b/drivers/md/dm-bio-prison.h
index 6805a14..997a439 100644
--- a/drivers/md/dm-bio-prison.h
+++ b/drivers/md/dm-bio-prison.h
@@ -10,8 +10,8 @@
#include "persistent-data/dm-block-manager.h" /* FIXME: for dm_block_t */
#include "dm-thin-metadata.h" /* FIXME: for dm_thin_id */
-#include <linux/list.h>
#include <linux/bio.h>
+#include <linux/rbtree.h>
/*----------------------------------------------------------------*/
@@ -35,13 +35,14 @@ struct dm_cell_key {
* themselves.
*/
struct dm_bio_prison_cell {
- struct hlist_node list;
+ struct rb_node node;
+
struct dm_cell_key key;
struct bio *holder;
struct bio_list bios;
};
-struct dm_bio_prison *dm_bio_prison_create(unsigned nr_cells);
+struct dm_bio_prison *dm_bio_prison_create(void);
void dm_bio_prison_destroy(struct dm_bio_prison *prison);
/*
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 7130505..69de8b4 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -95,7 +95,6 @@ static void dm_unhook_bio(struct dm_hook_info *h, struct bio *bio)
/*----------------------------------------------------------------*/
-#define PRISON_CELLS 1024
#define MIGRATION_POOL_SIZE 128
#define COMMIT_PERIOD HZ
#define MIGRATION_COUNT_WINDOW 10
@@ -2327,7 +2326,7 @@ static int cache_create(struct cache_args *ca, struct cache **result)
INIT_DELAYED_WORK(&cache->waker, do_waker);
cache->last_commit_jiffies = jiffies;
- cache->prison = dm_bio_prison_create(PRISON_CELLS);
+ cache->prison = dm_bio_prison_create();
if (!cache->prison) {
*error = "could not create bio prison";
goto bad;
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 4843801..2459628 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -25,7 +25,6 @@
*/
#define ENDIO_HOOK_POOL_SIZE 1024
#define MAPPING_POOL_SIZE 1024
-#define PRISON_CELLS 1024
#define COMMIT_PERIOD HZ
#define NO_SPACE_TIMEOUT_SECS 60
@@ -2185,7 +2184,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
pool->sectors_per_block_shift = __ffs(block_size);
pool->low_water_blocks = 0;
pool_features_init(&pool->pf);
- pool->prison = dm_bio_prison_create(PRISON_CELLS);
+ pool->prison = dm_bio_prison_create();
if (!pool->prison) {
*error = "Error creating pool's bio prison";
err_p = ERR_PTR(-ENOMEM);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 04/17] dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (2 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 03/17] dm bio prison: switch to using a red black tree Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 05/17] dm transaction manager: add support for prefetching blocks of metadata Mike Snitzer
` (16 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
This change is a prerequisite for allowing metadata to be prefetched.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin-metadata.c | 30 +++++++++++++-----------------
drivers/md/dm-thin-metadata.h | 4 ++--
2 files changed, 15 insertions(+), 19 deletions(-)
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index e9d33ad..ee42d1c 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1384,42 +1384,38 @@ static bool __snapshotted_since(struct dm_thin_device *td, uint32_t time)
}
int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
- int can_block, struct dm_thin_lookup_result *result)
+ int can_issue_io, struct dm_thin_lookup_result *result)
{
- int r = -EINVAL;
- uint64_t block_time = 0;
+ int r;
__le64 value;
struct dm_pool_metadata *pmd = td->pmd;
dm_block_t keys[2] = { td->id, block };
struct dm_btree_info *info;
- if (can_block) {
- down_read(&pmd->root_lock);
- info = &pmd->info;
- } else if (down_read_trylock(&pmd->root_lock))
- info = &pmd->nb_info;
- else
- return -EWOULDBLOCK;
-
if (pmd->fail_io)
- goto out;
+ return -EINVAL;
- r = dm_btree_lookup(info, pmd->root, keys, &value);
- if (!r)
- block_time = le64_to_cpu(value);
+ down_read(&pmd->root_lock);
-out:
- up_read(&pmd->root_lock);
+ if (can_issue_io) {
+ info = &pmd->info;
+ } else
+ info = &pmd->nb_info;
+ r = dm_btree_lookup(info, pmd->root, keys, &value);
if (!r) {
+ uint64_t block_time = 0;
dm_block_t exception_block;
uint32_t exception_time;
+
+ block_time = le64_to_cpu(value);
unpack_block_time(block_time, &exception_block,
&exception_time);
result->block = exception_block;
result->shared = __snapshotted_since(td, exception_time);
}
+ up_read(&pmd->root_lock);
return r;
}
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index e3c857d..efedd5a 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -139,12 +139,12 @@ struct dm_thin_lookup_result {
/*
* Returns:
- * -EWOULDBLOCK iff @can_block is set and would block.
+ * -EWOULDBLOCK iff @can_issue_io is set and would issue IO
* -ENODATA iff that mapping is not present.
* 0 success
*/
int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
- int can_block, struct dm_thin_lookup_result *result);
+ int can_issue_io, struct dm_thin_lookup_result *result);
/*
* Obtain an unused block.
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 05/17] dm transaction manager: add support for prefetching blocks of metadata
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (3 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 04/17] dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 06/17] dm thin: prefetch missing metadata pages Mike Snitzer
` (15 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Introduce the dm_tm_issue_prefetches interface. If you're using a
non-blocking clone the tm will build up a list of requested blocks that
weren't in core. dm_tm_issue_prefetches will request those blocks to be
prefetched.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
.../md/persistent-data/dm-transaction-manager.c | 77 +++++++++++++++++++++-
.../md/persistent-data/dm-transaction-manager.h | 7 ++
2 files changed, 82 insertions(+), 2 deletions(-)
diff --git a/drivers/md/persistent-data/dm-transaction-manager.c b/drivers/md/persistent-data/dm-transaction-manager.c
index 3bc30a0..9cb797d 100644
--- a/drivers/md/persistent-data/dm-transaction-manager.c
+++ b/drivers/md/persistent-data/dm-transaction-manager.c
@@ -10,6 +10,8 @@
#include "dm-persistent-data-internal.h"
#include <linux/export.h>
+#include <linux/mutex.h>
+#include <linux/hash.h>
#include <linux/slab.h>
#include <linux/device-mapper.h>
@@ -17,6 +19,61 @@
/*----------------------------------------------------------------*/
+#define PREFETCH_SIZE 128
+#define PREFETCH_BITS 7
+#define PREFETCH_SENTINEL ((dm_block_t) -1ULL)
+
+struct prefetch_set {
+ struct mutex lock;
+ dm_block_t blocks[PREFETCH_SIZE];
+};
+
+static unsigned prefetch_hash(dm_block_t b)
+{
+ return hash_64(b, PREFETCH_BITS);
+}
+
+static void prefetch_wipe(struct prefetch_set *p)
+{
+ unsigned i;
+ for (i = 0; i < PREFETCH_SIZE; i++)
+ p->blocks[i] = PREFETCH_SENTINEL;
+}
+
+static void prefetch_init(struct prefetch_set *p)
+{
+ mutex_init(&p->lock);
+ prefetch_wipe(p);
+}
+
+static void prefetch_add(struct prefetch_set *p, dm_block_t b)
+{
+ unsigned h = prefetch_hash(b);
+
+ mutex_lock(&p->lock);
+ if (p->blocks[h] == PREFETCH_SENTINEL)
+ p->blocks[h] = b;
+
+ mutex_unlock(&p->lock);
+}
+
+static void prefetch_issue(struct prefetch_set *p, struct dm_block_manager *bm)
+{
+ unsigned i;
+
+ mutex_lock(&p->lock);
+
+ for (i = 0; i < PREFETCH_SIZE; i++)
+ if (p->blocks[i] != PREFETCH_SENTINEL) {
+ dm_bm_prefetch(bm, p->blocks[i]);
+ p->blocks[i] = PREFETCH_SENTINEL;
+ }
+
+ mutex_unlock(&p->lock);
+}
+
+/*----------------------------------------------------------------*/
+
struct shadow_info {
struct hlist_node hlist;
dm_block_t where;
@@ -37,6 +94,8 @@ struct dm_transaction_manager {
spinlock_t lock;
struct hlist_head buckets[DM_HASH_SIZE];
+
+ struct prefetch_set prefetches;
};
/*----------------------------------------------------------------*/
@@ -117,6 +176,8 @@ static struct dm_transaction_manager *dm_tm_create(struct dm_block_manager *bm,
for (i = 0; i < DM_HASH_SIZE; i++)
INIT_HLIST_HEAD(tm->buckets + i);
+ prefetch_init(&tm->prefetches);
+
return tm;
}
@@ -268,8 +329,14 @@ int dm_tm_read_lock(struct dm_transaction_manager *tm, dm_block_t b,
struct dm_block_validator *v,
struct dm_block **blk)
{
- if (tm->is_clone)
- return dm_bm_read_try_lock(tm->real->bm, b, v, blk);
+ if (tm->is_clone) {
+ int r = dm_bm_read_try_lock(tm->real->bm, b, v, blk);
+
+ if (r == -EWOULDBLOCK)
+ prefetch_add(&tm->real->prefetches, b);
+
+ return r;
+ }
return dm_bm_read_lock(tm->bm, b, v, blk);
}
@@ -317,6 +384,12 @@ struct dm_block_manager *dm_tm_get_bm(struct dm_transaction_manager *tm)
return tm->bm;
}
+void dm_tm_issue_prefetches(struct dm_transaction_manager *tm)
+{
+ prefetch_issue(&tm->prefetches, tm->bm);
+}
+EXPORT_SYMBOL_GPL(dm_tm_issue_prefetches);
+
/*----------------------------------------------------------------*/
static int dm_tm_create_internal(struct dm_block_manager *bm,
diff --git a/drivers/md/persistent-data/dm-transaction-manager.h b/drivers/md/persistent-data/dm-transaction-manager.h
index 2772ed2..2e0d4d6 100644
--- a/drivers/md/persistent-data/dm-transaction-manager.h
+++ b/drivers/md/persistent-data/dm-transaction-manager.h
@@ -109,6 +109,13 @@ int dm_tm_ref(struct dm_transaction_manager *tm, dm_block_t b,
struct dm_block_manager *dm_tm_get_bm(struct dm_transaction_manager *tm);
/*
+ * If you're using a non-blocking clone the tm will build up a list of
+ * requested blocks that weren't in core. This call will request those
+ * blocks to be prefetched.
+ */
+void dm_tm_issue_prefetches(struct dm_transaction_manager *tm);
+
+/*
* A little utility that ties the knot by producing a transaction manager
* that has a space map managed by the transaction manager...
*
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 06/17] dm thin: prefetch missing metadata pages
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (4 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 05/17] dm transaction manager: add support for prefetching blocks of metadata Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 07/17] dm thin: throttle incoming IO Mike Snitzer
` (14 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Prefetch metadata at the start of the worker thread and then again every
128th bio processed from the deferred list.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin-metadata.c | 5 +++++
drivers/md/dm-thin-metadata.h | 5 +++++
drivers/md/dm-thin.c | 10 ++++++----
3 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index ee42d1c..43adbb8 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -1809,3 +1809,8 @@ bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd)
return needs_check;
}
+
+void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd)
+{
+ dm_tm_issue_prefetches(pmd->tm);
+}
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index efedd5a..921d15e 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -213,6 +213,11 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd);
bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd);
+/*
+ * Issue any prefetches that may be useful.
+ */
+void dm_pool_issue_prefetches(struct dm_pool_metadata *pmd);
+
/*----------------------------------------------------------------*/
#endif
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 2459628..1d25b51 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1526,6 +1526,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
struct bio *bio;
struct bio_list bios;
struct blk_plug plug;
+ unsigned count = 0;
if (tc->requeue_mode) {
requeue_bio_list(tc, &tc->deferred_bio_list);
@@ -1567,6 +1568,10 @@ static void process_thin_deferred_bios(struct thin_c *tc)
pool->process_discard(tc, bio);
else
pool->process_bio(tc, bio);
+
+ if ((count++ & 127) == 0) {
+ dm_pool_issue_prefetches(pool->pmd);
+ }
}
blk_finish_plug(&plug);
}
@@ -1652,6 +1657,7 @@ static void do_worker(struct work_struct *ws)
{
struct pool *pool = container_of(ws, struct pool, worker);
+ dm_pool_issue_prefetches(pool->pmd);
process_prepared(pool, &pool->prepared_mappings, &pool->process_prepared_mapping);
process_prepared(pool, &pool->prepared_discards, &pool->process_prepared_discard);
process_deferred_bios(pool);
@@ -1990,10 +1996,6 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
/* fall through */
case -EWOULDBLOCK:
- /*
- * In future, the failed dm_thin_find_block above could
- * provide the hint to load the metadata into cache.
- */
thin_defer_bio(tc, bio);
return DM_MAPIO_SUBMITTED;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 07/17] dm thin: throttle incoming IO
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (5 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 06/17] dm thin: prefetch missing metadata pages Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 08/17] dm thin: adjust max_sectors_kb based on thinp blocksize Mike Snitzer
` (13 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Throttle IO based on the time it's taking the worker to do one loop.
There were reports of hung task timeouts occuring and it was observed
that the excessively long avgqu-sz (as reported by iostat) was
contributing to these hung tasks.
Throttling definitely helps dm-thinp perform better under heavy IO load
(without being detremental by being overzealous). It reduces avgqu-sz
drastically, e.g.: from 60K to ~6K, and even as low as 150 once metadata
is cached by bufio, when dirty_ratio=5, dirty_background_ratio=2. And
avgqu-sz stays at or below 30K even with dirty_ratio=20,
dirty_background_ratio=10.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 65 insertions(+), 1 deletion(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 1d25b51..6057441 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -126,6 +126,53 @@ static void build_virtual_key(struct dm_thin_device *td, dm_block_t b,
/*----------------------------------------------------------------*/
+#define THROTTLE_THRESHOLD (1 * HZ)
+
+struct throttle {
+ struct rw_semaphore lock;
+ unsigned long threshold;
+ bool throttle_applied;
+};
+
+static void throttle_init(struct throttle *t)
+{
+ init_rwsem(&t->lock);
+ t->throttle_applied = false;
+}
+
+static void throttle_work_start(struct throttle *t)
+{
+ t->threshold = jiffies + THROTTLE_THRESHOLD;
+}
+
+static void throttle_work_update(struct throttle *t)
+{
+ if (!t->throttle_applied && jiffies > t->threshold) {
+ down_write(&t->lock);
+ t->throttle_applied = true;
+ }
+}
+
+static void throttle_work_complete(struct throttle *t)
+{
+ if (t->throttle_applied) {
+ t->throttle_applied = false;
+ up_write(&t->lock);
+ }
+}
+
+static void throttle_lock(struct throttle *t)
+{
+ down_read(&t->lock);
+}
+
+static void throttle_unlock(struct throttle *t)
+{
+ up_read(&t->lock);
+}
+
+/*----------------------------------------------------------------*/
+
/*
* A pool device ties together a metadata device and a data device. It
* also provides the interface for creating and destroying internal
@@ -175,6 +222,7 @@ struct pool {
struct dm_kcopyd_client *copier;
struct workqueue_struct *wq;
+ struct throttle throttle;
struct work_struct worker;
struct delayed_work waker;
struct delayed_work no_space_timeout;
@@ -1570,6 +1618,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
pool->process_bio(tc, bio);
if ((count++ & 127) == 0) {
+ throttle_work_update(&pool->throttle);
dm_pool_issue_prefetches(pool->pmd);
}
}
@@ -1657,10 +1706,15 @@ static void do_worker(struct work_struct *ws)
{
struct pool *pool = container_of(ws, struct pool, worker);
+ throttle_work_start(&pool->throttle);
dm_pool_issue_prefetches(pool->pmd);
+ throttle_work_update(&pool->throttle);
process_prepared(pool, &pool->prepared_mappings, &pool->process_prepared_mapping);
+ throttle_work_update(&pool->throttle);
process_prepared(pool, &pool->prepared_discards, &pool->process_prepared_discard);
+ throttle_work_update(&pool->throttle);
process_deferred_bios(pool);
+ throttle_work_complete(&pool->throttle);
}
/*
@@ -1900,6 +1954,15 @@ static void thin_defer_bio(struct thin_c *tc, struct bio *bio)
wake_worker(pool);
}
+static void thin_defer_bio_with_throttle(struct thin_c *tc, struct bio *bio)
+{
+ struct pool *pool = tc->pool;
+
+ throttle_lock(&pool->throttle);
+ thin_defer_bio(tc, bio);
+ throttle_unlock(&pool->throttle);
+}
+
static void thin_hook_bio(struct thin_c *tc, struct bio *bio)
{
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
@@ -1937,7 +2000,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
}
if (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA)) {
- thin_defer_bio(tc, bio);
+ thin_defer_bio_with_throttle(tc, bio);
return DM_MAPIO_SUBMITTED;
}
@@ -2212,6 +2275,7 @@ static struct pool *pool_create(struct mapped_device *pool_md,
goto bad_wq;
}
+ throttle_init(&pool->throttle);
INIT_WORK(&pool->worker, do_worker);
INIT_DELAYED_WORK(&pool->waker, do_waker);
INIT_DELAYED_WORK(&pool->no_space_timeout, do_no_space_timeout);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 08/17] dm thin: adjust max_sectors_kb based on thinp blocksize
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (6 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 07/17] dm thin: throttle incoming IO Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 09/17] dm: improve documentation and code clarity in dm_merge_bvec Mike Snitzer
` (12 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Allows for filesystems to submit bios that are a factor of the thinp
blocksize, improving dm-thinp efficiency (particularly when the data
volume is RAID).
Also set io_min to max_sectors_kb if it is a factor of the thinp
blocksize.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 28 +++++++++++++++++++++++++---
1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 6057441..3f2b1d1 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -11,6 +11,7 @@
#include <linux/device-mapper.h>
#include <linux/dm-io.h>
#include <linux/dm-kcopyd.h>
+#include <linux/log2.h>
#include <linux/list.h>
#include <linux/rculist.h>
#include <linux/init.h>
@@ -3234,15 +3235,36 @@ static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
{
struct pool_c *pt = ti->private;
struct pool *pool = pt->pool;
- uint64_t io_opt_sectors = limits->io_opt >> SECTOR_SHIFT;
+ sector_t io_opt_sectors = limits->io_opt >> SECTOR_SHIFT;
+
+ /*
+ * Adjust max_sectors_kb to highest possible power-of-2
+ * factor of pool->sectors_per_block.
+ */
+ if (limits->max_hw_sectors & (limits->max_hw_sectors - 1))
+ limits->max_sectors = rounddown_pow_of_two(limits->max_hw_sectors);
+ else
+ limits->max_sectors = limits->max_hw_sectors;
+
+ if (limits->max_sectors < pool->sectors_per_block) {
+ while (!is_factor(pool->sectors_per_block, limits->max_sectors))
+ limits->max_sectors = rounddown_pow_of_two(limits->max_sectors);
+ } else if (block_size_is_power_of_two(pool)) {
+ /* max_sectors_kb is >= power-of-2 thinp blocksize */
+ while (!is_factor(limits->max_sectors, pool->sectors_per_block))
+ limits->max_sectors = rounddown_pow_of_two(limits->max_sectors);
+ }
/*
* If the system-determined stacked limits are compatible with the
* pool's blocksize (io_opt is a factor) do not override them.
*/
if (io_opt_sectors < pool->sectors_per_block ||
- do_div(io_opt_sectors, pool->sectors_per_block)) {
- blk_limits_io_min(limits, pool->sectors_per_block << SECTOR_SHIFT);
+ !is_factor(io_opt_sectors, pool->sectors_per_block)) {
+ if (is_factor(pool->sectors_per_block, limits->max_sectors))
+ blk_limits_io_min(limits, limits->max_sectors << SECTOR_SHIFT);
+ else
+ blk_limits_io_min(limits, pool->sectors_per_block << SECTOR_SHIFT);
blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 09/17] dm: improve documentation and code clarity in dm_merge_bvec
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (7 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 08/17] dm thin: adjust max_sectors_kb based on thinp blocksize Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 10/17] dm thin: implement thin_merge Mike Snitzer
` (11 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
These code changes do not introduce a functional change.
But bio_add_page() will never attempt to build up a bio larger than
queue_max_sectors(). Similarly, bio_get_nr_vecs() is also bound by
queue_max_sectors(). Therefore, there is no point in allowing
dm_merge_bvec() to answer "how many sectors can a bio have at this
offset?" with anything larger than queue_max_sectors(). Using
queue_max_sectors() rather than BIO_MAX_SECTORS serves to more
accurately convey the limits that are being imposed.
Also, use unlikely() to clarify the fact that the defensive code in
dm_merge_bvec() relative to max_size going negative shouldn't ever
happen -- if it does happen there is a bug in the block layer for
requesting larger than dm_merge_bvec()'s initial response for a given
offset. Also, update a comment in dm_merge_bvec() relative to
max_hw_sectors_kb. And fix empty newline whitespace.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 58f3927..0fee0e5 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1607,9 +1607,9 @@ static int dm_merge_bvec(struct request_queue *q,
* Find maximum amount of I/O that won't need splitting
*/
max_sectors = min(max_io_len(bvm->bi_sector, ti),
- (sector_t) BIO_MAX_SECTORS);
+ (sector_t) queue_max_sectors(q));
max_size = (max_sectors << SECTOR_SHIFT) - bvm->bi_size;
- if (max_size < 0)
+ if (unlikely(max_size < 0)) /* this shouldn't _ever_ happen */
max_size = 0;
/*
@@ -1621,10 +1621,10 @@ static int dm_merge_bvec(struct request_queue *q,
max_size = ti->type->merge(ti, bvm, biovec, max_size);
/*
* If the target doesn't support merge method and some of the devices
- * provided their merge_bvec method (we know this by looking at
- * queue_max_hw_sectors), then we can't allow bios with multiple vector
- * entries. So always set max_size to 0, and the code below allows
- * just one page.
+ * provided their merge_bvec method (we know this by looking for the
+ * max_hw_sectors that dm_set_device_limits may set), then we can't
+ * allow bios with multiple vector entries. So always set max_size
+ * to 0, and the code below allows just one page.
*/
else if (queue_max_hw_sectors(q) <= PAGE_SIZE >> 9)
max_size = 0;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 10/17] dm thin: implement thin_merge
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (8 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 09/17] dm: improve documentation and code clarity in dm_merge_bvec Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 11/17] dm thin: grab a virtual cell before looking up the mapping Mike Snitzer
` (10 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Introduce thin_merge so that any additional constraints from the data
volume may be taken into account when determing the maximum number of
sectors that can be issued relative to the specified logical offset.
This is particularly important if/when the data volume is layered ontop
of a more sophisticated device (e.g. dm-raid or some other DM target).
Reviewed-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 3f2b1d1..34e42a2 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3620,6 +3620,21 @@ err:
DMEMIT("Error");
}
+static int thin_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+ struct bio_vec *biovec, int max_size)
+{
+ struct thin_c *tc = ti->private;
+ struct request_queue *q = bdev_get_queue(tc->pool_dev->bdev);
+
+ if (!q->merge_bvec_fn)
+ return max_size;
+
+ bvm->bi_bdev = tc->pool_dev->bdev;
+ bvm->bi_sector = dm_target_offset(ti, bvm->bi_sector);
+
+ return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
+}
+
static int thin_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -3644,7 +3659,7 @@ static int thin_iterate_devices(struct dm_target *ti,
static struct target_type thin_target = {
.name = "thin",
- .version = {1, 13, 0},
+ .version = {1, 14, 0},
.module = THIS_MODULE,
.ctr = thin_ctr,
.dtr = thin_dtr,
@@ -3654,6 +3669,7 @@ static struct target_type thin_target = {
.presuspend = thin_presuspend,
.postsuspend = thin_postsuspend,
.status = thin_status,
+ .merge = thin_merge,
.iterate_devices = thin_iterate_devices,
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 11/17] dm thin: grab a virtual cell before looking up the mapping
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (9 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 10/17] dm thin: implement thin_merge Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 12/17] dm thin: performance improvement to discard processing Mike Snitzer
` (9 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Avoids normal IO racing with discard.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
---
drivers/md/dm-thin.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 34e42a2..73a1c96 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2005,6 +2005,14 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_SUBMITTED;
}
+ /*
+ * We must hold the virtual cell before doing the lookup, otherwise
+ * there's a race with discard.
+ */
+ build_virtual_key(tc->td, block, &key);
+ if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1, &cell_result))
+ return DM_MAPIO_SUBMITTED;
+
r = dm_thin_find_block(td, block, 0, &result);
/*
@@ -2028,13 +2036,10 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* shared flag will be set in their case.
*/
thin_defer_bio(tc, bio);
+ cell_defer_no_holder_no_free(tc, &cell1);
return DM_MAPIO_SUBMITTED;
}
- build_virtual_key(tc->td, block, &key);
- if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1, &cell_result))
- return DM_MAPIO_SUBMITTED;
-
build_data_key(tc->td, result.block, &key);
if (dm_bio_detain(tc->pool->prison, &key, bio, &cell2, &cell_result)) {
cell_defer_no_holder_no_free(tc, &cell1);
@@ -2055,12 +2060,14 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* of doing so.
*/
handle_unserviceable_bio(tc->pool, bio);
+ cell_defer_no_holder_no_free(tc, &cell1);
return DM_MAPIO_SUBMITTED;
}
/* fall through */
case -EWOULDBLOCK:
thin_defer_bio(tc, bio);
+ cell_defer_no_holder_no_free(tc, &cell1);
return DM_MAPIO_SUBMITTED;
default:
@@ -2070,6 +2077,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* pool is switched to fail-io mode.
*/
bio_io_error(bio);
+ cell_defer_no_holder_no_free(tc, &cell1);
return DM_MAPIO_SUBMITTED;
}
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 12/17] dm thin: performance improvement to discard processing
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (10 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 11/17] dm thin: grab a virtual cell before looking up the mapping Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 13/17] dm thin: factor out remap_and_issue_overwrite Mike Snitzer
` (8 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
When processing a discard bio, if the block is already quiesced do the
discard immediately rather than adding the mapping to a list for the
next iteration of the worker thread.
Discarding a fully provisioned 100G thin volume with 64k block size goes
from 860s to 95s with this change.
Clearly there's something wrong with the worker architecture, more
investigation needed.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 73a1c96..fffdebc 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1194,7 +1194,6 @@ static void retry_bios_on_resume(struct pool *pool, struct dm_bio_prison_cell *c
static void process_discard(struct thin_c *tc, struct bio *bio)
{
int r;
- unsigned long flags;
struct pool *pool = tc->pool;
struct dm_bio_prison_cell *cell, *cell2;
struct dm_cell_key key, key2;
@@ -1235,12 +1234,9 @@ static void process_discard(struct thin_c *tc, struct bio *bio)
m->cell2 = cell2;
m->bio = bio;
- if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list)) {
- spin_lock_irqsave(&pool->lock, flags);
- list_add_tail(&m->list, &pool->prepared_discards);
- spin_unlock_irqrestore(&pool->lock, flags);
- wake_worker(pool);
- }
+ if (!dm_deferred_set_add_work(pool->all_io_ds, &m->list))
+ pool->process_prepared_discard(m);
+
} else {
inc_all_io_entry(pool, bio);
cell_defer_no_holder(tc, cell);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 13/17] dm thin: factor out remap_and_issue_overwrite
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (11 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 12/17] dm thin: performance improvement to discard processing Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 14/17] dm thin: defer whole cells rather than individual bios Mike Snitzer
` (7 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Purely cleanup of duplicated code, no functional change.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index fffdebc..4658635 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -890,6 +890,20 @@ static void ll_zero(struct thin_c *tc, struct dm_thin_new_mapping *m,
}
}
+static void remap_and_issue_overwrite(struct thin_c *tc, struct bio *bio,
+ dm_block_t data_block,
+ struct dm_thin_new_mapping *m)
+{
+ struct pool *pool = tc->pool;
+ struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
+
+ h->overwrite_mapping = m;
+ m->bio = bio;
+ save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
+ inc_all_io_entry(pool, bio);
+ remap_and_issue(tc, bio, data_block);
+}
+
/*
* A partial copy also needs to zero the uncopied region.
*/
@@ -924,15 +938,9 @@ static void schedule_copy(struct thin_c *tc, dm_block_t virt_block,
* If the whole block of data is being overwritten, we can issue the
* bio immediately. Otherwise we use kcopyd to clone the data first.
*/
- if (io_overwrites_block(pool, bio)) {
- struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
-
- h->overwrite_mapping = m;
- m->bio = bio;
- save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
- inc_all_io_entry(pool, bio);
- remap_and_issue(tc, bio, data_dest);
- } else {
+ if (io_overwrites_block(pool, bio))
+ remap_and_issue_overwrite(tc, bio, data_dest, m);
+ else {
struct dm_io_region from, to;
from.bdev = origin->bdev;
@@ -1001,16 +1009,10 @@ static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
if (!pool->pf.zero_new_blocks)
process_prepared_mapping(m);
- else if (io_overwrites_block(pool, bio)) {
- struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
-
- h->overwrite_mapping = m;
- m->bio = bio;
- save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
- inc_all_io_entry(pool, bio);
- remap_and_issue(tc, bio, data_block);
+ else if (io_overwrites_block(pool, bio))
+ remap_and_issue_overwrite(tc, bio, data_block, m);
- } else
+ else
ll_zero(tc, m,
data_block * pool->sectors_per_block,
(data_block + 1) * pool->sectors_per_block);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 14/17] dm thin: defer whole cells rather than individual bios
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (12 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 13/17] dm thin: factor out remap_and_issue_overwrite Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 15/17] dm thin: remap the bios in a cell immediately Mike Snitzer
` (6 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
This avoids dropping the cell, so increases the probability that other
bios will collect within the cell, rather than being passed individually
to the worker.
Also add required process_cell and process_discard_cell error handling
wrappers and set associated pool-mode function pointers accordingly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bio-prison.h | 1 +
drivers/md/dm-thin.c | 235 +++++++++++++++++++++++++++++++++++----------
2 files changed, 185 insertions(+), 51 deletions(-)
diff --git a/drivers/md/dm-bio-prison.h b/drivers/md/dm-bio-prison.h
index 997a439..c0cddb1 100644
--- a/drivers/md/dm-bio-prison.h
+++ b/drivers/md/dm-bio-prison.h
@@ -35,6 +35,7 @@ struct dm_cell_key {
* themselves.
*/
struct dm_bio_prison_cell {
+ struct list_head user_list; /* for client use */
struct rb_node node;
struct dm_cell_key key;
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 4658635..a7b188f 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -202,6 +202,7 @@ struct pool_features {
struct thin_c;
typedef void (*process_bio_fn)(struct thin_c *tc, struct bio *bio);
+typedef void (*process_cell_fn)(struct thin_c *tc, struct dm_bio_prison_cell *cell);
typedef void (*process_mapping_fn)(struct dm_thin_new_mapping *m);
struct pool {
@@ -246,6 +247,9 @@ struct pool {
process_bio_fn process_bio;
process_bio_fn process_discard;
+ process_cell_fn process_cell;
+ process_cell_fn process_discard_cell;
+
process_mapping_fn process_prepared_mapping;
process_mapping_fn process_prepared_discard;
};
@@ -282,6 +286,7 @@ struct thin_c {
struct dm_thin_device *td;
bool requeue_mode:1;
spinlock_t lock;
+ struct list_head deferred_cells;
struct bio_list deferred_bio_list;
struct bio_list retry_on_resume_list;
struct rb_root sort_bio_list; /* sorted list of deferred bios */
@@ -346,19 +351,6 @@ static void cell_release_no_holder(struct pool *pool,
dm_bio_prison_free_cell(pool->prison, cell);
}
-static void cell_defer_no_holder_no_free(struct thin_c *tc,
- struct dm_bio_prison_cell *cell)
-{
- struct pool *pool = tc->pool;
- unsigned long flags;
-
- spin_lock_irqsave(&tc->lock, flags);
- dm_cell_release_no_holder(pool->prison, cell, &tc->deferred_bio_list);
- spin_unlock_irqrestore(&tc->lock, flags);
-
- wake_worker(pool);
-}
-
static void cell_error_with_code(struct pool *pool,
struct dm_bio_prison_cell *cell, int error_code)
{
@@ -371,6 +363,11 @@ static void cell_error(struct pool *pool, struct dm_bio_prison_cell *cell)
cell_error_with_code(pool, cell, -EIO);
}
+static void cell_success(struct pool *pool, struct dm_bio_prison_cell *cell)
+{
+ cell_error_with_code(pool, cell, 0);
+}
+
/*----------------------------------------------------------------*/
/*
@@ -706,6 +703,28 @@ static void cell_defer_no_holder(struct thin_c *tc, struct dm_bio_prison_cell *c
wake_worker(pool);
}
+static void thin_defer_bio(struct thin_c *tc, struct bio *bio);
+
+static void inc_remap_and_issue_cell(struct thin_c *tc,
+ struct dm_bio_prison_cell *cell,
+ dm_block_t block)
+{
+ struct bio *bio;
+ struct bio_list bios;
+
+ bio_list_init(&bios);
+ cell_release_no_holder(tc->pool, cell, &bios);
+
+ while ((bio = bio_list_pop(&bios))) {
+ if (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA))
+ thin_defer_bio(tc, bio);
+ else {
+ inc_all_io_entry(tc->pool, bio);
+ remap_and_issue(tc, bio, block);
+ }
+ }
+}
+
static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
{
if (m->bio) {
@@ -1193,20 +1212,17 @@ static void retry_bios_on_resume(struct pool *pool, struct dm_bio_prison_cell *c
retry_on_resume(bio);
}
-static void process_discard(struct thin_c *tc, struct bio *bio)
+static void process_discard_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
{
int r;
+ struct bio *bio = cell->holder;
struct pool *pool = tc->pool;
- struct dm_bio_prison_cell *cell, *cell2;
- struct dm_cell_key key, key2;
+ struct dm_bio_prison_cell *cell2;
+ struct dm_cell_key key2;
dm_block_t block = get_bio_block(tc, bio);
struct dm_thin_lookup_result lookup_result;
struct dm_thin_new_mapping *m;
- build_virtual_key(tc->td, block, &key);
- if (bio_detain(tc->pool, &key, bio, &cell))
- return;
-
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
@@ -1273,6 +1289,19 @@ static void process_discard(struct thin_c *tc, struct bio *bio)
}
}
+static void process_discard_bio(struct thin_c *tc, struct bio *bio)
+{
+ struct dm_bio_prison_cell *cell;
+ struct dm_cell_key key;
+ dm_block_t block = get_bio_block(tc, bio);
+
+ build_virtual_key(tc->td, block, &key);
+ if (bio_detain(tc->pool, &key, bio, &cell))
+ return;
+
+ process_discard_cell(tc, cell);
+}
+
static void break_sharing(struct thin_c *tc, struct bio *bio, dm_block_t block,
struct dm_cell_key *key,
struct dm_thin_lookup_result *lookup_result,
@@ -1379,34 +1408,25 @@ static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block
}
}
-static void process_bio(struct thin_c *tc, struct bio *bio)
+static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
{
int r;
struct pool *pool = tc->pool;
+ struct bio *bio = cell->holder;
dm_block_t block = get_bio_block(tc, bio);
- struct dm_bio_prison_cell *cell;
- struct dm_cell_key key;
struct dm_thin_lookup_result lookup_result;
- /*
- * If cell is already occupied, then the block is already
- * being provisioned so we have nothing further to do here.
- */
- build_virtual_key(tc->td, block, &key);
- if (bio_detain(pool, &key, bio, &cell))
- return;
-
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
if (lookup_result.shared) {
process_shared_bio(tc, bio, block, &lookup_result);
+ // FIXME: we can't remap because we're waiting on a commit
cell_defer_no_holder(tc, cell); /* FIXME: pass this cell into process_shared? */
} else {
inc_all_io_entry(pool, bio);
- cell_defer_no_holder(tc, cell);
-
remap_and_issue(tc, bio, lookup_result.block);
+ inc_remap_and_issue_cell(tc, cell, lookup_result.block);
}
break;
@@ -1440,7 +1460,26 @@ static void process_bio(struct thin_c *tc, struct bio *bio)
}
}
-static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
+static void process_bio(struct thin_c *tc, struct bio *bio)
+{
+ struct pool *pool = tc->pool;
+ dm_block_t block = get_bio_block(tc, bio);
+ struct dm_bio_prison_cell *cell;
+ struct dm_cell_key key;
+
+ /*
+ * If cell is already occupied, then the block is already
+ * being provisioned so we have nothing further to do here.
+ */
+ build_virtual_key(tc->td, block, &key);
+ if (bio_detain(pool, &key, bio, &cell))
+ return;
+
+ process_cell(tc, cell);
+}
+
+static void __process_bio_read_only(struct thin_c *tc, struct bio *bio,
+ struct dm_bio_prison_cell *cell)
{
int r;
int rw = bio_data_dir(bio);
@@ -1450,15 +1489,21 @@ static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
- if (lookup_result.shared && (rw == WRITE) && bio->bi_iter.bi_size)
+ if (lookup_result.shared && (rw == WRITE) && bio->bi_iter.bi_size) {
handle_unserviceable_bio(tc->pool, bio);
- else {
+ if (cell)
+ cell_defer_no_holder(tc, cell);
+ } else {
inc_all_io_entry(tc->pool, bio);
remap_and_issue(tc, bio, lookup_result.block);
+ if (cell)
+ inc_remap_and_issue_cell(tc, cell, lookup_result.block);
}
break;
case -ENODATA:
+ if (cell)
+ cell_defer_no_holder(tc, cell);
if (rw != READ) {
handle_unserviceable_bio(tc->pool, bio);
break;
@@ -1477,11 +1522,23 @@ static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
default:
DMERR_LIMIT("%s: dm_thin_find_block() failed: error = %d",
__func__, r);
+ if (cell)
+ cell_defer_no_holder(tc, cell);
bio_io_error(bio);
break;
}
}
+static void process_bio_read_only(struct thin_c *tc, struct bio *bio)
+{
+ __process_bio_read_only(tc, bio, NULL);
+}
+
+static void process_cell_read_only(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ __process_bio_read_only(tc, cell->holder, cell);
+}
+
static void process_bio_success(struct thin_c *tc, struct bio *bio)
{
bio_endio(bio, 0);
@@ -1492,6 +1549,16 @@ static void process_bio_fail(struct thin_c *tc, struct bio *bio)
bio_io_error(bio);
}
+static void process_cell_success(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ cell_success(tc->pool, cell);
+}
+
+static void process_cell_fail(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ cell_error(tc->pool, cell);
+}
+
/*
* FIXME: should we also commit due to size of transaction, measured in
* metadata blocks?
@@ -1624,6 +1691,51 @@ static void process_thin_deferred_bios(struct thin_c *tc)
blk_finish_plug(&plug);
}
+static void process_thin_deferred_cells(struct thin_c *tc)
+{
+ struct pool *pool = tc->pool;
+ unsigned long flags;
+ struct list_head cells;
+ struct dm_bio_prison_cell *cell, *tmp;
+
+ // FIXME: push down into per cell processing
+ if (tc->requeue_mode) {
+ requeue_bio_list(tc, &tc->deferred_bio_list);
+ return;
+ }
+
+ INIT_LIST_HEAD(&cells);
+
+ spin_lock_irqsave(&tc->lock, flags);
+ list_splice_init(&tc->deferred_cells, &cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+
+ if (list_empty(&cells))
+ return;
+
+ list_for_each_entry_safe(cell, tmp, &cells, user_list) {
+ BUG_ON(!cell->holder);
+
+ /*
+ * If we've got no free new_mapping structs, and processing
+ * this bio might require one, we pause until there are some
+ * prepared mappings to process.
+ */
+ if (ensure_next_mapping(pool)) {
+ spin_lock_irqsave(&tc->lock, flags);
+ list_add(&cell->user_list, &tc->deferred_cells);
+ list_splice(&cells, &tc->deferred_cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+ break;
+ }
+
+ if (cell->holder->bi_rw & REQ_DISCARD)
+ pool->process_discard_cell(tc, cell);
+ else
+ pool->process_cell(tc, cell);
+ }
+}
+
static void thin_get(struct thin_c *tc);
static void thin_put(struct thin_c *tc);
@@ -1672,6 +1784,7 @@ static void process_deferred_bios(struct pool *pool)
tc = get_first_thin(pool);
while (tc) {
+ process_thin_deferred_cells(tc);
process_thin_deferred_bios(tc);
tc = get_next_thin(pool, tc);
}
@@ -1850,6 +1963,8 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
dm_pool_metadata_read_only(pool->pmd);
pool->process_bio = process_bio_fail;
pool->process_discard = process_bio_fail;
+ pool->process_cell = process_cell_fail;
+ pool->process_discard_cell = process_cell_fail;
pool->process_prepared_mapping = process_prepared_mapping_fail;
pool->process_prepared_discard = process_prepared_discard_fail;
@@ -1862,6 +1977,8 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
dm_pool_metadata_read_only(pool->pmd);
pool->process_bio = process_bio_read_only;
pool->process_discard = process_bio_success;
+ pool->process_cell = process_cell_read_only;
+ pool->process_discard_cell = process_cell_success;
pool->process_prepared_mapping = process_prepared_mapping_fail;
pool->process_prepared_discard = process_prepared_discard_passdown;
@@ -1880,7 +1997,9 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
if (old_mode != new_mode)
notify_of_pool_mode_change(pool, "out-of-data-space");
pool->process_bio = process_bio_read_only;
- pool->process_discard = process_discard;
+ pool->process_discard = process_discard_bio;
+ pool->process_cell = process_cell_read_only;
+ pool->process_discard_cell = process_discard_cell;
pool->process_prepared_mapping = process_prepared_mapping;
pool->process_prepared_discard = process_prepared_discard_passdown;
@@ -1893,7 +2012,9 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
notify_of_pool_mode_change(pool, "write");
dm_pool_metadata_read_write(pool->pmd);
pool->process_bio = process_bio;
- pool->process_discard = process_discard;
+ pool->process_discard = process_discard_bio;
+ pool->process_cell = process_cell;
+ pool->process_discard_cell = process_discard_cell;
pool->process_prepared_mapping = process_prepared_mapping;
pool->process_prepared_discard = process_prepared_discard;
break;
@@ -1962,6 +2083,20 @@ static void thin_defer_bio_with_throttle(struct thin_c *tc, struct bio *bio)
throttle_unlock(&pool->throttle);
}
+static void thin_defer_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+{
+ unsigned long flags;
+ struct pool *pool = tc->pool;
+
+ throttle_lock(&pool->throttle);
+ spin_lock_irqsave(&tc->lock, flags);
+ list_add_tail(&cell->user_list, &tc->deferred_cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+ throttle_unlock(&pool->throttle);
+
+ wake_worker(pool);
+}
+
static void thin_hook_bio(struct thin_c *tc, struct bio *bio)
{
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
@@ -1982,8 +2117,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
dm_block_t block = get_bio_block(tc, bio);
struct dm_thin_device *td = tc->td;
struct dm_thin_lookup_result result;
- struct dm_bio_prison_cell cell1, cell2;
- struct dm_bio_prison_cell *cell_result;
+ struct dm_bio_prison_cell *virt_cell, *data_cell;
struct dm_cell_key key;
thin_hook_bio(tc, bio);
@@ -2008,7 +2142,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* there's a race with discard.
*/
build_virtual_key(tc->td, block, &key);
- if (dm_bio_detain(tc->pool->prison, &key, bio, &cell1, &cell_result))
+ if (bio_detain(tc->pool, &key, bio, &virt_cell))
return DM_MAPIO_SUBMITTED;
r = dm_thin_find_block(td, block, 0, &result);
@@ -2033,20 +2167,19 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* More distant ancestors are irrelevant. The
* shared flag will be set in their case.
*/
- thin_defer_bio(tc, bio);
- cell_defer_no_holder_no_free(tc, &cell1);
+ thin_defer_cell(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
}
build_data_key(tc->td, result.block, &key);
- if (dm_bio_detain(tc->pool->prison, &key, bio, &cell2, &cell_result)) {
- cell_defer_no_holder_no_free(tc, &cell1);
+ if (bio_detain(tc->pool, &key, bio, &data_cell)) {
+ cell_defer_no_holder(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
}
inc_all_io_entry(tc->pool, bio);
- cell_defer_no_holder_no_free(tc, &cell2);
- cell_defer_no_holder_no_free(tc, &cell1);
+ cell_defer_no_holder(tc, data_cell);
+ cell_defer_no_holder(tc, virt_cell);
remap(tc, bio, result.block);
return DM_MAPIO_REMAPPED;
@@ -2058,14 +2191,13 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* of doing so.
*/
handle_unserviceable_bio(tc->pool, bio);
- cell_defer_no_holder_no_free(tc, &cell1);
+ cell_defer_no_holder(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
}
/* fall through */
case -EWOULDBLOCK:
- thin_defer_bio(tc, bio);
- cell_defer_no_holder_no_free(tc, &cell1);
+ thin_defer_cell(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
default:
@@ -2075,7 +2207,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
* pool is switched to fail-io mode.
*/
bio_io_error(bio);
- cell_defer_no_holder_no_free(tc, &cell1);
+ cell_defer_no_holder(tc, virt_cell);
return DM_MAPIO_SUBMITTED;
}
}
@@ -3388,6 +3520,7 @@ static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
goto out_unlock;
}
spin_lock_init(&tc->lock);
+ INIT_LIST_HEAD(&tc->deferred_cells);
bio_list_init(&tc->deferred_bio_list);
bio_list_init(&tc->retry_on_resume_list);
tc->sort_bio_list = RB_ROOT;
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 15/17] dm thin: remap the bios in a cell immediately
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (13 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 14/17] dm thin: defer whole cells rather than individual bios Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 16/17] dm thin: direct dispatch when breaking sharing Mike Snitzer
` (5 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
This use of direct submission in process_prepared_mapping() reduces
latency for submitting bios in a cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.
But this direct submission exposes the potential for a race between
releasing a cell and incrementing deferred set. Fix this by introducing
dm_cell_visit_release() and refactoring inc_remap_and_issue_cell()
accordingly.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-bio-prison.c | 14 ++++++++
drivers/md/dm-bio-prison.h | 8 +++++
drivers/md/dm-thin.c | 90 +++++++++++++++++++++++++++++++---------------
3 files changed, 83 insertions(+), 29 deletions(-)
diff --git a/drivers/md/dm-bio-prison.c b/drivers/md/dm-bio-prison.c
index 90a5662..bbe22a5 100644
--- a/drivers/md/dm-bio-prison.c
+++ b/drivers/md/dm-bio-prison.c
@@ -241,6 +241,20 @@ void dm_cell_error(struct dm_bio_prison *prison,
}
EXPORT_SYMBOL_GPL(dm_cell_error);
+void dm_cell_visit_release(struct dm_bio_prison *prison,
+ void (*visit_fn)(void *, struct dm_bio_prison_cell *),
+ void *context,
+ struct dm_bio_prison_cell *cell)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&prison->lock, flags);
+ visit_fn(context, cell);
+ rb_erase(&cell->node, &prison->cells);
+ spin_unlock_irqrestore(&prison->lock, flags);
+}
+EXPORT_SYMBOL_GPL(dm_cell_visit_release);
+
/*----------------------------------------------------------------*/
#define DEFERRED_SET_SIZE 64
diff --git a/drivers/md/dm-bio-prison.h b/drivers/md/dm-bio-prison.h
index c0cddb1..b039886 100644
--- a/drivers/md/dm-bio-prison.h
+++ b/drivers/md/dm-bio-prison.h
@@ -89,6 +89,14 @@ void dm_cell_release_no_holder(struct dm_bio_prison *prison,
void dm_cell_error(struct dm_bio_prison *prison,
struct dm_bio_prison_cell *cell, int error);
+/*
+ * Visits the cell and then releases. Guarantees no new inmates are
+ * inserted between the visit and release.
+ */
+void dm_cell_visit_release(struct dm_bio_prison *prison,
+ void (*visit_fn)(void *, struct dm_bio_prison_cell *),
+ void *context, struct dm_bio_prison_cell *cell);
+
/*----------------------------------------------------------------*/
/*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index a7b188f..e89e41f 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -343,6 +343,15 @@ static void cell_release(struct pool *pool,
dm_bio_prison_free_cell(pool->prison, cell);
}
+static void cell_visit_release(struct pool *pool,
+ void (*fn)(void *, struct dm_bio_prison_cell *),
+ void *context,
+ struct dm_bio_prison_cell *cell)
+{
+ dm_cell_visit_release(pool->prison, fn, context, cell);
+ dm_bio_prison_free_cell(pool->prison, cell);
+}
+
static void cell_release_no_holder(struct pool *pool,
struct dm_bio_prison_cell *cell,
struct bio_list *bios)
@@ -674,55 +683,75 @@ static void overwrite_endio(struct bio *bio, int err)
*/
/*
- * This sends the bios in the cell back to the deferred_bios list.
+ * This sends the bios in the cell, except the original holder, back
+ * to the deferred_bios list.
*/
-static void cell_defer(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+static void cell_defer_no_holder(struct thin_c *tc, struct dm_bio_prison_cell *cell)
{
struct pool *pool = tc->pool;
unsigned long flags;
spin_lock_irqsave(&tc->lock, flags);
- cell_release(pool, cell, &tc->deferred_bio_list);
+ cell_release_no_holder(pool, cell, &tc->deferred_bio_list);
spin_unlock_irqrestore(&tc->lock, flags);
wake_worker(pool);
}
-/*
- * Same as cell_defer above, except it omits the original holder of the cell.
- */
-static void cell_defer_no_holder(struct thin_c *tc, struct dm_bio_prison_cell *cell)
+static void thin_defer_bio(struct thin_c *tc, struct bio *bio);
+
+struct remap_info {
+ struct thin_c *tc;
+ struct bio_list defer_bios;
+ struct bio_list issue_bios;
+};
+
+static void __inc_remap_and_issue_cell(void *context,
+ struct dm_bio_prison_cell *cell)
{
- struct pool *pool = tc->pool;
- unsigned long flags;
+ struct remap_info *info = context;
+ struct bio *bio;
- spin_lock_irqsave(&tc->lock, flags);
- cell_release_no_holder(pool, cell, &tc->deferred_bio_list);
- spin_unlock_irqrestore(&tc->lock, flags);
+ while ((bio = bio_list_pop(&cell->bios))) {
+ if (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA))
+ bio_list_add(&info->defer_bios, bio);
+ else {
+ inc_all_io_entry(info->tc->pool, bio);
- wake_worker(pool);
+ /*
+ * We can't issue the bios with the bio prison lock
+ * held, so we add them to a list to issue on
+ * return from this function.
+ */
+ bio_list_add(&info->issue_bios, bio);
+ }
+ }
}
-static void thin_defer_bio(struct thin_c *tc, struct bio *bio);
-
static void inc_remap_and_issue_cell(struct thin_c *tc,
struct dm_bio_prison_cell *cell,
dm_block_t block)
{
struct bio *bio;
- struct bio_list bios;
+ struct remap_info info;
- bio_list_init(&bios);
- cell_release_no_holder(tc->pool, cell, &bios);
+ info.tc = tc;
+ bio_list_init(&info.defer_bios);
+ bio_list_init(&info.issue_bios);
- while ((bio = bio_list_pop(&bios))) {
- if (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA))
- thin_defer_bio(tc, bio);
- else {
- inc_all_io_entry(tc->pool, bio);
- remap_and_issue(tc, bio, block);
- }
- }
+ /*
+ * We have to be careful to inc any bios we're about to issue
+ * before the cell is released, and avoid a race with new bios
+ * being added to the cell.
+ */
+ cell_visit_release(tc->pool, __inc_remap_and_issue_cell,
+ &info, cell);
+
+ while ((bio = bio_list_pop(&info.defer_bios)))
+ thin_defer_bio(tc, bio);
+
+ while ((bio = bio_list_pop(&info.issue_bios)))
+ remap_and_issue(info.tc, bio, block);
}
static void process_prepared_mapping_fail(struct dm_thin_new_mapping *m)
@@ -773,10 +802,13 @@ static void process_prepared_mapping(struct dm_thin_new_mapping *m)
* the bios in the cell.
*/
if (bio) {
- cell_defer_no_holder(tc, m->cell);
+ inc_remap_and_issue_cell(tc, m->cell, m->data_block);
bio_endio(bio, 0);
- } else
- cell_defer(tc, m->cell);
+ } else {
+ inc_all_io_entry(tc->pool, m->cell->holder);
+ remap_and_issue(tc, m->cell->holder, m->data_block);
+ inc_remap_and_issue_cell(tc, m->cell, m->data_block);
+ }
out:
list_del(&m->list);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 16/17] dm thin: direct dispatch when breaking sharing
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (14 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 15/17] dm thin: remap the bios in a cell immediately Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 17/17] dm thin: sort the deferred cells Mike Snitzer
` (4 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
This use of direct submission in process_shared_bio() reduces latency
for submitting bios in the shared cell by avoiding adding those bios to
the deferred list and waiting for the next iteration of the worker.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 70 ++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 57 insertions(+), 13 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index e89e41f..0a202f7 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1362,11 +1362,53 @@ static void break_sharing(struct thin_c *tc, struct bio *bio, dm_block_t block,
}
}
+static void __remap_and_issue_shared_cell(void *context,
+ struct dm_bio_prison_cell *cell)
+{
+ struct remap_info *info = context;
+ struct bio *bio;
+
+ while ((bio = bio_list_pop(&cell->bios))) {
+ if ((bio_data_dir(bio) == WRITE) ||
+ (bio->bi_rw & (REQ_DISCARD | REQ_FLUSH | REQ_FUA)))
+ bio_list_add(&info->defer_bios, bio);
+ else {
+ struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));;
+
+ h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds);
+ inc_all_io_entry(info->tc->pool, bio);
+ bio_list_add(&info->issue_bios, bio);
+ }
+ }
+}
+
+static void remap_and_issue_shared_cell(struct thin_c *tc,
+ struct dm_bio_prison_cell *cell,
+ dm_block_t block)
+{
+ struct bio *bio;
+ struct remap_info info;
+
+ info.tc = tc;
+ bio_list_init(&info.defer_bios);
+ bio_list_init(&info.issue_bios);
+
+ cell_visit_release(tc->pool, __remap_and_issue_shared_cell,
+ &info, cell);
+
+ while ((bio = bio_list_pop(&info.defer_bios)))
+ thin_defer_bio(tc, bio);
+
+ while ((bio = bio_list_pop(&info.issue_bios)))
+ remap_and_issue(tc, bio, block);
+}
+
static void process_shared_bio(struct thin_c *tc, struct bio *bio,
dm_block_t block,
- struct dm_thin_lookup_result *lookup_result)
+ struct dm_thin_lookup_result *lookup_result,
+ struct dm_bio_prison_cell *virt_cell)
{
- struct dm_bio_prison_cell *cell;
+ struct dm_bio_prison_cell *data_cell;
struct pool *pool = tc->pool;
struct dm_cell_key key;
@@ -1375,19 +1417,23 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
* of being broken so we have nothing further to do here.
*/
build_data_key(tc->td, lookup_result->block, &key);
- if (bio_detain(pool, &key, bio, &cell))
+ if (bio_detain(pool, &key, bio, &data_cell)) {
+ cell_defer_no_holder(tc, virt_cell);
return;
+ }
- if (bio_data_dir(bio) == WRITE && bio->bi_iter.bi_size)
- break_sharing(tc, bio, block, &key, lookup_result, cell);
- else {
+ if (bio_data_dir(bio) == WRITE && bio->bi_iter.bi_size) {
+ break_sharing(tc, bio, block, &key, lookup_result, data_cell);
+ cell_defer_no_holder(tc, virt_cell);
+ } else {
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
h->shared_read_entry = dm_deferred_entry_inc(pool->shared_read_ds);
inc_all_io_entry(pool, bio);
- cell_defer_no_holder(tc, cell);
+ remap_and_issue(tc, bio, block);
- remap_and_issue(tc, bio, lookup_result->block);
+ remap_and_issue_shared_cell(tc, data_cell, lookup_result->block);
+ remap_and_issue_shared_cell(tc, virt_cell, lookup_result->block);
}
}
@@ -1451,11 +1497,9 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
- if (lookup_result.shared) {
- process_shared_bio(tc, bio, block, &lookup_result);
- // FIXME: we can't remap because we're waiting on a commit
- cell_defer_no_holder(tc, cell); /* FIXME: pass this cell into process_shared? */
- } else {
+ if (lookup_result.shared)
+ process_shared_bio(tc, bio, block, &lookup_result, cell);
+ else {
inc_all_io_entry(pool, bio);
remap_and_issue(tc, bio, lookup_result.block);
inc_remap_and_issue_cell(tc, cell, lookup_result.block);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 17/17] dm thin: sort the deferred cells
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (15 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 16/17] dm thin: direct dispatch when breaking sharing Mike Snitzer
@ 2014-10-17 18:37 ` Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 fix 18/17] dm thin: fix process_shared_bio (fixes SnapshotTests) Mike Snitzer
` (3 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-17 18:37 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
From: Joe Thornber <ejt@redhat.com>
Sort the cells in logical block order before processing each cell in
process_thin_deferred_cells(). This significantly improves the ondisk
layout on rotational storage, whereby improving read performance.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 88 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 68 insertions(+), 20 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 0a202f7..cd63180 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -17,6 +17,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/slab.h>
+#include <linux/sort.h>
#include <linux/rbtree.h>
#define DM_MSG_PREFIX "thin"
@@ -205,6 +206,8 @@ typedef void (*process_bio_fn)(struct thin_c *tc, struct bio *bio);
typedef void (*process_cell_fn)(struct thin_c *tc, struct dm_bio_prison_cell *cell);
typedef void (*process_mapping_fn)(struct dm_thin_new_mapping *m);
+#define CELL_SORT_ARRAY_SIZE 8192
+
struct pool {
struct list_head list;
struct dm_target *ti; /* Only set if a pool target is bound */
@@ -252,6 +255,8 @@ struct pool {
process_mapping_fn process_prepared_mapping;
process_mapping_fn process_prepared_discard;
+
+ struct dm_bio_prison_cell *cell_sort_array[CELL_SORT_ARRAY_SIZE];
};
static enum pool_mode get_pool_mode(struct pool *pool);
@@ -1767,12 +1772,48 @@ static void process_thin_deferred_bios(struct thin_c *tc)
blk_finish_plug(&plug);
}
+static int cmp_cells(const void *lhs, const void *rhs)
+{
+ struct dm_bio_prison_cell *lhs_cell = *((struct dm_bio_prison_cell **) lhs);
+ struct dm_bio_prison_cell *rhs_cell = *((struct dm_bio_prison_cell **) rhs);
+
+ BUG_ON(!lhs_cell->holder);
+ BUG_ON(!rhs_cell->holder);
+
+ if (lhs_cell->holder->bi_iter.bi_sector < rhs_cell->holder->bi_iter.bi_sector)
+ return -1;
+
+ if (lhs_cell->holder->bi_iter.bi_sector > rhs_cell->holder->bi_iter.bi_sector)
+ return 1;
+
+ return 0;
+}
+
+static unsigned sort_cells(struct pool *pool, struct list_head *cells)
+{
+ unsigned count = 0;
+ struct dm_bio_prison_cell *cell, *tmp;
+
+ list_for_each_entry_safe(cell, tmp, cells, user_list) {
+ if (count >= CELL_SORT_ARRAY_SIZE)
+ break;
+
+ pool->cell_sort_array[count++] = cell;
+ list_del(&cell->user_list);
+ }
+
+ sort(pool->cell_sort_array, count, sizeof(cell), cmp_cells, NULL);
+
+ return count;
+}
+
static void process_thin_deferred_cells(struct thin_c *tc)
{
struct pool *pool = tc->pool;
unsigned long flags;
struct list_head cells;
- struct dm_bio_prison_cell *cell, *tmp;
+ struct dm_bio_prison_cell *cell;
+ unsigned i, j, count;
// FIXME: push down into per cell processing
if (tc->requeue_mode) {
@@ -1789,27 +1830,34 @@ static void process_thin_deferred_cells(struct thin_c *tc)
if (list_empty(&cells))
return;
- list_for_each_entry_safe(cell, tmp, &cells, user_list) {
- BUG_ON(!cell->holder);
+ do {
+ count = sort_cells(tc->pool, &cells);
- /*
- * If we've got no free new_mapping structs, and processing
- * this bio might require one, we pause until there are some
- * prepared mappings to process.
- */
- if (ensure_next_mapping(pool)) {
- spin_lock_irqsave(&tc->lock, flags);
- list_add(&cell->user_list, &tc->deferred_cells);
- list_splice(&cells, &tc->deferred_cells);
- spin_unlock_irqrestore(&tc->lock, flags);
- break;
- }
+ for (i = 0; i < count; i++) {
+ cell = pool->cell_sort_array[i];
+ BUG_ON(!cell->holder);
- if (cell->holder->bi_rw & REQ_DISCARD)
- pool->process_discard_cell(tc, cell);
- else
- pool->process_cell(tc, cell);
- }
+ /*
+ * If we've got no free new_mapping structs, and processing
+ * this bio might require one, we pause until there are some
+ * prepared mappings to process.
+ */
+ if (ensure_next_mapping(pool)) {
+ for (j = i; j < count; j++)
+ list_add(&pool->cell_sort_array[j]->user_list, &cells);
+
+ spin_lock_irqsave(&tc->lock, flags);
+ list_splice(&cells, &tc->deferred_cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+ return;
+ }
+
+ if (cell->holder->bi_rw & REQ_DISCARD)
+ pool->process_discard_cell(tc, cell);
+ else
+ pool->process_cell(tc, cell);
+ }
+ } while (!list_empty(&cells));
}
static void thin_get(struct thin_c *tc);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 fix 18/17] dm thin: fix process_shared_bio (fixes SnapshotTests)
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (16 preceding siblings ...)
2014-10-17 18:37 ` [for-3.19 PATCH v2 17/17] dm thin: sort the deferred cells Mike Snitzer
@ 2014-10-19 23:02 ` Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 fix 19/17] dm thin: requeue deferred_cells when in requeue_mode Mike Snitzer
` (2 subsequent siblings)
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-19 23:02 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Fold into "dm thin: direct dispatch when breaking sharing"
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index cd63180..f5c59cf 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1435,7 +1435,7 @@ static void process_shared_bio(struct thin_c *tc, struct bio *bio,
h->shared_read_entry = dm_deferred_entry_inc(pool->shared_read_ds);
inc_all_io_entry(pool, bio);
- remap_and_issue(tc, bio, block);
+ remap_and_issue(tc, bio, lookup_result->block);
remap_and_issue_shared_cell(tc, data_cell, lookup_result->block);
remap_and_issue_shared_cell(tc, virt_cell, lookup_result->block);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 fix 19/17] dm thin: requeue deferred_cells when in requeue_mode
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (17 preceding siblings ...)
2014-10-19 23:02 ` [for-3.19 PATCH v2 fix 18/17] dm thin: fix process_shared_bio (fixes SnapshotTests) Mike Snitzer
@ 2014-10-19 23:02 ` Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 20/17] dm thin: optimize retry_bios_on_resume Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 21/17] dm thin: refactor requeue_io to eliminate spinlock bouncing Mike Snitzer
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-19 23:02 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Fold into "dm thin: defer whole cells rather than individual bios"
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 39 +++++++++++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 6 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index f5c59cf..1351df0 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -382,6 +382,11 @@ static void cell_success(struct pool *pool, struct dm_bio_prison_cell *cell)
cell_error_with_code(pool, cell, 0);
}
+static void cell_requeue(struct pool *pool, struct dm_bio_prison_cell *cell)
+{
+ cell_error_with_code(pool, cell, DM_ENDIO_REQUEUE);
+}
+
/*----------------------------------------------------------------*/
/*
@@ -469,10 +474,28 @@ static void requeue_bio_list(struct thin_c *tc, struct bio_list *master)
bio_endio(bio, DM_ENDIO_REQUEUE);
}
+static void requeue_deferred_cells(struct thin_c *tc)
+{
+ struct pool *pool = tc->pool;
+ unsigned long flags;
+ struct list_head cells;
+ struct dm_bio_prison_cell *cell, *tmp;
+
+ INIT_LIST_HEAD(&cells);
+
+ spin_lock_irqsave(&tc->lock, flags);
+ list_splice_init(&tc->deferred_cells, &cells);
+ spin_unlock_irqrestore(&tc->lock, flags);
+
+ list_for_each_entry_safe(cell, tmp, &cells, user_list)
+ cell_requeue(pool, cell);
+}
+
static void requeue_io(struct thin_c *tc)
{
requeue_bio_list(tc, &tc->deferred_bio_list);
requeue_bio_list(tc, &tc->retry_on_resume_list);
+ requeue_deferred_cells(tc);
}
static void error_thin_retry_list(struct thin_c *tc)
@@ -1260,6 +1283,11 @@ static void process_discard_cell(struct thin_c *tc, struct dm_bio_prison_cell *c
struct dm_thin_lookup_result lookup_result;
struct dm_thin_new_mapping *m;
+ if (tc->requeue_mode) {
+ cell_requeue(pool, cell);
+ return;
+ }
+
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
@@ -1499,6 +1527,11 @@ static void process_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
dm_block_t block = get_bio_block(tc, bio);
struct dm_thin_lookup_result lookup_result;
+ if (tc->requeue_mode) {
+ cell_requeue(pool, cell);
+ return;
+ }
+
r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
switch (r) {
case 0:
@@ -1815,12 +1848,6 @@ static void process_thin_deferred_cells(struct thin_c *tc)
struct dm_bio_prison_cell *cell;
unsigned i, j, count;
- // FIXME: push down into per cell processing
- if (tc->requeue_mode) {
- requeue_bio_list(tc, &tc->deferred_bio_list);
- return;
- }
-
INIT_LIST_HEAD(&cells);
spin_lock_irqsave(&tc->lock, flags);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 20/17] dm thin: optimize retry_bios_on_resume
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (18 preceding siblings ...)
2014-10-19 23:02 ` [for-3.19 PATCH v2 fix 19/17] dm thin: requeue deferred_cells when in requeue_mode Mike Snitzer
@ 2014-10-19 23:02 ` Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 21/17] dm thin: refactor requeue_io to eliminate spinlock bouncing Mike Snitzer
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-19 23:02 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Eliminate redundant should_error_unserviceable_bio check and error
loop.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 1351df0..917ade5 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1263,13 +1263,8 @@ static void retry_bios_on_resume(struct pool *pool, struct dm_bio_prison_cell *c
bio_list_init(&bios);
cell_release(pool, cell, &bios);
- error = should_error_unserviceable_bio(pool);
- if (error)
- while ((bio = bio_list_pop(&bios)))
- bio_endio(bio, error);
- else
- while ((bio = bio_list_pop(&bios)))
- retry_on_resume(bio);
+ while ((bio = bio_list_pop(&bios)))
+ retry_on_resume(bio);
}
static void process_discard_cell(struct thin_c *tc, struct dm_bio_prison_cell *cell)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [for-3.19 PATCH v2 21/17] dm thin: refactor requeue_io to eliminate spinlock bouncing
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
` (19 preceding siblings ...)
2014-10-19 23:02 ` [for-3.19 PATCH v2 20/17] dm thin: optimize retry_bios_on_resume Mike Snitzer
@ 2014-10-19 23:02 ` Mike Snitzer
20 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2014-10-19 23:02 UTC (permalink / raw)
To: dm-devel; +Cc: ejt
Also refactor some other bio_list erroring helpers.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
drivers/md/dm-thin.c | 43 +++++++++++++++++++++++--------------------
1 file changed, 23 insertions(+), 20 deletions(-)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 917ade5..5fb5df1 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -457,21 +457,32 @@ struct dm_thin_endio_hook {
struct rb_node rb_node;
};
-static void requeue_bio_list(struct thin_c *tc, struct bio_list *master)
+static void __merge_bio_list(struct bio_list *bios, struct bio_list *master)
+{
+ bio_list_merge(bios, master);
+ bio_list_init(master);
+}
+
+static void error_bio_list(struct bio_list *bios, int error)
{
struct bio *bio;
+
+ while ((bio = bio_list_pop(bios)))
+ bio_endio(bio, error);
+}
+
+static void error_thin_bio_list(struct thin_c *tc, struct bio_list *master, int error)
+{
struct bio_list bios;
unsigned long flags;
bio_list_init(&bios);
spin_lock_irqsave(&tc->lock, flags);
- bio_list_merge(&bios, master);
- bio_list_init(master);
+ __merge_bio_list(&bios, master);
spin_unlock_irqrestore(&tc->lock, flags);
- while ((bio = bio_list_pop(&bios)))
- bio_endio(bio, DM_ENDIO_REQUEUE);
+ error_bio_list(&bios, error);
}
static void requeue_deferred_cells(struct thin_c *tc)
@@ -493,26 +504,18 @@ static void requeue_deferred_cells(struct thin_c *tc)
static void requeue_io(struct thin_c *tc)
{
- requeue_bio_list(tc, &tc->deferred_bio_list);
- requeue_bio_list(tc, &tc->retry_on_resume_list);
- requeue_deferred_cells(tc);
-}
-
-static void error_thin_retry_list(struct thin_c *tc)
-{
- struct bio *bio;
- unsigned long flags;
struct bio_list bios;
+ unsigned long flags;
bio_list_init(&bios);
spin_lock_irqsave(&tc->lock, flags);
- bio_list_merge(&bios, &tc->retry_on_resume_list);
- bio_list_init(&tc->retry_on_resume_list);
+ __merge_bio_list(&bios, &tc->deferred_bio_list);
+ __merge_bio_list(&bios, &tc->retry_on_resume_list);
spin_unlock_irqrestore(&tc->lock, flags);
- while ((bio = bio_list_pop(&bios)))
- bio_io_error(bio);
+ error_bio_list(&bios, DM_ENDIO_REQUEUE);
+ requeue_deferred_cells(tc);
}
static void error_retry_list(struct pool *pool)
@@ -521,7 +524,7 @@ static void error_retry_list(struct pool *pool)
rcu_read_lock();
list_for_each_entry_rcu(tc, &pool->active_thins, list)
- error_thin_retry_list(tc);
+ error_thin_bio_list(tc, &tc->retry_on_resume_list, -EIO);
rcu_read_unlock();
}
@@ -1752,7 +1755,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
unsigned count = 0;
if (tc->requeue_mode) {
- requeue_bio_list(tc, &tc->deferred_bio_list);
+ error_thin_bio_list(tc, &tc->deferred_bio_list, DM_ENDIO_REQUEUE);
return;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [for-3.19 PATCH v2 01/17] dm bufio: switch from a huge hash table to an rbtree
2014-10-17 18:37 ` [for-3.19 PATCH v2 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
@ 2014-10-21 22:48 ` Mikulas Patocka
2014-10-22 9:41 ` Joe Thornber
0 siblings, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2014-10-21 22:48 UTC (permalink / raw)
To: Mike Snitzer; +Cc: dm-devel, ejt
I wonder if we need this patch if it doesn't really help performance.
We don't have any measurement that shows that this is benefical. Rbtree
causes more cache hits and more branch mispredictions than hash, so I
doubt there is an advantage.
In case you observe performance degradation due to hash collisions, we can
- use better hash function (from include/linux/hash.h)
- use hashing with rb-tree root at each hash location (that would keep low
overhead of hashing and guaranteed worst complexity log n)
Mikulas
On Fri, 17 Oct 2014, Mike Snitzer wrote:
> From: Joe Thornber <ejt@redhat.com>
>
> Converting over to using an rbtree eliminates a fixed 8MB allocation
> from vmalloc space for the hash table.
>
> Signed-off-by: Joe Thornber <ejt@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
> drivers/md/dm-bufio.c | 97 ++++++++++++++++++++++++++++-----------------------
> 1 file changed, 54 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 0be200b..dcaa1d9 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -14,6 +14,7 @@
> #include <linux/vmalloc.h>
> #include <linux/shrinker.h>
> #include <linux/module.h>
> +#include <linux/rbtree.h>
>
> #define DM_MSG_PREFIX "bufio"
>
> @@ -48,14 +49,6 @@
> #define DM_BUFIO_INLINE_VECS 16
>
> /*
> - * Buffer hash
> - */
> -#define DM_BUFIO_HASH_BITS 20
> -#define DM_BUFIO_HASH(block) \
> - ((((block) >> DM_BUFIO_HASH_BITS) ^ (block)) & \
> - ((1 << DM_BUFIO_HASH_BITS) - 1))
> -
> -/*
> * Don't try to use kmem_cache_alloc for blocks larger than this.
> * For explanation, see alloc_buffer_data below.
> */
> @@ -106,7 +99,7 @@ struct dm_bufio_client {
>
> unsigned minimum_buffers;
>
> - struct hlist_head *cache_hash;
> + struct rb_root buffer_tree;
> wait_queue_head_t free_buffer_wait;
>
> int async_write_error;
> @@ -135,7 +128,7 @@ enum data_mode {
> };
>
> struct dm_buffer {
> - struct hlist_node hash_list;
> + struct rb_node node;
> struct list_head lru_list;
> sector_t block;
> void *data;
> @@ -253,6 +246,53 @@ static LIST_HEAD(dm_bufio_all_clients);
> */
> static DEFINE_MUTEX(dm_bufio_clients_lock);
>
> +/*----------------------------------------------------------------
> + * A red/black tree acts as an index for all the buffers.
> + *--------------------------------------------------------------*/
> +static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
> +{
> + struct rb_node *n = c->buffer_tree.rb_node;
> + struct dm_buffer *b;
> +
> + while (n) {
> + b = container_of(n, struct dm_buffer, node);
> +
> + if (b->block == block)
> + return b;
> +
> + n = (b->block < block) ? n->rb_left : n->rb_right;
> + }
> +
> + return NULL;
> +}
> +
> +static void __insert(struct dm_bufio_client *c, struct dm_buffer *b)
> +{
> + struct rb_node **new = &c->buffer_tree.rb_node, *parent = NULL;
> + struct dm_buffer *found;
> +
> + while (*new) {
> + found = container_of(*new, struct dm_buffer, node);
> +
> + if (found->block == b->block) {
> + BUG_ON(found != b);
> + return;
> + }
> +
> + parent = *new;
> + new = (found->block < b->block) ?
> + &((*new)->rb_left) : &((*new)->rb_right);
> + }
> +
> + rb_link_node(&b->node, parent, new);
> + rb_insert_color(&b->node, &c->buffer_tree);
> +}
> +
> +static void __remove(struct dm_bufio_client *c, struct dm_buffer *b)
> +{
> + rb_erase(&b->node, &c->buffer_tree);
> +}
> +
> /*----------------------------------------------------------------*/
>
> static void adjust_total_allocated(enum data_mode data_mode, long diff)
> @@ -434,7 +474,7 @@ static void __link_buffer(struct dm_buffer *b, sector_t block, int dirty)
> b->block = block;
> b->list_mode = dirty;
> list_add(&b->lru_list, &c->lru[dirty]);
> - hlist_add_head(&b->hash_list, &c->cache_hash[DM_BUFIO_HASH(block)]);
> + __insert(b->c, b);
> b->last_accessed = jiffies;
> }
>
> @@ -448,7 +488,7 @@ static void __unlink_buffer(struct dm_buffer *b)
> BUG_ON(!c->n_buffers[b->list_mode]);
>
> c->n_buffers[b->list_mode]--;
> - hlist_del(&b->hash_list);
> + __remove(b->c, b);
> list_del(&b->lru_list);
> }
>
> @@ -888,23 +928,6 @@ static void __check_watermark(struct dm_bufio_client *c,
> __write_dirty_buffers_async(c, 1, write_list);
> }
>
> -/*
> - * Find a buffer in the hash.
> - */
> -static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
> -{
> - struct dm_buffer *b;
> -
> - hlist_for_each_entry(b, &c->cache_hash[DM_BUFIO_HASH(block)],
> - hash_list) {
> - dm_bufio_cond_resched();
> - if (b->block == block)
> - return b;
> - }
> -
> - return NULL;
> -}
> -
> /*----------------------------------------------------------------
> * Getting a buffer
> *--------------------------------------------------------------*/
> @@ -1534,11 +1557,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
> r = -ENOMEM;
> goto bad_client;
> }
> - c->cache_hash = vmalloc(sizeof(struct hlist_head) << DM_BUFIO_HASH_BITS);
> - if (!c->cache_hash) {
> - r = -ENOMEM;
> - goto bad_hash;
> - }
> + c->buffer_tree = RB_ROOT;
>
> c->bdev = bdev;
> c->block_size = block_size;
> @@ -1557,9 +1576,6 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
> c->n_buffers[i] = 0;
> }
>
> - for (i = 0; i < 1 << DM_BUFIO_HASH_BITS; i++)
> - INIT_HLIST_HEAD(&c->cache_hash[i]);
> -
> mutex_init(&c->lock);
> INIT_LIST_HEAD(&c->reserved_buffers);
> c->need_reserved_buffers = reserved_buffers;
> @@ -1633,8 +1649,6 @@ bad_cache:
> }
> dm_io_client_destroy(c->dm_io);
> bad_dm_io:
> - vfree(c->cache_hash);
> -bad_hash:
> kfree(c);
> bad_client:
> return ERR_PTR(r);
> @@ -1661,9 +1675,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
>
> mutex_unlock(&dm_bufio_clients_lock);
>
> - for (i = 0; i < 1 << DM_BUFIO_HASH_BITS; i++)
> - BUG_ON(!hlist_empty(&c->cache_hash[i]));
> -
> + BUG_ON(!RB_EMPTY_ROOT(&c->buffer_tree));
> BUG_ON(c->need_reserved_buffers);
>
> while (!list_empty(&c->reserved_buffers)) {
> @@ -1681,7 +1693,6 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
> BUG_ON(c->n_buffers[i]);
>
> dm_io_client_destroy(c->dm_io);
> - vfree(c->cache_hash);
> kfree(c);
> }
> EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
> --
> 1.8.3.1
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [for-3.19 PATCH v2 01/17] dm bufio: switch from a huge hash table to an rbtree
2014-10-21 22:48 ` Mikulas Patocka
@ 2014-10-22 9:41 ` Joe Thornber
0 siblings, 0 replies; 43+ messages in thread
From: Joe Thornber @ 2014-10-22 9:41 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: dm-devel, ejt, Mike Snitzer
On Tue, Oct 21, 2014 at 06:48:00PM -0400, Mikulas Patocka wrote:
> I wonder if we need this patch if it doesn't really help performance.
I wasn't concerned with performance. I just thought it absurd that
bufio was evicting all my metadata yet using 8m per client for a hash
table.
- Joe
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [for-3.19 PATCH v2 02/17] dm bufio: evict buffers that are past the max age but retain some buffers
2014-10-17 18:37 ` [for-3.19 PATCH v2 02/17] dm bufio: evict buffers that are past the max age but retain some buffers Mike Snitzer
@ 2014-10-31 16:37 ` Mikulas Patocka
0 siblings, 0 replies; 43+ messages in thread
From: Mikulas Patocka @ 2014-10-31 16:37 UTC (permalink / raw)
To: Mike Snitzer; +Cc: dm-devel, ejt
Hi
I've already created a patch for this purpose, see patches
dm-bufio-ratio.patch and dm-bufio-minimum-buffers.patch here:
http://people.redhat.com/~mpatocka/patches/kernel/dm-bufio-cache/series.html
The difference is that my patch sets the default amount of memory to keep
dynamically, as 1/500 of total memory.
Mikulas
On Fri, 17 Oct 2014, Mike Snitzer wrote:
> From: Joe Thornber <ejt@redhat.com>
>
> These changes help keep metadata backed by dm-bufio in-core longer which
> fixes reports of metadata churn in the face of heavy random IO workloads.
>
> Before, bufio evicted all buffers older than DM_BUFIO_DEFAULT_AGE_SECS.
> Having a device (e.g. dm-thinp or dm-cache) lose all metadata just
> because associated buffers had been idle for some time is unfriendly.
>
> Now, the user may now configure the number of bytes that bufio retains
> using the 'retain_bytes' module parameter. The default is 256K.
>
> Also, the DM_BUFIO_WORK_TIMER_SECS and DM_BUFIO_DEFAULT_AGE_SECS
> defaults were quite low so increase them (to 30 and 300 respectively).
>
> Signed-off-by: Joe Thornber <ejt@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
> drivers/md/dm-bufio.c | 109 ++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 75 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index dcaa1d9..99579c8 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -35,12 +35,17 @@
> /*
> * Check buffer ages in this interval (seconds)
> */
> -#define DM_BUFIO_WORK_TIMER_SECS 10
> +#define DM_BUFIO_WORK_TIMER_SECS 30
>
> /*
> * Free buffers when they are older than this (seconds)
> */
> -#define DM_BUFIO_DEFAULT_AGE_SECS 60
> +#define DM_BUFIO_DEFAULT_AGE_SECS 300
> +
> +/*
> + * The nr of bytes of cached data to keep around.
> + */
> +#define DM_BUFIO_DEFAULT_RETAIN_BYTES (256 * 1024)
>
> /*
> * The number of bvec entries that are embedded directly in the buffer.
> @@ -216,6 +221,7 @@ static DEFINE_SPINLOCK(param_spinlock);
> * Buffers are freed after this timeout
> */
> static unsigned dm_bufio_max_age = DM_BUFIO_DEFAULT_AGE_SECS;
> +static unsigned dm_bufio_retain_bytes = DM_BUFIO_DEFAULT_RETAIN_BYTES;
>
> static unsigned long dm_bufio_peak_allocated;
> static unsigned long dm_bufio_allocated_kmem_cache;
> @@ -1457,45 +1463,52 @@ static void drop_buffers(struct dm_bufio_client *c)
> }
>
> /*
> - * Test if the buffer is unused and too old, and commit it.
> + * We may not be able to evict this buffer if IO pending or the client
> + * is still using it. Caller is expected to know buffer is too old.
> + *
> * And if GFP_NOFS is used, we must not do any I/O because we hold
> * dm_bufio_clients_lock and we would risk deadlock if the I/O gets
> * rerouted to different bufio client.
> */
> -static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
> - unsigned long max_jiffies)
> +static bool __try_evict_buffer(struct dm_buffer *b, gfp_t gfp)
> {
> - if (jiffies - b->last_accessed < max_jiffies)
> - return 0;
> -
> if (!(gfp & __GFP_FS)) {
> if (test_bit(B_READING, &b->state) ||
> test_bit(B_WRITING, &b->state) ||
> test_bit(B_DIRTY, &b->state))
> - return 0;
> + return false;
> }
>
> if (b->hold_count)
> - return 0;
> + return false;
>
> __make_buffer_clean(b);
> __unlink_buffer(b);
> __free_buffer_wake(b);
>
> - return 1;
> + return true;
> }
>
> -static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
> - gfp_t gfp_mask)
> +static unsigned get_retain_buffers(struct dm_bufio_client *c)
> +{
> + unsigned retain_bytes = ACCESS_ONCE(dm_bufio_retain_bytes);
> + return retain_bytes / c->block_size;
> +}
> +
> +static unsigned long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
> + gfp_t gfp_mask)
> {
> int l;
> struct dm_buffer *b, *tmp;
> - long freed = 0;
> + unsigned long freed = 0;
> + unsigned long count = nr_to_scan;
> + unsigned retain_target = get_retain_buffers(c);
>
> for (l = 0; l < LIST_SIZE; l++) {
> list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
> - freed += __cleanup_old_buffer(b, gfp_mask, 0);
> - if (!--nr_to_scan)
> + if (__try_evict_buffer(b, gfp_mask))
> + freed++;
> + if (!--nr_to_scan || ((count - freed) <= retain_target))
> return freed;
> dm_bufio_cond_resched();
> }
> @@ -1697,31 +1710,56 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
> }
> EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
>
> -static void cleanup_old_buffers(void)
> +static unsigned get_max_age_hz(void)
> {
> - unsigned long max_age = ACCESS_ONCE(dm_bufio_max_age);
> - struct dm_bufio_client *c;
> + unsigned max_age = ACCESS_ONCE(dm_bufio_max_age);
>
> - if (max_age > ULONG_MAX / HZ)
> - max_age = ULONG_MAX / HZ;
> + if (max_age > UINT_MAX / HZ)
> + max_age = UINT_MAX / HZ;
>
> - mutex_lock(&dm_bufio_clients_lock);
> - list_for_each_entry(c, &dm_bufio_all_clients, client_list) {
> - if (!dm_bufio_trylock(c))
> - continue;
> + return max_age * HZ;
> +}
>
> - while (!list_empty(&c->lru[LIST_CLEAN])) {
> - struct dm_buffer *b;
> - b = list_entry(c->lru[LIST_CLEAN].prev,
> - struct dm_buffer, lru_list);
> - if (!__cleanup_old_buffer(b, 0, max_age * HZ))
> - break;
> - dm_bufio_cond_resched();
> - }
> +static bool older_than(struct dm_buffer *b, unsigned long age_hz)
> +{
> + return (jiffies - b->last_accessed) >= age_hz;
> +}
> +
> +static void __evict_old_buffers(struct dm_bufio_client *c, unsigned long age_hz)
> +{
> + struct dm_buffer *b, *tmp;
> + unsigned retain_target = get_retain_buffers(c);
> + unsigned count;
> +
> + dm_bufio_lock(c);
> +
> + count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
> + list_for_each_entry_safe_reverse(b, tmp, &c->lru[LIST_CLEAN], lru_list) {
> + if (count <= retain_target)
> + break;
> +
> + if (!older_than(b, age_hz))
> + break;
> +
> + if (__try_evict_buffer(b, 0))
> + count--;
>
> - dm_bufio_unlock(c);
> dm_bufio_cond_resched();
> }
> +
> + dm_bufio_unlock(c);
> +}
> +
> +static void cleanup_old_buffers(void)
> +{
> + unsigned long max_age_hz = get_max_age_hz();
> + struct dm_bufio_client *c;
> +
> + mutex_lock(&dm_bufio_clients_lock);
> +
> + list_for_each_entry(c, &dm_bufio_all_clients, client_list)
> + __evict_old_buffers(c, max_age_hz);
> +
> mutex_unlock(&dm_bufio_clients_lock);
> }
>
> @@ -1846,6 +1884,9 @@ MODULE_PARM_DESC(max_cache_size_bytes, "Size of metadata cache");
> module_param_named(max_age_seconds, dm_bufio_max_age, uint, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(max_age_seconds, "Max age of a buffer in seconds");
>
> +module_param_named(retain_bytes, dm_bufio_retain_bytes, uint, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(retain_bytes, "Try to keep at least this many bytes cached in memory");
> +
> module_param_named(peak_allocated_bytes, dm_bufio_peak_allocated, ulong, S_IRUGO | S_IWUSR);
> MODULE_PARM_DESC(peak_allocated_bytes, "Tracks the maximum allocated memory");
>
> --
> 1.8.3.1
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2014-10-31 16:37 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-17 6:06 [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 02/17] dm bufio: evict buffers that are past the max age but retain some buffers Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 03/17] dm bio prison: switch to using a red black tree Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 04/17] dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 05/17] dm transaction manager: add support for prefetching blocks of metadata Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 06/17] dm thin: prefetch missing metadata pages Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 07/17] dm thin: throttle incoming IO Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 08/17] dm thin: adjust max_sectors_kb based on thinp blocksize Mike Snitzer
2014-10-17 6:06 ` [for-3.19 PATCH 09/17] dm: improve documentation and code clarity in dm_merge_bvec Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 10/17] dm thin: implement thin_merge Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 11/17] dm thin: grab a virtual cell before looking up the mapping Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 12/17] dm thin: performance improvement to discard processing Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 13/17] dm thin: factor out remap_and_issue_overwrite Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 14/17] dm thin: defer whole cells rather than individual bios Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 15/17] dm thin: remap the bios in a cell immediately Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 16/17] dm thin: direct dispatch when breaking sharing Mike Snitzer
2014-10-17 6:07 ` [for-3.19 PATCH 17/17] dm thin: sort the deferred cells Mike Snitzer
2014-10-17 18:31 ` [for-3.19 PATCH 00/17] dm thin: performance improvements Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 01/17] dm bufio: switch from a huge hash table to an rbtree Mike Snitzer
2014-10-21 22:48 ` Mikulas Patocka
2014-10-22 9:41 ` Joe Thornber
2014-10-17 18:37 ` [for-3.19 PATCH v2 02/17] dm bufio: evict buffers that are past the max age but retain some buffers Mike Snitzer
2014-10-31 16:37 ` Mikulas Patocka
2014-10-17 18:37 ` [for-3.19 PATCH v2 03/17] dm bio prison: switch to using a red black tree Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 04/17] dm thin metadata: change dm_thin_find_block to allow blocking, but not issuing, IO Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 05/17] dm transaction manager: add support for prefetching blocks of metadata Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 06/17] dm thin: prefetch missing metadata pages Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 07/17] dm thin: throttle incoming IO Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 08/17] dm thin: adjust max_sectors_kb based on thinp blocksize Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 09/17] dm: improve documentation and code clarity in dm_merge_bvec Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 10/17] dm thin: implement thin_merge Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 11/17] dm thin: grab a virtual cell before looking up the mapping Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 12/17] dm thin: performance improvement to discard processing Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 13/17] dm thin: factor out remap_and_issue_overwrite Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 14/17] dm thin: defer whole cells rather than individual bios Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 15/17] dm thin: remap the bios in a cell immediately Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 16/17] dm thin: direct dispatch when breaking sharing Mike Snitzer
2014-10-17 18:37 ` [for-3.19 PATCH v2 17/17] dm thin: sort the deferred cells Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 fix 18/17] dm thin: fix process_shared_bio (fixes SnapshotTests) Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 fix 19/17] dm thin: requeue deferred_cells when in requeue_mode Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 20/17] dm thin: optimize retry_bios_on_resume Mike Snitzer
2014-10-19 23:02 ` [for-3.19 PATCH v2 21/17] dm thin: refactor requeue_io to eliminate spinlock bouncing Mike Snitzer
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.