problem w/ read caching..

linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* problem w/ read caching..
@ 2012-09-12 20:01 Brad Walker
       [not found] ` <CAPKZHbV3n7O+VRVNS-C2oDVSpO_VdirMDUOuwwWKaA5ZOUEG_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Brad Walker @ 2012-09-12 20:01 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

I am having a problem with BCache.

I’ve followed the documentation and have my cache attached. Here is
what dmesg tells me:

[  372.622905] bcache: invalidating existing data
[  372.637517] bcache: registered cache device rssda1
[  400.704672] bcache: Caching dm-2 as bcache0 on set
16fd7139-f018-461c-9d9e-daa

I warmed up the cache by using an application (vdbench) to do random
reads over a 10GB region.

Everything looks good as the response time comes down as the cache
warms up. But, for some reason the cache_hit_rate is showing 90% and
yet I’m still seeing heavy activity to the disk device.

bwalker@nellis:~> cat
/sys/fs/bcache/16fd7139-f018-461c-9d9e-daa7666c7f1e/stats_total/cache_hit_ratio
94
bwalker@nellis:~>

Any ideas on why this might be happening are appreciated.

-brad w.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAPKZHbV3n7O+VRVNS-C2oDVSpO_VdirMDUOuwwWKaA5ZOUEG_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: problem w/ read caching..
       [not found] ` <CAPKZHbV3n7O+VRVNS-C2oDVSpO_VdirMDUOuwwWKaA5ZOUEG_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-09-13 18:43   ` Kent Overstreet
  2012-09-27 23:28     ` Brad Walker
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2012-09-13 18:43 UTC (permalink / raw)
  To: Brad Walker; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Sounds like a good portion of your IO is bypassing the cache. That
will happen if some of it's sequential, or if the SSD latency goes
over a threshold - sequential_cutoff, congested_read_threshold_us and
congest_write_threshold_us (if I'm remembering the names correctly)
are the settings that control all that. 0 disables all of them.

On Wed, Sep 12, 2012 at 1:01 PM, Brad Walker <bwalker-WlSugiYO8JFBDgjK7y7TUQ@public.gmane.org> wrote:
> I am having a problem with BCache.
>
> I’ve followed the documentation and have my cache attached. Here is
> what dmesg tells me:
>
> [  372.622905] bcache: invalidating existing data
> [  372.637517] bcache: registered cache device rssda1
> [  400.704672] bcache: Caching dm-2 as bcache0 on set
> 16fd7139-f018-461c-9d9e-daa
>
> I warmed up the cache by using an application (vdbench) to do random
> reads over a 10GB region.
>
> Everything looks good as the response time comes down as the cache
> warms up. But, for some reason the cache_hit_rate is showing 90% and
> yet I’m still seeing heavy activity to the disk device.
>
> bwalker@nellis:~> cat
> /sys/fs/bcache/16fd7139-f018-461c-9d9e-daa7666c7f1e/stats_total/cache_hit_ratio
> 94
> bwalker@nellis:~>
>
> Any ideas on why this might be happening are appreciated.
>
> -brad w.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: problem w/ read caching..
  2012-09-13 18:43   ` Kent Overstreet
@ 2012-09-27 23:28     ` Brad Walker
       [not found]       ` <loom.20120928T010314-562-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Brad Walker @ 2012-09-27 23:28 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Kent Overstreet <koverstreet@...> writes:

> 
> Sounds like a good portion of your IO is bypassing the cache. That
> will happen if some of it's sequential, or if the SSD latency goes
> over a threshold - sequential_cutoff, congested_read_threshold_us and
> congest_write_threshold_us (if I'm remembering the names correctly)
> are the settings that control all that. 0 disables all of them.
> 

So I set the sequential_cutoff and congested_read_threshold_us to both be 0.
Since I was only doing reads, I figured there was no need to mess with the write
option.

But, I'm still seeing a problem. 

My hardware is:
1 - Dell PowerEdge R710 w/ 24 x Xeon processors, 96GB of ram
2 - Micron P320H SSD
3 - LSI storage device connected by a SAS interface

What I see is that when I do random reads over a 10GB region, the cache warms up
but hits a read response plateau at about 7ms. I still see a LOT (i.e. 32000
IOPS) of I/O to the disk. 

Yes, if I run the same test over a 1GB region, runs really fast. Pretty close to
the max IOPS rate of the SSD.

So I'm thinking there is a problem here or I have a bcache config issue.

I'm willing to try things but I need some guidance on what to look for as it
seems like a bcache issue.

Thanks for the help.

-brad w.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <loom.20120928T010314-562-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>]

* Re: problem w/ read caching..
       [not found]       ` <loom.20120928T010314-562-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
@ 2012-09-28 18:59         ` Kent Overstreet
  2012-10-01 19:18           ` Brad Walker
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2012-09-28 18:59 UTC (permalink / raw)
  To: Brad Walker; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 27, 2012 at 11:28:20PM +0000, Brad Walker wrote:
> Kent Overstreet <koverstreet@...> writes:
> 
> > 
> > Sounds like a good portion of your IO is bypassing the cache. That
> > will happen if some of it's sequential, or if the SSD latency goes
> > over a threshold - sequential_cutoff, congested_read_threshold_us and
> > congest_write_threshold_us (if I'm remembering the names correctly)
> > are the settings that control all that. 0 disables all of them.
> > 
> 
> So I set the sequential_cutoff and congested_read_threshold_us to both be 0.
> Since I was only doing reads, I figured there was no need to mess with the write
> option.
> 
> But, I'm still seeing a problem. 
> 
> My hardware is:
> 1 - Dell PowerEdge R710 w/ 24 x Xeon processors, 96GB of ram
> 2 - Micron P320H SSD
> 3 - LSI storage device connected by a SAS interface
> 
> What I see is that when I do random reads over a 10GB region, the cache warms up
> but hits a read response plateau at about 7ms. I still see a LOT (i.e. 32000
> IOPS) of I/O to the disk. 

By disk you do mean spinning disk? Or just to the bcache device?

I'm wondering if your storage array just is that fast (which would
explain the 7 ms) or something weird is going on.

Cache hit ratio or iostat would tell you.

> Yes, if I run the same test over a 1GB region, runs really fast. Pretty close to
> the max IOPS rate of the SSD.
> 
> So I'm thinking there is a problem here or I have a bcache config issue.

Sounds like some sort of bcache problem. hrm.

Most likely cause is something is keeping the cache from warming up, and
some IO is still going to disk. That used to be an issue with the old
synchronization for updating the cache on cache miss, but it shouldn't
be anymore...

Check number of cache misses after a run... if it's going up when all
the data should be in the cache, that's one bug. If there's no cache
misses and you're still seeing 7 ms latency... well, that would be
weird. queueing delays, maybe..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: problem w/ read caching..
  2012-09-28 18:59         ` Kent Overstreet
@ 2012-10-01 19:18           ` Brad Walker
       [not found]             ` <loom.20121001T211315-779-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Brad Walker @ 2012-10-01 19:18 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Kent Overstreet <koverstreet@...> writes:

> 
> By disk you do mean spinning disk? Or just to the bcache device?
>
> I'm wondering if your storage array just is that fast (which would
> explain the 7 ms) or something weird is going on.
> 
> Cache hit ratio or iostat would tell you.

The cache_hit_ratio is 99%. And yet, iostat still shows i/o running to the
raid array.

> > Yes, if I run the same test over a 1GB region, runs really fast. 
> > Pretty close to the max IOPS rate of the SSD.
> > 
> > So I'm thinking there is a problem here or I have a bcache config issue.
> 
> Sounds like some sort of bcache problem. hrm.
> 
> Most likely cause is something is keeping the cache from warming up, and
> some IO is still going to disk. That used to be an issue with the old
> synchronization for updating the cache on cache miss, but it shouldn't
> be anymore...
> 
> Check number of cache misses after a run... if it's going up when all
> the data should be in the cache, that's one bug. If there's no cache
> misses and you're still seeing 7 ms latency... well, that would be
> weird. queueing delays, maybe..

After running my tests when the cache is fully warmed, the cache_hit_ratio
goes to 99%.

Yet, cache misses are stable and not changing. Cache hits are increasing and
still iostat is showing 32K blocks being read from disk.

Any ideas on how to debug this?

Thanks for the help.

-brad w.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <loom.20121001T211315-779-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>]

* Re: problem w/ read caching..
       [not found]             ` <loom.20121001T211315-779-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
@ 2012-10-01 19:38               ` Kent Overstreet
  2012-10-01 20:05                 ` Brad Walker
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2012-10-01 19:38 UTC (permalink / raw)
  To: Brad Walker; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Mon, Oct 01, 2012 at 07:18:40PM +0000, Brad Walker wrote:
> Kent Overstreet <koverstreet@...> writes:
> 
> > 
> > By disk you do mean spinning disk? Or just to the bcache device?
> >
> > I'm wondering if your storage array just is that fast (which would
> > explain the 7 ms) or something weird is going on.
> > 
> > Cache hit ratio or iostat would tell you.
> 
> The cache_hit_ratio is 99%. And yet, iostat still shows i/o running to the
> raid array.
> 
> > > Yes, if I run the same test over a 1GB region, runs really fast. 
> > > Pretty close to the max IOPS rate of the SSD.
> > > 
> > > So I'm thinking there is a problem here or I have a bcache config issue.
> > 
> > Sounds like some sort of bcache problem. hrm.
> > 
> > Most likely cause is something is keeping the cache from warming up, and
> > some IO is still going to disk. That used to be an issue with the old
> > synchronization for updating the cache on cache miss, but it shouldn't
> > be anymore...
> > 
> > Check number of cache misses after a run... if it's going up when all
> > the data should be in the cache, that's one bug. If there's no cache
> > misses and you're still seeing 7 ms latency... well, that would be
> > weird. queueing delays, maybe..
> 
> After running my tests when the cache is fully warmed, the cache_hit_ratio
> goes to 99%.
> 
> Yet, cache misses are stable and not changing. Cache hits are increasing and
> still iostat is showing 32K blocks being read from disk.
> 
> Any ideas on how to debug this?

What about cache_bypass_hits, cache_bypass_misses?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: problem w/ read caching..
  2012-10-01 19:38               ` Kent Overstreet
@ 2012-10-01 20:05                 ` Brad Walker
       [not found]                   ` <loom.20121001T220412-225-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Brad Walker @ 2012-10-01 20:05 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Kent Overstreet <koverstreet@...> writes:

> 
> What about cache_bypass_hits, cache_bypass_misses?
> 

cache_bypass_hits = 0
cache_bypass_misses = 0

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <loom.20121001T220412-225-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>]

* Re: problem w/ read caching..
       [not found]                   ` <loom.20121001T220412-225-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
@ 2012-10-01 20:37                     ` Kent Overstreet
  2012-10-01 20:56                       ` Brad Walker
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2012-10-01 20:37 UTC (permalink / raw)
  To: Brad Walker; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Mon, Oct 01, 2012 at 08:05:14PM +0000, Brad Walker wrote:
> Kent Overstreet <koverstreet@...> writes:
> 
> > 
> > What about cache_bypass_hits, cache_bypass_misses?
> > 
> 
> cache_bypass_hits = 0
> cache_bypass_misses = 0

I should've just asked you for all the stats - what about
cache_miss_collision?

Also, internal/cache_read_races?

Perhaps stuff is getting evicted from the cache for some reason... How
big is the SSD?

Is cache_replacement_policy lru? (the default - cache_replacement_policy
is in cache0/)

What's cache0/priority_stats say?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: problem w/ read caching..
  2012-10-01 20:37                     ` Kent Overstreet
@ 2012-10-01 20:56                       ` Brad Walker
       [not found]                         ` <loom.20121001T223817-249-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Brad Walker @ 2012-10-01 20:56 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Kent Overstreet <koverstreet@...> writes:

> 
> On Mon, Oct 01, 2012 at 08:05:14PM +0000, Brad Walker wrote:
> > Kent Overstreet <koverstreet@...> writes:
> > 
> > > 
> > > What about cache_bypass_hits, cache_bypass_misses?
> > > 
> > 
> > cache_bypass_hits = 0
> > cache_bypass_misses = 0
> 
> I should've just asked you for all the stats - what about
> cache_miss_collision?

bwalker:/sys/fs/bcache/dd10d09c-0605-462c-af85-8466b0aa2017/stats_total> ls
bypassed           cache_bypass_misses  cache_hits             cache_misses
cache_bypass_hits  cache_hit_ratio      cache_miss_collisions  cache_readaheads
bwalker:/sys/fs/bcache/dd10d09c-0605-462c-af85-8466b0aa2017/stats_total> cat *
0
0
0
98
162315691
0
2329081
0
bwalker:/sys/fs/bcache/dd10d09c-0605-462c-af85-8466b0aa2017/stats_total>


> Also, internal/cache_read_races?

cat /sys/fs/bcache/dd10d09c-0605-462c-af85-8466b0aa2017/interna/cache_read_races

0

> Perhaps stuff is getting evicted from the cache for some reason... How
> big is the SSD?

nellis: # sg_inq /dev/rssda
ATA device: model, serial number and firmware revision:
  Micron P320h-MTFDGAR350SAH               000000001143020287B2 B1490300
nellis: # fdisk /dev/rssda

The device presents a logical sector size that is smaller than
the physical sector size. Aligning to a physical sector (or optimal
I/O) size boundary is recommended, or performance may be impacted.

Command (m for help): p

Disk /dev/rssda: 350.1 GB, 350078754816 bytes
210 heads, 56 sectors/track, 58141 cylinders, total 683747568 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x16e1596f

     Device Boot      Start         End      Blocks   Id  System
/dev/rssda1            2048   683747567   341872760   83  Linux

Command (m for help):


> Is cache_replacement_policy lru? (the default - cache_replacement_policy
> is in cache0/)

LRU

> What's cache0/priority_stats say?
> 

nellis: # cat priority_stats
Unused:         97%
Metadata:       0%
Average:        0
Sectors per Q:  582336
Quantiles:      [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1]
nellis: #

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <loom.20121001T223817-249-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>]

* Re: problem w/ read caching..
       [not found]                         ` <loom.20121001T223817-249-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
@ 2012-10-01 21:14                           ` Kent Overstreet
  2012-10-01 22:26                             ` Brad Walker
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2012-10-01 21:14 UTC (permalink / raw)
  To: Brad Walker; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Mon, Oct 01, 2012 at 08:56:21PM +0000, Brad Walker wrote:
> Kent Overstreet <koverstreet@...> writes:
> 
> > 
> > On Mon, Oct 01, 2012 at 08:05:14PM +0000, Brad Walker wrote:
> > > Kent Overstreet <koverstreet@...> writes:
> > > 
> > > > 
> > > > What about cache_bypass_hits, cache_bypass_misses?
> > > > 
> > > 
> > > cache_bypass_hits = 0
> > > cache_bypass_misses = 0
> > 
> > I should've just asked you for all the stats - what about
> > cache_miss_collision?

So cache_miss_collisions, cache_read_races are 0...

----

I was just browsing around the code, and I bet I know what it is -
btree_insert_check_key() is failing because the btree node is full.

The way the code works is on cache miss, we can't just blindly insert
that data into the cache because if a write happens to the same location
after the cache miss but before the data from the cache miss gets
inserted, we'd overwrite the write with stale data.

So btree_insert_check_key() inserts a fake key atomically with the cache
miss - we don't need that key to be persisted so we can skip
journalling and all the normal btree insert code, which is how we can
insert this fake key atomically.

Then, on when we go to insert the real key that points to the data from
the cache miss, we check if the fake key we inserted is still present
and fail the insert if it's not.

It's cmpxchg(), but for the btree.

Anyways... since we're skipping all the normal btree_insert() code,
btree_insert_check_key() can't split the btree node if it's full - if
the btree node is full it just fails it.

This'd be perfectly fine in any normal workload where you've got some
mix of reads and writes... if the btree node is full, a write will come
along to split it.

But the synthetic workload is a bit of a pathological case here :)

But, we should confirm this really is what's going on...  Can you apply
this patch and rerun to test my theory? See if the number of times the
printk fires lines up with the number of cache misses.

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 4102267..d5c5313 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1875,9 +1875,13 @@ bool bch_btree_insert_check_key(struct btree *b, struct btree_op *op,
 	rw_unlock(false, b);
 	rw_lock(true, b, b->level);

+	if (should_split(b)) {
+		printk(KERN_DEBUG "bcache: bch_btree_insert_check_key() failed because btree node full\n");
+		goto out;
+	}
+
 	if (b->key.ptr[0] != btree_ptr ||
-	    b->seq != seq + 1 ||
-	    should_split(b))
+	    b->seq != seq + 1)
 		goto out;

 	op->replace = KEY(op->inode, bio_end(bio), bio_sectors(bio));

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: problem w/ read caching..
  2012-10-01 21:14                           ` Kent Overstreet
@ 2012-10-01 22:26                             ` Brad Walker
       [not found]                               ` <loom.20121002T001556-394-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Brad Walker @ 2012-10-01 22:26 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Kent Overstreet <koverstreet@...> writes:


> I was just browsing around the code, and I bet I know what it is -
> btree_insert_check_key() is failing because the btree node is full.
> 
> But, we should confirm this really is what's going on...  Can you apply
> this patch and rerun to test my theory? See if the number of times the
> printk fires lines up with the number of cache misses.
> 


I applied this change and I see a LOT of the messages.

And the rate seems to be increasing.

-brad w.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <loom.20121002T001556-394-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>]

* Re: problem w/ read caching..
       [not found]                               ` <loom.20121002T001556-394-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
@ 2012-10-01 23:00                                 ` Kent Overstreet
       [not found]                                   ` <20121001230023.GG26488-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2012-10-01 23:00 UTC (permalink / raw)
  To: Brad Walker; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Mon, Oct 01, 2012 at 10:26:43PM +0000, Brad Walker wrote:
> Kent Overstreet <koverstreet@...> writes:
> 
> 
> > I was just browsing around the code, and I bet I know what it is -
> > btree_insert_check_key() is failing because the btree node is full.
> > 
> > But, we should confirm this really is what's going on...  Can you apply
> > this patch and rerun to test my theory? See if the number of times the
> > printk fires lines up with the number of cache misses.
> > 
> 
> 
> I applied this change and I see a LOT of the messages.
> 
> And the rate seems to be increasing.

Sweet, we know what it is then.

So, like I mentioned this won't be an issue on any workload with mixed
read/writes, so if that's what your production workloads are then this
may not matter to you.

For warming up the cache, doing a few random writes (just enough that
you hit all the btree nodes - and there aren't many btree nodes, cat
internel/btree_nodes) will fix it.

A real fix for this shouldn't be too hard, but it's not exactly trivial
and it'll be a pain to test... not quite sure when I'll get to it, but
it would be good to have it fixed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <20121001230023.GG26488-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: problem w/ read caching..
       [not found]                                   ` <20121001230023.GG26488-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-10-03  4:44                                     ` Kent Overstreet
  2012-10-08 16:39                                       ` Brad Walker
  0 siblings, 1 reply; 15+ messages in thread
From: Kent Overstreet @ 2012-10-03  4:44 UTC (permalink / raw)
  To: Brad Walker; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Man, this is a frustrating issue :)

I haven't been able to come up with a way of fixing it right without
rewriting a bunch of code. I do have a workaround figured out though
if this is a real issue for you - it'll just increase internal
fragmentation in the btree a bit, but with large btree nodes like you
normally want not enough to matter.

Though like I said, as long as your workload isn't 100% reads this
shouldn't be an isuse.

On Mon, Oct 1, 2012 at 4:00 PM, Kent Overstreet <koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Oct 01, 2012 at 10:26:43PM +0000, Brad Walker wrote:
>> Kent Overstreet <koverstreet@...> writes:
>>
>>
>> > I was just browsing around the code, and I bet I know what it is -
>> > btree_insert_check_key() is failing because the btree node is full.
>> >
>> > But, we should confirm this really is what's going on...  Can you apply
>> > this patch and rerun to test my theory? See if the number of times the
>> > printk fires lines up with the number of cache misses.
>> >
>>
>>
>> I applied this change and I see a LOT of the messages.
>>
>> And the rate seems to be increasing.
>
> Sweet, we know what it is then.
>
> So, like I mentioned this won't be an issue on any workload with mixed
> read/writes, so if that's what your production workloads are then this
> may not matter to you.
>
> For warming up the cache, doing a few random writes (just enough that
> you hit all the btree nodes - and there aren't many btree nodes, cat
> internel/btree_nodes) will fix it.
>
> A real fix for this shouldn't be too hard, but it's not exactly trivial
> and it'll be a pain to test... not quite sure when I'll get to it, but
> it would be good to have it fixed.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: problem w/ read caching..
  2012-10-03  4:44                                     ` Kent Overstreet
@ 2012-10-08 16:39                                       ` Brad Walker
  0 siblings, 0 replies; 15+ messages in thread
From: Brad Walker @ 2012-10-08 16:39 UTC (permalink / raw)
  To: linux-bcache-u79uwXL29TY76Z2rM5mHXA

Kent Overstreet <koverstreet@...> writes:

> 
> Man, this is a frustrating issue :)
> 
> I haven't been able to come up with a way of fixing it right without
> rewriting a bunch of code. I do have a workaround figured out though
> if this is a real issue for you - it'll just increase internal
> fragmentation in the btree a bit, but with large btree nodes like you
> normally want not enough to matter.
> 
> Though like I said, as long as your workload isn't 100% reads this
> shouldn't be an isuse.
> 

Sorry to be away for a few days.

I can imagine it is frustrating. I've been looking at the code and it seems
pretty tough.

I think it would be worthwhile to have this fixed if possible as it does benefit
small caches (i.e. 1GB in size) as well.

Basically, I made my combination of reads verses writes to be 99 to 1 like you
said to do. So 99% of the time, I'm doing a read. When I do this, I never see my
IOPS rate get over approximately 45K.

Also previously, when I run my test over a 1GB region with 100% reads, once the
cache warmed, I would see 0.350ms response time for a read. Once I changed the
read/write mixture to 99% reads, the response time went down to 0.150ms. So I
would think it would be beneficial to small workloads and large workloads if we
could fix this.

I'm happy to test this for you or help out in any way possible.

Thanks.

-brad w.

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <50651c68.a8e1440a.1165.67c8SMTPIN_ADDED@mx.google.com>]

[parent not found: <50651c68.a8e1440a.1165.67c8SMTPIN_ADDED-ATjtLOhZ0NVl57MIdRCFDg@public.gmane.org>]

* Re: problem w/ read caching..
       [not found] ` <50651c68.a8e1440a.1165.67c8SMTPIN_ADDED-ATjtLOhZ0NVl57MIdRCFDg@public.gmane.org>
@ 2012-09-28 16:31   ` Brad Walker
  0 siblings, 0 replies; 15+ messages in thread
From: Brad Walker @ 2012-09-28 16:31 UTC (permalink / raw)
  To: jason-/cow75dQlsI@public.gmane.org; +Cc: linux-bcache-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 27, 2012 at 9:41 PM, jason-/cow75dQlsI@public.gmane.org <jason-/cow75dQlsI@public.gmane.org> wrote:
> One thing that comes to mind for me is your 1gb test may be mostly hitting
> the ram cache on the ssd.  That is generally where ssd makers, consumer and
> to a lesser extent enterprise, get the peak I/O ops numbers from.

I understand your thinking. But, in tests using Facebook FlashCache as
well as our internally developed caching software (I work for a big
storage company), that is not what I see. What I see is that once the
cache warms ups, regardless of the size (i.e. 1GB, 10GB, 100GB), that
access times are sub-millisecond.

> Can you do the same tests, 10gb and 1gb, using only the ssd as your block
> device to get a baseline for it without bcache in the picture?

Good suggestion! I have done this and the data shows that my SSD has a
baseline of 800K iops when using a 4k blocksize.

> I would also do some mixed r/w tests just to sort of get a real world
> profile for your ssd and system.

I'll take a look at doing this over the weekend.

> You also don't mention what type and how many spindles you have in the
> mechanical array.

I'm using a RAID array which has lots of horsepower and the RAID 5
configuration has 6 drives. Each drive is a 15K rpm type. So the array
has very good performance.

One data point that is when I run Facebook flashcache, once the cache
is fully warmed, I see about 0.60 ms response time.

Any suggestions?

-brad w.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-10-08 16:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-12 20:01 problem w/ read caching Brad Walker
     [not found] ` <CAPKZHbV3n7O+VRVNS-C2oDVSpO_VdirMDUOuwwWKaA5ZOUEG_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-09-13 18:43   ` Kent Overstreet
2012-09-27 23:28     ` Brad Walker
     [not found]       ` <loom.20120928T010314-562-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
2012-09-28 18:59         ` Kent Overstreet
2012-10-01 19:18           ` Brad Walker
     [not found]             ` <loom.20121001T211315-779-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
2012-10-01 19:38               ` Kent Overstreet
2012-10-01 20:05                 ` Brad Walker
     [not found]                   ` <loom.20121001T220412-225-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
2012-10-01 20:37                     ` Kent Overstreet
2012-10-01 20:56                       ` Brad Walker
     [not found]                         ` <loom.20121001T223817-249-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
2012-10-01 21:14                           ` Kent Overstreet
2012-10-01 22:26                             ` Brad Walker
     [not found]                               ` <loom.20121002T001556-394-eS7Uydv5nfjZ+VzJOa5vwg@public.gmane.org>
2012-10-01 23:00                                 ` Kent Overstreet
     [not found]                                   ` <20121001230023.GG26488-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-10-03  4:44                                     ` Kent Overstreet
2012-10-08 16:39                                       ` Brad Walker
     [not found] <50651c68.a8e1440a.1165.67c8SMTPIN_ADDED@mx.google.com>
     [not found] ` <50651c68.a8e1440a.1165.67c8SMTPIN_ADDED-ATjtLOhZ0NVl57MIdRCFDg@public.gmane.org>
2012-09-28 16:31   ` Brad Walker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).