* Re: Local SSD cache for ceph on each compute node.
[not found] ` <ce98776b.9Ro.9Gf.cS.1kvo9AAN6V-ImYt9qTNe79BDgjK7y7TUQ@public.gmane.org>
@ 2016-03-29 13:39 ` Ric Wheeler
[not found] ` <56FA859B.5050608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Ric Wheeler @ 2016-03-29 13:39 UTC (permalink / raw)
To: Nick Fisk, 'Sage Weil'
Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw, device-mapper development
On 03/29/2016 04:35 PM, Nick Fisk wrote:
> One thing I picked up on when looking at dm-cache for doing caching with
> RBD's is that it wasn't really designed to be used as a writeback cache for
> new writes, as in how you would expect a traditional writeback cache to
> work. It seems all the policies are designed around the idea that writes go
> to cache only if the block is already in the cache (through reads) or its
> hot enough to promote. Although there did seem to be some tunables to alter
> this behaviour, posts on the mailing list seemed to suggest this wasn't how
> it was designed to be used. I'm not sure if this has been addressed since I
> last looked at it though.
>
> Depending on if you are trying to accelerate all writes, or just your "hot"
> blocks, this may or may not matter. Even <1GB local caches can make a huge
> difference to sync writes.
Hi Nick,
Some of the caching policies have changed recently as the team has looked at
different workloads.
Happy to introduce you to them if you want to discuss offline or post comments
over on their list: device-mapper development <dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
thanks!
Ric
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Local SSD cache for ceph on each compute node.
[not found] ` <56FA859B.5050608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-03-29 13:53 ` Nick Fisk
0 siblings, 0 replies; 6+ messages in thread
From: Nick Fisk @ 2016-03-29 13:53 UTC (permalink / raw)
To: 'Ric Wheeler', 'Sage Weil'
Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw,
'device-mapper development'
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org] On Behalf Of
> Ric Wheeler
> Sent: 29 March 2016 14:40
> To: Nick Fisk <nick-ksME7r3P/wO1Qrn1Bg8BZw@public.gmane.org>; 'Sage Weil' <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; device-mapper development <dm-
> devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
>
> On 03/29/2016 04:35 PM, Nick Fisk wrote:
> > One thing I picked up on when looking at dm-cache for doing caching
> > with RBD's is that it wasn't really designed to be used as a writeback
> > cache for new writes, as in how you would expect a traditional
> > writeback cache to work. It seems all the policies are designed around
> > the idea that writes go to cache only if the block is already in the
> > cache (through reads) or its hot enough to promote. Although there did
> > seem to be some tunables to alter this behaviour, posts on the mailing
> > list seemed to suggest this wasn't how it was designed to be used. I'm
> > not sure if this has been addressed since I last looked at it though.
> >
> > Depending on if you are trying to accelerate all writes, or just your
"hot"
> > blocks, this may or may not matter. Even <1GB local caches can make a
> > huge difference to sync writes.
>
> Hi Nick,
>
> Some of the caching policies have changed recently as the team has looked
> at different workloads.
>
> Happy to introduce you to them if you want to discuss offline or post
> comments over on their list: device-mapper development <dm-
> devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> thanks!
>
> Ric
Hi Ric,
Thanks for the heads up, just from a quick flick through I can see there are
now separate read and write promotion thresholds, so I can see just from
that it would be a lot more suitable for what I intended. I might try and
find some time to give it another test.
Nick
>
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ceph-users] Local SSD cache for ceph on each compute node.
2016-03-29 13:39 ` Local SSD cache for ceph on each compute node Ric Wheeler
[not found] ` <56FA859B.5050608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-03-29 13:53 ` Nick Fisk
[not found] ` <9be96412.9Ro.9Gf.hp.1e3JYGxgKk@mailjet.com>
2 siblings, 0 replies; 6+ messages in thread
From: Nick Fisk @ 2016-03-29 13:53 UTC (permalink / raw)
To: 'Ric Wheeler', 'Sage Weil'
Cc: ceph-users, 'device-mapper development'
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of
> Ric Wheeler
> Sent: 29 March 2016 14:40
> To: Nick Fisk <nick@fisk.me.uk>; 'Sage Weil' <sage@newdream.net>
> Cc: ceph-users@lists.ceph.com; device-mapper development <dm-
> devel@redhat.com>
> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
>
> On 03/29/2016 04:35 PM, Nick Fisk wrote:
> > One thing I picked up on when looking at dm-cache for doing caching
> > with RBD's is that it wasn't really designed to be used as a writeback
> > cache for new writes, as in how you would expect a traditional
> > writeback cache to work. It seems all the policies are designed around
> > the idea that writes go to cache only if the block is already in the
> > cache (through reads) or its hot enough to promote. Although there did
> > seem to be some tunables to alter this behaviour, posts on the mailing
> > list seemed to suggest this wasn't how it was designed to be used. I'm
> > not sure if this has been addressed since I last looked at it though.
> >
> > Depending on if you are trying to accelerate all writes, or just your
"hot"
> > blocks, this may or may not matter. Even <1GB local caches can make a
> > huge difference to sync writes.
>
> Hi Nick,
>
> Some of the caching policies have changed recently as the team has looked
> at different workloads.
>
> Happy to introduce you to them if you want to discuss offline or post
> comments over on their list: device-mapper development <dm-
> devel@redhat.com>
>
> thanks!
>
> Ric
Hi Ric,
Thanks for the heads up, just from a quick flick through I can see there are
now separate read and write promotion thresholds, so I can see just from
that it would be a lot more suitable for what I intended. I might try and
find some time to give it another test.
Nick
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Local SSD cache for ceph on each compute node.
[not found] ` <9be96412.9Ro.9Gf.hp.1e3JYGxgKk-ImYt9qTNe79BDgjK7y7TUQ@public.gmane.org>
@ 2016-03-29 15:35 ` Ric Wheeler
[not found] ` <56FAA0A7.6050009-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-03-30 13:02 ` [ceph-users] " Nick Fisk
0 siblings, 2 replies; 6+ messages in thread
From: Ric Wheeler @ 2016-03-29 15:35 UTC (permalink / raw)
To: Nick Fisk, 'Sage Weil'
Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw,
'device-mapper development'
On 03/29/2016 04:53 PM, Nick Fisk wrote:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org] On Behalf Of
>> Ric Wheeler
>> Sent: 29 March 2016 14:40
>> To: Nick Fisk <nick-ksME7r3P/wO1Qrn1Bg8BZw@public.gmane.org>; 'Sage Weil' <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>
>> Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org; device-mapper development <dm-
>> devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node.
>>
>> On 03/29/2016 04:35 PM, Nick Fisk wrote:
>>> One thing I picked up on when looking at dm-cache for doing caching
>>> with RBD's is that it wasn't really designed to be used as a writeback
>>> cache for new writes, as in how you would expect a traditional
>>> writeback cache to work. It seems all the policies are designed around
>>> the idea that writes go to cache only if the block is already in the
>>> cache (through reads) or its hot enough to promote. Although there did
>>> seem to be some tunables to alter this behaviour, posts on the mailing
>>> list seemed to suggest this wasn't how it was designed to be used. I'm
>>> not sure if this has been addressed since I last looked at it though.
>>>
>>> Depending on if you are trying to accelerate all writes, or just your
> "hot"
>>> blocks, this may or may not matter. Even <1GB local caches can make a
>>> huge difference to sync writes.
>> Hi Nick,
>>
>> Some of the caching policies have changed recently as the team has looked
>> at different workloads.
>>
>> Happy to introduce you to them if you want to discuss offline or post
>> comments over on their list: device-mapper development <dm-
>> devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>
>> thanks!
>>
>> Ric
> Hi Ric,
>
> Thanks for the heads up, just from a quick flick through I can see there are
> now separate read and write promotion thresholds, so I can see just from
> that it would be a lot more suitable for what I intended. I might try and
> find some time to give it another test.
>
> Nick
Let us know how it works out for you, I know that they are very interested in
making sure things are useful :)
ric
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Local SSD cache for ceph on each compute node.
[not found] ` <56FAA0A7.6050009-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-03-30 13:02 ` Nick Fisk
0 siblings, 0 replies; 6+ messages in thread
From: Nick Fisk @ 2016-03-30 13:02 UTC (permalink / raw)
To: 'Ric Wheeler'
Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw,
'device-mapper development'
> >>
> >> On 03/29/2016 04:35 PM, Nick Fisk wrote:
> >>> One thing I picked up on when looking at dm-cache for doing caching
> >>> with RBD's is that it wasn't really designed to be used as a
> >>> writeback cache for new writes, as in how you would expect a
> >>> traditional writeback cache to work. It seems all the policies are
> >>> designed around the idea that writes go to cache only if the block
> >>> is already in the cache (through reads) or its hot enough to
> >>> promote. Although there did seem to be some tunables to alter this
> >>> behaviour, posts on the mailing list seemed to suggest this wasn't
> >>> how it was designed to be used. I'm not sure if this has been addressed
> since I last looked at it though.
> >>>
> >>> Depending on if you are trying to accelerate all writes, or just
> >>> your
> > "hot"
> >>> blocks, this may or may not matter. Even <1GB local caches can make
> >>> a huge difference to sync writes.
> >> Hi Nick,
> >>
> >> Some of the caching policies have changed recently as the team has
> >> looked at different workloads.
> >>
> >> Happy to introduce you to them if you want to discuss offline or post
> >> comments over on their list: device-mapper development <dm-
> >> devel@redhat.com>
> >>
> >> thanks!
> >>
> >> Ric
> > Hi Ric,
> >
> > Thanks for the heads up, just from a quick flick through I can see
> > there are now separate read and write promotion thresholds, so I can
> > see just from that it would be a lot more suitable for what I
> > intended. I might try and find some time to give it another test.
> >
> > Nick
>
> Let us know how it works out for you, I know that they are very interested in
> making sure things are useful :)
Hi Ric,
I have given it another test and unfortunately it seems it's still not giving the improvements that I was expecting.
Here is a rough description of my test
10GB RBD
1GB ZRAM kernel device for cache (Testing only)
0 20971520 cache 8 106/4096 64 32768/32768 2492 1239 349993 113194 47157 47157 0 1 writeback 2 migration_threshold 8192 mq 10 ra
ndom_threshold 0 sequential_threshold 0 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 0 rw -
I'm then running a directio 64kb seq write QD=1 bench with fio to the DM device.
What I expect to happen would be for this sequential stream of 64kb IO's to be coalesced into 4MB IO's and written out to the RBD at a high queue depth as possible/required. Effectively meaning my 64kb sequential bandwidth should match the limit of 4MB sequential bandwidth of my cluster. I'm more interested in replicating the behaviour of a write cache on a battery backed raid card, than a RW SSD cache, if that makes sense?
An example real life scenario would be for sitting underneath a iSCSI target, something like ESXi generates that IO pattern when moving VM's between datastores.
What I was seeing is that I get a sudden burst of speed at the start of the fio test, but then it quickly drops down to the speed of the underlying RBD device. The dirty blocks counter never seems to go too high, so I don't think that it’s a cache full problem. The counter is probably no more than about 40% when the slowdown starts and then it drops to less than 10% for the remainder of the test as it crawls along. It feels like it hits some sort of throttle and it never recovers.
I've done similar tests with flashcache and it gets more stable performance over a longer period of time, but the associative hit set behaviour seems to cause write misses due to the sequential IO pattern, which limits overall top performance.
Nick
>
> ric
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [ceph-users] Local SSD cache for ceph on each compute node.
2016-03-29 15:35 ` Ric Wheeler
[not found] ` <56FAA0A7.6050009-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-03-30 13:02 ` Nick Fisk
1 sibling, 0 replies; 6+ messages in thread
From: Nick Fisk @ 2016-03-30 13:02 UTC (permalink / raw)
To: 'Ric Wheeler'; +Cc: ceph-users, 'device-mapper development'
> >>
> >> On 03/29/2016 04:35 PM, Nick Fisk wrote:
> >>> One thing I picked up on when looking at dm-cache for doing caching
> >>> with RBD's is that it wasn't really designed to be used as a
> >>> writeback cache for new writes, as in how you would expect a
> >>> traditional writeback cache to work. It seems all the policies are
> >>> designed around the idea that writes go to cache only if the block
> >>> is already in the cache (through reads) or its hot enough to
> >>> promote. Although there did seem to be some tunables to alter this
> >>> behaviour, posts on the mailing list seemed to suggest this wasn't
> >>> how it was designed to be used. I'm not sure if this has been addressed
> since I last looked at it though.
> >>>
> >>> Depending on if you are trying to accelerate all writes, or just
> >>> your
> > "hot"
> >>> blocks, this may or may not matter. Even <1GB local caches can make
> >>> a huge difference to sync writes.
> >> Hi Nick,
> >>
> >> Some of the caching policies have changed recently as the team has
> >> looked at different workloads.
> >>
> >> Happy to introduce you to them if you want to discuss offline or post
> >> comments over on their list: device-mapper development <dm-
> >> devel@redhat.com>
> >>
> >> thanks!
> >>
> >> Ric
> > Hi Ric,
> >
> > Thanks for the heads up, just from a quick flick through I can see
> > there are now separate read and write promotion thresholds, so I can
> > see just from that it would be a lot more suitable for what I
> > intended. I might try and find some time to give it another test.
> >
> > Nick
>
> Let us know how it works out for you, I know that they are very interested in
> making sure things are useful :)
Hi Ric,
I have given it another test and unfortunately it seems it's still not giving the improvements that I was expecting.
Here is a rough description of my test
10GB RBD
1GB ZRAM kernel device for cache (Testing only)
0 20971520 cache 8 106/4096 64 32768/32768 2492 1239 349993 113194 47157 47157 0 1 writeback 2 migration_threshold 8192 mq 10 ra
ndom_threshold 0 sequential_threshold 0 discard_promote_adjustment 1 read_promote_adjustment 4 write_promote_adjustment 0 rw -
I'm then running a directio 64kb seq write QD=1 bench with fio to the DM device.
What I expect to happen would be for this sequential stream of 64kb IO's to be coalesced into 4MB IO's and written out to the RBD at a high queue depth as possible/required. Effectively meaning my 64kb sequential bandwidth should match the limit of 4MB sequential bandwidth of my cluster. I'm more interested in replicating the behaviour of a write cache on a battery backed raid card, than a RW SSD cache, if that makes sense?
An example real life scenario would be for sitting underneath a iSCSI target, something like ESXi generates that IO pattern when moving VM's between datastores.
What I was seeing is that I get a sudden burst of speed at the start of the fio test, but then it quickly drops down to the speed of the underlying RBD device. The dirty blocks counter never seems to go too high, so I don't think that it’s a cache full problem. The counter is probably no more than about 40% when the slowdown starts and then it drops to less than 10% for the remainder of the test as it crawls along. It feels like it hits some sort of throttle and it never recovers.
I've done similar tests with flashcache and it gets more stable performance over a longer period of time, but the associative hit set behaviour seems to cause write misses due to the sequential IO pattern, which limits overall top performance.
Nick
>
> ric
>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-03-30 13:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <VI1PR05MB16772049FF17DD2D4DE93EDEE08A0@VI1PR05MB1677.eurprd05.prod.outlook.com>
[not found] ` <1003879409.38672781.1458091493153.JavaMail.zimbra@redhat.com>
[not found] ` <VI1PR05MB1677B32D621D52ECFBD8C889E08A0@VI1PR05MB1677.eurprd05.prod.outlook.com>
[not found] ` <1517351774.38675790.1458092531934.JavaMail.zimbra@redhat.com>
[not found] ` <8C4FE234-AC6A-4F80-84DA-DCB69C58E874@ebay.com>
[not found] ` <VI1PR05MB1677291EA89E9C02BEE437C2E08A0@VI1PR05MB1677.eurprd05.prod.outlook.com>
[not found] ` <054DE45B-30E4-44B6-88CE-23FC207FE41E@ebay.com>
[not found] ` <56F792F6.9050400@redhat.com>
[not found] ` <DC9C5DB4-43C1-49DB-A37E-F7C6A9D3D172@ebay.com>
[not found] ` <56FA46C9.6090500@redhat.com>
[not found] ` <E5E25BC0-A526-4EB8-9EC0-80F6A7734B23@ebay.com>
[not found] ` <56FA6792.1050005@redhat.com>
[not found] ` <alpine.DEB.2.11.1603290827510.6473@cpach.fuggernut.com>
[not found] ` <56FA7DEA.9010005@redhat.com>
[not found] ` <ce98776b.9Ro.9Gf.cS.1kvo9AAN6V@mailjet.com>
[not found] ` <ce98776b.9Ro.9Gf.cS.1kvo9AAN6V-ImYt9qTNe79BDgjK7y7TUQ@public.gmane.org>
2016-03-29 13:39 ` Local SSD cache for ceph on each compute node Ric Wheeler
[not found] ` <56FA859B.5050608-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-03-29 13:53 ` Nick Fisk
2016-03-29 13:53 ` [ceph-users] " Nick Fisk
[not found] ` <9be96412.9Ro.9Gf.hp.1e3JYGxgKk@mailjet.com>
[not found] ` <9be96412.9Ro.9Gf.hp.1e3JYGxgKk-ImYt9qTNe79BDgjK7y7TUQ@public.gmane.org>
2016-03-29 15:35 ` Ric Wheeler
[not found] ` <56FAA0A7.6050009-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-03-30 13:02 ` Nick Fisk
2016-03-30 13:02 ` [ceph-users] " Nick Fisk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).