All of lore.kernel.org
 help / color / mirror / Atom feed
* question for the new ceph-osd key/value backend
@ 2013-12-11  2:57 Duan, Jiangang
  2013-12-11  6:09 ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Duan, Jiangang @ 2013-12-11  2:57 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org, Sage Weil; +Cc: Duan, Jiangang

Sage,

I have some questions regarding to the key/value backend work.

What is the motivation to work on this? (or what is the problem we want to solve?)
1) to use the new interface thus we can bypass all the OS layer thus get a short latency?
2) or to leverage some new primitive e.g. the atomic write thus to simplify the code writing?

There are several different possibilities to use future NVM technology - NVM.FILE, NVM.BLOCK, PM.XXX http://snia.org/sites/default/files/NVMProgrammingModel_v1r10DRAFT.pdf
Even for openNVM thing - there are other usage model than k/v.

Do you have any typical usage model for this? 

-jiangang

===================

From: Sage Weil <sage <at> inktank.com>
Subject: new ceph-osd key/value backend
Newsgroups: gmane.comp.file-systems.ceph.devel
Date: 2013-11-09 10:09:52 GMT (4 weeks, 3 days, 16 hours and 39 minutes ago)
I've written up a blueprint with a rough sketch of how to take advantage 
of alternative storage interfaces.  I am very happy to see that several f 
them have emerged over the past year or two:

 - fusionio's KVMKV is a key/value interface for their flash products
 - seagate's kinetic is a key/value interface for their new ethernet-based 
drive

Also, leveldb is pretty great for many workloads when run on a 
tranditional disk/fs.

The good news is a lot of the existing work that went into support omap 
looks to be reusable here.  Some new functionality and refactoring is 
needed, though, particularly when it comes to storing object data (the 
file-like bag of bytes portion) as key/value pairs.

The blueprint is here:

  http://wiki.ceph.com/01Planning/02Blueprints/Firefly/osd%3A_new_key%2F%2Fvalue_backend


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question for the new ceph-osd key/value backend
  2013-12-11  2:57 question for the new ceph-osd key/value backend Duan, Jiangang
@ 2013-12-11  6:09 ` Sage Weil
  2013-12-11  6:59   ` Duan, Jiangang
  2013-12-11  7:12   ` Mark Kirkwood
  0 siblings, 2 replies; 6+ messages in thread
From: Sage Weil @ 2013-12-11  6:09 UTC (permalink / raw)
  To: Duan, Jiangang; +Cc: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3598 bytes --]

Hi Jiangang,

On Wed, 11 Dec 2013, Duan, Jiangang wrote:
> Sage,
> 
> I have some questions regarding to the key/value backend work.
> 
> What is the motivation to work on this? (or what is the problem we want to solve?)
> 1) to use the new interface thus we can bypass all the OS layer thus get a short latency?

That is one part.  The current strategy of layering on top of a file 
system and using a write-ahead journal makes sense given the existing 
linux fs building blocks, but is far from an optimal solution for many 
workloads.  A k/v interface based on something leveldb probably performs 
much better for many small-object use-cases.  Also, a k/v backend can take 
advatange of emerging non-block storage interfaces like NVMKV, Kinetic, 
new libraries like rocksdb, etc.

> 2) or to leverage some new primitive e.g. the atomic write thus to simplify the code writing?

That too.  Basically, we are currently doing a lot of work to get what we 
need out of posix, and are paying the price.

> There are several different possibilities to use future NVM technology - 
> NVM.FILE, NVM.BLOCK, PM.XXX 
> http://snia.org/sites/default/files/NVMProgrammingModel_v1r10DRAFT.pdf 
> Even for openNVM thing - there are other usage model than k/v.
> 
> Do you have any typical usage model for this? 

I wasn't familiar with these; thanks for the reference!  Of these, 
NVM.FILE seems the most interesting (it maps most closely to an object).  
I am predisposed to skepticism when it comes to these sorts of 
standards/API docs that precede an actual implementation, but it is 
encourgaging to see some effort here towards a common interface.

In the end, we want to support generic Ceph workloads.  These range from 
rbd block and file type workloads (objects are stripes of files, with 
random bytes rewritten) to omap type workloads (like rgw bucket indices 
that are purely key/value).

I think the first wins would be:

1- a backend that more efficiently handles rgw bucket index workloads
2- a backend that is more efficient for rgw in general (i.e., immutable 
objects)
3- a backend that can handle more general purpose workloads (like rbd and 
cephfs)

and separately,

4- a backend that lets you plug in a next-gen backend beneath it, like 
NVMKV and speedy flash.

sage



> 
> -jiangang
> 
> ===================
> 
> From: Sage Weil <sage <at> inktank.com>
> Subject: new ceph-osd key/value backend
> Newsgroups: gmane.comp.file-systems.ceph.devel
> Date: 2013-11-09 10:09:52 GMT (4 weeks, 3 days, 16 hours and 39 minutes ago)
> I've written up a blueprint with a rough sketch of how to take advantage 
> of alternative storage interfaces.  I am very happy to see that several f 
> them have emerged over the past year or two:
> 
>  - fusionio's KVMKV is a key/value interface for their flash products
>  - seagate's kinetic is a key/value interface for their new ethernet-based 
> drive
> 
> Also, leveldb is pretty great for many workloads when run on a 
> tranditional disk/fs.
> 
> The good news is a lot of the existing work that went into support omap 
> looks to be reusable here.  Some new functionality and refactoring is 
> needed, though, particularly when it comes to storing object data (the 
> file-like bag of bytes portion) as key/value pairs.
> 
> The blueprint is here:
> 
>   http://wiki.ceph.com/01Planning/02Blueprints/Firefly/osd%3A_new_key%2F%2Fvalue_backend
> 
> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: question for the new ceph-osd key/value backend
  2013-12-11  6:09 ` Sage Weil
@ 2013-12-11  6:59   ` Duan, Jiangang
  2013-12-11 13:52     ` Mark Nelson
  2013-12-11  7:12   ` Mark Kirkwood
  1 sibling, 1 reply; 6+ messages in thread
From: Duan, Jiangang @ 2013-12-11  6:59 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Thanks. I in general think to find one implementation suitable for all usage models (small vs. big, cold vs. hot) is very difficult. So I like the idea of "a backend that lets you plug in a next-gen backend beneath it - " - 
K/V may be a good way to handle many small objects than XFS - however I am not sure if levelDB is the right choice (consider it is good for write than read) and also not sure K/V this will benefit RBD workload or not (consider all 4MB object size).
Will think more about this and talk with you again. 

-jiangang

-----Original Message-----
From: Sage Weil [mailto:sage@inktank.com] 
Sent: Wednesday, December 11, 2013 2:09 PM
To: Duan, Jiangang
Cc: ceph-devel@vger.kernel.org
Subject: Re: question for the new ceph-osd key/value backend

Hi Jiangang,

On Wed, 11 Dec 2013, Duan, Jiangang wrote:
> Sage,
> 
> I have some questions regarding to the key/value backend work.
> 
> What is the motivation to work on this? (or what is the problem we 
> want to solve?)
> 1) to use the new interface thus we can bypass all the OS layer thus get a short latency?

That is one part.  The current strategy of layering on top of a file system and using a write-ahead journal makes sense given the existing linux fs building blocks, but is far from an optimal solution for many workloads.  A k/v interface based on something leveldb probably performs much better for many small-object use-cases.  Also, a k/v backend can take advatange of emerging non-block storage interfaces like NVMKV, Kinetic, new libraries like rocksdb, etc.

> 2) or to leverage some new primitive e.g. the atomic write thus to simplify the code writing?

That too.  Basically, we are currently doing a lot of work to get what we need out of posix, and are paying the price.

> There are several different possibilities to use future NVM technology 
> - NVM.FILE, NVM.BLOCK, PM.XXX 
> http://snia.org/sites/default/files/NVMProgrammingModel_v1r10DRAFT.pdf
> Even for openNVM thing - there are other usage model than k/v.
> 
> Do you have any typical usage model for this? 

I wasn't familiar with these; thanks for the reference!  Of these, NVM.FILE seems the most interesting (it maps most closely to an object).  
I am predisposed to skepticism when it comes to these sorts of standards/API docs that precede an actual implementation, but it is encourgaging to see some effort here towards a common interface.

In the end, we want to support generic Ceph workloads.  These range from rbd block and file type workloads (objects are stripes of files, with random bytes rewritten) to omap type workloads (like rgw bucket indices that are purely key/value).

I think the first wins would be:

1- a backend that more efficiently handles rgw bucket index workloads
2- a backend that is more efficient for rgw in general (i.e., immutable
objects)
3- a backend that can handle more general purpose workloads (like rbd and
cephfs)

and separately,

4- a backend that lets you plug in a next-gen backend beneath it, like NVMKV and speedy flash.

sage



> 
> -jiangang
> 
> ===================
> 
> From: Sage Weil <sage <at> inktank.com>
> Subject: new ceph-osd key/value backend
> Newsgroups: gmane.comp.file-systems.ceph.devel
> Date: 2013-11-09 10:09:52 GMT (4 weeks, 3 days, 16 hours and 39 
> minutes ago) I've written up a blueprint with a rough sketch of how to 
> take advantage of alternative storage interfaces.  I am very happy to 
> see that several f them have emerged over the past year or two:
> 
>  - fusionio's KVMKV is a key/value interface for their flash products
>  - seagate's kinetic is a key/value interface for their new 
> ethernet-based drive
> 
> Also, leveldb is pretty great for many workloads when run on a 
> tranditional disk/fs.
> 
> The good news is a lot of the existing work that went into support 
> omap looks to be reusable here.  Some new functionality and 
> refactoring is needed, though, particularly when it comes to storing 
> object data (the file-like bag of bytes portion) as key/value pairs.
> 
> The blueprint is here:
> 
>   
> http://wiki.ceph.com/01Planning/02Blueprints/Firefly/osd%3A_new_key%2F
> %2Fvalue_backend
> 
> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j ??f???h?????\x1e?w???
???j:+v???w???????? ????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question for the new ceph-osd key/value backend
  2013-12-11  6:09 ` Sage Weil
  2013-12-11  6:59   ` Duan, Jiangang
@ 2013-12-11  7:12   ` Mark Kirkwood
  1 sibling, 0 replies; 6+ messages in thread
From: Mark Kirkwood @ 2013-12-11  7:12 UTC (permalink / raw)
  To: Sage Weil, Duan, Jiangang; +Cc: ceph-devel@vger.kernel.org

On 11/12/13 19:09, Sage Weil wrote:

> That is one part.  The current strategy of layering on top of a file
> system and using a write-ahead journal makes sense given the existing
> linux fs building blocks, but is far from an optimal solution for many
> workloads.  A k/v interface based on something leveldb probably performs
> much better for many small-object use-cases.  Also, a k/v backend can take
> advatange of emerging non-block storage interfaces like NVMKV, Kinetic,
> new libraries like rocksdb, etc.
>

Yeah, the pluggable backend approach seems like a really good plan.

Regards

Mark


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question for the new ceph-osd key/value backend
  2013-12-11  6:59   ` Duan, Jiangang
@ 2013-12-11 13:52     ` Mark Nelson
  2013-12-12  1:17       ` Duan, Jiangang
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Nelson @ 2013-12-11 13:52 UTC (permalink / raw)
  To: Duan, Jiangang; +Cc: Sage Weil, ceph-devel@vger.kernel.org

On 12/11/2013 12:59 AM, Duan, Jiangang wrote:
> Thanks. I in general think to find one implementation suitable for all usage models (small vs. big, cold vs. hot) is very difficult. So I like the idea of "a backend that lets you plug in a next-gen backend beneath it - " -
> K/V may be a good way to handle many small objects than XFS - however I am not sure if levelDB is the right choice (consider it is good for write than read) and also not sure K/V this will benefit RBD workload or not (consider all 4MB object size).
> Will think more about this and talk with you again.

I have been very interested in this topic recently and have been doing 
some benchmarking with basho's leveldb, hyperdex, and stock leveldb 
implementations.  Each has certain advantages (usually related to 
whatever they advertise, ie crc32 for basho, etc), but all seem to have 
poor sync read/write performance, both sequential and random.  I want to 
take a look at rocksdb to see how it compares as well.

This whole area is very interesting as there are obvious tradeoffs 
regarding how we do things now vs what we potentially could do down the 
road.  Being able to eliminate POSIX entirely behind the scenes would 
obviously be nice for a lot of reasons.

Mark

>
> -jiangang
>
> -----Original Message-----
> From: Sage Weil [mailto:sage@inktank.com]
> Sent: Wednesday, December 11, 2013 2:09 PM
> To: Duan, Jiangang
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: question for the new ceph-osd key/value backend
>
> Hi Jiangang,
>
> On Wed, 11 Dec 2013, Duan, Jiangang wrote:
>> Sage,
>>
>> I have some questions regarding to the key/value backend work.
>>
>> What is the motivation to work on this? (or what is the problem we
>> want to solve?)
>> 1) to use the new interface thus we can bypass all the OS layer thus get a short latency?
>
> That is one part.  The current strategy of layering on top of a file system and using a write-ahead journal makes sense given the existing linux fs building blocks, but is far from an optimal solution for many workloads.  A k/v interface based on something leveldb probably performs much better for many small-object use-cases.  Also, a k/v backend can take advatange of emerging non-block storage interfaces like NVMKV, Kinetic, new libraries like rocksdb, etc.
>
>> 2) or to leverage some new primitive e.g. the atomic write thus to simplify the code writing?
>
> That too.  Basically, we are currently doing a lot of work to get what we need out of posix, and are paying the price.
>
>> There are several different possibilities to use future NVM technology
>> - NVM.FILE, NVM.BLOCK, PM.XXX
>> http://snia.org/sites/default/files/NVMProgrammingModel_v1r10DRAFT.pdf
>> Even for openNVM thing - there are other usage model than k/v.
>>
>> Do you have any typical usage model for this?
>
> I wasn't familiar with these; thanks for the reference!  Of these, NVM.FILE seems the most interesting (it maps most closely to an object).
> I am predisposed to skepticism when it comes to these sorts of standards/API docs that precede an actual implementation, but it is encourgaging to see some effort here towards a common interface.
>
> In the end, we want to support generic Ceph workloads.  These range from rbd block and file type workloads (objects are stripes of files, with random bytes rewritten) to omap type workloads (like rgw bucket indices that are purely key/value).
>
> I think the first wins would be:
>
> 1- a backend that more efficiently handles rgw bucket index workloads
> 2- a backend that is more efficient for rgw in general (i.e., immutable
> objects)
> 3- a backend that can handle more general purpose workloads (like rbd and
> cephfs)
>
> and separately,
>
> 4- a backend that lets you plug in a next-gen backend beneath it, like NVMKV and speedy flash.
>
> sage
>
>
>
>>
>> -jiangang
>>
>> ===================
>>
>> From: Sage Weil <sage <at> inktank.com>
>> Subject: new ceph-osd key/value backend
>> Newsgroups: gmane.comp.file-systems.ceph.devel
>> Date: 2013-11-09 10:09:52 GMT (4 weeks, 3 days, 16 hours and 39
>> minutes ago) I've written up a blueprint with a rough sketch of how to
>> take advantage of alternative storage interfaces.  I am very happy to
>> see that several f them have emerged over the past year or two:
>>
>>   - fusionio's KVMKV is a key/value interface for their flash products
>>   - seagate's kinetic is a key/value interface for their new
>> ethernet-based drive
>>
>> Also, leveldb is pretty great for many workloads when run on a
>> tranditional disk/fs.
>>
>> The good news is a lot of the existing work that went into support
>> omap looks to be reusable here.  Some new functionality and
>> refactoring is needed, though, particularly when it comes to storing
>> object data (the file-like bag of bytes portion) as key/value pairs.
>>
>> The blueprint is here:
>>
>>
>> http://wiki.ceph.com/01Planning/02Blueprints/Firefly/osd%3A_new_key%2F
>> %2Fvalue_backend
>>
>> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j ??f???h?????\x1e?w???
> ???j:+v???w???????? ????zZ+???????j"????i
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z�{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+��ݢj"��!tml=
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: question for the new ceph-osd key/value backend
  2013-12-11 13:52     ` Mark Nelson
@ 2013-12-12  1:17       ` Duan, Jiangang
  0 siblings, 0 replies; 6+ messages in thread
From: Duan, Jiangang @ 2013-12-12  1:17 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Sage Weil, ceph-devel@vger.kernel.org

Mark, want to share us your early result? :) -jiangang

-----Original Message-----
From: Mark Nelson [mailto:mark.nelson@inktank.com] 
Sent: Wednesday, December 11, 2013 9:53 PM
To: Duan, Jiangang
Cc: Sage Weil; ceph-devel@vger.kernel.org
Subject: Re: question for the new ceph-osd key/value backend

On 12/11/2013 12:59 AM, Duan, Jiangang wrote:
> Thanks. I in general think to find one implementation suitable for all 
> usage models (small vs. big, cold vs. hot) is very difficult. So I like the idea of "a backend that lets you plug in a next-gen backend beneath it - " - K/V may be a good way to handle many small objects than XFS - however I am not sure if levelDB is the right choice (consider it is good for write than read) and also not sure K/V this will benefit RBD workload or not (consider all 4MB object size).
> Will think more about this and talk with you again.

I have been very interested in this topic recently and have been doing some benchmarking with basho's leveldb, hyperdex, and stock leveldb implementations.  Each has certain advantages (usually related to whatever they advertise, ie crc32 for basho, etc), but all seem to have poor sync read/write performance, both sequential and random.  I want to take a look at rocksdb to see how it compares as well.

This whole area is very interesting as there are obvious tradeoffs regarding how we do things now vs what we potentially could do down the road.  Being able to eliminate POSIX entirely behind the scenes would obviously be nice for a lot of reasons.

Mark

>
> -jiangang
>
> -----Original Message-----
> From: Sage Weil [mailto:sage@inktank.com]
> Sent: Wednesday, December 11, 2013 2:09 PM
> To: Duan, Jiangang
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: question for the new ceph-osd key/value backend
>
> Hi Jiangang,
>
> On Wed, 11 Dec 2013, Duan, Jiangang wrote:
>> Sage,
>>
>> I have some questions regarding to the key/value backend work.
>>
>> What is the motivation to work on this? (or what is the problem we 
>> want to solve?)
>> 1) to use the new interface thus we can bypass all the OS layer thus get a short latency?
>
> That is one part.  The current strategy of layering on top of a file system and using a write-ahead journal makes sense given the existing linux fs building blocks, but is far from an optimal solution for many workloads.  A k/v interface based on something leveldb probably performs much better for many small-object use-cases.  Also, a k/v backend can take advatange of emerging non-block storage interfaces like NVMKV, Kinetic, new libraries like rocksdb, etc.
>
>> 2) or to leverage some new primitive e.g. the atomic write thus to simplify the code writing?
>
> That too.  Basically, we are currently doing a lot of work to get what we need out of posix, and are paying the price.
>
>> There are several different possibilities to use future NVM 
>> technology
>> - NVM.FILE, NVM.BLOCK, PM.XXX
>> http://snia.org/sites/default/files/NVMProgrammingModel_v1r10DRAFT.pd
>> f Even for openNVM thing - there are other usage model than k/v.
>>
>> Do you have any typical usage model for this?
>
> I wasn't familiar with these; thanks for the reference!  Of these, NVM.FILE seems the most interesting (it maps most closely to an object).
> I am predisposed to skepticism when it comes to these sorts of standards/API docs that precede an actual implementation, but it is encourgaging to see some effort here towards a common interface.
>
> In the end, we want to support generic Ceph workloads.  These range from rbd block and file type workloads (objects are stripes of files, with random bytes rewritten) to omap type workloads (like rgw bucket indices that are purely key/value).
>
> I think the first wins would be:
>
> 1- a backend that more efficiently handles rgw bucket index workloads
> 2- a backend that is more efficient for rgw in general (i.e., 
> immutable
> objects)
> 3- a backend that can handle more general purpose workloads (like rbd 
> and
> cephfs)
>
> and separately,
>
> 4- a backend that lets you plug in a next-gen backend beneath it, like NVMKV and speedy flash.
>
> sage
>
>
>
>>
>> -jiangang
>>
>> ===================
>>
>> From: Sage Weil <sage <at> inktank.com>
>> Subject: new ceph-osd key/value backend
>> Newsgroups: gmane.comp.file-systems.ceph.devel
>> Date: 2013-11-09 10:09:52 GMT (4 weeks, 3 days, 16 hours and 39 
>> minutes ago) I've written up a blueprint with a rough sketch of how 
>> to take advantage of alternative storage interfaces.  I am very happy 
>> to see that several f them have emerged over the past year or two:
>>
>>   - fusionio's KVMKV is a key/value interface for their flash products
>>   - seagate's kinetic is a key/value interface for their new 
>> ethernet-based drive
>>
>> Also, leveldb is pretty great for many workloads when run on a 
>> tranditional disk/fs.
>>
>> The good news is a lot of the existing work that went into support 
>> omap looks to be reusable here.  Some new functionality and 
>> refactoring is needed, though, particularly when it comes to storing 
>> object data (the file-like bag of bytes portion) as key/value pairs.
>>
>> The blueprint is here:
>>
>>
>> http://wiki.ceph.com/01Planning/02Blueprints/Firefly/osd%3A_new_key%2
>> F
>> %2Fvalue_backend
>>
>> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j ??f???h?????\x1e?w???
> ???j:+v???w???????? ????zZ+???????j"????i 
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z {ay \x1dʇڙ ,j   f   h   z \x1e w   
   j:+v   w j m         zZ+  ݢj"  !tml=
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-12-12  1:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-11  2:57 question for the new ceph-osd key/value backend Duan, Jiangang
2013-12-11  6:09 ` Sage Weil
2013-12-11  6:59   ` Duan, Jiangang
2013-12-11 13:52     ` Mark Nelson
2013-12-12  1:17       ` Duan, Jiangang
2013-12-11  7:12   ` Mark Kirkwood

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.