How are you using Ceph?

All of lore.kernel.org
 help / color / mirror / Atom feed

* How are you using Ceph?
@ 2012-09-17 22:14 Ross Turk
  2012-09-17 22:47 ` Nick Couchman
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Ross Turk @ 2012-09-17 22:14 UTC (permalink / raw)
  To: ceph-devel

Hi, all!

One of the most important parts of Inktank's mission is to spread the
word about Ceph. We want everyone to know what it is and how to use
it.

In order to tell a better story to potential new users, I'm trying to
get a sense for today's deployments. We've spent the last few months
talking to folks around the world, but I'm sure there are a few great
stories we haven't heard yet!

If you've got a spare five minutes, I would love to hear what you're
up to. What kind of projects are you working on, and in what stage?
What is your workload? Are you using Ceph alongside other
technologies? How has your experience been?

This is also a good opportunity for me to introduce myself to those I
haven't met yet! Feel free to copy the list if you think others would
be interested (and you don't mind sharing).

Cheers,
Ross

--
Ross Turk
Ceph Community Guy

"Any sufficiently advanced technology is indistinguishable from magic."
-- Arthur C. Clarke

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-17 22:14 How are you using Ceph? Ross Turk
@ 2012-09-17 22:47 ` Nick Couchman
  2012-09-17 22:53   ` Mark Nelson
  2012-09-18  0:05 ` Smart Weblications GmbH - Florian Wiessner
  2012-09-18 16:01 ` Travis Rhoden
  2 siblings, 1 reply; 29+ messages in thread
From: Nick Couchman @ 2012-09-17 22:47 UTC (permalink / raw)
  To: Ross Turk, ceph-devel

My use of Ceph is probably pretty unique in some of the aspects of where/how I'm using it.  I run an IT department for a medium-sized engineering firm.  One of my goals is to try to make the best possible use of the hardware we're deploying to users' desktops.  Often times users cannot get by with a thin client and a VM somewhere, they actually need decent hardware on the desktop.  However, when the hardware isn't being used, it's nice to be able to have access to some of the free disk space, I/O bandwidth, memory, and CPU cycles available on the hardware.  So, Ceph is part of an overall strategy for making use of the hardware.  I'm guessing most folks run it on racked servers in datacenters, but I'm distributing it across desktops.

I've started by rolling out Linux to the desktop bare metal rather than Windows.  I run openSuSE 12.1, probably moving to 12.2 here in the near-future (I have Ceph packages available and built for openSuSE 11.4, 12.1, and 12.2 on my OBS project).  I run the Xen kernel on this hardware so that I can run VMs on top of it for various purposes.  For folks who need Windows, I use Windows-based VMs on Xen.  For the types who are comfortable with switching between Linux and Windows, I use a Windows VM and then rdesktop to connect from the Linux desktop/window manager.  For the types who are only comfortable in Windows, I use VGA and PCI pass-through in Xen to pass the video card and the USB controllers to the Windows guest, making the Linux base install transparent to the end-user.

To make use of free CPU cycles, in addition to VMs, I use the latest freely-available version of the software formerly known as the Sun Grid Engine to make these desktop systems part of the batching system that allows engineers to run HPC jobs.  They mount various filesystems from our NFS servers and jobs can execute on these systems on evenings and weekends.

Ceph is a pretty recent addition to these configurations.  I wanted to find an easy way to make use of the free disk space on these systems, but in a useful way that aggregates it all together.  After looking at several distributed filesystems, Ceph came up as the one with the feature sets that made the most sense for me.  So, I've spent a bunch of time building packages, testing out Ceph, and have finally rolled it out on these two dozen Linux desktops, aggregating 100GB from each desktop's 250GB drive into a single pool that adds up to roughly 2.2TB of raw storage.  I currently do 3 replications for all of my pools in Ceph to try to protect against a desktop machine going down, getting shut down, etc., which does happen from time-to-time.  So far this has worked out pretty well, and Ceph
  seems to recover pretty well from these failures, moving blocks to different systems when necessary, then re-doing that when the systems come back online.

My next steps for this setup, including Ceph, really get into more of a private cloud infrastructure using desktop commodity hardware.  I'd like to be able to install something like Openstack or the XAPI/XCP software on these systems and centrally manage the aggregated storage along with memory and CPU with a tool like that.  This would give me the ability to deploy these inexpensive systems across the organization, but make sure they're used to their best capacity, and it also allows for great flexibility when users move from machine to machine, or VMs need to move from place to place.  I do keep a lot of my critical infrastructure in my datacenter on more traditional compute systems - a SAN, XenServer, fileservers/NAS with NFS/CIFS, etc. - but this is a good way for me to prove out the u
 sefulness and reliability of systems like Ceph and other cloud-computing concepts and then take those and apply them to increasingly complex and critical needs in my organization.

For Ceph improvements that would help me out, the ability to support POSIX and NFSv4 ACLs would be a fantastic addition.  We use these types of permissions on our main filesystems to control access better than the traditional UGO-style permissions, and I already miss it while using Ceph.  Also, I know the concept of deduplication has been discussed, and this, too, would be great.  I was actually wondering about the feasibility of implementing post-processing deduplication on Ceph, first, rather than inline deduplication - obviously this increases disk space requirements since there has to be enough to store the duplicated data, but still seems to beat no deduplication at all.  Not a huge requirement at this point, but playing with FSs that support deduplication makes me want it everywhere 
 :-).

-Nick

>>> On 2012/09/17 at 16:14, Ross Turk <ross@inktank.com> wrote: 
> Hi, all!
> 
> One of the most important parts of Inktank's mission is to spread the
> word about Ceph. We want everyone to know what it is and how to use
> it.
> 
> In order to tell a better story to potential new users, I'm trying to
> get a sense for today's deployments. We've spent the last few months
> talking to folks around the world, but I'm sure there are a few great
> stories we haven't heard yet!
> 
> If you've got a spare five minutes, I would love to hear what you're
> up to. What kind of projects are you working on, and in what stage?
> What is your workload? Are you using Ceph alongside other
> technologies? How has your experience been?
> 
> This is also a good opportunity for me to introduce myself to those I
> haven't met yet! Feel free to copy the list if you think others would
> be interested (and you don't mind sharing).
> 
> Cheers,
> Ross
> 
> --
> Ross Turk
> Ceph Community Guy
> 
> "Any sufficiently advanced technology is indistinguishable from magic."
> -- Arthur C. Clarke
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--------

This e-mail may contain confidential and privileged material for the sole use of the intended recipient.  If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information.  In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way.  If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox.  Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-17 22:47 ` Nick Couchman
@ 2012-09-17 22:53   ` Mark Nelson
  2012-09-17 23:26     ` John Axel Eriksson
  0 siblings, 1 reply; 29+ messages in thread
From: Mark Nelson @ 2012-09-17 22:53 UTC (permalink / raw)
  To: Nick Couchman; +Cc: Ross Turk, ceph-devel

Hi Nick,

All I have to say, is that is totally awesome and scary at the same time. :)

Glad to hear that it recovers well when people shut their desktops off!

Mark

On 09/17/2012 05:47 PM, Nick Couchman wrote:
> My use of Ceph is probably pretty unique in some of the aspects of where/how I'm using it.  I run an IT department for a medium-sized engineering firm.  One of my goals is to try to make the best possible use of the hardware we're deploying to users' desktops.  Often times users cannot get by with a thin client and a VM somewhere, they actually need decent hardware on the desktop.  However, when the hardware isn't being used, it's nice to be able to have access to some of the free disk space, I/O bandwidth, memory, and CPU cycles available on the hardware.  So, Ceph is part of an overall strategy for making use of the hardware.  I'm guessing most folks run it on racked servers in datacenters, but I'm distributing it across desktops.
>
> I've started by rolling out Linux to the desktop bare metal rather than Windows.  I run openSuSE 12.1, probably moving to 12.2 here in the near-future (I have Ceph packages available and built for openSuSE 11.4, 12.1, and 12.2 on my OBS project).  I run the Xen kernel on this hardware so that I can run VMs on top of it for various purposes.  For folks who need Windows, I use Windows-based VMs on Xen.  For the types who are comfortable with switching between Linux and Windows, I use a Windows VM and then rdesktop to connect from the Linux desktop/window manager.  For the types who are only comfortable in Windows, I use VGA and PCI pass-through in Xen to pass the video card and the USB controllers to the Windows guest, making the Linux base install transparent to the end-user.
>
> To make use of free CPU cycles, in addition to VMs, I use the latest freely-available version of the software formerly known as the Sun Grid Engine to make these desktop systems part of the batching system that allows engineers to run HPC jobs.  They mount various filesystems from our NFS servers and jobs can execute on these systems on evenings and weekends.
>
> Ceph is a pretty recent addition to these configurations.  I wanted to find an easy way to make use of the free disk space on these systems, but in a useful way that aggregates it all together.  After looking at several distributed filesystems, Ceph came up as the one with the feature sets that made the most sense for me.  So, I've spent a bunch of time building packages, testing out Ceph, and have finally rolled it out on these two dozen Linux desktops, aggregating 100GB from each desktop's 250GB drive into a single pool that adds up to roughly 2.2TB of raw storage.  I currently do 3 replications for all of my pools in Ceph to try to protect against a desktop machine going down, getting shut down, etc., which does happen from time-to-time.  So far this has worked out pretty well, and Ce
 ph seems to recover pretty well from these failures, moving blocks to different systems when necessary, then re-doing that when the systems come back online.
>
> My next steps for this setup, including Ceph, really get into more of a private cloud infrastructure using desktop commodity hardware.  I'd like to be able to install something like Openstack or the XAPI/XCP software on these systems and centrally manage the aggregated storage along with memory and CPU with a tool like that.  This would give me the ability to deploy these inexpensive systems across the organization, but make sure they're used to their best capacity, and it also allows for great flexibility when users move from machine to machine, or VMs need to move from place to place.  I do keep a lot of my critical infrastructure in my datacenter on more traditional compute systems - a SAN, XenServer, fileservers/NAS with NFS/CIFS, etc. - but this is a good way for me to prove out the
  usefulness and reliability of systems like Ceph and other cloud-computing concepts and then take those and apply them to increasingly complex and critical needs in my organization.
>
> For Ceph improvements that would help me out, the ability to support POSIX and NFSv4 ACLs would be a fantastic addition.  We use these types of permissions on our main filesystems to control access better than the traditional UGO-style permissions, and I already miss it while using Ceph.  Also, I know the concept of deduplication has been discussed, and this, too, would be great.  I was actually wondering about the feasibility of implementing post-processing deduplication on Ceph, first, rather than inline deduplication - obviously this increases disk space requirements since there has to be enough to store the duplicated data, but still seems to beat no deduplication at all.  Not a huge requirement at this point, but playing with FSs that support deduplication makes me want it everywher
 e :-).
>
> -Nick
>
>>>> On 2012/09/17 at 16:14, Ross Turk<ross@inktank.com>  wrote:
>> Hi, all!
>>
>> One of the most important parts of Inktank's mission is to spread the
>> word about Ceph. We want everyone to know what it is and how to use
>> it.
>>
>> In order to tell a better story to potential new users, I'm trying to
>> get a sense for today's deployments. We've spent the last few months
>> talking to folks around the world, but I'm sure there are a few great
>> stories we haven't heard yet!
>>
>> If you've got a spare five minutes, I would love to hear what you're
>> up to. What kind of projects are you working on, and in what stage?
>> What is your workload? Are you using Ceph alongside other
>> technologies? How has your experience been?
>>
>> This is also a good opportunity for me to introduce myself to those I
>> haven't met yet! Feel free to copy the list if you think others would
>> be interested (and you don't mind sharing).
>>
>> Cheers,
>> Ross
>>
>> --
>> Ross Turk
>> Ceph Community Guy
>>
>> "Any sufficiently advanced technology is indistinguishable from magic."
>> -- Arthur C. Clarke
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --------
> This e-mail may contain confidential and privileged material for the sole use of the intended recipient.  If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information.  In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way.  If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox.  Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-17 22:53   ` Mark Nelson
@ 2012-09-17 23:26     ` John Axel Eriksson
  2012-09-18  7:47       ` Plaetinck, Dieter
  0 siblings, 1 reply; 29+ messages in thread
From: John Axel Eriksson @ 2012-09-17 23:26 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Nick Couchman, Ross Turk, ceph-devel

Our use of Ceph started pretty recently (this summer). We only use
rados together with the radosgw. We moved from another distributed
storage solution that had failed us more than once and we lost data.
Since the old system had an http interface (not S3 compatible though)
we looked around for another similar system. In the end we chose Ceph
since it had been in development for quite some time, had been
incorporated in the kernel (well, the client for the fs that is) and
recently got a company behind it. Ceph felt pretty solid, even though
it's still early days I guess.

We obviously liked the fact that it has an S3 compatible interface,
especially since we started backing up data to Amazon S3 some time ago
- having the same interface simplified our client code tremendously.
We don't actually need extreme throughput (yet anyway :-) but we do
need replication. We're quite happy with the performance so far since
it's better than our old system.
We store medical data for archival and conversion from and to
different formats. Since we (after previous failures in the old
storage system) store everything in Amazon S3 as well, we made a bet
on Kernel 3.5 and Btrfs with compression for some quite dramatic space
savings - the data we store often compresses really well. So far we
haven't regretted that choice, but we've only been running it in
production for about two months while slowly phasing out the old
storage system.

John

On Tue, Sep 18, 2012 at 12:53 AM, Mark Nelson <mark.nelson@inktank.com> wrote:
> Hi Nick,
>
> All I have to say, is that is totally awesome and scary at the same time. :)
>
> Glad to hear that it recovers well when people shut their desktops off!
>
> Mark
>
>
> On 09/17/2012 05:47 PM, Nick Couchman wrote:
>>
>> My use of Ceph is probably pretty unique in some of the aspects of
>> where/how I'm using it.  I run an IT department for a medium-sized
>> engineering firm.  One of my goals is to try to make the best possible use
>> of the hardware we're deploying to users' desktops.  Often times users
>> cannot get by with a thin client and a VM somewhere, they actually need
>> decent hardware on the desktop.  However, when the hardware isn't being
>> used, it's nice to be able to have access to some of the free disk space,
>> I/O bandwidth, memory, and CPU cycles available on the hardware.  So, Ceph
>> is part of an overall strategy for making use of the hardware.  I'm guessing
>> most folks run it on racked servers in datacenters, but I'm distributing it
>> across desktops.
>>
>> I've started by rolling out Linux to the desktop bare metal rather than
>> Windows.  I run openSuSE 12.1, probably moving to 12.2 here in the
>> near-future (I have Ceph packages available and built for openSuSE 11.4,
>> 12.1, and 12.2 on my OBS project).  I run the Xen kernel on this hardware so
>> that I can run VMs on top of it for various purposes.  For folks who need
>> Windows, I use Windows-based VMs on Xen.  For the types who are comfortable
>> with switching between Linux and Windows, I use a Windows VM and then
>> rdesktop to connect from the Linux desktop/window manager.  For the types
>> who are only comfortable in Windows, I use VGA and PCI pass-through in Xen
>> to pass the video card and the USB controllers to the Windows guest, making
>> the Linux base install transparent to the end-user.
>>
>> To make use of free CPU cycles, in addition to VMs, I use the latest
>> freely-available version of the software formerly known as the Sun Grid
>> Engine to make these desktop systems part of the batching system that allows
>> engineers to run HPC jobs.  They mount various filesystems from our NFS
>> servers and jobs can execute on these systems on evenings and weekends.
>>
>> Ceph is a pretty recent addition to these configurations.  I wanted to
>> find an easy way to make use of the free disk space on these systems, but in
>> a useful way that aggregates it all together.  After looking at several
>> distributed filesystems, Ceph came up as the one with the feature sets that
>> made the most sense for me.  So, I've spent a bunch of time building
>> packages, testing out Ceph, and have finally rolled it out on these two
>> dozen Linux desktops, aggregating 100GB from each desktop's 250GB drive into
>> a single pool that adds up to roughly 2.2TB of raw storage.  I currently do
>> 3 replications for all of my pools in Ceph to try to protect against a
>> desktop machine going down, getting shut down, etc., which does happen from
>> time-to-time.  So far this has worked out pretty well, and Ceph seems to
>> recover pretty well from these failures, moving blocks to different systems
>> when necessary, then re-doing that when the systems come back online.
>>
>> My next steps for this setup, including Ceph, really get into more of a
>> private cloud infrastructure using desktop commodity hardware.  I'd like to
>> be able to install something like Openstack or the XAPI/XCP software on
>> these systems and centrally manage the aggregated storage along with memory
>> and CPU with a tool like that.  This would give me the ability to deploy
>> these inexpensive systems across the organization, but make sure they're
>> used to their best capacity, and it also allows for great flexibility when
>> users move from machine to machine, or VMs need to move from place to place.
>> I do keep a lot of my critical infrastructure in my datacenter on more
>> traditional compute systems - a SAN, XenServer, fileservers/NAS with
>> NFS/CIFS, etc. - but this is a good way for me to prove out the usefulness
>> and reliability of systems like Ceph and other cloud-computing concepts and
>> then take those and apply them to increasingly complex and critical needs in
>> my organization.
>>
>> For Ceph improvements that would help me out, the ability to support POSIX
>> and NFSv4 ACLs would be a fantastic addition.  We use these types of
>> permissions on our main filesystems to control access better than the
>> traditional UGO-style permissions, and I already miss it while using Ceph.
>> Also, I know the concept of deduplication has been discussed, and this, too,
>> would be great.  I was actually wondering about the feasibility of
>> implementing post-processing deduplication on Ceph, first, rather than
>> inline deduplication - obviously this increases disk space requirements
>> since there has to be enough to store the duplicated data, but still seems
>> to beat no deduplication at all.  Not a huge requirement at this point, but
>> playing with FSs that support deduplication makes me want it everywhere :-).
>>
>> -Nick
>>
>>>>> On 2012/09/17 at 16:14, Ross Turk<ross@inktank.com>  wrote:
>>>
>>> Hi, all!
>>>
>>> One of the most important parts of Inktank's mission is to spread the
>>> word about Ceph. We want everyone to know what it is and how to use
>>> it.
>>>
>>> In order to tell a better story to potential new users, I'm trying to
>>> get a sense for today's deployments. We've spent the last few months
>>> talking to folks around the world, but I'm sure there are a few great
>>> stories we haven't heard yet!
>>>
>>> If you've got a spare five minutes, I would love to hear what you're
>>> up to. What kind of projects are you working on, and in what stage?
>>> What is your workload? Are you using Ceph alongside other
>>> technologies? How has your experience been?
>>>
>>> This is also a good opportunity for me to introduce myself to those I
>>> haven't met yet! Feel free to copy the list if you think others would
>>> be interested (and you don't mind sharing).
>>>
>>> Cheers,
>>> Ross
>>>
>>> --
>>> Ross Turk
>>> Ceph Community Guy
>>>
>>> "Any sufficiently advanced technology is indistinguishable from magic."
>>> -- Arthur C. Clarke
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --------
>> This e-mail may contain confidential and privileged material for the sole
>> use of the intended recipient.  If this email is not intended for you, or
>> you are not responsible for the delivery of this message to the intended
>> recipient, please note that this message may contain SEAKR Engineering
>> (SEAKR) Privileged/Proprietary Information.  In such a case, you are
>> strictly prohibited from downloading, photocopying, distributing or
>> otherwise using this message, its contents or attachments in any way.  If
>> you have received this message in error, please notify us immediately by
>> replying to this e-mail and delete the message from your mailbox.
>> Information contained in this message that does not relate to the business
>> of SEAKR is neither endorsed by nor attributable to SEAKR.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
@ 2012-09-17 23:55 Nick Couchman
  0 siblings, 0 replies; 29+ messages in thread
From: Nick Couchman @ 2012-09-17 23:55 UTC (permalink / raw)
  To: mark.nelson; +Cc: ross, ceph-devel

We actually ask people to not shut off their desktops, so it doesn't happen very often :-).  Also, I run the MDS and MON systems inside my datacenter, so only the OSDs are out there on the desktops.

-Nick

>>> Mark Nelson  09/17/12 4:53 PM >>>
Hi Nick,

All I have to say, is that is totally awesome and scary at the same time. :)

Glad to hear that it recovers well when people shut their desktops off!

Mark

On 09/17/2012 05:47 PM, Nick Couchman wrote:
> My use of Ceph is probably pretty unique in some of the aspects of where/how I'm using it.  I run an IT department for a medium-sized engineering firm.  One of my goals is to try to make the best possible use of the hardware we're deploying to users' desktops.  Often times users cannot get by with a thin client and a VM somewhere, they actually need decent hardware on the desktop.  However, when the hardware isn't being used, it's nice to be able to have access to some of the free disk space, I/O bandwidth, memory, and CPU cycles available on the hardware.  So, Ceph is part of an overall strategy for making use of the hardware.  I'm guessing most folks run it on racked servers in datacenters, but I'm distributing it across desktops.
>
> I've started by rolling out Linux to the desktop bare metal rather than Windows.  I run openSuSE 12.1, probably moving to 12.2 here in the near-future (I have Ceph packages available and built for openSuSE 11.4, 12.1, and 12.2 on my OBS project).  I run the Xen kernel on this hardware so that I can run VMs on top of it for various purposes.  For folks who need Windows, I use Windows-based VMs on Xen.  For the types who are comfortable with switching between Linux and Windows, I use a Windows VM and then rdesktop to connect from the Linux desktop/window manager.  For the types who are only comfortable in Windows, I use VGA and PCI pass-through in Xen to pass the video card and the USB controllers to the Windows guest, making the Linux base install transparent to the end-user.
>
> To make use of free CPU cycles, in addition to VMs, I use the latest freely-available version of the software formerly known as the Sun Grid Engine to make these desktop systems part of the batching system that allows engineers to run HPC jobs.  They mount various filesystems from our NFS servers and jobs can execute on these systems on evenings and weekends.
>
> Ceph is a pretty recent addition to these configurations.  I wanted to find an easy way to make use of the free disk space on these systems, but in a useful way that aggregates it all together.  After looking at several distributed filesystems, Ceph came up as the one with the feature sets that made the most sense for me.  So, I've spent a bunch of time building packages, testing out Ceph, and have finally rolled it out on these two dozen Linux desktops, aggregating 100GB from each desktop's 250GB drive into a single pool that adds up to roughly 2.2TB of raw storage.  I currently do 3 replications for all of my pools in Ceph to try to protect against a desktop machine going down, getting shut down, etc., which does happen from time-to-time.  So far this has worked out pretty well, and Ce
 ph seems to recover pretty well from these failures, moving blocks to different systems when necessary, then re-doing that when the systems come back online.
>
> My next steps for this setup, including Ceph, really get into more of a private cloud infrastructure using desktop commodity hardware.  I'd like to be able to install something like Openstack or the XAPI/XCP software on these systems and centrally manage the aggregated storage along with memory and CPU with a tool like that.  This would give me the ability to deploy these inexpensive systems across the organization, but make sure they're used to their best capacity, and it also allows for great flexibility when users move from machine to machine, or VMs need to move from place to place.  I do keep a lot of my critical infrastructure in my datacenter on more traditional compute systems - a SAN, XenServer, fileservers/NAS with NFS/CIFS, etc. - but this is a good way for me to prove out the
  usefulness and reliability of systems like Ceph and other cloud-computing concepts and then take those and apply them to increasingly complex and critical needs in my organization.
>
> For Ceph improvements that would help me out, the ability to support POSIX and NFSv4 ACLs would be a fantastic addition.  We use these types of permissions on our main filesystems to control access better than the traditional UGO-style permissions, and I already miss it while using Ceph.  Also, I know the concept of deduplication has been discussed, and this, too, would be great.  I was actually wondering about the feasibility of implementing post-processing deduplication on Ceph, first, rather than inline deduplication - obviously this increases disk space requirements since there has to be enough to store the duplicated data, but still seems to beat no deduplication at all.  Not a huge requirement at this point, but playing with FSs that support deduplication makes me want it everywher
 e :-).
>
> -Nick
>
>>>> On 2012/09/17 at 16:14, Ross Turk  wrote:
>> Hi, all!
>>
>> One of the most important parts of Inktank's mission is to spread the
>> word about Ceph. We want everyone to know what it is and how to use
>> it.
>>
>> In order to tell a better story to potential new users, I'm trying to
>> get a sense for today's deployments. We've spent the last few months
>> talking to folks around the world, but I'm sure there are a few great
>> stories we haven't heard yet!
>>
>> If you've got a spare five minutes, I would love to hear what you're
>> up to. What kind of projects are you working on, and in what stage?
>> What is your workload? Are you using Ceph alongside other
>> technologies? How has your experience been?
>>
>> This is also a good opportunity for me to introduce myself to those I
>> haven't met yet! Feel free to copy the list if you think others would
>> be interested (and you don't mind sharing).
>>
>> Cheers,
>> Ross
>>
>> --
>> Ross Turk
>> Ceph Community Guy
>>
>> "Any sufficiently advanced technology is indistinguishable from magic."
>> -- Arthur C. Clarke
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --------
> This e-mail may contain confidential and privileged material for the sole use of the intended recipient.  If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information.  In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way.  If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox.  Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




--------

This e-mail may contain confidential and privileged material for the sole use of the intended recipient.  If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information.  In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way.  If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox.  Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
@ 2012-09-17 23:57 Nick Couchman
  2012-09-18  6:35 ` John Axel Eriksson
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Couchman @ 2012-09-17 23:57 UTC (permalink / raw)
  To: mark.nelson, john; +Cc: ross, ceph-devel

John,
I'd be really interested to hear how Btrfs goes over time.  I tried it out a few kernel versions ago and regretted it - lost some data after using it.  Hopefully the stability is better than it was before, and inline compression is always great!

-Nick

>>> John Axel Eriksson  09/17/12 5:26 PM >>>
Our use of Ceph started pretty recently (this summer). We only use
rados together with the radosgw. We moved from another distributed
storage solution that had failed us more than once and we lost data.
Since the old system had an http interface (not S3 compatible though)
we looked around for another similar system. In the end we chose Ceph
since it had been in development for quite some time, had been
incorporated in the kernel (well, the client for the fs that is) and
recently got a company behind it. Ceph felt pretty solid, even though
it's still early days I guess.

We obviously liked the fact that it has an S3 compatible interface,
especially since we started backing up data to Amazon S3 some time ago
- having the same interface simplified our client code tremendously.
We don't actually need extreme throughput (yet anyway :-) but we do
need replication. We're quite happy with the performance so far since
it's better than our old system.
We store medical data for archival and conversion from and to
different formats. Since we (after previous failures in the old
storage system) store everything in Amazon S3 as well, we made a bet
on Kernel 3.5 and Btrfs with compression for some quite dramatic space
savings - the data we store often compresses really well. So far we
haven't regretted that choice, but we've only been running it in
production for about two months while slowly phasing out the old
storage system.

John

On Tue, Sep 18, 2012 at 12:53 AM, Mark Nelson  wrote:
> Hi Nick,
>
> All I have to say, is that is totally awesome and scary at the same time. :)
>
> Glad to hear that it recovers well when people shut their desktops off!
>
> Mark
>
>
> On 09/17/2012 05:47 PM, Nick Couchman wrote:
>>
>> My use of Ceph is probably pretty unique in some of the aspects of
>> where/how I'm using it.  I run an IT department for a medium-sized
>> engineering firm.  One of my goals is to try to make the best possible use
>> of the hardware we're deploying to users' desktops.  Often times users
>> cannot get by with a thin client and a VM somewhere, they actually need
>> decent hardware on the desktop.  However, when the hardware isn't being
>> used, it's nice to be able to have access to some of the free disk space,
>> I/O bandwidth, memory, and CPU cycles available on the hardware.  So, Ceph
>> is part of an overall strategy for making use of the hardware.  I'm guessing
>> most folks run it on racked servers in datacenters, but I'm distributing it
>> across desktops.
>>
>> I've started by rolling out Linux to the desktop bare metal rather than
>> Windows.  I run openSuSE 12.1, probably moving to 12.2 here in the
>> near-future (I have Ceph packages available and built for openSuSE 11.4,
>> 12.1, and 12.2 on my OBS project).  I run the Xen kernel on this hardware so
>> that I can run VMs on top of it for various purposes.  For folks who need
>> Windows, I use Windows-based VMs on Xen.  For the types who are comfortable
>> with switching between Linux and Windows, I use a Windows VM and then
>> rdesktop to connect from the Linux desktop/window manager.  For the types
>> who are only comfortable in Windows, I use VGA and PCI pass-through in Xen
>> to pass the video card and the USB controllers to the Windows guest, making
>> the Linux base install transparent to the end-user.
>>
>> To make use of free CPU cycles, in addition to VMs, I use the latest
>> freely-available version of the software formerly known as the Sun Grid
>> Engine to make these desktop systems part of the batching system that allows
>> engineers to run HPC jobs.  They mount various filesystems from our NFS
>> servers and jobs can execute on these systems on evenings and weekends.
>>
>> Ceph is a pretty recent addition to these configurations.  I wanted to
>> find an easy way to make use of the free disk space on these systems, but in
>> a useful way that aggregates it all together.  After looking at several
>> distributed filesystems, Ceph came up as the one with the feature sets that
>> made the most sense for me.  So, I've spent a bunch of time building
>> packages, testing out Ceph, and have finally rolled it out on these two
>> dozen Linux desktops, aggregating 100GB from each desktop's 250GB drive into
>> a single pool that adds up to roughly 2.2TB of raw storage.  I currently do
>> 3 replications for all of my pools in Ceph to try to protect against a
>> desktop machine going down, getting shut down, etc., which does happen from
>> time-to-time.  So far this has worked out pretty well, and Ceph seems to
>> recover pretty well from these failures, moving blocks to different systems
>> when necessary, then re-doing that when the systems come back online.
>>
>> My next steps for this setup, including Ceph, really get into more of a
>> private cloud infrastructure using desktop commodity hardware.  I'd like to
>> be able to install something like Openstack or the XAPI/XCP software on
>> these systems and centrally manage the aggregated storage along with memory
>> and CPU with a tool like that.  This would give me the ability to deploy
>> these inexpensive systems across the organization, but make sure they're
>> used to their best capacity, and it also allows for great flexibility when
>> users move from machine to machine, or VMs need to move from place to place.
>> I do keep a lot of my critical infrastructure in my datacenter on more
>> traditional compute systems - a SAN, XenServer, fileservers/NAS with
>> NFS/CIFS, etc. - but this is a good way for me to prove out the usefulness
>> and reliability of systems like Ceph and other cloud-computing concepts and
>> then take those and apply them to increasingly complex and critical needs in
>> my organization.
>>
>> For Ceph improvements that would help me out, the ability to support POSIX
>> and NFSv4 ACLs would be a fantastic addition.  We use these types of
>> permissions on our main filesystems to control access better than the
>> traditional UGO-style permissions, and I already miss it while using Ceph.
>> Also, I know the concept of deduplication has been discussed, and this, too,
>> would be great.  I was actually wondering about the feasibility of
>> implementing post-processing deduplication on Ceph, first, rather than
>> inline deduplication - obviously this increases disk space requirements
>> since there has to be enough to store the duplicated data, but still seems
>> to beat no deduplication at all.  Not a huge requirement at this point, but
>> playing with FSs that support deduplication makes me want it everywhere :-).
>>
>> -Nick
>>
>>>>> On 2012/09/17 at 16:14, Ross Turk  wrote:
>>>
>>> Hi, all!
>>>
>>> One of the most important parts of Inktank's mission is to spread the
>>> word about Ceph. We want everyone to know what it is and how to use
>>> it.
>>>
>>> In order to tell a better story to potential new users, I'm trying to
>>> get a sense for today's deployments. We've spent the last few months
>>> talking to folks around the world, but I'm sure there are a few great
>>> stories we haven't heard yet!
>>>
>>> If you've got a spare five minutes, I would love to hear what you're
>>> up to. What kind of projects are you working on, and in what stage?
>>> What is your workload? Are you using Ceph alongside other
>>> technologies? How has your experience been?
>>>
>>> This is also a good opportunity for me to introduce myself to those I
>>> haven't met yet! Feel free to copy the list if you think others would
>>> be interested (and you don't mind sharing).
>>>
>>> Cheers,
>>> Ross
>>>
>>> --
>>> Ross Turk
>>> Ceph Community Guy
>>>
>>> "Any sufficiently advanced technology is indistinguishable from magic."
>>> -- Arthur C. Clarke
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --------
>> This e-mail may contain confidential and privileged material for the sole
>> use of the intended recipient.  If this email is not intended for you, or
>> you are not responsible for the delivery of this message to the intended
>> recipient, please note that this message may contain SEAKR Engineering
>> (SEAKR) Privileged/Proprietary Information.  In such a case, you are
>> strictly prohibited from downloading, photocopying, distributing or
>> otherwise using this message, its contents or attachments in any way.  If
>> you have received this message in error, please notify us immediately by
>> replying to this e-mail and delete the message from your mailbox.
>> Information contained in this message that does not relate to the business
>> of SEAKR is neither endorsed by nor attributable to SEAKR.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--------

This e-mail may contain confidential and privileged material for the sole use of the intended recipient.  If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information.  In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way.  If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox.  Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-17 22:14 How are you using Ceph? Ross Turk
  2012-09-17 22:47 ` Nick Couchman
@ 2012-09-18  0:05 ` Smart Weblications GmbH - Florian Wiessner
  2012-09-18  0:18   ` Tren Blackburn
  2012-09-18 16:01 ` Travis Rhoden
  2 siblings, 1 reply; 29+ messages in thread
From: Smart Weblications GmbH - Florian Wiessner @ 2012-09-18  0:05 UTC (permalink / raw)
  To: Ross Turk, ceph-devel


Hi,

i use ceph to provide storage via rbd for our virtualization cluster delivering
KVM based high availability Virtual Machines to my customers. I also use it
as rbd device with ocfs2 on top of it for a 4 node webserver cluster as shared
storage - i do this, because unfortunatelly cephfs is not ready yet ;)


Am 18.09.2012 00:14, schrieb Ross Turk:
> Hi, all!
> 
> One of the most important parts of Inktank's mission is to spread the
> word about Ceph. We want everyone to know what it is and how to use
> it.
> 
> In order to tell a better story to potential new users, I'm trying to
> get a sense for today's deployments. We've spent the last few months
> talking to folks around the world, but I'm sure there are a few great
> stories we haven't heard yet!
> 
> If you've got a spare five minutes, I would love to hear what you're
> up to. What kind of projects are you working on, and in what stage?
> What is your workload? Are you using Ceph alongside other
> technologies? How has your experience been?
> 
> This is also a good opportunity for me to introduce myself to those I
> haven't met yet! Feel free to copy the list if you think others would
> be interested (and you don't mind sharing).
> 



-- 

Mit freundlichen Grüßen,

Florian Wiessner

Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila

fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de

--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18  0:05 ` Smart Weblications GmbH - Florian Wiessner
@ 2012-09-18  0:18   ` Tren Blackburn
  2012-09-18  2:32     ` Sage Weil
  0 siblings, 1 reply; 29+ messages in thread
From: Tren Blackburn @ 2012-09-18  0:18 UTC (permalink / raw)
  To: f.wiessner; +Cc: Ross Turk, ceph-devel

On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
Wiessner <f.wiessner@smart-weblications.de> wrote:
>
> Hi,
>
> i use ceph to provide storage via rbd for our virtualization cluster delivering
> KVM based high availability Virtual Machines to my customers. I also use it
> as rbd device with ocfs2 on top of it for a 4 node webserver cluster as shared
> storage - i do this, because unfortunatelly cephfs is not ready yet ;)
>
Hi Florian;

When you say "cephfs is not ready yet", what parts about it are not
ready? There are vague rumblings about that in general, but I'd love
to see specific issues. I understand multiple *active* mds's are not
supported, but what other issues are you aware of?

And if there's a page documenting this already, I apologize...and
would appreciate a link :)

t.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18  0:18   ` Tren Blackburn
@ 2012-09-18  2:32     ` Sage Weil
  2012-09-18 11:48       ` Smart Weblications GmbH - Florian Wiessner
  2012-09-18 16:35       ` Tren Blackburn
  0 siblings, 2 replies; 29+ messages in thread
From: Sage Weil @ 2012-09-18  2:32 UTC (permalink / raw)
  To: Tren Blackburn; +Cc: f.wiessner, Ross Turk, ceph-devel

On Mon, 17 Sep 2012, Tren Blackburn wrote:
> On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
> Wiessner <f.wiessner@smart-weblications.de> wrote:
> >
> > Hi,
> >
> > i use ceph to provide storage via rbd for our virtualization cluster delivering
> > KVM based high availability Virtual Machines to my customers. I also use it
> > as rbd device with ocfs2 on top of it for a 4 node webserver cluster as shared
> > storage - i do this, because unfortunatelly cephfs is not ready yet ;)
> >
> Hi Florian;
> 
> When you say "cephfs is not ready yet", what parts about it are not
> ready? There are vague rumblings about that in general, but I'd love
> to see specific issues. I understand multiple *active* mds's are not
> supported, but what other issues are you aware of?

Inktank is not yet supporting it because we do not have the QA in place 
and general hardening that will make us feel comfortable recommending it 
for customers.  That said, it works pretty well for most workloads.  In 
particular, if you stay away from the snapshots and multi-mds, you should 
be quite stable.

The engineering team here is about to do a bit of a pivot and refocus on 
the file system now that the object store and RBD are in pretty good 
shape.  That will mean both core fs/mds stability and features as well as 
integration efforts (NFS/CIFS/Hadoop).

'Ready' is in the eye of the beholder.  There are a few people using the 
fs successfully in production, but not too many.

sage

 > 
> And if there's a page documenting this already, I apologize...and
> would appreciate a link :)
> 
> t.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
       [not found] <1784724793.100.1347938272315.JavaMail.root@thunderbeast.private.linuxbox.com>
@ 2012-09-18  3:19 ` Matt W. Benjamin
  2012-09-18  5:44   ` Ian Pye
  2012-09-18 16:13   ` Sage Weil
  0 siblings, 2 replies; 29+ messages in thread
From: Matt W. Benjamin @ 2012-09-18  3:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: f wiessner, Ross Turk, ceph-devel, Tren Blackburn

Hi

Just FYI, on the NFS integration front.  A pnfs files (RFC5661)-capable NFSv4 re-exporter for Ceph has been committed to the Ganesha NFSv4 server development branch.  We're continuing to enhance and elaborate this.  We have had on our (full) plates for a while to return Ceph client library changes.  We've finished pullup and rebasing of these, are doing some final testing of a couple things in preparation to push a branch for review.

Regards,

Matt

----- "Sage Weil" <sage@inktank.com> wrote:

> On Mon, 17 Sep 2012, Tren Blackburn wrote:
> > On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
> > Wiessner <f.wiessner@smart-weblications.de> wrote:
> > >
> > > Hi,
> > >
> > > i use ceph to provide storage via rbd for our virtualization
> cluster delivering
> > > KVM based high availability Virtual Machines to my customers. I
> also use it
> > > as rbd device with ocfs2 on top of it for a 4 node webserver
> cluster as shared
> > > storage - i do this, because unfortunatelly cephfs is not ready
> yet ;)
> > >
> > Hi Florian;
> > 
> > When you say "cephfs is not ready yet", what parts about it are not
> > ready? There are vague rumblings about that in general, but I'd
> love
> > to see specific issues. I understand multiple *active* mds's are
> not
> > supported, but what other issues are you aware of?
> 
> Inktank is not yet supporting it because we do not have the QA in
> place 
> and general hardening that will make us feel comfortable recommending
> it 
> for customers.  That said, it works pretty well for most workloads. 
> In 
> particular, if you stay away from the snapshots and multi-mds, you
> should 
> be quite stable.
> 
> The engineering team here is about to do a bit of a pivot and refocus
> on 
> the file system now that the object store and RBD are in pretty good 
> shape.  That will mean both core fs/mds stability and features as well
> as 
> integration efforts (NFS/CIFS/Hadoop).
> 
> 'Ready' is in the eye of the beholder.  There are a few people using
> the 
> fs successfully in production, but not too many.
> 
> sage
> 
> 
>  > 
> > And if there's a page documenting this already, I apologize...and
> > would appreciate a link :)
> > 
> > t.
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18  3:19 ` Matt W. Benjamin
@ 2012-09-18  5:44   ` Ian Pye
  2012-09-18  6:06     ` Yehuda Sadeh
  2012-09-18 16:13   ` Sage Weil
  1 sibling, 1 reply; 29+ messages in thread
From: Ian Pye @ 2012-09-18  5:44 UTC (permalink / raw)
  To: ceph-devel

I'm looking at building a hbase/bigtable style key-value store on top
of Ceph's omap abstraction of LevelDB. The plan is to use this for log
storage at first. Writes use libradospp, with individual log lines
serialized via message-pack and then stored as omap values.  Omap keys
are strings which group common data together using a similar prefix.

Reads are exposed using a custom fuse integration which supports query
parameters separated via the # token like so:

cat /cf/adefs/logger/pg/data/2012-08-28/OID/2/1346191920#pr=N:1015438#lm=1#fr=json

[{"bcktime":"0.000","bcktype":"BCK_C1","bytes_bck":"2196","bytes_dlv":"2196","cachestat":"HIT","chktime":"0.000","chktimestamp":"1346192219","country":"CA","dc_old":"IMAGE","dlvtime":"0.002","doctype":"IMAGE","domuid":"df475bc52ab9f7b546ef60a8e2803bca61343075938","dw_key":"N:1015438:208.69.:IMAGE:14f1-1343075940.119-10-115680413","host":"www.forum.immigrer.com","hoststat":"200","http_method":"GET","http_proto":"HTTP/1.1","id":"14f1-1343075940.119-10-115680413","iptype":"CLEAN","ownerid":"226010","path_op":"WL","path_src":"MACRO","path_stat":"NR","rmoteip":"208.69.11.150","seclvl":"eoff","servnmdlv":"14f1","servnmflc":"14f1","uag":"Mozilla/5.0
(Windows NT 6.1; rv:14.0) Gecko/20100101
Firefox/14.0.1","url":"/icon-16.png","zone_plan":"pro","zoneid":"1015438","zonename":"immigrer.com"},{"bytes_bck":2196,"bytes_dlv":2196,"cfbb":0,"flstat":470,"hoststat":200,"isbot":0,"missing_dlv":0,"ownerid":226010,"s404":0,"upstat":0,"zoneid":1015438}]

Passing in a pr parameter downloads only those keys matching the
prefix specified. fr is a format (json or msgpack) and lm is a limit.

I'm also working with the CLS framework to compute aggregate values
(for example the average request size) on the OSDs directly.

A further level of abstraction is provided by writing a postgres to
ceph binding, exposing omap values as postgres hstores. This allows
Postgres functions like:

select kc_hstore->'uag' as user_agent, count(*) as cn from
kc_hstore('1346191920', '2012-08-28/OID/1', 'N:') group by user_agent
order by cn desc limit 10;

                                                      user_agent
                                                |  cn
-----------------------------------------------------------------------------------------------------------------------+------
 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like
Gecko) Chrome/20.0.1132.57 Safari/536.11          | 1717
 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101
Firefox/14.0.1                                            |  862
 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64;
Trident/5.0)                                                |  837
 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko)
Chrome/20.0.1132.57 Safari/536.11                 |  504
 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko)
Chrome/20.0.1132.57 Safari/536.11                 |  332
 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101
Firefox/13.0.1                                            |  312
 Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.11 (KHTML, like Gecko)
Chrome/20.0.1132.57 Safari/536.11                 |  256
 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
                                                |  220
 Mozilla/5.0 (Windows NT 6.1; rv:14.0) Gecko/20100101 Firefox/14.0.1
                                                |  178
 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.57.2
(KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 |  172
(10 rows)

Here, we get the top 10 most common user agents seen for a given time
range and data shard.

Currently using xfs, as I too have been bitten by btrfs.



On Mon, Sep 17, 2012 at 8:19 PM, Matt W. Benjamin <matt@linuxbox.com> wrote:
> Hi
>
> Just FYI, on the NFS integration front.  A pnfs files (RFC5661)-capable NFSv4 re-exporter for Ceph has been committed to the Ganesha NFSv4 server development branch.  We're continuing to enhance and elaborate this.  We have had on our (full) plates for a while to return Ceph client library changes.  We've finished pullup and rebasing of these, are doing some final testing of a couple things in preparation to push a branch for review.
>
> Regards,
>
> Matt
>
> ----- "Sage Weil" <sage@inktank.com> wrote:
>
>> On Mon, 17 Sep 2012, Tren Blackburn wrote:
>> > On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
>> > Wiessner <f.wiessner@smart-weblications.de> wrote:
>> > >
>> > > Hi,
>> > >
>> > > i use ceph to provide storage via rbd for our virtualization
>> cluster delivering
>> > > KVM based high availability Virtual Machines to my customers. I
>> also use it
>> > > as rbd device with ocfs2 on top of it for a 4 node webserver
>> cluster as shared
>> > > storage - i do this, because unfortunatelly cephfs is not ready
>> yet ;)
>> > >
>> > Hi Florian;
>> >
>> > When you say "cephfs is not ready yet", what parts about it are not
>> > ready? There are vague rumblings about that in general, but I'd
>> love
>> > to see specific issues. I understand multiple *active* mds's are
>> not
>> > supported, but what other issues are you aware of?
>>
>> Inktank is not yet supporting it because we do not have the QA in
>> place
>> and general hardening that will make us feel comfortable recommending
>> it
>> for customers.  That said, it works pretty well for most workloads.
>> In
>> particular, if you stay away from the snapshots and multi-mds, you
>> should
>> be quite stable.
>>
>> The engineering team here is about to do a bit of a pivot and refocus
>> on
>> the file system now that the object store and RBD are in pretty good
>> shape.  That will mean both core fs/mds stability and features as well
>> as
>> integration efforts (NFS/CIFS/Hadoop).
>>
>> 'Ready' is in the eye of the beholder.  There are a few people using
>> the
>> fs successfully in production, but not too many.
>>
>> sage
>>
>>
>>  >
>> > And if there's a page documenting this already, I apologize...and
>> > would appreciate a link :)
>> >
>> > t.
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Matt Benjamin
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
>
> http://linuxbox.com
>
> tel. 734-761-4689
> fax. 734-769-8938
> cel. 734-216-5309
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18  5:44   ` Ian Pye
@ 2012-09-18  6:06     ` Yehuda Sadeh
  0 siblings, 0 replies; 29+ messages in thread
From: Yehuda Sadeh @ 2012-09-18  6:06 UTC (permalink / raw)
  To: Ian Pye; +Cc: ceph-devel

We had a similar idea on our mind for a while now. The thought was to
add a key value support that leverages omaps and get it exposed
through the RESTful rados gateway. Having a real world use for it will
certainly help in understanding the requirements.

Yehuda

On Mon, Sep 17, 2012 at 10:44 PM, Ian Pye <ianpye@gmail.com> wrote:
> I'm looking at building a hbase/bigtable style key-value store on top
> of Ceph's omap abstraction of LevelDB. The plan is to use this for log
> storage at first. Writes use libradospp, with individual log lines
> serialized via message-pack and then stored as omap values.  Omap keys
> are strings which group common data together using a similar prefix.
>
> Reads are exposed using a custom fuse integration which supports query
> parameters separated via the # token like so:
>
> cat /cf/adefs/logger/pg/data/2012-08-28/OID/2/1346191920#pr=N:1015438#lm=1#fr=json
>
> [{"bcktime":"0.000","bcktype":"BCK_C1","bytes_bck":"2196","bytes_dlv":"2196","cachestat":"HIT","chktime":"0.000","chktimestamp":"1346192219","country":"CA","dc_old":"IMAGE","dlvtime":"0.002","doctype":"IMAGE","domuid":"df475bc52ab9f7b546ef60a8e2803bca61343075938","dw_key":"N:1015438:208.69.:IMAGE:14f1-1343075940.119-10-115680413","host":"www.forum.immigrer.com","hoststat":"200","http_method":"GET","http_proto":"HTTP/1.1","id":"14f1-1343075940.119-10-115680413","iptype":"CLEAN","ownerid":"226010","path_op":"WL","path_src":"MACRO","path_stat":"NR","rmoteip":"208.69.11.150","seclvl":"eoff","servnmdlv":"14f1","servnmflc":"14f1","uag":"Mozilla/5.0
> (Windows NT 6.1; rv:14.0) Gecko/20100101
> Firefox/14.0.1","url":"/icon-16.png","zone_plan":"pro","zoneid":"1015438","zonename":"immigrer.com"},{"bytes_bck":2196,"bytes_dlv":2196,"cfbb":0,"flstat":470,"hoststat":200,"isbot":0,"missing_dlv":0,"ownerid":226010,"s404":0,"upstat":0,"zoneid":1015438}]
>
> Passing in a pr parameter downloads only those keys matching the
> prefix specified. fr is a format (json or msgpack) and lm is a limit.
>
> I'm also working with the CLS framework to compute aggregate values
> (for example the average request size) on the OSDs directly.
>
> A further level of abstraction is provided by writing a postgres to
> ceph binding, exposing omap values as postgres hstores. This allows
> Postgres functions like:
>
> select kc_hstore->'uag' as user_agent, count(*) as cn from
> kc_hstore('1346191920', '2012-08-28/OID/1', 'N:') group by user_agent
> order by cn desc limit 10;
>
>                                                       user_agent
>                                                 |  cn
> -----------------------------------------------------------------------------------------------------------------------+------
>  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like
> Gecko) Chrome/20.0.1132.57 Safari/536.11          | 1717
>  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101
> Firefox/14.0.1                                            |  862
>  Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64;
> Trident/5.0)                                                |  837
>  Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko)
> Chrome/20.0.1132.57 Safari/536.11                 |  504
>  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko)
> Chrome/20.0.1132.57 Safari/536.11                 |  332
>  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101
> Firefox/13.0.1                                            |  312
>  Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.11 (KHTML, like Gecko)
> Chrome/20.0.1132.57 Safari/536.11                 |  256
>  Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
>                                                 |  220
>  Mozilla/5.0 (Windows NT 6.1; rv:14.0) Gecko/20100101 Firefox/14.0.1
>                                                 |  178
>  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.57.2
> (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 |  172
> (10 rows)
>
> Here, we get the top 10 most common user agents seen for a given time
> range and data shard.
>
> Currently using xfs, as I too have been bitten by btrfs.
>
>
>
> On Mon, Sep 17, 2012 at 8:19 PM, Matt W. Benjamin <matt@linuxbox.com> wrote:
>> Hi
>>
>> Just FYI, on the NFS integration front.  A pnfs files (RFC5661)-capable NFSv4 re-exporter for Ceph has been committed to the Ganesha NFSv4 server development branch.  We're continuing to enhance and elaborate this.  We have had on our (full) plates for a while to return Ceph client library changes.  We've finished pullup and rebasing of these, are doing some final testing of a couple things in preparation to push a branch for review.
>>
>> Regards,
>>
>> Matt
>>
>> ----- "Sage Weil" <sage@inktank.com> wrote:
>>
>>> On Mon, 17 Sep 2012, Tren Blackburn wrote:
>>> > On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
>>> > Wiessner <f.wiessner@smart-weblications.de> wrote:
>>> > >
>>> > > Hi,
>>> > >
>>> > > i use ceph to provide storage via rbd for our virtualization
>>> cluster delivering
>>> > > KVM based high availability Virtual Machines to my customers. I
>>> also use it
>>> > > as rbd device with ocfs2 on top of it for a 4 node webserver
>>> cluster as shared
>>> > > storage - i do this, because unfortunatelly cephfs is not ready
>>> yet ;)
>>> > >
>>> > Hi Florian;
>>> >
>>> > When you say "cephfs is not ready yet", what parts about it are not
>>> > ready? There are vague rumblings about that in general, but I'd
>>> love
>>> > to see specific issues. I understand multiple *active* mds's are
>>> not
>>> > supported, but what other issues are you aware of?
>>>
>>> Inktank is not yet supporting it because we do not have the QA in
>>> place
>>> and general hardening that will make us feel comfortable recommending
>>> it
>>> for customers.  That said, it works pretty well for most workloads.
>>> In
>>> particular, if you stay away from the snapshots and multi-mds, you
>>> should
>>> be quite stable.
>>>
>>> The engineering team here is about to do a bit of a pivot and refocus
>>> on
>>> the file system now that the object store and RBD are in pretty good
>>> shape.  That will mean both core fs/mds stability and features as well
>>> as
>>> integration efforts (NFS/CIFS/Hadoop).
>>>
>>> 'Ready' is in the eye of the beholder.  There are a few people using
>>> the
>>> fs successfully in production, but not too many.
>>>
>>> sage
>>>
>>>
>>>  >
>>> > And if there's a page documenting this already, I apologize...and
>>> > would appreciate a link :)
>>> >
>>> > t.
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe
>>> ceph-devel" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>>> >
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Matt Benjamin
>> The Linux Box
>> 206 South Fifth Ave. Suite 150
>> Ann Arbor, MI  48104
>>
>> http://linuxbox.com
>>
>> tel. 734-761-4689
>> fax. 734-769-8938
>> cel. 734-216-5309
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-17 23:57 Nick Couchman
@ 2012-09-18  6:35 ` John Axel Eriksson
  0 siblings, 0 replies; 29+ messages in thread
From: John Axel Eriksson @ 2012-09-18  6:35 UTC (permalink / raw)
  To: Nick Couchman; +Cc: mark.nelson, ross, ceph-devel

Well I've used Btrfs on and off for two years now I think - in less
critical situations though (at home, on testing equipment at work and
on easily rebuildable systems). I've been bitten several times before
so I know there've been serious problems with it.
With Kernel 3.5 I had a pretty good feeling about it and I ran it for
some time without any issues at all so I based my decision to try it
on that and on the fact that we store to AWS S3 as well and can
actually switch over to that as the primary store if we need to,
though with much worse throughput (though throughput isn't extremely
important to us). So far it's been holding up pretty well, though I
guess we're neither the most write or read intensive out there.

For development we use VMs with Ceph on our laptops - those are
running Btrfs on Kernel 3.2 and we've had issues on them, so I think
something good happened to Btrfs on 3.4/3.5.

I'll keep you posted on how it works out for us on 3.5.

On Tue, Sep 18, 2012 at 1:57 AM, Nick Couchman <Nick.Couchman@seakr.com> wrote:
> John,
> I'd be really interested to hear how Btrfs goes over time.  I tried it out a few kernel versions ago and regretted it - lost some data after using it.  Hopefully the stability is better than it was before, and inline compression is always great!
>
> -Nick
>
>>>> John Axel Eriksson  09/17/12 5:26 PM >>>
> Our use of Ceph started pretty recently (this summer). We only use
> rados together with the radosgw. We moved from another distributed
> storage solution that had failed us more than once and we lost data.
> Since the old system had an http interface (not S3 compatible though)
> we looked around for another similar system. In the end we chose Ceph
> since it had been in development for quite some time, had been
> incorporated in the kernel (well, the client for the fs that is) and
> recently got a company behind it. Ceph felt pretty solid, even though
> it's still early days I guess.
>
> We obviously liked the fact that it has an S3 compatible interface,
> especially since we started backing up data to Amazon S3 some time ago
> - having the same interface simplified our client code tremendously.
> We don't actually need extreme throughput (yet anyway :-) but we do
> need replication. We're quite happy with the performance so far since
> it's better than our old system.
> We store medical data for archival and conversion from and to
> different formats. Since we (after previous failures in the old
> storage system) store everything in Amazon S3 as well, we made a bet
> on Kernel 3.5 and Btrfs with compression for some quite dramatic space
> savings - the data we store often compresses really well. So far we
> haven't regretted that choice, but we've only been running it in
> production for about two months while slowly phasing out the old
> storage system.
>
> John
>
> On Tue, Sep 18, 2012 at 12:53 AM, Mark Nelson  wrote:
>> Hi Nick,
>>
>> All I have to say, is that is totally awesome and scary at the same time. :)
>>
>> Glad to hear that it recovers well when people shut their desktops off!
>>
>> Mark
>>
>>
>> On 09/17/2012 05:47 PM, Nick Couchman wrote:
>>>
>>> My use of Ceph is probably pretty unique in some of the aspects of
>>> where/how I'm using it.  I run an IT department for a medium-sized
>>> engineering firm.  One of my goals is to try to make the best possible use
>>> of the hardware we're deploying to users' desktops.  Often times users
>>> cannot get by with a thin client and a VM somewhere, they actually need
>>> decent hardware on the desktop.  However, when the hardware isn't being
>>> used, it's nice to be able to have access to some of the free disk space,
>>> I/O bandwidth, memory, and CPU cycles available on the hardware.  So, Ceph
>>> is part of an overall strategy for making use of the hardware.  I'm guessing
>>> most folks run it on racked servers in datacenters, but I'm distributing it
>>> across desktops.
>>>
>>> I've started by rolling out Linux to the desktop bare metal rather than
>>> Windows.  I run openSuSE 12.1, probably moving to 12.2 here in the
>>> near-future (I have Ceph packages available and built for openSuSE 11.4,
>>> 12.1, and 12.2 on my OBS project).  I run the Xen kernel on this hardware so
>>> that I can run VMs on top of it for various purposes.  For folks who need
>>> Windows, I use Windows-based VMs on Xen.  For the types who are comfortable
>>> with switching between Linux and Windows, I use a Windows VM and then
>>> rdesktop to connect from the Linux desktop/window manager.  For the types
>>> who are only comfortable in Windows, I use VGA and PCI pass-through in Xen
>>> to pass the video card and the USB controllers to the Windows guest, making
>>> the Linux base install transparent to the end-user.
>>>
>>> To make use of free CPU cycles, in addition to VMs, I use the latest
>>> freely-available version of the software formerly known as the Sun Grid
>>> Engine to make these desktop systems part of the batching system that allows
>>> engineers to run HPC jobs.  They mount various filesystems from our NFS
>>> servers and jobs can execute on these systems on evenings and weekends.
>>>
>>> Ceph is a pretty recent addition to these configurations.  I wanted to
>>> find an easy way to make use of the free disk space on these systems, but in
>>> a useful way that aggregates it all together.  After looking at several
>>> distributed filesystems, Ceph came up as the one with the feature sets that
>>> made the most sense for me.  So, I've spent a bunch of time building
>>> packages, testing out Ceph, and have finally rolled it out on these two
>>> dozen Linux desktops, aggregating 100GB from each desktop's 250GB drive into
>>> a single pool that adds up to roughly 2.2TB of raw storage.  I currently do
>>> 3 replications for all of my pools in Ceph to try to protect against a
>>> desktop machine going down, getting shut down, etc., which does happen from
>>> time-to-time.  So far this has worked out pretty well, and Ceph seems to
>>> recover pretty well from these failures, moving blocks to different systems
>>> when necessary, then re-doing that when the systems come back online.
>>>
>>> My next steps for this setup, including Ceph, really get into more of a
>>> private cloud infrastructure using desktop commodity hardware.  I'd like to
>>> be able to install something like Openstack or the XAPI/XCP software on
>>> these systems and centrally manage the aggregated storage along with memory
>>> and CPU with a tool like that.  This would give me the ability to deploy
>>> these inexpensive systems across the organization, but make sure they're
>>> used to their best capacity, and it also allows for great flexibility when
>>> users move from machine to machine, or VMs need to move from place to place.
>>> I do keep a lot of my critical infrastructure in my datacenter on more
>>> traditional compute systems - a SAN, XenServer, fileservers/NAS with
>>> NFS/CIFS, etc. - but this is a good way for me to prove out the usefulness
>>> and reliability of systems like Ceph and other cloud-computing concepts and
>>> then take those and apply them to increasingly complex and critical needs in
>>> my organization.
>>>
>>> For Ceph improvements that would help me out, the ability to support POSIX
>>> and NFSv4 ACLs would be a fantastic addition.  We use these types of
>>> permissions on our main filesystems to control access better than the
>>> traditional UGO-style permissions, and I already miss it while using Ceph.
>>> Also, I know the concept of deduplication has been discussed, and this, too,
>>> would be great.  I was actually wondering about the feasibility of
>>> implementing post-processing deduplication on Ceph, first, rather than
>>> inline deduplication - obviously this increases disk space requirements
>>> since there has to be enough to store the duplicated data, but still seems
>>> to beat no deduplication at all.  Not a huge requirement at this point, but
>>> playing with FSs that support deduplication makes me want it everywhere :-).
>>>
>>> -Nick
>>>
>>>>>> On 2012/09/17 at 16:14, Ross Turk  wrote:
>>>>
>>>> Hi, all!
>>>>
>>>> One of the most important parts of Inktank's mission is to spread the
>>>> word about Ceph. We want everyone to know what it is and how to use
>>>> it.
>>>>
>>>> In order to tell a better story to potential new users, I'm trying to
>>>> get a sense for today's deployments. We've spent the last few months
>>>> talking to folks around the world, but I'm sure there are a few great
>>>> stories we haven't heard yet!
>>>>
>>>> If you've got a spare five minutes, I would love to hear what you're
>>>> up to. What kind of projects are you working on, and in what stage?
>>>> What is your workload? Are you using Ceph alongside other
>>>> technologies? How has your experience been?
>>>>
>>>> This is also a good opportunity for me to introduce myself to those I
>>>> haven't met yet! Feel free to copy the list if you think others would
>>>> be interested (and you don't mind sharing).
>>>>
>>>> Cheers,
>>>> Ross
>>>>
>>>> --
>>>> Ross Turk
>>>> Ceph Community Guy
>>>>
>>>> "Any sufficiently advanced technology is indistinguishable from magic."
>>>> -- Arthur C. Clarke
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>> --------
>>> This e-mail may contain confidential and privileged material for the sole
>>> use of the intended recipient.  If this email is not intended for you, or
>>> you are not responsible for the delivery of this message to the intended
>>> recipient, please note that this message may contain SEAKR Engineering
>>> (SEAKR) Privileged/Proprietary Information.  In such a case, you are
>>> strictly prohibited from downloading, photocopying, distributing or
>>> otherwise using this message, its contents or attachments in any way.  If
>>> you have received this message in error, please notify us immediately by
>>> replying to this e-mail and delete the message from your mailbox.
>>> Information contained in this message that does not relate to the business
>>> of SEAKR is neither endorsed by nor attributable to SEAKR.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --------
>
> This e-mail may contain confidential and privileged material for the sole use of the intended recipient.  If this email is not intended for you, or you are not responsible for the delivery of this message to the intended recipient, please note that this message may contain SEAKR Engineering (SEAKR) Privileged/Proprietary Information.  In such a case, you are strictly prohibited from downloading, photocopying, distributing or otherwise using this message, its contents or attachments in any way.  If you have received this message in error, please notify us immediately by replying to this e-mail and delete the message from your mailbox.  Information contained in this message that does not relate to the business of SEAKR is neither endorsed by nor attributable to SEAKR.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-17 23:26     ` John Axel Eriksson
@ 2012-09-18  7:47       ` Plaetinck, Dieter
  2012-09-18 14:34         ` John Axel Eriksson
  0 siblings, 1 reply; 29+ messages in thread
From: Plaetinck, Dieter @ 2012-09-18  7:47 UTC (permalink / raw)
  To: John Axel Eriksson; +Cc: ceph-devel

On Tue, 18 Sep 2012 01:26:03 +0200
John Axel Eriksson <john@insane.se> wrote:

> another distributed
> storage solution that had failed us more than once and we lost data.
> Since the old system had an http interface (not S3 compatible though)

can you say a bit more about this? failure stories are very interesting and useful.

Dieter

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18  2:32     ` Sage Weil
@ 2012-09-18 11:48       ` Smart Weblications GmbH - Florian Wiessner
  2012-09-18 16:20         ` Sage Weil
  2012-09-18 16:35       ` Tren Blackburn
  1 sibling, 1 reply; 29+ messages in thread
From: Smart Weblications GmbH - Florian Wiessner @ 2012-09-18 11:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: Tren Blackburn, Ross Turk, ceph-devel

Am 18.09.2012 04:32, schrieb Sage Weil:
> On Mon, 17 Sep 2012, Tren Blackburn wrote:
>> On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
>> Wiessner <f.wiessner@smart-weblications.de> wrote:
>>>
>>> Hi,
>>>
>>> i use ceph to provide storage via rbd for our virtualization cluster delivering
>>> KVM based high availability Virtual Machines to my customers. I also use it
>>> as rbd device with ocfs2 on top of it for a 4 node webserver cluster as shared
>>> storage - i do this, because unfortunatelly cephfs is not ready yet ;)
>>>
>> Hi Florian;
>>
>> When you say "cephfs is not ready yet", what parts about it are not
>> ready? There are vague rumblings about that in general, but I'd love
>> to see specific issues. I understand multiple *active* mds's are not
>> supported, but what other issues are you aware of?
> 
> Inktank is not yet supporting it because we do not have the QA in place 
> and general hardening that will make us feel comfortable recommending it 
> for customers.  That said, it works pretty well for most workloads.  In 
> particular, if you stay away from the snapshots and multi-mds, you should 
> be quite stable.
> 
> The engineering team here is about to do a bit of a pivot and refocus on 
> the file system now that the object store and RBD are in pretty good 
> shape.  That will mean both core fs/mds stability and features as well as 
> integration efforts (NFS/CIFS/Hadoop).
> 
> 'Ready' is in the eye of the beholder.  There are a few people using the 
> fs successfully in production, but not too many.
> 

I tried it using multiple mds, because without multiple mds there is no
redundancy and the single mds will be SPOF. I noticed things like empty
directories which could not be deleted. It said directory not empty, but it was
empty and could not be deleted. I also noticed kernel panic on 3.2 kernels using
kernel ceph client, or crashes with ceph-fuse. It is somewhat unstable so that i
always had to reboot a node after a while of usage for various reasons
(ceph-fuse crashed and messed up fuse, kernel panic using kernel ceph client,
unable to delete files/dirs, no fsck for fixing things).

Last time i tried was simple untarring kernel tree in cephfs mountpoint - a new
created cephfs and after 10 minutes there where errors like unable to delete
dirs etc. Since i do not know how to reset/reformat only the cephfs part
(without touching rbd!), i stopped testing for now. The last time i tried it i
lost data - the data was not important and i had backups, but i was feeling
uncomfortable now with using cephfs...

I also did not have tried btrfs with ceph since 11/2011 again, because of losing
data after reboots when btrfs dies, the btrfs was unmountable and there was no
fsck so i only could reformat and wait for ceph to rebuild. After a
powerfailure, no btrfs partitions survived and i lost all test data :/

So i think the first thing to be done to cephfs would be to integrate some sort
of fsck and the ability to format only cephfs without losing other rbd
images/data o rados data...

-- 

Mit freundlichen Grüßen,

Florian Wiessner

Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila

fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de

--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18  7:47       ` Plaetinck, Dieter
@ 2012-09-18 14:34         ` John Axel Eriksson
  2012-09-18 14:51           ` Plaetinck, Dieter
  2012-09-18 16:20           ` Xiaopong Tran
  0 siblings, 2 replies; 29+ messages in thread
From: John Axel Eriksson @ 2012-09-18 14:34 UTC (permalink / raw)
  To: Plaetinck, Dieter; +Cc: ceph-devel

I actually opted to not specifically mention the product we had
problems with since there have been lots of changes and fixes to it,
which we unfortunately were unable to make use of(you'll know why
later). But I guess it's interesting enough to go into a little more
detail so... before moving to Ceph we were using the Riak Distributed
Database from Basho - http://riak.basho.com.

First I have to say that Riak is actually pretty awesome in many ways
- not in the least operations wise. Compared to Ceph it's alot easier
to get up and running and add storage as you go... basically just one
command to add a node to the cluster and you only need the address of
any other existing node for this. With Riak, every node is the same,
so there is no SPOF by default (eg. no MDS, no MON - just nodes).

As you might have thought already "Distributed Database isn't exactly
the same as Distributed Storage" so why did we use it? Well, there is
an add-on to Riak called Luwak, also created and supported by Basho,
that is touted as "Large Object Support" where you can store as large
objects as you want. I think our main problem was with using this
add-on (as I said created and supported by Basho). An object in
"standard" riak k/v is limited to... I think around 40 MB, or at least
you shouldn't store larger objects than that because it means
"trouble". Anyway, we went with Luwak which seemed to be a perfect
solution for the type of storage we do.

We ran with Luwak for almost two years and usually it served us pretty
well. Unfortunately there were bugs and hidden problems which i.m.o
Basho should have been more open about. One issue is that Riak is
based on a repair mechanism called "read-repair" - that pretty much
tells you how it works, data will only be repaired on a read. Now that
is a problem in itself when you archive data which we do (eg. not
reading it very often or at all).

With Luwak(the large-object add-on), data is split into many keys and
values and stored in the "normal" riak k/v store... unfortunately
read-repair in this scenario doesn't seem to work at all and if
something was missing - Riak had a tendency to crash HARD, sometimes
managing to take the whole machine with it. There were also strange
issues where one crashing node seemed to affect it's neighbors so that
they also crashed... a domino effect which makes "distributed" a
little too "distributed". This didn't always happen but it did happen
several times in our case. The logs were often pretty hard to
understand and more often than not left us completely in the dark
about what was going on.

We also discovered that deleting data in Luwak doesn't actually DO
anything... sure the key is gone but data is still on disk - seemingly
orphaned, so deleting was more or less a noop. This was nowhere to be
found in the docs.

Finally, I think 3rd of June this year, we requested paid support from
Basho to help us in our last crash-and-burn situation and that's when
we, among other things, were told about the fact that DELETEing just
seems to work. We were also told that Luwak was originally created to
store email and not really the types of things we store (eg. files).
This information wasn't available anywhere - Luwak simply had the
wrong "table of contents" associated with it. All this was quite a
turn-off for us. To Bashos credit they really did help us fix our
cluster and they're really nice, friendly and helpful guys.

Actually I think the last straw was when Luwak was suddenly - out of
nowhere really - discontinued around the beginning of this year,
probably because of the bugs and hidden problems that I think may have
come from a less than stellar implementation of large-object support
from the start... so by then we were on something completely
unsupported. We couldn't switch to something else immediately of
course but we started looking around for something else at that time.
That's when I found Ceph among other more or less distributed systems,
where the others were:

Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
XtreemFS         http://www.xtreemfs.org
HDFS             http://hadoop.apache.org/hdfs/
GlusterFS        http://www.gluster.org
PomegranateFS    https://github.com/macan/Pomegranate/wiki
moosefs          http://www.moosefs.org
Openstack Swift  http://docs.openstack.org/developer/swift/
MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
LS4              http://ls4.sourceforge.net/

After trying most of these I decided to look closer at a few of them,
MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
suited for our use case or just too complicated to setup and keep
running (i.m.o). For a short while I dabbled in writing my own storage
system using zeromq for communication but it's just not what our
company does - so I gave that up pretty quickly :-). In the end I
chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
every other aspect better and definitely a good fit. The Rados
Gateway(S3 compat) was really a big thing for us as well.

As I started out saying: there have been many improvements to Riak not
in the least to the large-object support... but that large-object
support is not built on Luwak but a completely new thing and it's not
open source or free. It's called Riak CS(CS for Cluster Storage I
guess) and has an S3 compatible interface and it seems to be pretty
good. We had many discussions internally if Riak CS was the right move
for us but in the end we decided on Ceph since we couldn't justify the
cost of Riak CS.

To sum it up: we made, in retrospect, a bad choice - not because Riak
itself doesn't work or isn't any good for the things it's good at(it
really is!) but because the add-on Luwak was misrepresented and not a
good fit for us.

I really have high hopes for Ceph and I think it has a bright future
in our company and in general. Riak CS would probably have been a very
good fit as well if it wasn't for the cost involved.

So there you have it - not just failure scenarios but bad decisions,
misrepresenation of features and somewhat sparse documentation. By the
the way, Ceph has improved it's docs alot but still could use some
work.

-John

On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter <dieter@vimeo.com> wrote:
> On Tue, 18 Sep 2012 01:26:03 +0200
> John Axel Eriksson <john@insane.se> wrote:
>
>> another distributed
>> storage solution that had failed us more than once and we lost data.
>> Since the old system had an http interface (not S3 compatible though)
>
> can you say a bit more about this? failure stories are very interesting and useful.
>
> Dieter

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 14:34         ` John Axel Eriksson
@ 2012-09-18 14:51           ` Plaetinck, Dieter
  2012-09-18 14:56             ` Mark Nelson
  2012-09-18 16:20           ` Xiaopong Tran
  1 sibling, 1 reply; 29+ messages in thread
From: Plaetinck, Dieter @ 2012-09-18 14:51 UTC (permalink / raw)
  To: John Axel Eriksson; +Cc: ceph-devel

thanks a lot for the detailed writeup, I found it quite useful.
the list of contestants is similar to the list I made when researching (and I also had luwak);
while I also think ceph is very promising and probably deserves to dominate in the future,
I'm focusing on openstack swift for now. FWIW

Dieter

On Tue, 18 Sep 2012 16:34:23 +0200
John Axel Eriksson <john@insane.se> wrote:

> I actually opted to not specifically mention the product we had
> problems with since there have been lots of changes and fixes to it,
> which we unfortunately were unable to make use of(you'll know why
> later). But I guess it's interesting enough to go into a little more
> detail so... before moving to Ceph we were using the Riak Distributed
> Database from Basho - http://riak.basho.com.
> 
> First I have to say that Riak is actually pretty awesome in many ways
> - not in the least operations wise. Compared to Ceph it's alot easier
> to get up and running and add storage as you go... basically just one
> command to add a node to the cluster and you only need the address of
> any other existing node for this. With Riak, every node is the same,
> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
> 
> As you might have thought already "Distributed Database isn't exactly
> the same as Distributed Storage" so why did we use it? Well, there is
> an add-on to Riak called Luwak, also created and supported by Basho,
> that is touted as "Large Object Support" where you can store as large
> objects as you want. I think our main problem was with using this
> add-on (as I said created and supported by Basho). An object in
> "standard" riak k/v is limited to... I think around 40 MB, or at least
> you shouldn't store larger objects than that because it means
> "trouble". Anyway, we went with Luwak which seemed to be a perfect
> solution for the type of storage we do.
> 
> We ran with Luwak for almost two years and usually it served us pretty
> well. Unfortunately there were bugs and hidden problems which i.m.o
> Basho should have been more open about. One issue is that Riak is
> based on a repair mechanism called "read-repair" - that pretty much
> tells you how it works, data will only be repaired on a read. Now that
> is a problem in itself when you archive data which we do (eg. not
> reading it very often or at all).
> 
> With Luwak(the large-object add-on), data is split into many keys and
> values and stored in the "normal" riak k/v store... unfortunately
> read-repair in this scenario doesn't seem to work at all and if
> something was missing - Riak had a tendency to crash HARD, sometimes
> managing to take the whole machine with it. There were also strange
> issues where one crashing node seemed to affect it's neighbors so that
> they also crashed... a domino effect which makes "distributed" a
> little too "distributed". This didn't always happen but it did happen
> several times in our case. The logs were often pretty hard to
> understand and more often than not left us completely in the dark
> about what was going on.
> 
> We also discovered that deleting data in Luwak doesn't actually DO
> anything... sure the key is gone but data is still on disk - seemingly
> orphaned, so deleting was more or less a noop. This was nowhere to be
> found in the docs.
> 
> Finally, I think 3rd of June this year, we requested paid support from
> Basho to help us in our last crash-and-burn situation and that's when
> we, among other things, were told about the fact that DELETEing just
> seems to work. We were also told that Luwak was originally created to
> store email and not really the types of things we store (eg. files).
> This information wasn't available anywhere - Luwak simply had the
> wrong "table of contents" associated with it. All this was quite a
> turn-off for us. To Bashos credit they really did help us fix our
> cluster and they're really nice, friendly and helpful guys.
> 
> Actually I think the last straw was when Luwak was suddenly - out of
> nowhere really - discontinued around the beginning of this year,
> probably because of the bugs and hidden problems that I think may have
> come from a less than stellar implementation of large-object support
> from the start... so by then we were on something completely
> unsupported. We couldn't switch to something else immediately of
> course but we started looking around for something else at that time.
> That's when I found Ceph among other more or less distributed systems,
> where the others were:
> 
> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
> XtreemFS         http://www.xtreemfs.org
> HDFS             http://hadoop.apache.org/hdfs/
> GlusterFS        http://www.gluster.org
> PomegranateFS    https://github.com/macan/Pomegranate/wiki
> moosefs          http://www.moosefs.org
> Openstack Swift  http://docs.openstack.org/developer/swift/
> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
> LS4              http://ls4.sourceforge.net/
> 
> After trying most of these I decided to look closer at a few of them,
> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
> suited for our use case or just too complicated to setup and keep
> running (i.m.o). For a short while I dabbled in writing my own storage
> system using zeromq for communication but it's just not what our
> company does - so I gave that up pretty quickly :-). In the end I
> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
> every other aspect better and definitely a good fit. The Rados
> Gateway(S3 compat) was really a big thing for us as well.
> 
> As I started out saying: there have been many improvements to Riak not
> in the least to the large-object support... but that large-object
> support is not built on Luwak but a completely new thing and it's not
> open source or free. It's called Riak CS(CS for Cluster Storage I
> guess) and has an S3 compatible interface and it seems to be pretty
> good. We had many discussions internally if Riak CS was the right move
> for us but in the end we decided on Ceph since we couldn't justify the
> cost of Riak CS.
> 
> To sum it up: we made, in retrospect, a bad choice - not because Riak
> itself doesn't work or isn't any good for the things it's good at(it
> really is!) but because the add-on Luwak was misrepresented and not a
> good fit for us.
> 
> I really have high hopes for Ceph and I think it has a bright future
> in our company and in general. Riak CS would probably have been a very
> good fit as well if it wasn't for the cost involved.
> 
> So there you have it - not just failure scenarios but bad decisions,
> misrepresenation of features and somewhat sparse documentation. By the
> the way, Ceph has improved it's docs alot but still could use some
> work.
> 
> -John
> 
> 
> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter <dieter@vimeo.com> wrote:
> > On Tue, 18 Sep 2012 01:26:03 +0200
> > John Axel Eriksson <john@insane.se> wrote:
> >
> >> another distributed
> >> storage solution that had failed us more than once and we lost data.
> >> Since the old system had an http interface (not S3 compatible though)
> >
> > can you say a bit more about this? failure stories are very interesting and useful.
> >
> > Dieter


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 14:51           ` Plaetinck, Dieter
@ 2012-09-18 14:56             ` Mark Nelson
  2012-09-18 15:19               ` Plaetinck, Dieter
  0 siblings, 1 reply; 29+ messages in thread
From: Mark Nelson @ 2012-09-18 14:56 UTC (permalink / raw)
  To: Plaetinck, Dieter; +Cc: John Axel Eriksson, ceph-devel

Agreed, this was a really interesting writeup!  Thanks John!

Dieter, do you mind if I ask what is compelling for you in choosing 
swift vs the other options you've looked at including Ceph?

Thanks,
Mark

On 09/18/2012 09:51 AM, Plaetinck, Dieter wrote:
> thanks a lot for the detailed writeup, I found it quite useful.
> the list of contestants is similar to the list I made when researching (and I also had luwak);
> while I also think ceph is very promising and probably deserves to dominate in the future,
> I'm focusing on openstack swift for now. FWIW
>
> Dieter
>
> On Tue, 18 Sep 2012 16:34:23 +0200
> John Axel Eriksson<john@insane.se>  wrote:
>
>> I actually opted to not specifically mention the product we had
>> problems with since there have been lots of changes and fixes to it,
>> which we unfortunately were unable to make use of(you'll know why
>> later). But I guess it's interesting enough to go into a little more
>> detail so... before moving to Ceph we were using the Riak Distributed
>> Database from Basho - http://riak.basho.com.
>>
>> First I have to say that Riak is actually pretty awesome in many ways
>> - not in the least operations wise. Compared to Ceph it's alot easier
>> to get up and running and add storage as you go... basically just one
>> command to add a node to the cluster and you only need the address of
>> any other existing node for this. With Riak, every node is the same,
>> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
>>
>> As you might have thought already "Distributed Database isn't exactly
>> the same as Distributed Storage" so why did we use it? Well, there is
>> an add-on to Riak called Luwak, also created and supported by Basho,
>> that is touted as "Large Object Support" where you can store as large
>> objects as you want. I think our main problem was with using this
>> add-on (as I said created and supported by Basho). An object in
>> "standard" riak k/v is limited to... I think around 40 MB, or at least
>> you shouldn't store larger objects than that because it means
>> "trouble". Anyway, we went with Luwak which seemed to be a perfect
>> solution for the type of storage we do.
>>
>> We ran with Luwak for almost two years and usually it served us pretty
>> well. Unfortunately there were bugs and hidden problems which i.m.o
>> Basho should have been more open about. One issue is that Riak is
>> based on a repair mechanism called "read-repair" - that pretty much
>> tells you how it works, data will only be repaired on a read. Now that
>> is a problem in itself when you archive data which we do (eg. not
>> reading it very often or at all).
>>
>> With Luwak(the large-object add-on), data is split into many keys and
>> values and stored in the "normal" riak k/v store... unfortunately
>> read-repair in this scenario doesn't seem to work at all and if
>> something was missing - Riak had a tendency to crash HARD, sometimes
>> managing to take the whole machine with it. There were also strange
>> issues where one crashing node seemed to affect it's neighbors so that
>> they also crashed... a domino effect which makes "distributed" a
>> little too "distributed". This didn't always happen but it did happen
>> several times in our case. The logs were often pretty hard to
>> understand and more often than not left us completely in the dark
>> about what was going on.
>>
>> We also discovered that deleting data in Luwak doesn't actually DO
>> anything... sure the key is gone but data is still on disk - seemingly
>> orphaned, so deleting was more or less a noop. This was nowhere to be
>> found in the docs.
>>
>> Finally, I think 3rd of June this year, we requested paid support from
>> Basho to help us in our last crash-and-burn situation and that's when
>> we, among other things, were told about the fact that DELETEing just
>> seems to work. We were also told that Luwak was originally created to
>> store email and not really the types of things we store (eg. files).
>> This information wasn't available anywhere - Luwak simply had the
>> wrong "table of contents" associated with it. All this was quite a
>> turn-off for us. To Bashos credit they really did help us fix our
>> cluster and they're really nice, friendly and helpful guys.
>>
>> Actually I think the last straw was when Luwak was suddenly - out of
>> nowhere really - discontinued around the beginning of this year,
>> probably because of the bugs and hidden problems that I think may have
>> come from a less than stellar implementation of large-object support
>> from the start... so by then we were on something completely
>> unsupported. We couldn't switch to something else immediately of
>> course but we started looking around for something else at that time.
>> That's when I found Ceph among other more or less distributed systems,
>> where the others were:
>>
>> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
>> XtreemFS         http://www.xtreemfs.org
>> HDFS             http://hadoop.apache.org/hdfs/
>> GlusterFS        http://www.gluster.org
>> PomegranateFS    https://github.com/macan/Pomegranate/wiki
>> moosefs          http://www.moosefs.org
>> Openstack Swift  http://docs.openstack.org/developer/swift/
>> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
>> LS4              http://ls4.sourceforge.net/
>>
>> After trying most of these I decided to look closer at a few of them,
>> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
>> suited for our use case or just too complicated to setup and keep
>> running (i.m.o). For a short while I dabbled in writing my own storage
>> system using zeromq for communication but it's just not what our
>> company does - so I gave that up pretty quickly :-). In the end I
>> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
>> every other aspect better and definitely a good fit. The Rados
>> Gateway(S3 compat) was really a big thing for us as well.
>>
>> As I started out saying: there have been many improvements to Riak not
>> in the least to the large-object support... but that large-object
>> support is not built on Luwak but a completely new thing and it's not
>> open source or free. It's called Riak CS(CS for Cluster Storage I
>> guess) and has an S3 compatible interface and it seems to be pretty
>> good. We had many discussions internally if Riak CS was the right move
>> for us but in the end we decided on Ceph since we couldn't justify the
>> cost of Riak CS.
>>
>> To sum it up: we made, in retrospect, a bad choice - not because Riak
>> itself doesn't work or isn't any good for the things it's good at(it
>> really is!) but because the add-on Luwak was misrepresented and not a
>> good fit for us.
>>
>> I really have high hopes for Ceph and I think it has a bright future
>> in our company and in general. Riak CS would probably have been a very
>> good fit as well if it wasn't for the cost involved.
>>
>> So there you have it - not just failure scenarios but bad decisions,
>> misrepresenation of features and somewhat sparse documentation. By the
>> the way, Ceph has improved it's docs alot but still could use some
>> work.
>>
>> -John
>>
>>
>> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<dieter@vimeo.com>  wrote:
>>> On Tue, 18 Sep 2012 01:26:03 +0200
>>> John Axel Eriksson<john@insane.se>  wrote:
>>>
>>>> another distributed
>>>> storage solution that had failed us more than once and we lost data.
>>>> Since the old system had an http interface (not S3 compatible though)
>>>
>>> can you say a bit more about this? failure stories are very interesting and useful.
>>>
>>> Dieter
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 14:56             ` Mark Nelson
@ 2012-09-18 15:19               ` Plaetinck, Dieter
  2012-09-18 15:27                 ` Mark Nelson
  0 siblings, 1 reply; 29+ messages in thread
From: Plaetinck, Dieter @ 2012-09-18 15:19 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

I don't mind.
Ultimately it came down to ceph vs swift for us.
Nothing is cast in stone yet, but we choose swift for our new not-yet-production cluster, because
swift has has been around longer and has more production deployments, and hence a bigger/more experienced community, better documentation (both official as well as unofficial, blogs, tutorials etc), more conferences/techtalks.
It's also a more simple system that reuses more existing technology, which makes it (a bit?) less efficient, but makes it easier to understand. (http protocol vs custom protocol, cluster metadata in sqlite, python which I'm more comfortable with than C, etc).
I would like to implement Ceph (because on paper it's just awesome) but running it involves a certain uncertainty/risk I personally don't want to take yet.

Dieter

On Tue, 18 Sep 2012 09:56:50 -0500
Mark Nelson <mark.nelson@inktank.com> wrote:

> Agreed, this was a really interesting writeup!  Thanks John!
> 
> Dieter, do you mind if I ask what is compelling for you in choosing 
> swift vs the other options you've looked at including Ceph?
> 
> Thanks,
> Mark
> 
> On 09/18/2012 09:51 AM, Plaetinck, Dieter wrote:
> > thanks a lot for the detailed writeup, I found it quite useful.
> > the list of contestants is similar to the list I made when researching (and I also had luwak);
> > while I also think ceph is very promising and probably deserves to dominate in the future,
> > I'm focusing on openstack swift for now. FWIW
> >
> > Dieter
> >
> > On Tue, 18 Sep 2012 16:34:23 +0200
> > John Axel Eriksson<john@insane.se>  wrote:
> >
> >> I actually opted to not specifically mention the product we had
> >> problems with since there have been lots of changes and fixes to it,
> >> which we unfortunately were unable to make use of(you'll know why
> >> later). But I guess it's interesting enough to go into a little more
> >> detail so... before moving to Ceph we were using the Riak Distributed
> >> Database from Basho - http://riak.basho.com.
> >>
> >> First I have to say that Riak is actually pretty awesome in many ways
> >> - not in the least operations wise. Compared to Ceph it's alot easier
> >> to get up and running and add storage as you go... basically just one
> >> command to add a node to the cluster and you only need the address of
> >> any other existing node for this. With Riak, every node is the same,
> >> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
> >>
> >> As you might have thought already "Distributed Database isn't exactly
> >> the same as Distributed Storage" so why did we use it? Well, there is
> >> an add-on to Riak called Luwak, also created and supported by Basho,
> >> that is touted as "Large Object Support" where you can store as large
> >> objects as you want. I think our main problem was with using this
> >> add-on (as I said created and supported by Basho). An object in
> >> "standard" riak k/v is limited to... I think around 40 MB, or at least
> >> you shouldn't store larger objects than that because it means
> >> "trouble". Anyway, we went with Luwak which seemed to be a perfect
> >> solution for the type of storage we do.
> >>
> >> We ran with Luwak for almost two years and usually it served us pretty
> >> well. Unfortunately there were bugs and hidden problems which i.m.o
> >> Basho should have been more open about. One issue is that Riak is
> >> based on a repair mechanism called "read-repair" - that pretty much
> >> tells you how it works, data will only be repaired on a read. Now that
> >> is a problem in itself when you archive data which we do (eg. not
> >> reading it very often or at all).
> >>
> >> With Luwak(the large-object add-on), data is split into many keys and
> >> values and stored in the "normal" riak k/v store... unfortunately
> >> read-repair in this scenario doesn't seem to work at all and if
> >> something was missing - Riak had a tendency to crash HARD, sometimes
> >> managing to take the whole machine with it. There were also strange
> >> issues where one crashing node seemed to affect it's neighbors so that
> >> they also crashed... a domino effect which makes "distributed" a
> >> little too "distributed". This didn't always happen but it did happen
> >> several times in our case. The logs were often pretty hard to
> >> understand and more often than not left us completely in the dark
> >> about what was going on.
> >>
> >> We also discovered that deleting data in Luwak doesn't actually DO
> >> anything... sure the key is gone but data is still on disk - seemingly
> >> orphaned, so deleting was more or less a noop. This was nowhere to be
> >> found in the docs.
> >>
> >> Finally, I think 3rd of June this year, we requested paid support from
> >> Basho to help us in our last crash-and-burn situation and that's when
> >> we, among other things, were told about the fact that DELETEing just
> >> seems to work. We were also told that Luwak was originally created to
> >> store email and not really the types of things we store (eg. files).
> >> This information wasn't available anywhere - Luwak simply had the
> >> wrong "table of contents" associated with it. All this was quite a
> >> turn-off for us. To Bashos credit they really did help us fix our
> >> cluster and they're really nice, friendly and helpful guys.
> >>
> >> Actually I think the last straw was when Luwak was suddenly - out of
> >> nowhere really - discontinued around the beginning of this year,
> >> probably because of the bugs and hidden problems that I think may have
> >> come from a less than stellar implementation of large-object support
> >> from the start... so by then we were on something completely
> >> unsupported. We couldn't switch to something else immediately of
> >> course but we started looking around for something else at that time.
> >> That's when I found Ceph among other more or less distributed systems,
> >> where the others were:
> >>
> >> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
> >> XtreemFS         http://www.xtreemfs.org
> >> HDFS             http://hadoop.apache.org/hdfs/
> >> GlusterFS        http://www.gluster.org
> >> PomegranateFS    https://github.com/macan/Pomegranate/wiki
> >> moosefs          http://www.moosefs.org
> >> Openstack Swift  http://docs.openstack.org/developer/swift/
> >> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
> >> LS4              http://ls4.sourceforge.net/
> >>
> >> After trying most of these I decided to look closer at a few of them,
> >> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
> >> suited for our use case or just too complicated to setup and keep
> >> running (i.m.o). For a short while I dabbled in writing my own storage
> >> system using zeromq for communication but it's just not what our
> >> company does - so I gave that up pretty quickly :-). In the end I
> >> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
> >> every other aspect better and definitely a good fit. The Rados
> >> Gateway(S3 compat) was really a big thing for us as well.
> >>
> >> As I started out saying: there have been many improvements to Riak not
> >> in the least to the large-object support... but that large-object
> >> support is not built on Luwak but a completely new thing and it's not
> >> open source or free. It's called Riak CS(CS for Cluster Storage I
> >> guess) and has an S3 compatible interface and it seems to be pretty
> >> good. We had many discussions internally if Riak CS was the right move
> >> for us but in the end we decided on Ceph since we couldn't justify the
> >> cost of Riak CS.
> >>
> >> To sum it up: we made, in retrospect, a bad choice - not because Riak
> >> itself doesn't work or isn't any good for the things it's good at(it
> >> really is!) but because the add-on Luwak was misrepresented and not a
> >> good fit for us.
> >>
> >> I really have high hopes for Ceph and I think it has a bright future
> >> in our company and in general. Riak CS would probably have been a very
> >> good fit as well if it wasn't for the cost involved.
> >>
> >> So there you have it - not just failure scenarios but bad decisions,
> >> misrepresenation of features and somewhat sparse documentation. By the
> >> the way, Ceph has improved it's docs alot but still could use some
> >> work.
> >>
> >> -John
> >>
> >>
> >> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<dieter@vimeo.com>  wrote:
> >>> On Tue, 18 Sep 2012 01:26:03 +0200
> >>> John Axel Eriksson<john@insane.se>  wrote:
> >>>
> >>>> another distributed
> >>>> storage solution that had failed us more than once and we lost data.
> >>>> Since the old system had an http interface (not S3 compatible though)
> >>>
> >>> can you say a bit more about this? failure stories are very interesting and useful.
> >>>
> >>> Dieter
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 15:19               ` Plaetinck, Dieter
@ 2012-09-18 15:27                 ` Mark Nelson
  2012-09-18 15:46                   ` Plaetinck, Dieter
  0 siblings, 1 reply; 29+ messages in thread
From: Mark Nelson @ 2012-09-18 15:27 UTC (permalink / raw)
  To: Plaetinck, Dieter; +Cc: ceph-devel

Hi Dieter,

It sounds like some of those things will come with time (more 
experienced community, docs, deployments, papers, etc).  Are there other 
things we could be doing that would make Ceph feel less risky for people 
doing similar comparisons?

Thanks,
Mark

On 09/18/2012 10:19 AM, Plaetinck, Dieter wrote:
> I don't mind.
> Ultimately it came down to ceph vs swift for us.
> Nothing is cast in stone yet, but we choose swift for our new not-yet-production cluster, because
> swift has has been around longer and has more production deployments, and hence a bigger/more experienced community, better documentation (both official as well as unofficial, blogs, tutorials etc), more conferences/techtalks.
> It's also a more simple system that reuses more existing technology, which makes it (a bit?) less efficient, but makes it easier to understand. (http protocol vs custom protocol, cluster metadata in sqlite, python which I'm more comfortable with than C, etc).
> I would like to implement Ceph (because on paper it's just awesome) but running it involves a certain uncertainty/risk I personally don't want to take yet.
>
> Dieter
>
> On Tue, 18 Sep 2012 09:56:50 -0500
> Mark Nelson<mark.nelson@inktank.com>  wrote:
>
>> Agreed, this was a really interesting writeup!  Thanks John!
>>
>> Dieter, do you mind if I ask what is compelling for you in choosing
>> swift vs the other options you've looked at including Ceph?
>>
>> Thanks,
>> Mark
>>
>> On 09/18/2012 09:51 AM, Plaetinck, Dieter wrote:
>>> thanks a lot for the detailed writeup, I found it quite useful.
>>> the list of contestants is similar to the list I made when researching (and I also had luwak);
>>> while I also think ceph is very promising and probably deserves to dominate in the future,
>>> I'm focusing on openstack swift for now. FWIW
>>>
>>> Dieter
>>>
>>> On Tue, 18 Sep 2012 16:34:23 +0200
>>> John Axel Eriksson<john@insane.se>   wrote:
>>>
>>>> I actually opted to not specifically mention the product we had
>>>> problems with since there have been lots of changes and fixes to it,
>>>> which we unfortunately were unable to make use of(you'll know why
>>>> later). But I guess it's interesting enough to go into a little more
>>>> detail so... before moving to Ceph we were using the Riak Distributed
>>>> Database from Basho - http://riak.basho.com.
>>>>
>>>> First I have to say that Riak is actually pretty awesome in many ways
>>>> - not in the least operations wise. Compared to Ceph it's alot easier
>>>> to get up and running and add storage as you go... basically just one
>>>> command to add a node to the cluster and you only need the address of
>>>> any other existing node for this. With Riak, every node is the same,
>>>> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
>>>>
>>>> As you might have thought already "Distributed Database isn't exactly
>>>> the same as Distributed Storage" so why did we use it? Well, there is
>>>> an add-on to Riak called Luwak, also created and supported by Basho,
>>>> that is touted as "Large Object Support" where you can store as large
>>>> objects as you want. I think our main problem was with using this
>>>> add-on (as I said created and supported by Basho). An object in
>>>> "standard" riak k/v is limited to... I think around 40 MB, or at least
>>>> you shouldn't store larger objects than that because it means
>>>> "trouble". Anyway, we went with Luwak which seemed to be a perfect
>>>> solution for the type of storage we do.
>>>>
>>>> We ran with Luwak for almost two years and usually it served us pretty
>>>> well. Unfortunately there were bugs and hidden problems which i.m.o
>>>> Basho should have been more open about. One issue is that Riak is
>>>> based on a repair mechanism called "read-repair" - that pretty much
>>>> tells you how it works, data will only be repaired on a read. Now that
>>>> is a problem in itself when you archive data which we do (eg. not
>>>> reading it very often or at all).
>>>>
>>>> With Luwak(the large-object add-on), data is split into many keys and
>>>> values and stored in the "normal" riak k/v store... unfortunately
>>>> read-repair in this scenario doesn't seem to work at all and if
>>>> something was missing - Riak had a tendency to crash HARD, sometimes
>>>> managing to take the whole machine with it. There were also strange
>>>> issues where one crashing node seemed to affect it's neighbors so that
>>>> they also crashed... a domino effect which makes "distributed" a
>>>> little too "distributed". This didn't always happen but it did happen
>>>> several times in our case. The logs were often pretty hard to
>>>> understand and more often than not left us completely in the dark
>>>> about what was going on.
>>>>
>>>> We also discovered that deleting data in Luwak doesn't actually DO
>>>> anything... sure the key is gone but data is still on disk - seemingly
>>>> orphaned, so deleting was more or less a noop. This was nowhere to be
>>>> found in the docs.
>>>>
>>>> Finally, I think 3rd of June this year, we requested paid support from
>>>> Basho to help us in our last crash-and-burn situation and that's when
>>>> we, among other things, were told about the fact that DELETEing just
>>>> seems to work. We were also told that Luwak was originally created to
>>>> store email and not really the types of things we store (eg. files).
>>>> This information wasn't available anywhere - Luwak simply had the
>>>> wrong "table of contents" associated with it. All this was quite a
>>>> turn-off for us. To Bashos credit they really did help us fix our
>>>> cluster and they're really nice, friendly and helpful guys.
>>>>
>>>> Actually I think the last straw was when Luwak was suddenly - out of
>>>> nowhere really - discontinued around the beginning of this year,
>>>> probably because of the bugs and hidden problems that I think may have
>>>> come from a less than stellar implementation of large-object support
>>>> from the start... so by then we were on something completely
>>>> unsupported. We couldn't switch to something else immediately of
>>>> course but we started looking around for something else at that time.
>>>> That's when I found Ceph among other more or less distributed systems,
>>>> where the others were:
>>>>
>>>> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
>>>> XtreemFS         http://www.xtreemfs.org
>>>> HDFS             http://hadoop.apache.org/hdfs/
>>>> GlusterFS        http://www.gluster.org
>>>> PomegranateFS    https://github.com/macan/Pomegranate/wiki
>>>> moosefs          http://www.moosefs.org
>>>> Openstack Swift  http://docs.openstack.org/developer/swift/
>>>> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
>>>> LS4              http://ls4.sourceforge.net/
>>>>
>>>> After trying most of these I decided to look closer at a few of them,
>>>> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
>>>> suited for our use case or just too complicated to setup and keep
>>>> running (i.m.o). For a short while I dabbled in writing my own storage
>>>> system using zeromq for communication but it's just not what our
>>>> company does - so I gave that up pretty quickly :-). In the end I
>>>> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
>>>> every other aspect better and definitely a good fit. The Rados
>>>> Gateway(S3 compat) was really a big thing for us as well.
>>>>
>>>> As I started out saying: there have been many improvements to Riak not
>>>> in the least to the large-object support... but that large-object
>>>> support is not built on Luwak but a completely new thing and it's not
>>>> open source or free. It's called Riak CS(CS for Cluster Storage I
>>>> guess) and has an S3 compatible interface and it seems to be pretty
>>>> good. We had many discussions internally if Riak CS was the right move
>>>> for us but in the end we decided on Ceph since we couldn't justify the
>>>> cost of Riak CS.
>>>>
>>>> To sum it up: we made, in retrospect, a bad choice - not because Riak
>>>> itself doesn't work or isn't any good for the things it's good at(it
>>>> really is!) but because the add-on Luwak was misrepresented and not a
>>>> good fit for us.
>>>>
>>>> I really have high hopes for Ceph and I think it has a bright future
>>>> in our company and in general. Riak CS would probably have been a very
>>>> good fit as well if it wasn't for the cost involved.
>>>>
>>>> So there you have it - not just failure scenarios but bad decisions,
>>>> misrepresenation of features and somewhat sparse documentation. By the
>>>> the way, Ceph has improved it's docs alot but still could use some
>>>> work.
>>>>
>>>> -John
>>>>
>>>>
>>>> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<dieter@vimeo.com>   wrote:
>>>>> On Tue, 18 Sep 2012 01:26:03 +0200
>>>>> John Axel Eriksson<john@insane.se>   wrote:
>>>>>
>>>>>> another distributed
>>>>>> storage solution that had failed us more than once and we lost data.
>>>>>> Since the old system had an http interface (not S3 compatible though)
>>>>>
>>>>> can you say a bit more about this? failure stories are very interesting and useful.
>>>>>
>>>>> Dieter
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 15:27                 ` Mark Nelson
@ 2012-09-18 15:46                   ` Plaetinck, Dieter
  0 siblings, 0 replies; 29+ messages in thread
From: Plaetinck, Dieter @ 2012-09-18 15:46 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Right, it just takes time to grow these things.
Maybe the process could be accelerated by being more out there, but what do I know about marketing.. not much :)

Dieter

On Tue, 18 Sep 2012 10:27:52 -0500
Mark Nelson <mark.nelson@inktank.com> wrote:

> Hi Dieter,
> 
> It sounds like some of those things will come with time (more 
> experienced community, docs, deployments, papers, etc).  Are there other 
> things we could be doing that would make Ceph feel less risky for people 
> doing similar comparisons?
> 
> Thanks,
> Mark
> 
> On 09/18/2012 10:19 AM, Plaetinck, Dieter wrote:
> > I don't mind.
> > Ultimately it came down to ceph vs swift for us.
> > Nothing is cast in stone yet, but we choose swift for our new not-yet-production cluster, because
> > swift has has been around longer and has more production deployments, and hence a bigger/more experienced community, better documentation (both official as well as unofficial, blogs, tutorials etc), more conferences/techtalks.
> > It's also a more simple system that reuses more existing technology, which makes it (a bit?) less efficient, but makes it easier to understand. (http protocol vs custom protocol, cluster metadata in sqlite, python which I'm more comfortable with than C, etc).
> > I would like to implement Ceph (because on paper it's just awesome) but running it involves a certain uncertainty/risk I personally don't want to take yet.
> >
> > Dieter
> >
> > On Tue, 18 Sep 2012 09:56:50 -0500
> > Mark Nelson<mark.nelson@inktank.com>  wrote:
> >
> >> Agreed, this was a really interesting writeup!  Thanks John!
> >>
> >> Dieter, do you mind if I ask what is compelling for you in choosing
> >> swift vs the other options you've looked at including Ceph?
> >>
> >> Thanks,
> >> Mark
> >>
> >> On 09/18/2012 09:51 AM, Plaetinck, Dieter wrote:
> >>> thanks a lot for the detailed writeup, I found it quite useful.
> >>> the list of contestants is similar to the list I made when researching (and I also had luwak);
> >>> while I also think ceph is very promising and probably deserves to dominate in the future,
> >>> I'm focusing on openstack swift for now. FWIW
> >>>
> >>> Dieter
> >>>
> >>> On Tue, 18 Sep 2012 16:34:23 +0200
> >>> John Axel Eriksson<john@insane.se>   wrote:
> >>>
> >>>> I actually opted to not specifically mention the product we had
> >>>> problems with since there have been lots of changes and fixes to it,
> >>>> which we unfortunately were unable to make use of(you'll know why
> >>>> later). But I guess it's interesting enough to go into a little more
> >>>> detail so... before moving to Ceph we were using the Riak Distributed
> >>>> Database from Basho - http://riak.basho.com.
> >>>>
> >>>> First I have to say that Riak is actually pretty awesome in many ways
> >>>> - not in the least operations wise. Compared to Ceph it's alot easier
> >>>> to get up and running and add storage as you go... basically just one
> >>>> command to add a node to the cluster and you only need the address of
> >>>> any other existing node for this. With Riak, every node is the same,
> >>>> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
> >>>>
> >>>> As you might have thought already "Distributed Database isn't exactly
> >>>> the same as Distributed Storage" so why did we use it? Well, there is
> >>>> an add-on to Riak called Luwak, also created and supported by Basho,
> >>>> that is touted as "Large Object Support" where you can store as large
> >>>> objects as you want. I think our main problem was with using this
> >>>> add-on (as I said created and supported by Basho). An object in
> >>>> "standard" riak k/v is limited to... I think around 40 MB, or at least
> >>>> you shouldn't store larger objects than that because it means
> >>>> "trouble". Anyway, we went with Luwak which seemed to be a perfect
> >>>> solution for the type of storage we do.
> >>>>
> >>>> We ran with Luwak for almost two years and usually it served us pretty
> >>>> well. Unfortunately there were bugs and hidden problems which i.m.o
> >>>> Basho should have been more open about. One issue is that Riak is
> >>>> based on a repair mechanism called "read-repair" - that pretty much
> >>>> tells you how it works, data will only be repaired on a read. Now that
> >>>> is a problem in itself when you archive data which we do (eg. not
> >>>> reading it very often or at all).
> >>>>
> >>>> With Luwak(the large-object add-on), data is split into many keys and
> >>>> values and stored in the "normal" riak k/v store... unfortunately
> >>>> read-repair in this scenario doesn't seem to work at all and if
> >>>> something was missing - Riak had a tendency to crash HARD, sometimes
> >>>> managing to take the whole machine with it. There were also strange
> >>>> issues where one crashing node seemed to affect it's neighbors so that
> >>>> they also crashed... a domino effect which makes "distributed" a
> >>>> little too "distributed". This didn't always happen but it did happen
> >>>> several times in our case. The logs were often pretty hard to
> >>>> understand and more often than not left us completely in the dark
> >>>> about what was going on.
> >>>>
> >>>> We also discovered that deleting data in Luwak doesn't actually DO
> >>>> anything... sure the key is gone but data is still on disk - seemingly
> >>>> orphaned, so deleting was more or less a noop. This was nowhere to be
> >>>> found in the docs.
> >>>>
> >>>> Finally, I think 3rd of June this year, we requested paid support from
> >>>> Basho to help us in our last crash-and-burn situation and that's when
> >>>> we, among other things, were told about the fact that DELETEing just
> >>>> seems to work. We were also told that Luwak was originally created to
> >>>> store email and not really the types of things we store (eg. files).
> >>>> This information wasn't available anywhere - Luwak simply had the
> >>>> wrong "table of contents" associated with it. All this was quite a
> >>>> turn-off for us. To Bashos credit they really did help us fix our
> >>>> cluster and they're really nice, friendly and helpful guys.
> >>>>
> >>>> Actually I think the last straw was when Luwak was suddenly - out of
> >>>> nowhere really - discontinued around the beginning of this year,
> >>>> probably because of the bugs and hidden problems that I think may have
> >>>> come from a less than stellar implementation of large-object support
> >>>> from the start... so by then we were on something completely
> >>>> unsupported. We couldn't switch to something else immediately of
> >>>> course but we started looking around for something else at that time.
> >>>> That's when I found Ceph among other more or less distributed systems,
> >>>> where the others were:
> >>>>
> >>>> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
> >>>> XtreemFS         http://www.xtreemfs.org
> >>>> HDFS             http://hadoop.apache.org/hdfs/
> >>>> GlusterFS        http://www.gluster.org
> >>>> PomegranateFS    https://github.com/macan/Pomegranate/wiki
> >>>> moosefs          http://www.moosefs.org
> >>>> Openstack Swift  http://docs.openstack.org/developer/swift/
> >>>> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
> >>>> LS4              http://ls4.sourceforge.net/
> >>>>
> >>>> After trying most of these I decided to look closer at a few of them,
> >>>> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
> >>>> suited for our use case or just too complicated to setup and keep
> >>>> running (i.m.o). For a short while I dabbled in writing my own storage
> >>>> system using zeromq for communication but it's just not what our
> >>>> company does - so I gave that up pretty quickly :-). In the end I
> >>>> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
> >>>> every other aspect better and definitely a good fit. The Rados
> >>>> Gateway(S3 compat) was really a big thing for us as well.
> >>>>
> >>>> As I started out saying: there have been many improvements to Riak not
> >>>> in the least to the large-object support... but that large-object
> >>>> support is not built on Luwak but a completely new thing and it's not
> >>>> open source or free. It's called Riak CS(CS for Cluster Storage I
> >>>> guess) and has an S3 compatible interface and it seems to be pretty
> >>>> good. We had many discussions internally if Riak CS was the right move
> >>>> for us but in the end we decided on Ceph since we couldn't justify the
> >>>> cost of Riak CS.
> >>>>
> >>>> To sum it up: we made, in retrospect, a bad choice - not because Riak
> >>>> itself doesn't work or isn't any good for the things it's good at(it
> >>>> really is!) but because the add-on Luwak was misrepresented and not a
> >>>> good fit for us.
> >>>>
> >>>> I really have high hopes for Ceph and I think it has a bright future
> >>>> in our company and in general. Riak CS would probably have been a very
> >>>> good fit as well if it wasn't for the cost involved.
> >>>>
> >>>> So there you have it - not just failure scenarios but bad decisions,
> >>>> misrepresenation of features and somewhat sparse documentation. By the
> >>>> the way, Ceph has improved it's docs alot but still could use some
> >>>> work.
> >>>>
> >>>> -John
> >>>>
> >>>>
> >>>> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<dieter@vimeo.com>   wrote:
> >>>>> On Tue, 18 Sep 2012 01:26:03 +0200
> >>>>> John Axel Eriksson<john@insane.se>   wrote:
> >>>>>
> >>>>>> another distributed
> >>>>>> storage solution that had failed us more than once and we lost data.
> >>>>>> Since the old system had an http interface (not S3 compatible though)
> >>>>>
> >>>>> can you say a bit more about this? failure stories are very interesting and useful.
> >>>>>
> >>>>> Dieter
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-17 22:14 How are you using Ceph? Ross Turk
  2012-09-17 22:47 ` Nick Couchman
  2012-09-18  0:05 ` Smart Weblications GmbH - Florian Wiessner
@ 2012-09-18 16:01 ` Travis Rhoden
  2 siblings, 0 replies; 29+ messages in thread
From: Travis Rhoden @ 2012-09-18 16:01 UTC (permalink / raw)
  To: Ross Turk, ceph-devel

I am using Ceph mainly for it's KVM and OpenStack integration, and
also RBD.  I also needed to provide shared storage to clusters of
nodes, and thus far I haven't needed the highest-possible performance.
 Thus, I create RBDs, format them with ext4, and re-export them with
NFS.  Clients do both NFS v3 and v4 mounts.

I'm using such an NFS mount for my "nova-instances" directory in
OpenStack, which allows me to do live-migration of VMs between compute
nodes.  OpenStack Glance speaks to Ceph directly, and I am using that
as well.

CephFS would be simpler for most of my scenarios, but I've elected to
wait until Inktank is slightly more confident about it.  However,
given comments I've seen on here, it sounds like I should at least
give it a shot -- a SPOF from one MDS is no different than my current
single NFS  server.  =)  For me, Ceph's strong points lie in the
administrative capabilities to create custom pools with custom
replication rules.  And they are easy to change!  Taking down a node
(or OSD's) and watching new objects get copied in order to keep the
required number of copies is very reassuring.

The main competitor when I was exploring options for my project was
GlusterFS.  I played with it for a few months.  It worked for my okay,
but I found it to be slow, and surprisingly, very static.  I pretty
much had to define my volume/replication level ahead of time, and that
was fixed for all time.  I didn't like that I had to define that brick
X on host Y is mirrored to brick A on host B.  I wanted what Ceph does
-- make sure there are two copies of X, and make sure they are not on
the same host (or rack, or row, etc.).

The ability to grow the cluster and add storage while the cluster was
live was also critical.  Adjusting the crush map and having the
objects take immediate advantage of the new OSDs is great.  I wanted
to keep the HW fairly simple, so in keeping with commodity hardware
wanted something that was strictly software based -- no hardware RAID.
That has worked well for me.

Things I would like to see:
The RBD advisory locking and fencing (this seems to be close!)
CephFS of course
more docs re: best practices, performance tuning, HW configs, etc.
Some information (whitepaper?) about how Ceph could be used in more of
an HPC environment, in addition to cloud storage.  I feel like I read
somewhere that part of the inspiration for Ceph originally came out of
frustration with Lustre.  I've also had bad Lustre experiences, and
would like to see Ceph compete on that space.

 - Travis

On Mon, Sep 17, 2012 at 6:14 PM, Ross Turk <ross@inktank.com> wrote:
> Hi, all!
>
> One of the most important parts of Inktank's mission is to spread the
> word about Ceph. We want everyone to know what it is and how to use
> it.
>
> In order to tell a better story to potential new users, I'm trying to
> get a sense for today's deployments. We've spent the last few months
> talking to folks around the world, but I'm sure there are a few great
> stories we haven't heard yet!
>
> If you've got a spare five minutes, I would love to hear what you're
> up to. What kind of projects are you working on, and in what stage?
> What is your workload? Are you using Ceph alongside other
> technologies? How has your experience been?
>
> This is also a good opportunity for me to introduce myself to those I
> haven't met yet! Feel free to copy the list if you think others would
> be interested (and you don't mind sharing).
>
> Cheers,
> Ross
>
> --
> Ross Turk
> Ceph Community Guy
>
> "Any sufficiently advanced technology is indistinguishable from magic."
> -- Arthur C. Clarke
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18  3:19 ` Matt W. Benjamin
  2012-09-18  5:44   ` Ian Pye
@ 2012-09-18 16:13   ` Sage Weil
  2012-09-18 16:21     ` Matt W. Benjamin
  1 sibling, 1 reply; 29+ messages in thread
From: Sage Weil @ 2012-09-18 16:13 UTC (permalink / raw)
  To: Matt W. Benjamin; +Cc: f wiessner, Ross Turk, ceph-devel, Tren Blackburn

Hi Matt,

On Mon, 17 Sep 2012, Matt W. Benjamin wrote:
> Hi
> 
> Just FYI, on the NFS integration front.  A pnfs files (RFC5661)-capable 
> NFSv4 re-exporter for Ceph has been committed to the Ganesha NFSv4 
> server development branch.  We're continuing to enhance and elaborate 
> this.  We have had on our (full) plates for a while to return Ceph 
> client library changes.  We've finished pullup and rebasing of these, 
> are doing some final testing of a couple things in preparation to push a 
> branch for review.

This is great news!  I'm interested to hear how the Ganesha bits map pNFS 
server instances to OSDs.. is it just matching IP addresses or something?

sage

> 
> Regards,
> 
> Matt
> 
> ----- "Sage Weil" <sage@inktank.com> wrote:
> 
> > On Mon, 17 Sep 2012, Tren Blackburn wrote:
> > > On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
> > > Wiessner <f.wiessner@smart-weblications.de> wrote:
> > > >
> > > > Hi,
> > > >
> > > > i use ceph to provide storage via rbd for our virtualization
> > cluster delivering
> > > > KVM based high availability Virtual Machines to my customers. I
> > also use it
> > > > as rbd device with ocfs2 on top of it for a 4 node webserver
> > cluster as shared
> > > > storage - i do this, because unfortunatelly cephfs is not ready
> > yet ;)
> > > >
> > > Hi Florian;
> > > 
> > > When you say "cephfs is not ready yet", what parts about it are not
> > > ready? There are vague rumblings about that in general, but I'd
> > love
> > > to see specific issues. I understand multiple *active* mds's are
> > not
> > > supported, but what other issues are you aware of?
> > 
> > Inktank is not yet supporting it because we do not have the QA in
> > place 
> > and general hardening that will make us feel comfortable recommending
> > it 
> > for customers.  That said, it works pretty well for most workloads. 
> > In 
> > particular, if you stay away from the snapshots and multi-mds, you
> > should 
> > be quite stable.
> > 
> > The engineering team here is about to do a bit of a pivot and refocus
> > on 
> > the file system now that the object store and RBD are in pretty good 
> > shape.  That will mean both core fs/mds stability and features as well
> > as 
> > integration efforts (NFS/CIFS/Hadoop).
> > 
> > 'Ready' is in the eye of the beholder.  There are a few people using
> > the 
> > fs successfully in production, but not too many.
> > 
> > sage
> > 
> > 
> >  > 
> > > And if there's a page documenting this already, I apologize...and
> > > would appreciate a link :)
> > > 
> > > t.
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Matt Benjamin
> The Linux Box
> 206 South Fifth Ave. Suite 150
> Ann Arbor, MI  48104
> 
> http://linuxbox.com
> 
> tel. 734-761-4689
> fax. 734-769-8938
> cel. 734-216-5309
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 11:48       ` Smart Weblications GmbH - Florian Wiessner
@ 2012-09-18 16:20         ` Sage Weil
  0 siblings, 0 replies; 29+ messages in thread
From: Sage Weil @ 2012-09-18 16:20 UTC (permalink / raw)
  To: Smart Weblications GmbH - Florian Wiessner
  Cc: Tren Blackburn, Ross Turk, ceph-devel

On Tue, 18 Sep 2012, Smart Weblications GmbH - Florian Wiessner wrote:
> Am 18.09.2012 04:32, schrieb Sage Weil:
> > On Mon, 17 Sep 2012, Tren Blackburn wrote:
> >> On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
> >> Wiessner <f.wiessner@smart-weblications.de> wrote:
> >>>
> >>> Hi,
> >>>
> >>> i use ceph to provide storage via rbd for our virtualization cluster delivering
> >>> KVM based high availability Virtual Machines to my customers. I also use it
> >>> as rbd device with ocfs2 on top of it for a 4 node webserver cluster as shared
> >>> storage - i do this, because unfortunatelly cephfs is not ready yet ;)
> >>>
> >> Hi Florian;
> >>
> >> When you say "cephfs is not ready yet", what parts about it are not
> >> ready? There are vague rumblings about that in general, but I'd love
> >> to see specific issues. I understand multiple *active* mds's are not
> >> supported, but what other issues are you aware of?
> > 
> > Inktank is not yet supporting it because we do not have the QA in place 
> > and general hardening that will make us feel comfortable recommending it 
> > for customers.  That said, it works pretty well for most workloads.  In 
> > particular, if you stay away from the snapshots and multi-mds, you should 
> > be quite stable.
> > 
> > The engineering team here is about to do a bit of a pivot and refocus on 
> > the file system now that the object store and RBD are in pretty good 
> > shape.  That will mean both core fs/mds stability and features as well as 
> > integration efforts (NFS/CIFS/Hadoop).
> > 
> > 'Ready' is in the eye of the beholder.  There are a few people using the 
> > fs successfully in production, but not too many.
> > 
> 
> I tried it using multiple mds, because without multiple mds there is no
> redundancy and the single mds will be SPOF. I noticed things like empty

Just to clarify: by multi-mds I mean multiple *active* ceph-mds daemons.  
By default if you start a bunch of them they are just standby, ready to 
take over in the event of a failure.

> directories which could not be deleted. It said directory not empty, but it was
> empty and could not be deleted. I also noticed kernel panic on 3.2 kernels using
> kernel ceph client, or crashes with ceph-fuse. It is somewhat unstable so that i
> always had to reboot a node after a while of usage for various reasons
> (ceph-fuse crashed and messed up fuse, kernel panic using kernel ceph client,
> unable to delete files/dirs, no fsck for fixing things).
> 
> Last time i tried was simple untarring kernel tree in cephfs mountpoint - a new
> created cephfs and after 10 minutes there where errors like unable to delete
> dirs etc. Since i do not know how to reset/reformat only the cephfs part
> (without touching rbd!), i stopped testing for now. The last time i tried it i
> lost data - the data was not important and i had backups, but i was feeling
> uncomfortable now with using cephfs...

This is disconcerting.  If it is something you are able to reproduce, 
sharing those steps with us would be very helpful.

> I also did not have tried btrfs with ceph since 11/2011 again, because of losing
> data after reboots when btrfs dies, the btrfs was unmountable and there was no
> fsck so i only could reformat and wait for ceph to rebuild. After a
> powerfailure, no btrfs partitions survived and i lost all test data :/
> 
> 
> So i think the first thing to be done to cephfs would be to integrate some sort
> of fsck and the ability to format only cephfs without losing other rbd
> images/data o rados data...

FWIW it is already possible to create a new fs without touching other 
pools with the command

	ceph newfs <new metadata pool id> <new data pool id>

where the pool ids are the numeric ids (ceph osd dump | grep ^pool) for 
new, empty rados pools.

sage

> 
> 
> 
> -- 
> 
> Mit freundlichen Gr??en,
> 
> Florian Wiessner
> 
> Smart Weblications GmbH
> Martinsberger Str. 1
> D-95119 Naila
> 
> fon.: +49 9282 9638 200
> fax.: +49 9282 9638 205
> 24/7: +49 900 144 000 00 - 0,99 EUR/Min*
> http://www.smart-weblications.de
> 
> --
> Sitz der Gesellschaft: Naila
> Gesch?ftsf?hrer: Florian Wiessner
> HRB-Nr.: HRB 3840 Amtsgericht Hof
> *aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 14:34         ` John Axel Eriksson
  2012-09-18 14:51           ` Plaetinck, Dieter
@ 2012-09-18 16:20           ` Xiaopong Tran
  2012-09-18 17:09             ` John Axel Eriksson
  1 sibling, 1 reply; 29+ messages in thread
From: Xiaopong Tran @ 2012-09-18 16:20 UTC (permalink / raw)
  To: John Axel Eriksson; +Cc: Plaetinck, Dieter, ceph-devel

Excellent write-up. We are exactly in the same mess with
Riak Luwak, a decision that was made before I took over the
project. I thought we were the only one :)

We are still paying the price for it, as after over
a month of migrating the data from Riak to Ceph, we barely
moved 30% of the data.

When we retrieve a large file from Riak, it sometimes goes
crazy and can bring down the whole cluster of 10 nodes.
We keep on adding memory, and this thing does not seem
to have enough of it. And after one day of usage, it would
slow to a crawl, and we ended up recycling the cluster once
a day (sometimes more).

And as you have said, deleting just doesn't work. And a lots
of other issues too.

Then we were proposed the Riak CS. It all looks
great on paper, however, with the experiences of
Riak Luwak, which also looked great on paper, we
wouldn't even dare to consider.

I can't wait until the day we get rid off of it totally.

Best,

Xiaopong


On 09/18/2012 10:34 PM, John Axel Eriksson wrote:
> I actually opted to not specifically mention the product we had
> problems with since there have been lots of changes and fixes to it,
> which we unfortunately were unable to make use of(you'll know why
> later). But I guess it's interesting enough to go into a little more
> detail so... before moving to Ceph we were using the Riak Distributed
> Database from Basho - http://riak.basho.com.
>
> First I have to say that Riak is actually pretty awesome in many ways
> - not in the least operations wise. Compared to Ceph it's alot easier
> to get up and running and add storage as you go... basically just one
> command to add a node to the cluster and you only need the address of
> any other existing node for this. With Riak, every node is the same,
> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
>
> As you might have thought already "Distributed Database isn't exactly
> the same as Distributed Storage" so why did we use it? Well, there is
> an add-on to Riak called Luwak, also created and supported by Basho,
> that is touted as "Large Object Support" where you can store as large
> objects as you want. I think our main problem was with using this
> add-on (as I said created and supported by Basho). An object in
> "standard" riak k/v is limited to... I think around 40 MB, or at least
> you shouldn't store larger objects than that because it means
> "trouble". Anyway, we went with Luwak which seemed to be a perfect
> solution for the type of storage we do.
>
> We ran with Luwak for almost two years and usually it served us pretty
> well. Unfortunately there were bugs and hidden problems which i.m.o
> Basho should have been more open about. One issue is that Riak is
> based on a repair mechanism called "read-repair" - that pretty much
> tells you how it works, data will only be repaired on a read. Now that
> is a problem in itself when you archive data which we do (eg. not
> reading it very often or at all).
>
> With Luwak(the large-object add-on), data is split into many keys and
> values and stored in the "normal" riak k/v store... unfortunately
> read-repair in this scenario doesn't seem to work at all and if
> something was missing - Riak had a tendency to crash HARD, sometimes
> managing to take the whole machine with it. There were also strange
> issues where one crashing node seemed to affect it's neighbors so that
> they also crashed... a domino effect which makes "distributed" a
> little too "distributed". This didn't always happen but it did happen
> several times in our case. The logs were often pretty hard to
> understand and more often than not left us completely in the dark
> about what was going on.
>
> We also discovered that deleting data in Luwak doesn't actually DO
> anything... sure the key is gone but data is still on disk - seemingly
> orphaned, so deleting was more or less a noop. This was nowhere to be
> found in the docs.
>
> Finally, I think 3rd of June this year, we requested paid support from
> Basho to help us in our last crash-and-burn situation and that's when
> we, among other things, were told about the fact that DELETEing just
> seems to work. We were also told that Luwak was originally created to
> store email and not really the types of things we store (eg. files).
> This information wasn't available anywhere - Luwak simply had the
> wrong "table of contents" associated with it. All this was quite a
> turn-off for us. To Bashos credit they really did help us fix our
> cluster and they're really nice, friendly and helpful guys.
>
> Actually I think the last straw was when Luwak was suddenly - out of
> nowhere really - discontinued around the beginning of this year,
> probably because of the bugs and hidden problems that I think may have
> come from a less than stellar implementation of large-object support
> from the start... so by then we were on something completely
> unsupported. We couldn't switch to something else immediately of
> course but we started looking around for something else at that time.
> That's when I found Ceph among other more or less distributed systems,
> where the others were:
>
> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
> XtreemFS         http://www.xtreemfs.org
> HDFS             http://hadoop.apache.org/hdfs/
> GlusterFS        http://www.gluster.org
> PomegranateFS    https://github.com/macan/Pomegranate/wiki
> moosefs          http://www.moosefs.org
> Openstack Swift  http://docs.openstack.org/developer/swift/
> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
> LS4              http://ls4.sourceforge.net/
>
> After trying most of these I decided to look closer at a few of them,
> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
> suited for our use case or just too complicated to setup and keep
> running (i.m.o). For a short while I dabbled in writing my own storage
> system using zeromq for communication but it's just not what our
> company does - so I gave that up pretty quickly :-). In the end I
> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
> every other aspect better and definitely a good fit. The Rados
> Gateway(S3 compat) was really a big thing for us as well.
>
> As I started out saying: there have been many improvements to Riak not
> in the least to the large-object support... but that large-object
> support is not built on Luwak but a completely new thing and it's not
> open source or free. It's called Riak CS(CS for Cluster Storage I
> guess) and has an S3 compatible interface and it seems to be pretty
> good. We had many discussions internally if Riak CS was the right move
> for us but in the end we decided on Ceph since we couldn't justify the
> cost of Riak CS.
>
> To sum it up: we made, in retrospect, a bad choice - not because Riak
> itself doesn't work or isn't any good for the things it's good at(it
> really is!) but because the add-on Luwak was misrepresented and not a
> good fit for us.
>
> I really have high hopes for Ceph and I think it has a bright future
> in our company and in general. Riak CS would probably have been a very
> good fit as well if it wasn't for the cost involved.
>
> So there you have it - not just failure scenarios but bad decisions,
> misrepresenation of features and somewhat sparse documentation. By the
> the way, Ceph has improved it's docs alot but still could use some
> work.
>
> -John
>
>
> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<dieter@vimeo.com>  wrote:
>> On Tue, 18 Sep 2012 01:26:03 +0200
>> John Axel Eriksson<john@insane.se>  wrote:
>>
>>> another distributed
>>> storage solution that had failed us more than once and we lost data.
>>> Since the old system had an http interface (not S3 compatible though)
>>
>> can you say a bit more about this? failure stories are very interesting and useful.
>>
>> Dieter
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 16:13   ` Sage Weil
@ 2012-09-18 16:21     ` Matt W. Benjamin
  0 siblings, 0 replies; 29+ messages in thread
From: Matt W. Benjamin @ 2012-09-18 16:21 UTC (permalink / raw)
  To: Sage Weil; +Cc: f wiessner, Ross Turk, ceph-devel, Tren Blackburn

Hi Sage,

----- "Sage Weil" <sage@inktank.com> wrote:

> Hi Matt,
> 
> On Mon, 17 Sep 2012, Matt W. Benjamin wrote:
> > Hi
> > 
> > Just FYI, on the NFS integration front.  A pnfs files
> (RFC5661)-capable 
> > NFSv4 re-exporter for Ceph has been committed to the Ganesha NFSv4 
> > server development branch.  We're continuing to enhance and
> elaborate 
> > this.  We have had on our (full) plates for a while to return Ceph 
> > client library changes.  We've finished pullup and rebasing of
> these, 
> > are doing some final testing of a couple things in preparation to
> push a 
> > branch for review.
> 
> This is great news!  I'm interested to hear how the Ganesha bits map
> pNFS 
> server instances to OSDs.. is it just matching IP addresses or
> something?

The returned pnfs layouts together with a set of pseudo devices mentioned in it express a striping pattern.  The devices indicate the location of each OSD, each of which is running a Ganesha data server. We're in the process of deepening our integration here, potentially switching to the next draft version of the pnfs object layout.  This will increase the expressiveness of the stripe mappings we can do--as it is now, there is some impedence.

Regards,

Matt

-- 
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18  2:32     ` Sage Weil
  2012-09-18 11:48       ` Smart Weblications GmbH - Florian Wiessner
@ 2012-09-18 16:35       ` Tren Blackburn
  2012-09-18 17:00         ` Sage Weil
  1 sibling, 1 reply; 29+ messages in thread
From: Tren Blackburn @ 2012-09-18 16:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: f.wiessner, Ross Turk, ceph-devel

On Mon, Sep 17, 2012 at 7:32 PM, Sage Weil <sage@inktank.com> wrote:
> On Mon, 17 Sep 2012, Tren Blackburn wrote:
>> On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
>> Wiessner <f.wiessner@smart-weblications.de> wrote:
>> >
>> > Hi,
>> >
>> > i use ceph to provide storage via rbd for our virtualization cluster delivering
>> > KVM based high availability Virtual Machines to my customers. I also use it
>> > as rbd device with ocfs2 on top of it for a 4 node webserver cluster as shared
>> > storage - i do this, because unfortunatelly cephfs is not ready yet ;)
>> >
>> Hi Florian;
>>
>> When you say "cephfs is not ready yet", what parts about it are not
>> ready? There are vague rumblings about that in general, but I'd love
>> to see specific issues. I understand multiple *active* mds's are not
>> supported, but what other issues are you aware of?
>
> Inktank is not yet supporting it because we do not have the QA in place
> and general hardening that will make us feel comfortable recommending it
> for customers.  That said, it works pretty well for most workloads.  In
> particular, if you stay away from the snapshots and multi-mds, you should
> be quite stable.
With regards to the multi-mds, is that multi-active mds? I have 3
mds's built, and only 1 active. The others are incase there's a
failure. Does that scenario work?

>
> The engineering team here is about to do a bit of a pivot and refocus on
> the file system now that the object store and RBD are in pretty good
> shape.  That will mean both core fs/mds stability and features as well as
> integration efforts (NFS/CIFS/Hadoop).
That's awesome news. The file system component is very important to me.

>
> 'Ready' is in the eye of the beholder.  There are a few people using the
> fs successfully in production, but not too many.
I'll keep you up to date as our testing builds out! I'm in the process
of building out a new test cluster.

t.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 16:35       ` Tren Blackburn
@ 2012-09-18 17:00         ` Sage Weil
  0 siblings, 0 replies; 29+ messages in thread
From: Sage Weil @ 2012-09-18 17:00 UTC (permalink / raw)
  To: Tren Blackburn; +Cc: f.wiessner, Ross Turk, ceph-devel

On Tue, 18 Sep 2012, Tren Blackburn wrote:
> On Mon, Sep 17, 2012 at 7:32 PM, Sage Weil <sage@inktank.com> wrote:
> > On Mon, 17 Sep 2012, Tren Blackburn wrote:
> >> On Mon, Sep 17, 2012 at 5:05 PM, Smart Weblications GmbH - Florian
> >> Wiessner <f.wiessner@smart-weblications.de> wrote:
> >> >
> >> > Hi,
> >> >
> >> > i use ceph to provide storage via rbd for our virtualization cluster delivering
> >> > KVM based high availability Virtual Machines to my customers. I also use it
> >> > as rbd device with ocfs2 on top of it for a 4 node webserver cluster as shared
> >> > storage - i do this, because unfortunatelly cephfs is not ready yet ;)
> >> >
> >> Hi Florian;
> >>
> >> When you say "cephfs is not ready yet", what parts about it are not
> >> ready? There are vague rumblings about that in general, but I'd love
> >> to see specific issues. I understand multiple *active* mds's are not
> >> supported, but what other issues are you aware of?
> >
> > Inktank is not yet supporting it because we do not have the QA in place
> > and general hardening that will make us feel comfortable recommending it
> > for customers.  That said, it works pretty well for most workloads.  In
> > particular, if you stay away from the snapshots and multi-mds, you should
> > be quite stable.
> With regards to the multi-mds, is that multi-active mds? I have 3
> mds's built, and only 1 active. The others are incase there's a
> failure. Does that scenario work?

Correct.  One active and one (or more) standby is the default (and 
recommended) behavior.  You need to explicitly tell the monitor to make 
multiple MDSs active... don't do that (yet!).

> >
> > The engineering team here is about to do a bit of a pivot and refocus on
> > the file system now that the object store and RBD are in pretty good
> > shape.  That will mean both core fs/mds stability and features as well as
> > integration efforts (NFS/CIFS/Hadoop).
> That's awesome news. The file system component is very important to me.
> 
> >
> > 'Ready' is in the eye of the beholder.  There are a few people using the
> > fs successfully in production, but not too many.
> I'll keep you up to date as our testing builds out! I'm in the process
> of building out a new test cluster.

Great!

sage

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: How are you using Ceph?
  2012-09-18 16:20           ` Xiaopong Tran
@ 2012-09-18 17:09             ` John Axel Eriksson
  0 siblings, 0 replies; 29+ messages in thread
From: John Axel Eriksson @ 2012-09-18 17:09 UTC (permalink / raw)
  To: Xiaopong Tran; +Cc: Plaetinck, Dieter, ceph-devel

Hey Xiaopong (is that your first or last name by the way? - sorry for
my ignorance),

I feel your pain believe me :-). We've had many sleepless nights
salvaging data. We've actually completely
migrated off Riak/Luwak by now and are pretty happy about it. As you
say - we've watched the cluster go down
in flames too many times to count, especially migrating all the data
to Ceph has been a bit of a pain(not because of Ceph).
And yes, really large files can make the cluster go insane - I think
because there might be a missing piece somewhere.

If you didn't know about it (we didn't) - if you want/need to do a
read-repair on the keys in Luwak(which might solve your problems)
you have to go about it in a special way. Normally in "standard" riak
k/v you would simply list all your keys and http GET them, not
so in Luwak. It is however possible to do a read-repair of Luwak but
you must connect to a different endpoint, so:

If you've got luwak at the default location, eg:

http://some-riak-host:8098/luwak

you would need to list the actual keys(the pieces of the key in Luwak)
like this:

curl http://some-riak-host:8098/riak/luwak_node?keys=true (or perhaps
keys=stream)
note above that it's not /luwak but /riak/luwak_node

Then you would have to loop through all those keys and http GET them like so:

curl http://some-riak-host:8098/riak/luwak_node/<thekey>

These keys will look something like
"9c6f84432c1a164a2fdcda917e6b06f1812f4078fb9bff9fc065ef8b70ae7df21184c014c124842a8aab733f572166ef5c6cef69c5dc4e1e73515ceadc82af99"

What you get from the key stream is actually json(you probably knew
that though) so you will also need to massage that data to get a clean
simple text file with one key each line. A suggestion
is also to loop through maybe 10 000 keys at a time, then sleep for a
few minutes(maybe at least 10 mins) so the cluster has time to
actually repair and gossip about only those 10 000 keys. Don't
want it to eat all your memory ;-).

Needless to say, depending on the amount of data in your cluster the
repair process (eg. http GETing all the keys) will take alot of time.
We had it running for several days.

Hope this helps and good luck with the migration!

-John

On Tue, Sep 18, 2012 at 6:20 PM, Xiaopong Tran <xiaopong.tran@gmail.com> wrote:
> Excellent write-up. We are exactly in the same mess with
> Riak Luwak, a decision that was made before I took over the
> project. I thought we were the only one :)
>
> We are still paying the price for it, as after over
> a month of migrating the data from Riak to Ceph, we barely
> moved 30% of the data.
>
> When we retrieve a large file from Riak, it sometimes goes
> crazy and can bring down the whole cluster of 10 nodes.
> We keep on adding memory, and this thing does not seem
> to have enough of it. And after one day of usage, it would
> slow to a crawl, and we ended up recycling the cluster once
> a day (sometimes more).
>
> And as you have said, deleting just doesn't work. And a lots
> of other issues too.
>
> Then we were proposed the Riak CS. It all looks
> great on paper, however, with the experiences of
> Riak Luwak, which also looked great on paper, we
> wouldn't even dare to consider.
>
> I can't wait until the day we get rid off of it totally.
>
> Best,
>
> Xiaopong
>
>
>
> On 09/18/2012 10:34 PM, John Axel Eriksson wrote:
>>
>> I actually opted to not specifically mention the product we had
>> problems with since there have been lots of changes and fixes to it,
>> which we unfortunately were unable to make use of(you'll know why
>> later). But I guess it's interesting enough to go into a little more
>> detail so... before moving to Ceph we were using the Riak Distributed
>> Database from Basho - http://riak.basho.com.
>>
>> First I have to say that Riak is actually pretty awesome in many ways
>> - not in the least operations wise. Compared to Ceph it's alot easier
>> to get up and running and add storage as you go... basically just one
>> command to add a node to the cluster and you only need the address of
>> any other existing node for this. With Riak, every node is the same,
>> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
>>
>> As you might have thought already "Distributed Database isn't exactly
>> the same as Distributed Storage" so why did we use it? Well, there is
>> an add-on to Riak called Luwak, also created and supported by Basho,
>> that is touted as "Large Object Support" where you can store as large
>> objects as you want. I think our main problem was with using this
>> add-on (as I said created and supported by Basho). An object in
>> "standard" riak k/v is limited to... I think around 40 MB, or at least
>> you shouldn't store larger objects than that because it means
>> "trouble". Anyway, we went with Luwak which seemed to be a perfect
>> solution for the type of storage we do.
>>
>> We ran with Luwak for almost two years and usually it served us pretty
>> well. Unfortunately there were bugs and hidden problems which i.m.o
>> Basho should have been more open about. One issue is that Riak is
>> based on a repair mechanism called "read-repair" - that pretty much
>> tells you how it works, data will only be repaired on a read. Now that
>> is a problem in itself when you archive data which we do (eg. not
>> reading it very often or at all).
>>
>> With Luwak(the large-object add-on), data is split into many keys and
>> values and stored in the "normal" riak k/v store... unfortunately
>> read-repair in this scenario doesn't seem to work at all and if
>> something was missing - Riak had a tendency to crash HARD, sometimes
>> managing to take the whole machine with it. There were also strange
>> issues where one crashing node seemed to affect it's neighbors so that
>> they also crashed... a domino effect which makes "distributed" a
>> little too "distributed". This didn't always happen but it did happen
>> several times in our case. The logs were often pretty hard to
>> understand and more often than not left us completely in the dark
>> about what was going on.
>>
>> We also discovered that deleting data in Luwak doesn't actually DO
>> anything... sure the key is gone but data is still on disk - seemingly
>> orphaned, so deleting was more or less a noop. This was nowhere to be
>> found in the docs.
>>
>> Finally, I think 3rd of June this year, we requested paid support from
>> Basho to help us in our last crash-and-burn situation and that's when
>> we, among other things, were told about the fact that DELETEing just
>> seems to work. We were also told that Luwak was originally created to
>> store email and not really the types of things we store (eg. files).
>> This information wasn't available anywhere - Luwak simply had the
>> wrong "table of contents" associated with it. All this was quite a
>> turn-off for us. To Bashos credit they really did help us fix our
>> cluster and they're really nice, friendly and helpful guys.
>>
>> Actually I think the last straw was when Luwak was suddenly - out of
>> nowhere really - discontinued around the beginning of this year,
>> probably because of the bugs and hidden problems that I think may have
>> come from a less than stellar implementation of large-object support
>> from the start... so by then we were on something completely
>> unsupported. We couldn't switch to something else immediately of
>> course but we started looking around for something else at that time.
>> That's when I found Ceph among other more or less distributed systems,
>> where the others were:
>>
>> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
>> XtreemFS         http://www.xtreemfs.org
>> HDFS             http://hadoop.apache.org/hdfs/
>> GlusterFS        http://www.gluster.org
>> PomegranateFS    https://github.com/macan/Pomegranate/wiki
>> moosefs          http://www.moosefs.org
>> Openstack Swift  http://docs.openstack.org/developer/swift/
>> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
>> LS4              http://ls4.sourceforge.net/
>>
>> After trying most of these I decided to look closer at a few of them,
>> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
>> suited for our use case or just too complicated to setup and keep
>> running (i.m.o). For a short while I dabbled in writing my own storage
>> system using zeromq for communication but it's just not what our
>> company does - so I gave that up pretty quickly :-). In the end I
>> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
>> every other aspect better and definitely a good fit. The Rados
>> Gateway(S3 compat) was really a big thing for us as well.
>>
>> As I started out saying: there have been many improvements to Riak not
>> in the least to the large-object support... but that large-object
>> support is not built on Luwak but a completely new thing and it's not
>> open source or free. It's called Riak CS(CS for Cluster Storage I
>> guess) and has an S3 compatible interface and it seems to be pretty
>> good. We had many discussions internally if Riak CS was the right move
>> for us but in the end we decided on Ceph since we couldn't justify the
>> cost of Riak CS.
>>
>> To sum it up: we made, in retrospect, a bad choice - not because Riak
>> itself doesn't work or isn't any good for the things it's good at(it
>> really is!) but because the add-on Luwak was misrepresented and not a
>> good fit for us.
>>
>> I really have high hopes for Ceph and I think it has a bright future
>> in our company and in general. Riak CS would probably have been a very
>> good fit as well if it wasn't for the cost involved.
>>
>> So there you have it - not just failure scenarios but bad decisions,
>> misrepresenation of features and somewhat sparse documentation. By the
>> the way, Ceph has improved it's docs alot but still could use some
>> work.
>>
>> -John
>>
>>
>> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<dieter@vimeo.com>
>> wrote:
>>>
>>> On Tue, 18 Sep 2012 01:26:03 +0200
>>> John Axel Eriksson<john@insane.se>  wrote:
>>>
>>>> another distributed
>>>> storage solution that had failed us more than once and we lost data.
>>>> Since the old system had an http interface (not S3 compatible though)
>>>
>>>
>>> can you say a bit more about this? failure stories are very interesting
>>> and useful.
>>>
>>> Dieter
>>
>> --
>>
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-09-18 17:09 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-17 22:14 How are you using Ceph? Ross Turk
2012-09-17 22:47 ` Nick Couchman
2012-09-17 22:53   ` Mark Nelson
2012-09-17 23:26     ` John Axel Eriksson
2012-09-18  7:47       ` Plaetinck, Dieter
2012-09-18 14:34         ` John Axel Eriksson
2012-09-18 14:51           ` Plaetinck, Dieter
2012-09-18 14:56             ` Mark Nelson
2012-09-18 15:19               ` Plaetinck, Dieter
2012-09-18 15:27                 ` Mark Nelson
2012-09-18 15:46                   ` Plaetinck, Dieter
2012-09-18 16:20           ` Xiaopong Tran
2012-09-18 17:09             ` John Axel Eriksson
2012-09-18  0:05 ` Smart Weblications GmbH - Florian Wiessner
2012-09-18  0:18   ` Tren Blackburn
2012-09-18  2:32     ` Sage Weil
2012-09-18 11:48       ` Smart Weblications GmbH - Florian Wiessner
2012-09-18 16:20         ` Sage Weil
2012-09-18 16:35       ` Tren Blackburn
2012-09-18 17:00         ` Sage Weil
2012-09-18 16:01 ` Travis Rhoden
  -- strict thread matches above, loose matches on Subject: below --
2012-09-17 23:55 Nick Couchman
2012-09-17 23:57 Nick Couchman
2012-09-18  6:35 ` John Axel Eriksson
     [not found] <1784724793.100.1347938272315.JavaMail.root@thunderbeast.private.linuxbox.com>
2012-09-18  3:19 ` Matt W. Benjamin
2012-09-18  5:44   ` Ian Pye
2012-09-18  6:06     ` Yehuda Sadeh
2012-09-18 16:13   ` Sage Weil
2012-09-18 16:21     ` Matt W. Benjamin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.