v0.80 Firefly released

All of lore.kernel.org
 help / color / mirror / Atom feed

* v0.80 Firefly released
@ 2014-05-07  1:05 Sage Weil
       [not found] ` <alpine.DEB.2.00.1405061757540.28165-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
       [not found] ` <BLU436-SMTP195A770E0A729DF4723DD28DF310@phx.gbl>
  0 siblings, 2 replies; 18+ messages in thread
From: Sage Weil @ 2014-05-07  1:05 UTC (permalink / raw)
  To: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ

We did it!  Firefly v0.80 is built and pushed out to the ceph.com 
repositories.

This release will form the basis for our long-term supported release
Firefly, v0.80.x.  The big new features are support for erasure coding
and cache tiering, although a broad range of other features, fixes,
and improvements have been made across the code base.  Highlights include:

* *Erasure coding*: support for a broad range of erasure codes for lower
  storage overhead and better data durability.
* *Cache tiering*: support for creating 'cache pools' that store hot,
  recently accessed objects with automatic demotion of colder data to
  a base tier.  Typically the cache pool is backed by faster storage
  devices like SSDs.
* *Primary affinity*: Ceph now has the ability to skew selection of
  OSDs as the "primary" copy, which allows the read workload to be
  cheaply skewed away from parts of the cluster without migrating any
  data.
* *Key/value OSD backend* (experimental): An alternative storage backend
  for Ceph OSD processes that puts all data in a key/value database like
  leveldb.  This provides better performance for workloads dominated by
  key/value operations (like radosgw bucket indices).
* *Standalone radosgw* (experimental): The radosgw process can now run
  in a standalone mode without an apache (or similar) web server or
  fastcgi.  This simplifies deployment and can improve performance.

We expect to maintain a series of stable releases based on v0.80
Firefly for as much as a year.  In the meantime, development of Ceph
continues with the next release, Giant, which will feature work on the
CephFS distributed file system, more alternative storage backends
(like RocksDB and f2fs), RDMA support, support for pyramid erasure
codes, and additional functionality in the block device (RBD) like
copy-on-read and multisite mirroring.

This release is the culmination of a huge collective effort by about 100 
different contributors.  Thank you everyone who has helped to make this 
possible!

Upgrade Sequencing
------------------

* If your existing cluster is running a version older than v0.67
  Dumpling, please first upgrade to the latest Dumpling release before
  upgrading to v0.80 Firefly.  Please refer to the :ref:`Dumpling upgrade`
  documentation.

* Upgrade daemons in the following order:

    1. Monitors
    2. OSDs
    3. MDSs and/or radosgw

  If the ceph-mds daemon is restarted first, it will wait until all
  OSDs have been upgraded before finishing its startup sequence.  If
  the ceph-mon daemons are not restarted prior to the ceph-osd
  daemons, they will not correctly register their new capabilities
  with the cluster and new features may not be usable until they are
  restarted a second time.

* Upgrade radosgw daemons together.  There is a subtle change in behavior
  for multipart uploads that prevents a multipart request that was initiated
  with a new radosgw from being completed by an old radosgw.

Notable changes since v0.79
---------------------------

* ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
* ceph-fuse: trim inodes in response to mds memory pressure (Yan, Zheng)
* librados: fix inconsistencies in API error values (David Zafman)
* librados: fix watch operations with cache pools (Sage Weil)
* librados: new snap rollback operation (David Zafman)
* mds: fix respawn (John Spray)
* mds: misc bugs (Yan, Zheng)
* mds: misc multi-mds fixes (Yan, Zheng)
* mds: use shared_ptr for requests (Greg Farnum)
* mon: fix peer feature checks (Sage Weil)
* mon: require 'x' mon caps for auth operations (Joao Luis)
* mon: shutdown when removed from mon cluster (Joao Luis)
* msgr: fix locking bug in authentication (Josh Durgin)
* osd: fix bug in journal replay/restart (Sage Weil)
* osd: many many many bug fixes with cache tiering (Samuel Just)
* osd: track omap and hit_set objects in pg stats (Samuel Just)
* osd: warn if agent cannot enable due to invalid (post-split) stats (Sage Weil)
* rados bench: track metadata for multiple runs separately (Guang Yang)
* rgw: fixed subuser modify (Yehuda Sadeh)
* rpm: fix redhat-lsb dependency (Sage Weil, Alfredo Deza)

For the complete release notes, please see:

   http://ceph.com/docs/master/release-notes/#v0-80-firefly

Getting Ceph
------------

* Git at git://github.com/ceph/ceph.git
* Tarball at http://ceph.com/download/ceph-0.80.tar.gz
* For packages, see http://ceph.com/docs/master/install/get-packages
* For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: v0.80 Firefly released
       [not found] ` <alpine.DEB.2.00.1405061757540.28165-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2014-05-07 15:44   ` Dan van der Ster
       [not found]     ` <536A54C6.4060202-vJEk5272eHo@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Dan van der Ster @ 2014-05-07 15:44 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ


[-- Attachment #1.1: Type: text/plain, Size: 671 bytes --]

Hi,

Sage Weil wrote:
> **Primary affinity*: Ceph now has the ability to skew selection of
>    OSDs as the "primary" copy, which allows the read workload to be
>    cheaply skewed away from parts of the cluster without migrating any
>    data.

Can you please elaborate a bit on this one? I found the blueprint [1] 
but still don't quite understand how it works. Does this only change the 
crush calculation for reads? i.e writes still go to the usual primary, 
but reads are distributed across the replicas? If so, does this change 
the consistency model in any way.

Cheers, Dan



[1] 
http://wiki.ceph.com/Planning/Blueprints/Firefly/osdmap%3A_primary_role_affinity

[-- Attachment #1.2: Type: text/html, Size: 1221 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: v0.80 Firefly released
       [not found]     ` <536A54C6.4060202-vJEk5272eHo@public.gmane.org>
@ 2014-05-07 15:51       ` Sage Weil
  2014-05-07 15:53       ` Gregory Farnum
  1 sibling, 0 replies; 18+ messages in thread
From: Sage Weil @ 2014-05-07 15:51 UTC (permalink / raw)
  To: Dan van der Ster
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ

On Wed, 7 May 2014, Dan van der Ster wrote:
> Hi,
> 
> Sage Weil wrote:
> 
> * *Primary affinity*: Ceph now has the ability to skew selection of
>   OSDs as the "primary" copy, which allows the read workload to be
>   cheaply skewed away from parts of the cluster without migrating any
>   data.
> 
> 
> Can you please elaborate a bit on this one? I found the blueprint [1] but
> still don't quite understand how it works. Does this only change the crush
> calculation for reads? i.e writes still go to the usual primary, but reads
> are distributed across the replicas? If so, does this change the consistency
> model in any way.

It basically just skews the choice of which replica is the primary.  No 
data has to move, but the read workload and write overhead associated with 
being the primary (driving recovery and forwarding writes) is diverted 
away from the nodes whose 'affinity' is reduced from the default/baseline.

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: v0.80 Firefly released
       [not found]     ` <536A54C6.4060202-vJEk5272eHo@public.gmane.org>
  2014-05-07 15:51       ` Sage Weil
@ 2014-05-07 15:53       ` Gregory Farnum
  2014-05-07 18:18         ` [ceph-users] " Mike Dawson
  1 sibling, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2014-05-07 15:53 UTC (permalink / raw)
  To: Dan van der Ster
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ceph-users

On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
<daniel.vanderster-vJEk5272eHo@public.gmane.org> wrote:
> Hi,
>
>
> Sage Weil wrote:
>
> * *Primary affinity*: Ceph now has the ability to skew selection of
>   OSDs as the "primary" copy, which allows the read workload to be
>   cheaply skewed away from parts of the cluster without migrating any
>   data.
>
>
> Can you please elaborate a bit on this one? I found the blueprint [1] but
> still don't quite understand how it works. Does this only change the crush
> calculation for reads? i.e writes still go to the usual primary, but reads
> are distributed across the replicas? If so, does this change the consistency
> model in any way.

It changes the calculation of who becomes the primary, and that
primary serves both reads and writes. In slightly more depth:
Previously, the primary has always been the first OSD chosen as a
member of the PG.
For erasure coding, we added the ability to specify a primary
independent of the selection ordering. This was part of a broad set of
changes to prevent moving the EC "shards" around between different
members of the PG, and means that the primary might be the second OSD
in the PG, or the fourth.
Once this work existed, we realized that it might be useful in other
cases, because primaries get more of the work for their PG (serving
all reads, coordinating writes).
So we added the ability to specify a "primary affinity", which is like
the CRUSH weights but only impacts whether you become the primary. So
if you have 3 OSDs that each have primary affinity = 1, it will behave
as normal. If two have primary affinity = 0, the remaining OSD will be
the primary. Etc.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ceph-users] v0.80 Firefly released
  2014-05-07 15:53       ` Gregory Farnum
@ 2014-05-07 18:18         ` Mike Dawson
  2014-05-07 18:30           ` Gregory Farnum
  2014-05-08 12:20           ` Andrey Korolyov
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Dawson @ 2014-05-07 18:18 UTC (permalink / raw)
  To: Gregory Farnum, Dan van der Ster; +Cc: ceph-devel@vger.kernel.org, ceph-users


On 5/7/2014 11:53 AM, Gregory Farnum wrote:
> On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
> <daniel.vanderster@cern.ch> wrote:
>> Hi,
>>
>>
>> Sage Weil wrote:
>>
>> * *Primary affinity*: Ceph now has the ability to skew selection of
>>    OSDs as the "primary" copy, which allows the read workload to be
>>    cheaply skewed away from parts of the cluster without migrating any
>>    data.
>>
>>
>> Can you please elaborate a bit on this one? I found the blueprint [1] but
>> still don't quite understand how it works. Does this only change the crush
>> calculation for reads? i.e writes still go to the usual primary, but reads
>> are distributed across the replicas? If so, does this change the consistency
>> model in any way.
>
> It changes the calculation of who becomes the primary, and that
> primary serves both reads and writes. In slightly more depth:
> Previously, the primary has always been the first OSD chosen as a
> member of the PG.
> For erasure coding, we added the ability to specify a primary
> independent of the selection ordering. This was part of a broad set of
> changes to prevent moving the EC "shards" around between different
> members of the PG, and means that the primary might be the second OSD
> in the PG, or the fourth.
> Once this work existed, we realized that it might be useful in other
> cases, because primaries get more of the work for their PG (serving
> all reads, coordinating writes).
> So we added the ability to specify a "primary affinity", which is like
> the CRUSH weights but only impacts whether you become the primary. So
> if you have 3 OSDs that each have primary affinity = 1, it will behave
> as normal. If two have primary affinity = 0, the remaining OSD will be
> the primary. Etc.

Is it possible (and/or advisable) to set primary affinity low while 
backfilling / recovering an OSD in an effort to prevent unnecessary slow 
reads that could be directed to less busy replicas? I suppose if the 
cost of setting/unsetting primary affinity is low and clients are 
starved for reads during backfill/recovery from the osd in question, it 
could be a win.

Perhaps the workflow for maintenance on osd.0 would be something like:

- Stop osd.0, do some maintenance on osd.0
- Read primary affinity of osd.0, store it for later
- Set primary affinity on osd.0 to 0
- Start osd.0
- Enjoy a better backfill/recovery experience. RBD clients happier.
- Reset primary affinity on osd.0 to previous value

If the cost of setting primary affinity is low enough, perhaps this 
strategy could be automated by the ceph daemons.

Thanks,
Mike Dawson

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ceph-users] v0.80 Firefly released
  2014-05-07 18:18         ` [ceph-users] " Mike Dawson
@ 2014-05-07 18:30           ` Gregory Farnum
  2014-05-08 12:20           ` Andrey Korolyov
  1 sibling, 0 replies; 18+ messages in thread
From: Gregory Farnum @ 2014-05-07 18:30 UTC (permalink / raw)
  To: Mike Dawson; +Cc: Dan van der Ster, ceph-devel@vger.kernel.org, ceph-users

On Wed, May 7, 2014 at 11:18 AM, Mike Dawson <mike.dawson@cloudapt.com> wrote:
>
> On 5/7/2014 11:53 AM, Gregory Farnum wrote:
>>
>> On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
>> <daniel.vanderster@cern.ch> wrote:
>>>
>>> Hi,
>>>
>>>
>>> Sage Weil wrote:
>>>
>>> * *Primary affinity*: Ceph now has the ability to skew selection of
>>>    OSDs as the "primary" copy, which allows the read workload to be
>>>    cheaply skewed away from parts of the cluster without migrating any
>>>    data.
>>>
>>>
>>> Can you please elaborate a bit on this one? I found the blueprint [1] but
>>> still don't quite understand how it works. Does this only change the
>>> crush
>>> calculation for reads? i.e writes still go to the usual primary, but
>>> reads
>>> are distributed across the replicas? If so, does this change the
>>> consistency
>>> model in any way.
>>
>>
>> It changes the calculation of who becomes the primary, and that
>> primary serves both reads and writes. In slightly more depth:
>> Previously, the primary has always been the first OSD chosen as a
>> member of the PG.
>> For erasure coding, we added the ability to specify a primary
>> independent of the selection ordering. This was part of a broad set of
>> changes to prevent moving the EC "shards" around between different
>> members of the PG, and means that the primary might be the second OSD
>> in the PG, or the fourth.
>> Once this work existed, we realized that it might be useful in other
>> cases, because primaries get more of the work for their PG (serving
>> all reads, coordinating writes).
>> So we added the ability to specify a "primary affinity", which is like
>> the CRUSH weights but only impacts whether you become the primary. So
>> if you have 3 OSDs that each have primary affinity = 1, it will behave
>> as normal. If two have primary affinity = 0, the remaining OSD will be
>> the primary. Etc.
>
>
> Is it possible (and/or advisable) to set primary affinity low while
> backfilling / recovering an OSD in an effort to prevent unnecessary slow
> reads that could be directed to less busy replicas?

I have no experimental data and haven't thought about it in the past,
but that sounds like it might be helpful, yeah!.
Your clients will need to support this feature, so if you're using
kernel clients you need a very new kernel (I don't remember exactly
which one).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

> I suppose if the cost of
> setting/unsetting primary affinity is low and clients are starved for reads
> during backfill/recovery from the osd in question, it could be a win.
>
> Perhaps the workflow for maintenance on osd.0 would be something like:
>
> - Stop osd.0, do some maintenance on osd.0
> - Read primary affinity of osd.0, store it for later
> - Set primary affinity on osd.0 to 0
> - Start osd.0
> - Enjoy a better backfill/recovery experience. RBD clients happier.
> - Reset primary affinity on osd.0 to previous value
>
> If the cost of setting primary affinity is low enough, perhaps this strategy
> could be automated by the ceph daemons.
>
> Thanks,
> Mike Dawson
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ceph-users] v0.80 Firefly released
  2014-05-07 18:18         ` [ceph-users] " Mike Dawson
  2014-05-07 18:30           ` Gregory Farnum
@ 2014-05-08 12:20           ` Andrey Korolyov
  2014-05-09 21:48             ` Mike Dawson
  1 sibling, 1 reply; 18+ messages in thread
From: Andrey Korolyov @ 2014-05-08 12:20 UTC (permalink / raw)
  To: Mike Dawson
  Cc: Gregory Farnum, Dan van der Ster, ceph-devel@vger.kernel.org,
	ceph-users

Mike, would you mind to write your experience if you`ll manage to get
this flow through first? I hope I`ll be able to conduct some tests
related to 0.80 only next week, including maintenance combined with
primary pointer relocation - one of most crucial things remaining in
Ceph for the production performance.

On Wed, May 7, 2014 at 10:18 PM, Mike Dawson <mike.dawson@cloudapt.com> wrote:
>
> On 5/7/2014 11:53 AM, Gregory Farnum wrote:
>>
>> On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
>> <daniel.vanderster@cern.ch> wrote:
>>>
>>> Hi,
>>>
>>>
>>> Sage Weil wrote:
>>>
>>> * *Primary affinity*: Ceph now has the ability to skew selection of
>>>    OSDs as the "primary" copy, which allows the read workload to be
>>>    cheaply skewed away from parts of the cluster without migrating any
>>>    data.
>>>
>>>
>>> Can you please elaborate a bit on this one? I found the blueprint [1] but
>>> still don't quite understand how it works. Does this only change the
>>> crush
>>> calculation for reads? i.e writes still go to the usual primary, but
>>> reads
>>> are distributed across the replicas? If so, does this change the
>>> consistency
>>> model in any way.
>>
>>
>> It changes the calculation of who becomes the primary, and that
>> primary serves both reads and writes. In slightly more depth:
>> Previously, the primary has always been the first OSD chosen as a
>> member of the PG.
>> For erasure coding, we added the ability to specify a primary
>> independent of the selection ordering. This was part of a broad set of
>> changes to prevent moving the EC "shards" around between different
>> members of the PG, and means that the primary might be the second OSD
>> in the PG, or the fourth.
>> Once this work existed, we realized that it might be useful in other
>> cases, because primaries get more of the work for their PG (serving
>> all reads, coordinating writes).
>> So we added the ability to specify a "primary affinity", which is like
>> the CRUSH weights but only impacts whether you become the primary. So
>> if you have 3 OSDs that each have primary affinity = 1, it will behave
>> as normal. If two have primary affinity = 0, the remaining OSD will be
>> the primary. Etc.
>
>
> Is it possible (and/or advisable) to set primary affinity low while
> backfilling / recovering an OSD in an effort to prevent unnecessary slow
> reads that could be directed to less busy replicas? I suppose if the cost of
> setting/unsetting primary affinity is low and clients are starved for reads
> during backfill/recovery from the osd in question, it could be a win.
>
> Perhaps the workflow for maintenance on osd.0 would be something like:
>
> - Stop osd.0, do some maintenance on osd.0
> - Read primary affinity of osd.0, store it for later
> - Set primary affinity on osd.0 to 0
> - Start osd.0
> - Enjoy a better backfill/recovery experience. RBD clients happier.
> - Reset primary affinity on osd.0 to previous value
>
> If the cost of setting primary affinity is low enough, perhaps this strategy
> could be automated by the ceph daemons.
>
> Thanks,
> Mike Dawson
>
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [ceph-users] v0.80 Firefly released
  2014-05-08 12:20           ` Andrey Korolyov
@ 2014-05-09 21:48             ` Mike Dawson
       [not found]               ` <536D4D48.5040307-ffsCFlcjuZBWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Dawson @ 2014-05-09 21:48 UTC (permalink / raw)
  To: Andrey Korolyov
  Cc: Gregory Farnum, Dan van der Ster, ceph-devel@vger.kernel.org,
	ceph-users

Andrey,

In initial testing, it looks like it may work rather efficiently.

1) Upgrade all mon, osd, and clients to Firefly. Restart everything so 
no legacy ceph code is running.


2) Add "mon osd allow primary affinity = true" to ceph.conf, distribute 
ceph.conf to nodes.


3) Inject it into the monitors to make it immediately active:

# ceph tell mon.* injectargs '--mon_osd_allow_primary_affinity true'

Ignore the "mon.a: injectargs: failed to parse arguments: true" 
warnings, this appears to be a bug [0].


4) Check to see how many PGs have OSD 0 as their primary:

ceph pg dump | awk '{ print $15 " " $14 " " $1}' | egrep "^0" | wc -l


5) Set primary affinity to zero on osd.0:

# ceph osd primary-affinity osd.0 0

If you didn't set mon_osd_allow_primary_affinity properly above, you'll 
get a helpful error message.


6) Confirm it worked by comparing how many PGs have osd.0 as their primary.

ceph pg dump | awk '{ print $15 }' | egrep "^0" | wc -l

On my small dev cluster, the number goes to 0 in less than 10 seconds.


7) Perform maintenance and watch ceph -w. If you didn't get all your 
clients updated, you'll likely see a bunch of errors in ceph -w like:

2014-05-09 21:12:42.534900 osd.0 [WRN] client.130959 x.x.x.x:0/1015056 
misdirected client.130959.0:619497 pg 4.90eaebe to osd.0 not [6,1,0] in 
e1650/1650

8) After you are done with maintenance, reset the primary affinity:

# ceph osd primary-affinity osd.0 1


I have not scaled up my testing, but it looks like this has the 
potential to work well in preventing unnecessary read starvation in 
certain situations.


0: http://tracker.ceph.com/issues/8323#note-1


Cheers,
Mike Dawson

On 5/8/2014 8:20 AM, Andrey Korolyov wrote:
> Mike, would you mind to write your experience if you`ll manage to get
> this flow through first? I hope I`ll be able to conduct some tests
> related to 0.80 only next week, including maintenance combined with
> primary pointer relocation - one of most crucial things remaining in
> Ceph for the production performance.
>
> On Wed, May 7, 2014 at 10:18 PM, Mike Dawson <mike.dawson@cloudapt.com> wrote:
>>
>> On 5/7/2014 11:53 AM, Gregory Farnum wrote:
>>>
>>> On Wed, May 7, 2014 at 8:44 AM, Dan van der Ster
>>> <daniel.vanderster@cern.ch> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> Sage Weil wrote:
>>>>
>>>> * *Primary affinity*: Ceph now has the ability to skew selection of
>>>>     OSDs as the "primary" copy, which allows the read workload to be
>>>>     cheaply skewed away from parts of the cluster without migrating any
>>>>     data.
>>>>
>>>>
>>>> Can you please elaborate a bit on this one? I found the blueprint [1] but
>>>> still don't quite understand how it works. Does this only change the
>>>> crush
>>>> calculation for reads? i.e writes still go to the usual primary, but
>>>> reads
>>>> are distributed across the replicas? If so, does this change the
>>>> consistency
>>>> model in any way.
>>>
>>>
>>> It changes the calculation of who becomes the primary, and that
>>> primary serves both reads and writes. In slightly more depth:
>>> Previously, the primary has always been the first OSD chosen as a
>>> member of the PG.
>>> For erasure coding, we added the ability to specify a primary
>>> independent of the selection ordering. This was part of a broad set of
>>> changes to prevent moving the EC "shards" around between different
>>> members of the PG, and means that the primary might be the second OSD
>>> in the PG, or the fourth.
>>> Once this work existed, we realized that it might be useful in other
>>> cases, because primaries get more of the work for their PG (serving
>>> all reads, coordinating writes).
>>> So we added the ability to specify a "primary affinity", which is like
>>> the CRUSH weights but only impacts whether you become the primary. So
>>> if you have 3 OSDs that each have primary affinity = 1, it will behave
>>> as normal. If two have primary affinity = 0, the remaining OSD will be
>>> the primary. Etc.
>>
>>
>> Is it possible (and/or advisable) to set primary affinity low while
>> backfilling / recovering an OSD in an effort to prevent unnecessary slow
>> reads that could be directed to less busy replicas? I suppose if the cost of
>> setting/unsetting primary affinity is low and clients are starved for reads
>> during backfill/recovery from the osd in question, it could be a win.
>>
>> Perhaps the workflow for maintenance on osd.0 would be something like:
>>
>> - Stop osd.0, do some maintenance on osd.0
>> - Read primary affinity of osd.0, store it for later
>> - Set primary affinity on osd.0 to 0
>> - Start osd.0
>> - Enjoy a better backfill/recovery experience. RBD clients happier.
>> - Reset primary affinity on osd.0 to previous value
>>
>> If the cost of setting primary affinity is low enough, perhaps this strategy
>> could be automated by the ceph daemons.
>>
>> Thanks,
>> Mike Dawson
>>
>>
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: v0.80 Firefly released
       [not found]               ` <536D4D48.5040307-ffsCFlcjuZBWk0Htik3J/w@public.gmane.org>
@ 2014-05-11 12:32                 ` Sergey Malinin
  0 siblings, 0 replies; 18+ messages in thread
From: Sergey Malinin @ 2014-05-11 12:32 UTC (permalink / raw)
  To: Mike Dawson; +Cc: ceph-devel@vger.kernel.org, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 286 bytes --]

> 
> # ceph tell mon.* injectargs '--mon_osd_allow_primary_affinity true'
> 
> Ignore the "mon.a: injectargs: failed to parse arguments: true" 
> warnings, this appears to be a bug [0].
> 
> 

It will work this way: 
ceph tell mon.* injectargs -- --mon_osd_allow_primary_affinity=true


[-- Attachment #1.2: Type: text/html, Size: 566 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
       [not found] ` <BLU436-SMTP195A770E0A729DF4723DD28DF310@phx.gbl>
@ 2014-05-16 13:09   ` Sage Weil
  2014-05-16 16:42     ` Yehuda Sadeh
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2014-05-16 13:09 UTC (permalink / raw)
  To: Guang; +Cc: Yehuda Sadeh, haomaiwang, Ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 10702 bytes --]

Hi Guang,

[I think the problem is that your email is HTML formatted, and vger 
silently drops those.  Make sure your mailer is set to plain text mode.]

On Fri, 16 May 2014, Guang wrote:

>       * *Key/value OSD backend* (experimental): An alternative storage
>       backend
>        for Ceph OSD processes that puts all data in a key/value
>       database like
>        leveldb.  This provides better performance for workloads
>       dominated by
>        key/value operations (like radosgw bucket indices).
> 
> Hi Yehuda and Haomai,I managed to set up a K/V store backend and played
> around with it, as Sage mentioned in the release note, I thought K/V store
> could be the solution for radosgw?s bucket indexing feature which currently
> has scaling problems [1], however, after playing around with K/V store and
> understanding the requirement for bucket indexing, I think at least for now
> there is still gap to fix the bucket indexing by leveraging the K/V store.
> 
> In my opinion, one requirement (API) to implement bucket indexing is to
> support ordered scan (prefix filter), which is not part of the API of rados,
> and as K/V store does not extend the rados API (it is not supposed to) but
> only  change the underlying object store strategy. It is not likely to help
> for the bucket indexing, except that we use the original way using omap to
> store bucket indexing and each bucket corresponds to one object.

The rados omap API does allow a prefix filter, although it's somewhat 
implicit:

    /**
     * omap_get_keys: keys from the object omap
     *
     * Get up to max_return keys beginning after start_after
     *
     * @param start_after [in] list keys starting after start_after
     * @parem max_return [in] list no more than max_return keys
     * @param out_keys [out] place returned values in out_keys on completion
     * @param prval [out] place error code in prval upon completion
     */
    void omap_get_keys(const std::string &start_after,
                       uint64_t max_return,
                       std::set<std::string> *out_keys,
                       int *prval);

Since all keys are sorted alphanumerically, you simply have to set 
start_after == your prefix, and start ignoring the results once you get a 
key that does not contain your prefix.  This could be improved by having 
an explicit prefix argument that does this server-side, but for now at you 
can get the right data (plus a bit a extra at the end).

Is that what you mean by prefix scan, or are you referring to the ability 
to scan for rados objects that begin with a prefix?  If it's the latter, 
you are right: objects are hashed across nodes and there is no sorted 
object name index to allow prefix filtering.  There is a list_objects 
filter option, but it is still O(objects in the pool).

> Did I miss anything obvious here?
> 
> We are very interested in the effort to improve the scalability of bucket
> index [1] as the blueprint mentioned, here is my thoughts on top of this:
>  1. It would be nice we can refactor the interface so that it is easy to
> switch to a different underlying storage system for bucket indexing, for
> example, DynamoDB seems like being used for S3?s implementation [2], and SWIFT
> uses sqllite [1] and has a flat namespace for listing purpose (with prefix
> and delimiter).

radosgw is using the omap key/value API for objects, which is more or less 
equivalent to what swift is doing with sqlite.  This data passes straight 
into leveldb on the backend (or whatever other backend you are using).  
Using something like rocksdb in its place is pretty simple and ther are
unmerged patches to do that; the user would just need to adjust their 
crush map so that the rgw index pool is mapped to a different set of OSDs 
with the better k/v backend.

>  2. As mentioned in the blueprint, if we go with the approach to do sharding
> for the bucket index object, what is the design choice? Are we going to
> maintain a B- tree structure get all keys sorted and sharidng on demand,
> like having a background thread do the sharding when it reaches a certain
> threshold? 

I don't know... I'm sure Yehuda has a more well-formed opinion on this.  I 
suspect something simpler than a B tree (like a single-level hash-based 
fan out) would be sufficient, although you'd pay a bit of a price for 
object enumeration.

sage



> 
> [1] https://wiki.ceph.com/Planning/Sideboard/rgw%3A_bucket_index_scalabilit
> y
> [2] http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAn
> dScan.html
> [3] https://swiftstack.com/openstack-swift/architecture/
> 
> Thanks,
> Guang
> 
> On May 7, 2014, at 9:05 AM, Sage Weil <sage@inktank.com> wrote:
> 
>       We did it!  Firefly v0.80 is built and pushed out to the
>       ceph.com
>       repositories.
> 
>       This release will form the basis for our long-term supported
>       release
>       Firefly, v0.80.x.  The big new features are support for erasure
>       coding
>       and cache tiering, although a broad range of other features,
>       fixes,
>       and improvements have been made across the code base.
>        Highlights include:
> 
>       * *Erasure coding*: support for a broad range of erasure codes
>       for lower
>        storage overhead and better data durability.
>       * *Cache tiering*: support for creating 'cache pools' that store
>       hot,
>        recently accessed objects with automatic demotion of colder
>       data to
>        a base tier.  Typically the cache pool is backed by faster
>       storage
>        devices like SSDs.
>       * *Primary affinity*: Ceph now has the ability to skew selection
>       of
>        OSDs as the "primary" copy, which allows the read workload to
>       be
>        cheaply skewed away from parts of the cluster without migrating
>       any
>        data.
>       * *Key/value OSD backend* (experimental): An alternative storage
>       backend
>        for Ceph OSD processes that puts all data in a key/value
>       database like
>        leveldb.  This provides better performance for workloads
>       dominated by
>        key/value operations (like radosgw bucket indices).
>       * *Standalone radosgw* (experimental): The radosgw process can
>       now run
>        in a standalone mode without an apache (or similar) web server
>       or
>        fastcgi.  This simplifies deployment and can improve
>       performance.
> 
>       We expect to maintain a series of stable releases based on v0.80
>       Firefly for as much as a year.  In the meantime, development of
>       Ceph
>       continues with the next release, Giant, which will feature work
>       on the
>       CephFS distributed file system, more alternative storage
>       backends
>       (like RocksDB and f2fs), RDMA support, support for pyramid
>       erasure
>       codes, and additional functionality in the block device (RBD)
>       like
>       copy-on-read and multisite mirroring.
> 
>       This release is the culmination of a huge collective effort by
>       about 100
>       different contributors.  Thank you everyone who has helped to
>       make this
>       possible!
> 
>       Upgrade Sequencing
>       ------------------
> 
>       * If your existing cluster is running a version older than v0.67
>        Dumpling, please first upgrade to the latest Dumpling release
>       before
>        upgrading to v0.80 Firefly.  Please refer to the :ref:`Dumpling
>       upgrade`
>        documentation.
> 
>       * Upgrade daemons in the following order:
> 
>          1. Monitors
>          2. OSDs
>          3. MDSs and/or radosgw
> 
>        If the ceph-mds daemon is restarted first, it will wait until
>       all
>        OSDs have been upgraded before finishing its startup sequence.
>        If
>        the ceph-mon daemons are not restarted prior to the ceph-osd
>        daemons, they will not correctly register their new
>       capabilities
>        with the cluster and new features may not be usable until they
>       are
>        restarted a second time.
> 
>       * Upgrade radosgw daemons together.  There is a subtle change in
>       behavior
>        for multipart uploads that prevents a multipart request that
>       was initiated
>        with a new radosgw from being completed by an old radosgw.
> 
>       Notable changes since v0.79
>       ---------------------------
> 
>       * ceph-fuse, libcephfs: fix several caching bugs (Yan, Zheng)
>       * ceph-fuse: trim inodes in response to mds memory pressure
>       (Yan, Zheng)
>       * librados: fix inconsistencies in API error values (David
>       Zafman)
>       * librados: fix watch operations with cache pools (Sage Weil)
>       * librados: new snap rollback operation (David Zafman)
>       * mds: fix respawn (John Spray)
>       * mds: misc bugs (Yan, Zheng)
>       * mds: misc multi-mds fixes (Yan, Zheng)
>       * mds: use shared_ptr for requests (Greg Farnum)
>       * mon: fix peer feature checks (Sage Weil)
>       * mon: require 'x' mon caps for auth operations (Joao Luis)
>       * mon: shutdown when removed from mon cluster (Joao Luis)
>       * msgr: fix locking bug in authentication (Josh Durgin)
>       * osd: fix bug in journal replay/restart (Sage Weil)
>       * osd: many many many bug fixes with cache tiering (Samuel Just)
>       * osd: track omap and hit_set objects in pg stats (Samuel Just)
>       * osd: warn if agent cannot enable due to invalid (post-split)
>       stats (Sage Weil)
>       * rados bench: track metadata for multiple runs separately
>       (Guang Yang)
>       * rgw: fixed subuser modify (Yehuda Sadeh)
>       * rpm: fix redhat-lsb dependency (Sage Weil, Alfredo Deza)
> 
>       For the complete release notes, please see:
> 
>         http://ceph.com/docs/master/release-notes/#v0-80-firefly
> 
> 
>       Getting Ceph
>       ------------
> 
>       * Git at git://github.com/ceph/ceph.git
>       * Tarball at http://ceph.com/download/ceph-0.80.tar.gz
>       * For packages, see
>       http://ceph.com/docs/master/install/get-packages
>       * For ceph-deploy, see
>       http://ceph.com/docs/master/install/install-ceph-deploy
> 
>       --
>       To unsubscribe from this list: send the line "unsubscribe
>       ceph-devel" in
>       the body of a message to majordomo@vger.kernel.org
>       More majordomo info at
>        http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
  2014-05-16 13:09   ` Radosgw - bucket index Sage Weil
@ 2014-05-16 16:42     ` Yehuda Sadeh
  2014-05-18  6:25       ` Guang
  0 siblings, 1 reply; 18+ messages in thread
From: Yehuda Sadeh @ 2014-05-16 16:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: Guang, haomaiwang, Ceph-devel

On Fri, May 16, 2014 at 6:09 AM, Sage Weil <sage@inktank.com> wrote:
> Hi Guang,
>
> [I think the problem is that your email is HTML formatted, and vger
> silently drops those.  Make sure your mailer is set to plain text mode.]
>
> On Fri, 16 May 2014, Guang wrote:
>
>>       * *Key/value OSD backend* (experimental): An alternative storage
>>       backend
>>        for Ceph OSD processes that puts all data in a key/value
>>       database like
>>        leveldb.  This provides better performance for workloads
>>       dominated by
>>        key/value operations (like radosgw bucket indices).
>>
>> Hi Yehuda and Haomai,I managed to set up a K/V store backend and played
>> around with it, as Sage mentioned in the release note, I thought K/V store
>> could be the solution for radosgw?s bucket indexing feature which currently
>> has scaling problems [1], however, after playing around with K/V store and
>> understanding the requirement for bucket indexing, I think at least for now
>> there is still gap to fix the bucket indexing by leveraging the K/V store.
>>
>> In my opinion, one requirement (API) to implement bucket indexing is to
>> support ordered scan (prefix filter), which is not part of the API of rados,
>> and as K/V store does not extend the rados API (it is not supposed to) but
>> only  change the underlying object store strategy. It is not likely to help
>> for the bucket indexing, except that we use the original way using omap to
>> store bucket indexing and each bucket corresponds to one object.
>
> The rados omap API does allow a prefix filter, although it's somewhat
> implicit:
>
>     /**
>      * omap_get_keys: keys from the object omap
>      *
>      * Get up to max_return keys beginning after start_after
>      *
>      * @param start_after [in] list keys starting after start_after
>      * @parem max_return [in] list no more than max_return keys
>      * @param out_keys [out] place returned values in out_keys on completion
>      * @param prval [out] place error code in prval upon completion
>      */
>     void omap_get_keys(const std::string &start_after,
>                        uint64_t max_return,
>                        std::set<std::string> *out_keys,
>                        int *prval);
>
> Since all keys are sorted alphanumerically, you simply have to set
> start_after == your prefix, and start ignoring the results once you get a
> key that does not contain your prefix.  This could be improved by having
> an explicit prefix argument that does this server-side, but for now at you
> can get the right data (plus a bit a extra at the end).
>
> Is that what you mean by prefix scan, or are you referring to the ability
> to scan for rados objects that begin with a prefix?  If it's the latter,
> you are right: objects are hashed across nodes and there is no sorted
> object name index to allow prefix filtering.  There is a list_objects
> filter option, but it is still O(objects in the pool).
>
>> Did I miss anything obvious here?
>>
>> We are very interested in the effort to improve the scalability of bucket
>> index [1] as the blueprint mentioned, here is my thoughts on top of this:
>>  1. It would be nice we can refactor the interface so that it is easy to
>> switch to a different underlying storage system for bucket indexing, for
>> example, DynamoDB seems like being used for S3?s implementation [2], and SWIFT
>> uses sqllite [1] and has a flat namespace for listing purpose (with prefix
>> and delimiter).
>
> radosgw is using the omap key/value API for objects, which is more or less
> equivalent to what swift is doing with sqlite.  This data passes straight
> into leveldb on the backend (or whatever other backend you are using).
> Using something like rocksdb in its place is pretty simple and ther are
> unmerged patches to do that; the user would just need to adjust their
> crush map so that the rgw index pool is mapped to a different set of OSDs
> with the better k/v backend.
>
>>  2. As mentioned in the blueprint, if we go with the approach to do sharding
>> for the bucket index object, what is the design choice? Are we going to
>> maintain a B- tree structure get all keys sorted and sharidng on demand,
>> like having a background thread do the sharding when it reaches a certain
>> threshold?
>
> I don't know... I'm sure Yehuda has a more well-formed opinion on this.  I
> suspect something simpler than a B tree (like a single-level hash-based
> fan out) would be sufficient, although you'd pay a bit of a price for
> object enumeration.
>

My more well-formed opinion is that we need to come up with a good
design. It needs to be flexible enough to be able to grow (and maybe
shrink), and I assume there would be some kind of background operation
that will enable that. I also believe that making it hash based is the
way to go. It looks like that the more complicated issue is here is
how to handle the transition in which we shard buckets.

Yehuda

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
  2014-05-16 16:42     ` Yehuda Sadeh
@ 2014-05-18  6:25       ` Guang
  2014-05-18 23:05         ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Guang @ 2014-05-18  6:25 UTC (permalink / raw)
  To: Yehuda Sadeh, Sage Weil; +Cc: haomaiwang, Ceph-devel

Thanks Sage and Yehuda.
On May 17, 2014, at 12:42 AM, Yehuda Sadeh <yehuda@inktank.com> wrote:

> On Fri, May 16, 2014 at 6:09 AM, Sage Weil <sage@inktank.com> wrote:
>> Hi Guang,
>> 
>> [I think the problem is that your email is HTML formatted, and vger
>> silently drops those.  Make sure your mailer is set to plain text mode.]
Yeah, thanks Sage! My Yahoo account failed to to a change at Yahoo! side [1] and my outlook account failed with the HTML format, changed it to plain text.
[1]  http://thehackernews.com/2014/04/yahoos-new-dmarc-policy-destroys-every.html
>> 
>> On Fri, 16 May 2014, Guang wrote:
>> 
>>>      * *Key/value OSD backend* (experimental): An alternative storage
>>>      backend
>>>       for Ceph OSD processes that puts all data in a key/value
>>>      database like
>>>       leveldb.  This provides better performance for workloads
>>>      dominated by
>>>       key/value operations (like radosgw bucket indices).
>>> 
>>> Hi Yehuda and Haomai,I managed to set up a K/V store backend and played
>>> around with it, as Sage mentioned in the release note, I thought K/V store
>>> could be the solution for radosgw?s bucket indexing feature which currently
>>> has scaling problems [1], however, after playing around with K/V store and
>>> understanding the requirement for bucket indexing, I think at least for now
>>> there is still gap to fix the bucket indexing by leveraging the K/V store.
>>> 
>>> In my opinion, one requirement (API) to implement bucket indexing is to
>>> support ordered scan (prefix filter), which is not part of the API of rados,
>>> and as K/V store does not extend the rados API (it is not supposed to) but
>>> only  change the underlying object store strategy. It is not likely to help
>>> for the bucket indexing, except that we use the original way using omap to
>>> store bucket indexing and each bucket corresponds to one object.
>> 
>> The rados omap API does allow a prefix filter, although it's somewhat
>> implicit:
>> 
>>    /**
>>     * omap_get_keys: keys from the object omap
>>     *
>>     * Get up to max_return keys beginning after start_after
>>     *
>>     * @param start_after [in] list keys starting after start_after
>>     * @parem max_return [in] list no more than max_return keys
>>     * @param out_keys [out] place returned values in out_keys on completion
>>     * @param prval [out] place error code in prval upon completion
>>     */
>>    void omap_get_keys(const std::string &start_after,
>>                       uint64_t max_return,
>>                       std::set<std::string> *out_keys,
>>                       int *prval);
>> 
>> Since all keys are sorted alphanumerically, you simply have to set
>> start_after == your prefix, and start ignoring the results once you get a
>> key that does not contain your prefix.  This could be improved by having
>> an explicit prefix argument that does this server-side, but for now at you
>> can get the right data (plus a bit a extra at the end).
I think this is the API currently being used to implement the bucket indexing, and it operates on object basis, which makes it unscalable, e.g. two requests updating the same index object will need to be serialized at OSD side.
>> 
>> Is that what you mean by prefix scan, or are you referring to the ability
>> to scan for rados objects that begin with a prefix?  If it's the latter,
>> you are right: objects are hashed across nodes and there is no sorted
>> object name index to allow prefix filtering.  There is a list_objects
>> filter option, but it is still O(objects in the pool).
By prefix scan, I was referring to a radios objects level API (so that we can leverage the new K/V store to improve the scalability, that is, different bucket index entries are actually refer to different rados objects, which makes it scalable). As this is not true, we are not likely to leverage K/V store backend to simply solve the bucket indexing issue.
>> 
>>> Did I miss anything obvious here?
>>> 
>>> We are very interested in the effort to improve the scalability of bucket
>>> index [1] as the blueprint mentioned, here is my thoughts on top of this:
>>> 1. It would be nice we can refactor the interface so that it is easy to
>>> switch to a different underlying storage system for bucket indexing, for
>>> example, DynamoDB seems like being used for S3?s implementation [2], and SWIFT
>>> uses sqllite [1] and has a flat namespace for listing purpose (with prefix
>>> and delimiter).
>> 
>> radosgw is using the omap key/value API for objects, which is more or less
>> equivalent to what swift is doing with sqlite.  This data passes straight
>> into leveldb on the backend (or whatever other backend you are using).
>> Using something like rocksdb in its place is pretty simple and ther are
>> unmerged patches to do that; the user would just need to adjust their
>> crush map so that the rgw index pool is mapped to a different set of OSDs
>> with the better k/v backend.
Not sure if I miss anything, but the key difference with SWIFT’s implementation is that they are using a table for bucket index and it actually can be updated in parallel which makes more scalable for write, though at certain point the sql table would result in performance degradation as well.
>> 
>>> 2. As mentioned in the blueprint, if we go with the approach to do sharding
>>> for the bucket index object, what is the design choice? Are we going to
>>> maintain a B- tree structure get all keys sorted and sharidng on demand,
>>> like having a background thread do the sharding when it reaches a certain
>>> threshold?
>> 
>> I don't know... I'm sure Yehuda has a more well-formed opinion on this.  I
>> suspect something simpler than a B tree (like a single-level hash-based
>> fan out) would be sufficient, although you'd pay a bit of a price for
>> object enumeration.
>> 
> 
> My more well-formed opinion is that we need to come up with a good
> design. It needs to be flexible enough to be able to grow (and maybe
> shrink), and I assume there would be some kind of background operation
> that will enable that. I also believe that making it hash based is the
> way to go. It looks like that the more complicated issue is here is
> how to handle the transition in which we shard buckets.
Yeah I agree. I think the conflicting goals here are, we want a sorted list (so that it enable prefix scan for listing purpose) and we want to shard at the very beginning (the problem we are facing is parallel writes updating the same bucket index object will need to be serialized).
> Yehuda
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
  2014-05-18  6:25       ` Guang
@ 2014-05-18 23:05         ` Sage Weil
  2014-05-19  6:18           ` Guang Yang
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2014-05-18 23:05 UTC (permalink / raw)
  To: Guang; +Cc: Yehuda Sadeh, haomaiwang, Ceph-devel

On Sun, 18 May 2014, Guang wrote:
> >> radosgw is using the omap key/value API for objects, which is more or less
> >> equivalent to what swift is doing with sqlite.  This data passes straight
> >> into leveldb on the backend (or whatever other backend you are using).
> >> Using something like rocksdb in its place is pretty simple and ther are
> >> unmerged patches to do that; the user would just need to adjust their
> >> crush map so that the rgw index pool is mapped to a different set of OSDs
> >> with the better k/v backend.
> Not sure if I miss anything, but the key difference with SWIFT?s 
> implementation is that they are using a table for bucket index and it 
> actually can be updated in parallel which makes more scalable for write, 
> though at certain point the sql table would result in performance 
> degradation as well.

As I understand it the same limitation is present there too: the index is 
in a single sqlite table.

> > My more well-formed opinion is that we need to come up with a good
> > design. It needs to be flexible enough to be able to grow (and maybe
> > shrink), and I assume there would be some kind of background operation
> > that will enable that. I also believe that making it hash based is the
> > way to go. It looks like that the more complicated issue is here is
> > how to handle the transition in which we shard buckets.
> Yeah I agree. I think the conflicting goals here are, we want a sorted 
> list (so that it enable prefix scan for listing purpose) and we want to 
> shard at the very beginning (the problem we are facing is parallel 
> writes updating the same bucket index object will need to be 
> serialized).

Given how infrequent container listings are, pre-sharding containers 
across several objects makes some sense.  Paying the cost of doing 
listings in parallel across N (where N is not too big) is not a big price 
to pay. However, there will always need to be a way to re-shard further 
when containers/buckets get extremely big.  Perhaps a starting point would 
be support for static sharding where the number of shards is specified at 
container/bucket creation time...

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
  2014-05-18 23:05         ` Sage Weil
@ 2014-05-19  6:18           ` Guang Yang
  2014-05-19  6:47             ` Yehuda Sadeh
  0 siblings, 1 reply; 18+ messages in thread
From: Guang Yang @ 2014-05-19  6:18 UTC (permalink / raw)
  To: Sage Weil, Yehuda Sadeh; +Cc: haomaiwang, Ceph-devel

On May 19, 2014, at 7:05 AM, Sage Weil <sage@inktank.com> wrote:

> On Sun, 18 May 2014, Guang wrote:
>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>> into leveldb on the backend (or whatever other backend you are using).
>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>> unmerged patches to do that; the user would just need to adjust their
>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>> with the better k/v backend.
>> Not sure if I miss anything, but the key difference with SWIFT?s 
>> implementation is that they are using a table for bucket index and it 
>> actually can be updated in parallel which makes more scalable for write, 
>> though at certain point the sql table would result in performance 
>> degradation as well.
> 
> As I understand it the same limitation is present there too: the index is 
> in a single sqlite table.
> 
>>> My more well-formed opinion is that we need to come up with a good
>>> design. It needs to be flexible enough to be able to grow (and maybe
>>> shrink), and I assume there would be some kind of background operation
>>> that will enable that. I also believe that making it hash based is the
>>> way to go. It looks like that the more complicated issue is here is
>>> how to handle the transition in which we shard buckets.
>> Yeah I agree. I think the conflicting goals here are, we want a sorted 
>> list (so that it enable prefix scan for listing purpose) and we want to 
>> shard at the very beginning (the problem we are facing is parallel 
>> writes updating the same bucket index object will need to be 
>> serialized).
> 
> Given how infrequent container listings are, pre-sharding containers 
> across several objects makes some sense.  Paying the cost of doing 
> listings in parallel across N (where N is not too big) is not a big price 
> to pay. However, there will always need to be a way to re-shard further 
> when containers/buckets get extremely big.  Perhaps a starting point would 
> be support for static sharding where the number of shards is specified at 
> container/bucket creation time…
Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
Yehuda,
How do you think?
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
  2014-05-19  6:18           ` Guang Yang
@ 2014-05-19  6:47             ` Yehuda Sadeh
  2014-05-30  0:35               ` Guang Yang
       [not found]               ` <0F65B78C-DF9A-40E4-BAAF-7411443DA3B6@outlook.com>
  0 siblings, 2 replies; 18+ messages in thread
From: Yehuda Sadeh @ 2014-05-19  6:47 UTC (permalink / raw)
  To: Guang Yang; +Cc: Sage Weil, Haomai Wang, Ceph-devel

On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@outlook.com> wrote:
> On May 19, 2014, at 7:05 AM, Sage Weil <sage@inktank.com> wrote:
>
>> On Sun, 18 May 2014, Guang wrote:
>>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>>> with the better k/v backend.
>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>> implementation is that they are using a table for bucket index and it
>>> actually can be updated in parallel which makes more scalable for write,
>>> though at certain point the sql table would result in performance
>>> degradation as well.
>>
>> As I understand it the same limitation is present there too: the index is
>> in a single sqlite table.
>>
>>>> My more well-formed opinion is that we need to come up with a good
>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>> shrink), and I assume there would be some kind of background operation
>>>> that will enable that. I also believe that making it hash based is the
>>>> way to go. It looks like that the more complicated issue is here is
>>>> how to handle the transition in which we shard buckets.
>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>> list (so that it enable prefix scan for listing purpose) and we want to
>>> shard at the very beginning (the problem we are facing is parallel
>>> writes updating the same bucket index object will need to be
>>> serialized).
>>
>> Given how infrequent container listings are, pre-sharding containers
>> across several objects makes some sense.  Paying the cost of doing
>> listings in parallel across N (where N is not too big) is not a big price
>> to pay. However, there will always need to be a way to re-shard further
>> when containers/buckets get extremely big.  Perhaps a starting point would
>> be support for static sharding where the number of shards is specified at
>> container/bucket creation time…
> Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
> Yehuda,
> How do you think?

Sharding it will help with scaling it up to a certain point. As Sage
mentioned we can start with a static setting as a first simpler
approach, and move into a dynamic approach later on.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
  2014-05-19  6:47             ` Yehuda Sadeh
@ 2014-05-30  0:35               ` Guang Yang
       [not found]               ` <0F65B78C-DF9A-40E4-BAAF-7411443DA3B6@outlook.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Guang Yang @ 2014-05-30  0:35 UTC (permalink / raw)
  To: Yehuda Sadeh; +Cc: Sage Weil, Ceph-devel

Hi Yehuda,
I opened an issue here: http://tracker.ceph.com/issues/8473, please help to review and comment.

Thanks,
Guang

On May 19, 2014, at 2:47 PM, Yehuda Sadeh <yehuda@inktank.com> wrote:

> On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@outlook.com> wrote:
>> On May 19, 2014, at 7:05 AM, Sage Weil <sage@inktank.com> wrote:
>> 
>>> On Sun, 18 May 2014, Guang wrote:
>>>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>>>> with the better k/v backend.
>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>> implementation is that they are using a table for bucket index and it
>>>> actually can be updated in parallel which makes more scalable for write,
>>>> though at certain point the sql table would result in performance
>>>> degradation as well.
>>> 
>>> As I understand it the same limitation is present there too: the index is
>>> in a single sqlite table.
>>> 
>>>>> My more well-formed opinion is that we need to come up with a good
>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>> shrink), and I assume there would be some kind of background operation
>>>>> that will enable that. I also believe that making it hash based is the
>>>>> way to go. It looks like that the more complicated issue is here is
>>>>> how to handle the transition in which we shard buckets.
>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>> shard at the very beginning (the problem we are facing is parallel
>>>> writes updating the same bucket index object will need to be
>>>> serialized).
>>> 
>>> Given how infrequent container listings are, pre-sharding containers
>>> across several objects makes some sense.  Paying the cost of doing
>>> listings in parallel across N (where N is not too big) is not a big price
>>> to pay. However, there will always need to be a way to re-shard further
>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>> be support for static sharding where the number of shards is specified at
>>> container/bucket creation time…
>> Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
>> Yehuda,
>> How do you think?
> 
> Sharding it will help with scaling it up to a certain point. As Sage
> mentioned we can start with a static setting as a first simpler
> approach, and move into a dynamic approach later on.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
       [not found]               ` <0F65B78C-DF9A-40E4-BAAF-7411443DA3B6@outlook.com>
@ 2014-06-02 13:37                 ` Guang Yang
       [not found]                 ` <959F3CD3-D69A-4B9C-818C-7C0F241E48EC@outlook.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Guang Yang @ 2014-06-02 13:37 UTC (permalink / raw)
  To: Yehuda Sadeh, Sage Weil; +Cc: Ceph-devel

Hi Yehuda and Sage,
Can you help to comment on the ticket, I would like to send out a pull request some time this week for you to review, but before that, it would be nice to see your comments in terms of the interface and any other concerns you may have for this. Thanks.

Thanks,
Guang


On May 30, 2014, at 8:35 AM, Guang Yang <yguang11@outlook.com> wrote:

> Hi Yehuda,
> I opened an issue here: http://tracker.ceph.com/issues/8473, please help to review and comment.
> 
> Thanks,
> Guang
> 
> On May 19, 2014, at 2:47 PM, Yehuda Sadeh <yehuda@inktank.com> wrote:
> 
>> On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@outlook.com> wrote:
>>> On May 19, 2014, at 7:05 AM, Sage Weil <sage@inktank.com> wrote:
>>> 
>>>> On Sun, 18 May 2014, Guang wrote:
>>>>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>>>>> with the better k/v backend.
>>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>>> implementation is that they are using a table for bucket index and it
>>>>> actually can be updated in parallel which makes more scalable for write,
>>>>> though at certain point the sql table would result in performance
>>>>> degradation as well.
>>>> 
>>>> As I understand it the same limitation is present there too: the index is
>>>> in a single sqlite table.
>>>> 
>>>>>> My more well-formed opinion is that we need to come up with a good
>>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>>> shrink), and I assume there would be some kind of background operation
>>>>>> that will enable that. I also believe that making it hash based is the
>>>>>> way to go. It looks like that the more complicated issue is here is
>>>>>> how to handle the transition in which we shard buckets.
>>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>>> shard at the very beginning (the problem we are facing is parallel
>>>>> writes updating the same bucket index object will need to be
>>>>> serialized).
>>>> 
>>>> Given how infrequent container listings are, pre-sharding containers
>>>> across several objects makes some sense.  Paying the cost of doing
>>>> listings in parallel across N (where N is not too big) is not a big price
>>>> to pay. However, there will always need to be a way to re-shard further
>>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>>> be support for static sharding where the number of shards is specified at
>>>> container/bucket creation time…
>>> Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
>>> Yehuda,
>>> How do you think?
>> 
>> Sharding it will help with scaling it up to a certain point. As Sage
>> mentioned we can start with a static setting as a first simpler
>> approach, and move into a dynamic approach later on.
>> 
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Radosgw - bucket index
       [not found]                 ` <959F3CD3-D69A-4B9C-818C-7C0F241E48EC@outlook.com>
@ 2014-06-06 12:50                   ` Guang Yang
  0 siblings, 0 replies; 18+ messages in thread
From: Guang Yang @ 2014-06-06 12:50 UTC (permalink / raw)
  To: Yehuda Sadeh; +Cc: Ceph-devel

Hi Yehuda,
Can you take a look at a very high level of the code change, here is the pull request - https://github.com/ceph/ceph/pull/1929.

If things look good to you, i will continue the effort and make it more clear/complete by end of next week.

Thanks,
Guang

On Jun 2, 2014, at 9:37 PM, Guang Yang <yguang11@outlook.com> wrote:

> Hi Yehuda and Sage,
> Can you help to comment on the ticket, I would like to send out a pull request some time this week for you to review, but before that, it would be nice to see your comments in terms of the interface and any other concerns you may have for this. Thanks.
> 
> Thanks,
> Guang
> 
> 
> On May 30, 2014, at 8:35 AM, Guang Yang <yguang11@outlook.com> wrote:
> 
>> Hi Yehuda,
>> I opened an issue here: http://tracker.ceph.com/issues/8473, please help to review and comment.
>> 
>> Thanks,
>> Guang
>> 
>> On May 19, 2014, at 2:47 PM, Yehuda Sadeh <yehuda@inktank.com> wrote:
>> 
>>> On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@outlook.com> wrote:
>>>> On May 19, 2014, at 7:05 AM, Sage Weil <sage@inktank.com> wrote:
>>>> 
>>>>> On Sun, 18 May 2014, Guang wrote:
>>>>>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>>>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>>>>>> with the better k/v backend.
>>>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>>>> implementation is that they are using a table for bucket index and it
>>>>>> actually can be updated in parallel which makes more scalable for write,
>>>>>> though at certain point the sql table would result in performance
>>>>>> degradation as well.
>>>>> 
>>>>> As I understand it the same limitation is present there too: the index is
>>>>> in a single sqlite table.
>>>>> 
>>>>>>> My more well-formed opinion is that we need to come up with a good
>>>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>>>> shrink), and I assume there would be some kind of background operation
>>>>>>> that will enable that. I also believe that making it hash based is the
>>>>>>> way to go. It looks like that the more complicated issue is here is
>>>>>>> how to handle the transition in which we shard buckets.
>>>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>>>> shard at the very beginning (the problem we are facing is parallel
>>>>>> writes updating the same bucket index object will need to be
>>>>>> serialized).
>>>>> 
>>>>> Given how infrequent container listings are, pre-sharding containers
>>>>> across several objects makes some sense.  Paying the cost of doing
>>>>> listings in parallel across N (where N is not too big) is not a big price
>>>>> to pay. However, there will always need to be a way to re-shard further
>>>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>>>> be support for static sharding where the number of shards is specified at
>>>>> container/bucket creation time…
>>>> Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
>>>> Yehuda,
>>>> How do you think?
>>> 
>>> Sharding it will help with scaling it up to a certain point. As Sage
>>> mentioned we can start with a static setting as a first simpler
>>> approach, and move into a dynamic approach later on.
>>> 
>>> Yehuda
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-06-06 12:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-07  1:05 v0.80 Firefly released Sage Weil
     [not found] ` <alpine.DEB.2.00.1405061757540.28165-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-07 15:44   ` Dan van der Ster
     [not found]     ` <536A54C6.4060202-vJEk5272eHo@public.gmane.org>
2014-05-07 15:51       ` Sage Weil
2014-05-07 15:53       ` Gregory Farnum
2014-05-07 18:18         ` [ceph-users] " Mike Dawson
2014-05-07 18:30           ` Gregory Farnum
2014-05-08 12:20           ` Andrey Korolyov
2014-05-09 21:48             ` Mike Dawson
     [not found]               ` <536D4D48.5040307-ffsCFlcjuZBWk0Htik3J/w@public.gmane.org>
2014-05-11 12:32                 ` Sergey Malinin
     [not found] ` <BLU436-SMTP195A770E0A729DF4723DD28DF310@phx.gbl>
2014-05-16 13:09   ` Radosgw - bucket index Sage Weil
2014-05-16 16:42     ` Yehuda Sadeh
2014-05-18  6:25       ` Guang
2014-05-18 23:05         ` Sage Weil
2014-05-19  6:18           ` Guang Yang
2014-05-19  6:47             ` Yehuda Sadeh
2014-05-30  0:35               ` Guang Yang
     [not found]               ` <0F65B78C-DF9A-40E4-BAAF-7411443DA3B6@outlook.com>
2014-06-02 13:37                 ` Guang Yang
     [not found]                 ` <959F3CD3-D69A-4B9C-818C-7C0F241E48EC@outlook.com>
2014-06-06 12:50                   ` Guang Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.