All of lore.kernel.org
 help / color / mirror / Atom feed
* An Evaluation of Object Name Hashing
@ 2016-01-12 13:17 Marcel Lauhoff
  2016-01-12 13:38 ` Sage Weil
  0 siblings, 1 reply; 3+ messages in thread
From: Marcel Lauhoff @ 2016-01-12 13:17 UTC (permalink / raw)
  To: ceph-devel


Hi,

I wrote a Master's Thesis about Ceph and cold storage last year. One of
the things I looked at was modifications to object placement.

Among others, what would happen to balance (e.g objects / OSD) when
all objects of a file end up on the same OSD. I also ran tests with a
different hash algorithm (Linux dcache).

I wrote an article on my website with the analysis, changes to the
source and how I ran the tests:

  http://irq0.org/articles/ceph/object_name_hashing


~irq0

--
Marcel Lauhoff
lauhoff@uni-mainz.de

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: An Evaluation of Object Name Hashing
  2016-01-12 13:17 An Evaluation of Object Name Hashing Marcel Lauhoff
@ 2016-01-12 13:38 ` Sage Weil
  2016-01-26 18:40   ` Marcel Lauhoff
  0 siblings, 1 reply; 3+ messages in thread
From: Sage Weil @ 2016-01-12 13:38 UTC (permalink / raw)
  To: Marcel Lauhoff; +Cc: ceph-devel

Hi Marcel,

This is great!

On Tue, 12 Jan 2016, Marcel Lauhoff wrote:
> 
> Hi,
> 
> I wrote a Master's Thesis about Ceph and cold storage last year. One of
> the things I looked at was modifications to object placement.
> 
> Among others, what would happen to balance (e.g objects / OSD) when
> all objects of a file end up on the same OSD. I also ran tests with a
> different hash algorithm (Linux dcache).
> 
> I wrote an article on my website with the analysis, changes to the
> source and how I ran the tests:
> 
>   http://irq0.org/articles/ceph/object_name_hashing

The interesting thing to me is the error bars for linux prefix (the 
right-most set of bars on the last graph).  They range is significantly 
wider than rjenkins + prefix (ranging from 2.1TiB to 4.0TiB (vs 2.3-3.7ish 
for the others).  The reason we switched away from the linux dcache hash 
(it was the original choice) is because it is very weak.  I suspect that 
even if you look at the average + standard deviation it hides some of the 
badness; looking at 99th or 99.9th percentile, or simply a plot of the osd 
utilization distribution, will show that there are more low- and high- 
utilization outliers.

The other thing to keep in mind is that beyond a certain size locality 
doesn't buy you that much... the disk seek overhead is no longer 
significant once you've read several megabytes of data.  At the same 
time, concentrating all data in a file (or rbd image) on a single device 
means that a large, busy, hot file can focus a lot of traffic on a single 
OSD.

What might be more useful is the ability to take the data for several 
smaller files that are thought to be related (e.g., in the same directory, 
created at the same time) and try to store them together.  In that case, 
since we know the file are small, the impact on balance would not be 
significant.  On the other hand, what we currently do with (very) small 
files in CephFS is just inline the data in the inode anyway so we already 
get that locality (and more)--the main limitation there being that the max 
inline size is quite small (a KB or two, IIRC).

sage

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: An Evaluation of Object Name Hashing
  2016-01-12 13:38 ` Sage Weil
@ 2016-01-26 18:40   ` Marcel Lauhoff
  0 siblings, 0 replies; 3+ messages in thread
From: Marcel Lauhoff @ 2016-01-26 18:40 UTC (permalink / raw)
  To: ceph-devel


Hi!

Sage Weil <sage@newdream.net> writes:
> On Tue, 12 Jan 2016, Marcel Lauhoff wrote:
>>
>> I wrote an article on my website with the analysis, changes to the
>> source and how I ran the tests:
>>
>>   http://irq0.org/articles/ceph/object_name_hashing
>
> The interesting thing to me is the error bars for linux prefix (the
> right-most set of bars on the last graph).  They range is significantly
> wider than rjenkins + prefix (ranging from 2.1TiB to 4.0TiB (vs 2.3-3.7ish
> for the others).  The reason we switched away from the linux dcache hash
> (it was the original choice) is because it is very weak.  I suspect that
> even if you look at the average + standard deviation it hides some of the
> badness; looking at 99th or 99.9th percentile, or simply a plot of the osd
> utilization distribution, will show that there are more low- and high-
> utilization outliers.

I rerun the tests and included Adler-32, CRC32, MD5 and SHA-1 (MD5 and
SHA-1 truncated to 32 bit). I updated the article.

In summary: Adler-32 does not work. MD5 and SHA-1 are OK. CRC32 as good
as RJenkins, maybe even slightly better.

~marcel

--
Marcel Lauhoff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-01-26 18:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-12 13:17 An Evaluation of Object Name Hashing Marcel Lauhoff
2016-01-12 13:38 ` Sage Weil
2016-01-26 18:40   ` Marcel Lauhoff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.