long object names

All of lore.kernel.org
 help / color / mirror / Atom feed

* long object names
@ 2011-04-21  4:42 Sage Weil
  2011-04-21 18:56 ` Tommi Virtanen
  0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2011-04-21  4:42 UTC (permalink / raw)
  To: ceph-devel

Yehuda and I talked about the lfn branch today and we're not on the same 
page yet about the best way to proceed.  The current code keeps the long 
file name translating independent of the other naming/mangling that 
FileStore does (collection/ prefix, escaping, and sobject_t -> 
<object>_<snapid|head>).  I see that it's nice to do one thing at once, 
but I'm also not sure the long files are useful anywhere else.

Other thoughts:

The escaping may make more sense in the same layer as the long name stuff?

Eventually we'll be prehashing the pg dir contents into subdirs, and that 
translation will have to be done somewhere too.  That will mean possibily 
looking in two locations during the rehashing process, similar to how the 
lfn stuff has to peek at xattrs.  One thing to keep in mind is that the 
hash value will need to be passed down and stored with the file... it's 
usually hash(object name), but not always when the object_locator_t::key 
is set.  Where will this fit in?

We may eventually want to adjust the ObjectStore interface to include 
collection/dir handles so that the full path isn't traversed in kernel for 
every operation (the OSD could maintain an open handle/fd for each pg it 
has open).

I think the lfn_open/_get type interface below all of the operation 
methods will allow all of those things.  I think it'll be simpler to push 
as much of the filename rendering into that layer as possible, though 
(possibly including the sobject_t mangling).  Having all the 
mangling/rendering done in one place will also make it easy to extend 
without making multiple passes in different layers...

Unfortunately I'm out tomorrow.  Any other opinions?

One other thing: the xattr names are mangled too (user.ceph. prefix).  As 
long as the long name xattr has a different prefix we don't have to worry 
about those getting mixed up.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21  4:42 long object names Sage Weil
@ 2011-04-21 18:56 ` Tommi Virtanen
  2011-04-21 19:27   ` Colin McCabe
  0 siblings, 1 reply; 19+ messages in thread
From: Tommi Virtanen @ 2011-04-21 18:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wed, Apr 20, 2011 at 09:42:48PM -0700, Sage Weil wrote:
> I think the lfn_open/_get type interface below all of the operation 
> methods will allow all of those things.  I think it'll be simpler to push 
> as much of the filename rendering into that layer as possible, though 
> (possibly including the sobject_t mangling).  Having all the 
> mangling/rendering done in one place will also make it easy to extend 
> without making multiple passes in different layers...

So apparently, this email confused many. Let's try to clarify, to the
best of my understanding:

- Yehuda's code would currently write this:

  ./rados mkpool kitties
  ./rados create --pool=foo "longcatisl$(python -c 'print 300*"o"')ng"

  >>> FILENAME_SHORT_LEN=16
  >>> FILENAME_HASH_LEN=3
  >>> FILENAME_COOKIE="cLFN"
  >>> FILENAME_PREFIX_LEN = (FILENAME_SHORT_LEN - FILENAME_HASH_LEN - 1 - (len(FILENAME_COOKIE) - 1) - 1)
  >>> orig = "longcatisl"+300*"o"+"ng"
  >>> storable = orig + "_head"  # plus backslash escaping, but let's ignore that now
  >>> munged = orig[:FILENAME_PREFIX_LEN] + "_" + FILENAME_COOKIE + "_%s_%d" % ("zzz", 42)
  >>> munged
  'longcati_cLFN_zzz_42'

  So it loses the _head suffix, and all such things.

- Sage wanted (as far as I understand) to have the munging be where
  the backslash escaping is, so the end result would look like
  'longcati_cLFN_zzz_42_head'. Note the suffix.

  I hope I got that right.

As for Yehuda's approach, I'm not very happy to see layers upon layers
of rewriting the filenames.. It just seems more brittle. So Sage's
version looks nicer to me, can do all the work it needs to do in a
single pass, and lets us see the _head etc suffixes without reading
xattrs, which might be useful for fsck-style things.

As for both, I'm especially not fond of the very limited hashing, and
the loops that keep calling build_filename. I fear collisions and
races. Bumping up the hash size significantly will help the common
case, but I still fear the races.

I also think that Ceph, and especially the RGW bits, needs to be
written to be fairly robust against DoS attacks. Nasty things happen
out there, and having somebody able to trigger a "slow mode" on your
server with fairly cheap operations is bad.

Here's a concrete proposal: split the filename into subdirs if needed,
and map the names 1:1, just to avoid the unpredictability of the above
approach. And to get significantly less code and branching in the fast
path. That is, I think I'd go for something like (Python written in C
style to make it more direct to translate):

	# how much overhead to reserve in filenames to always have
	# prefix/suffix not split by slashes
	LONGEST_PREFIX_SUFFIX_LEN = len("_head")
	SAFE_FILENAME_LEN = 255 - LONGEST_PREFIX_SUFFIX_LEN

	def munge(path):
	    dirprefix = None
	    while len(path) > SAFE_FILENAME_LEN:
	        head = path[:SAFE_FILENAME_LEN]
	        if dirprefix is None:
	            dirprefix = head
	        else:
	            dirprefix = dirprefix + '/' + head
	        path = path[SAFE_FILENAME_LEN:]
	    return dirprefix + '/' + path

and now

>>> munged = munge(orig)
>>> munged
'longcatisloooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo/oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong'
>>> munged + '_head'
'longcatisloooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo/oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooong_head'

And the caller would just mkdir all the leading dirs (ignore EEXIST
errors). They can either be left around, or with a small loop handling
a race between mkdir & rmdir, can be cleaned up either on unlink or in
fsck/scrub/etc.

Also, I'm not thrilled to have something this core, *and* being string
manipulation in C, go without unit tests that exercise the corner
cases.

Finally, here's some misc notes on the existing code that are probably
obvious, I just wanted to make sure:

- hash is always "zzz"
- xattr user.ceph._lfn conflicts with actual end-user xattrs "_lfn"?
- escaping can lengthen the filename, does it handle that (I guess yes
  because this is a layer after that, but I can't tell without reading
  a lot of code)

-- 
:(){ :|:&};:

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 18:56 ` Tommi Virtanen
@ 2011-04-21 19:27   ` Colin McCabe
  2011-04-21 19:32     ` Tommi Virtanen
  0 siblings, 1 reply; 19+ messages in thread
From: Colin McCabe @ 2011-04-21 19:27 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Sage Weil, ceph-devel

On Thu, Apr 21, 2011 at 11:56 AM, Tommi Virtanen
<tommi.virtanen@dreamhost.com> wrote:
> I also think that Ceph, and especially the RGW bits, needs to be
> written to be fairly robust against DoS attacks. Nasty things happen
> out there, and having somebody able to trigger a "slow mode" on your
> server with fairly cheap operations is bad.

Yeah.

> Here's a concrete proposal: split the filename into subdirs if needed,
> and map the names 1:1, just to avoid the unpredictability of the above
> approach. And to get significantly less code and branching in the fast
> path. That is, I think I'd go for something like (Python written in C
> style to make it more direct to translate):

I like this idea a lot. It does involve extra expense, but only for
long file names. It also avoids object name collisions completely.

One additional idea: can we make the chunking configurable?
If we did a translation like this:
abcdefg -> abc/def/g
123456789 -> 123/456/789

prefix search would become a *lot* more efficient for rgw.
On the other hand, the filesystem layer doesn't care about prefix
search, so it could just configure the chunking to be after 200
characters or something (at which point it's basically a no-op.)

cheers,
Colin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 19:27   ` Colin McCabe
@ 2011-04-21 19:32     ` Tommi Virtanen
  2011-04-21 20:03       ` Gregory Farnum
  0 siblings, 1 reply; 19+ messages in thread
From: Tommi Virtanen @ 2011-04-21 19:32 UTC (permalink / raw)
  To: Colin McCabe; +Cc: Sage Weil, ceph-devel

On Thu, Apr 21, 2011 at 12:27:01PM -0700, Colin McCabe wrote:
> I like this idea a lot. It does involve extra expense, but only for
> long file names. It also avoids object name collisions completely.
> 
> One additional idea: can we make the chunking configurable?
> If we did a translation like this:
> abcdefg -> abc/def/g
> 123456789 -> 123/456/789
> 
> prefix search would become a *lot* more efficient for rgw.
> On the other hand, the filesystem layer doesn't care about prefix
> search, so it could just configure the chunking to be after 200
> characters or something (at which point it's basically a no-op.)

The one big downside is that with configurable chunking, you no longer
have an always correct 1:1 mapping between object and file.

You might argue for always (not configurably) chunking at some
smaller, fixed boundary, so on the average you'd need to readdir()
less to serve a prefix search. I think this is what your last sentence
refers to. But that means more overhead with the directories.

The only real answers are available via benchmarks.

-- 
:(){ :|:&};:

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 19:32     ` Tommi Virtanen
@ 2011-04-21 20:03       ` Gregory Farnum
  2011-04-21 21:09         ` Colin McCabe
  2011-04-21 22:00         ` Tommi Virtanen
  0 siblings, 2 replies; 19+ messages in thread
From: Gregory Farnum @ 2011-04-21 20:03 UTC (permalink / raw)
  To: ceph-devel

I really don't see how pushing the naming complexity into the local filesystem, where it adds lots of otherwise-useless inodes and dentries, is going to help us.

I like what Yehuda has here for its relative simplicity -- though I think we should just up the hash size enough that we don't need to handle collisions, and leave out the retry looping so as to make it simpler still -- but given the relative simplicity I think it might be nice to push all the name mangling into a flat space so that we can preserve the prefix- and post-fixing -- this would keep snapshots of one object more identifiable than hashing over the entire name like it's doing right now.
-Greg
On Thursday, April 21, 2011 at 12:32 PM, Tommi Virtanen wrote: 
> On Thu, Apr 21, 2011 at 12:27:01PM -0700, Colin McCabe wrote:
> > I like this idea a lot. It does involve extra expense, but only for
> > long file names. It also avoids object name collisions completely.
> > 
> > One additional idea: can we make the chunking configurable?
> > If we did a translation like this:
> > abcdefg -> abc/def/g
> > 123456789 -> 123/456/789
> > 
> > prefix search would become a *lot* more efficient for rgw.
> > On the other hand, the filesystem layer doesn't care about prefix
> > search, so it could just configure the chunking to be after 200
> > characters or something (at which point it's basically a no-op.)
> 
> The one big downside is that with configurable chunking, you no longer
> have an always correct 1:1 mapping between object and file.
> 
> You might argue for always (not configurably) chunking at some
> smaller, fixed boundary, so on the average you'd need to readdir()
> less to serve a prefix search. I think this is what your last sentence
> refers to. But that means more overhead with the directories.
> 
> The only real answers are available via benchmarks.
> 
> -- 
> :(){ :|:&};:
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 20:03       ` Gregory Farnum
@ 2011-04-21 21:09         ` Colin McCabe
  2011-04-21 21:23           ` Yehuda Sadeh Weinraub
  2011-04-21 22:00         ` Tommi Virtanen
  1 sibling, 1 reply; 19+ messages in thread
From: Colin McCabe @ 2011-04-21 21:09 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On Thu, Apr 21, 2011 at 1:03 PM, Gregory Farnum
<gregory.farnum@dreamhost.com> wrote:
> I really don't see how pushing the naming complexity into the local filesystem,
> where it adds lots of otherwise-useless inodes and dentries, is going to help us.

Here is a quick summary of how the TV's proposal would help us.
1. it avoids collisions entirely
2. You don't ever have do an extra xattr lookup, no matter how short
or long the object name is.

My add-on proposal helps us:
3. get reasonable prefix search performance (with those supposedly
"useless" dentries)

> I like what Yehuda has here for its relative simplicity -- though I think we should just up
> the hash size enough that we don't need to handle collisions,

Personally, I think the xattr proposal is more complex. I guess that
is a matter of taste.

No matter how big your hash table will be, there are still collisions!
That is the nature of hashing. And since the code is open source, it's
pretty easy for an attacker to read the source and then create two
objects whose names collide.

So far, the only disadvantage that has been pointed out to TV's scheme
is that it creates extra dentries. But those extra dentries only
affect long object names, not the ones that (for example) the Ceph FS
creates. Also, when long object names occur in S3, they don't tend to
come out of the blue. They come about because the organization has a
sort of directory structure like this:

foocorp/business_data/business_reports/year_2008/input/foo
foocorp/business_data/business_reports/year_2008/input/bar

Of course we "know" that there are no such things as directories in
S3. But people like to structure their object names as if there were.
In cases like that, TV's scheme only incurs the cost of creating the
extra dentries once per long prefix.

Colin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 21:09         ` Colin McCabe
@ 2011-04-21 21:23           ` Yehuda Sadeh Weinraub
  2011-04-21 21:44             ` Colin McCabe
  0 siblings, 1 reply; 19+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-04-21 21:23 UTC (permalink / raw)
  To: Colin McCabe; +Cc: Gregory Farnum, ceph-devel

On Thu, Apr 21, 2011 at 2:09 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
> On Thu, Apr 21, 2011 at 1:03 PM, Gregory Farnum
> <gregory.farnum@dreamhost.com> wrote:
>> I really don't see how pushing the naming complexity into the local filesystem,
>> where it adds lots of otherwise-useless inodes and dentries, is going to help us.
>
> Here is a quick summary of how the TV's proposal would help us.
> 1. it avoids collisions entirely
> 2. You don't ever have do an extra xattr lookup, no matter how short
> or long the object name is.

Yeah, but you read more directories. Note that btrfs stores the xattrs
on the directories, so reading those xattrs will have a lower IO
impact than traversing directories recursively.

>
> My add-on proposal helps us:
> 3. get reasonable prefix search performance (with those supposedly
> "useless" dentries)
>
>> I like what Yehuda has here for its relative simplicity -- though I think we should just up
>> the hash size enough that we don't need to handle collisions,
>
> Personally, I think the xattr proposal is more complex. I guess that
> is a matter of taste.
>
> No matter how big your hash table will be, there are still collisions!
> That is the nature of hashing. And since the code is open source, it's
> pretty easy for an attacker to read the source and then create two
> objects whose names collide.

Sure there will be, and the code should handle it. With a good hashing
scheme having a collision will be pretty rare.

>
> So far, the only disadvantage that has been pointed out to TV's scheme
> is that it creates extra dentries. But those extra dentries only
> affect long object names, not the ones that (for example) the Ceph FS
> creates. Also, when long object names occur in S3, they don't tend to
> come out of the blue. They come about because the organization has a
> sort of directory structure like this:
>
> foocorp/business_data/business_reports/year_2008/input/foo
> foocorp/business_data/business_reports/year_2008/input/bar
>
> Of course we "know" that there are no such things as directories in
> S3. But people like to structure their object names as if there were.
> In cases like that, TV's scheme only incurs the cost of creating the
> extra dentries once per long prefix.
>
As I said above, for most cases reading xattrs should be more efficient.


Yehuda

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 21:23           ` Yehuda Sadeh Weinraub
@ 2011-04-21 21:44             ` Colin McCabe
  2011-04-21 21:54               ` Yehuda Sadeh Weinraub
  0 siblings, 1 reply; 19+ messages in thread
From: Colin McCabe @ 2011-04-21 21:44 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: Gregory Farnum, ceph-devel

On Thu, Apr 21, 2011 at 2:23 PM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> On Thu, Apr 21, 2011 at 2:09 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>> On Thu, Apr 21, 2011 at 1:03 PM, Gregory Farnum
>> <gregory.farnum@dreamhost.com> wrote:
>>> I really don't see how pushing the naming complexity into the local filesystem,
>>> where it adds lots of otherwise-useless inodes and dentries, is going to help us.
>>
>> Here is a quick summary of how the TV's proposal would help us.
>> 1. it avoids collisions entirely
>> 2. You don't ever have do an extra xattr lookup, no matter how short
>> or long the object name is.
>
> Yeah, but you read more directories. Note that btrfs stores the xattrs
> on the directories, so reading those xattrs will have a lower IO
> impact than traversing directories recursively.

It does seem like btrfs' extended attribute implementation is fairly
efficient. But Linux's dentry cache (dcache) is also pretty efficient.

TV's approach involves fewer syscalls and no loop.

I also wonder how xattr performance is on ext3/4 these days.
I think benchmarks would be needed to really settle this question. I'm
almost tempted to write one...

sincerely,
Colin

>
>>
>> My add-on proposal helps us:
>> 3. get reasonable prefix search performance (with those supposedly
>> "useless" dentries)
>>
>>> I like what Yehuda has here for its relative simplicity -- though I think we should just up
>>> the hash size enough that we don't need to handle collisions,
>>
>> Personally, I think the xattr proposal is more complex. I guess that
>> is a matter of taste.
>>
>> No matter how big your hash table will be, there are still collisions!
>> That is the nature of hashing. And since the code is open source, it's
>> pretty easy for an attacker to read the source and then create two
>> objects whose names collide.
>
> Sure there will be, and the code should handle it. With a good hashing
> scheme having a collision will be pretty rare.
>
>>
>> So far, the only disadvantage that has been pointed out to TV's scheme
>> is that it creates extra dentries. But those extra dentries only
>> affect long object names, not the ones that (for example) the Ceph FS
>> creates. Also, when long object names occur in S3, they don't tend to
>> come out of the blue. They come about because the organization has a
>> sort of directory structure like this:
>>
>> foocorp/business_data/business_reports/year_2008/input/foo
>> foocorp/business_data/business_reports/year_2008/input/bar
>>
>> Of course we "know" that there are no such things as directories in
>> S3. But people like to structure their object names as if there were.
>> In cases like that, TV's scheme only incurs the cost of creating the
>> extra dentries once per long prefix.
>>
> As I said above, for most cases reading xattrs should be more efficient.
>
>
> Yehuda
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 21:44             ` Colin McCabe
@ 2011-04-21 21:54               ` Yehuda Sadeh Weinraub
  2011-04-21 22:01                 ` Colin McCabe
  0 siblings, 1 reply; 19+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-04-21 21:54 UTC (permalink / raw)
  To: Colin McCabe; +Cc: Gregory Farnum, ceph-devel

On Thu, Apr 21, 2011 at 2:44 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
> On Thu, Apr 21, 2011 at 2:23 PM, Yehuda Sadeh Weinraub
> <yehudasa@gmail.com> wrote:
>> On Thu, Apr 21, 2011 at 2:09 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>>> On Thu, Apr 21, 2011 at 1:03 PM, Gregory Farnum
>>> <gregory.farnum@dreamhost.com> wrote:
>>>> I really don't see how pushing the naming complexity into the local filesystem,
>>>> where it adds lots of otherwise-useless inodes and dentries, is going to help us.
>>>
>>> Here is a quick summary of how the TV's proposal would help us.
>>> 1. it avoids collisions entirely
>>> 2. You don't ever have do an extra xattr lookup, no matter how short
>>> or long the object name is.
>>
>> Yeah, but you read more directories. Note that btrfs stores the xattrs
>> on the directories, so reading those xattrs will have a lower IO
>> impact than traversing directories recursively.
>
> It does seem like btrfs' extended attribute implementation is fairly
> efficient. But Linux's dentry cache (dcache) is also pretty efficient.
>
(resending to list)

It needs to be populated first before being efficient. And it'll be
less efficient now that you populate it with extra entries.

Yehuda

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 20:03       ` Gregory Farnum
  2011-04-21 21:09         ` Colin McCabe
@ 2011-04-21 22:00         ` Tommi Virtanen
  2011-04-21 22:23           ` Gregory Farnum
  2011-04-21 22:25           ` Yehuda Sadeh Weinraub
  1 sibling, 2 replies; 19+ messages in thread
From: Tommi Virtanen @ 2011-04-21 22:00 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On Thu, Apr 21, 2011 at 01:03:57PM -0700, Gregory Farnum wrote:
> I like what Yehuda has here for its relative simplicity

It's far from simple.

Let's look at the unlink path:


static int lfn_unlink(const char *pathname)
{
  const char *filename;
  char short_fn[PATH_MAX];
  char short_fn2[PATH_MAX];
  int r, i, exist, err;
  int path_len;
  int is_lfn;

** helper function to split the path to dir and file, figure out a
** short name for this longname, count the lenght of the directory
** part of the path and other things; loops through the candidates,
** comparing against the xattr
  r = lfn_get(pathname, short_fn, sizeof(short_fn), &filename, &exist, &is_lfn);
  if (r < 0)
    return r;
** if the filename  wasn't actually too long, take the easy way out
  if (!is_lfn)
    return unlink(pathname);
  if (!exist) {
    errno = ENOENT;
    return -1;
  }

** actual file unlink here
  err = unlink(short_fn);
  if (err < 0)
    return err;

** and then, rename all the collisions, one by one, because they have
** a sequential number in them!
  path_len = filename - pathname;
  memcpy(short_fn2, pathname, path_len);

** this loop finds the highest sequential number in this hash
** collision bucket, saves it in i
  for (i = r + 1; ; i++) {
    struct stat buf;
    int ret;

    build_filename(&short_fn2[path_len], sizeof(short_fn2) - path_len, filename, i);
    ret = stat(short_fn2, &buf);
    if (ret < 0) {
      if (i == r + 1)
        return 0;

      break;
    }
  }

** and then the highest seq number munged filename gets renamed to
** fill the gap we left behind
  build_filename(&short_fn2[path_len], sizeof(short_fn2) - path_len, filename, i - 1);
  generic_dout(0) << "renaming " << short_fn2 << " -> " << short_fn << dendl;

  if (rename(short_fn2, short_fn) < 0) {
    generic_derr << "ERROR: could not rename " << short_fn2 << " -> " << short_fn << dendl;
    assert(0);
  }

  return 0;
}


Now, imagine a colliding file create between the stat and the rename
-> boom. This is not the only race in there.

The underlying problem is that you're constructing an atomic operation
out of multiple underlying operations, and you're not obsessively
careful about ordering them. Once you get obsessive about ordering
them, the extra directory my scheme creates will seem very cheap.

If you say that's not relevant because of some locking that the OSD
does, then 1) you're building a lot of assumptions on the locking
never changing 2) I can construct similar bugs with a single actor,
with a crash at the wrong moment.

Simple code makes Tv happy. You don't want an unhappy Tv all up in
your codebase.

-- 
:(){ :|:&};:

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 21:54               ` Yehuda Sadeh Weinraub
@ 2011-04-21 22:01                 ` Colin McCabe
  2011-04-21 22:58                   ` Zenon Panoussis
  0 siblings, 1 reply; 19+ messages in thread
From: Colin McCabe @ 2011-04-21 22:01 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: Gregory Farnum, ceph-devel

On Thu, Apr 21, 2011 at 2:54 PM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> On Thu, Apr 21, 2011 at 2:44 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>> On Thu, Apr 21, 2011 at 2:23 PM, Yehuda Sadeh Weinraub
>> <yehudasa@gmail.com> wrote:
>>> On Thu, Apr 21, 2011 at 2:09 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>>>> On Thu, Apr 21, 2011 at 1:03 PM, Gregory Farnum
>>>> <gregory.farnum@dreamhost.com> wrote:
>>>>> I really don't see how pushing the naming complexity into the local filesystem,
>>>>> where it adds lots of otherwise-useless inodes and dentries, is going to help us.
>>>>
>>>> Here is a quick summary of how the TV's proposal would help us.
>>>> 1. it avoids collisions entirely
>>>> 2. You don't ever have do an extra xattr lookup, no matter how short
>>>> or long the object name is.
>>>
>>> Yeah, but you read more directories. Note that btrfs stores the xattrs
>>> on the directories, so reading those xattrs will have a lower IO
>>> impact than traversing directories recursively.
>>
>> It does seem like btrfs' extended attribute implementation is fairly
>> efficient. But Linux's dentry cache (dcache) is also pretty efficient.
>>
> (resending to list)
>
> It needs to be populated first before being efficient. And it'll be
> less efficient now that you populate it with extra entries.

That is a good point. However, xattrs also have a cost. It seems like
btrfs sometimes creates an inode for xattrs, and sometimes just
stashes them in the dentry (presumably if there aren't many and
they're small?)

The xattr-scheme always creates an extra xattr per entry. The
directory-based scheme creates extra directories, but not that many,
assuming a lot of objects have names with similar prefixes-- an
assumption that is likely to be true nearly all the time.

I think both schemes are doable, but I still lean towards the
directory-based one, just because I like fast prefix search.

Colin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 22:00         ` Tommi Virtanen
@ 2011-04-21 22:23           ` Gregory Farnum
  2011-04-21 22:25           ` Yehuda Sadeh Weinraub
  1 sibling, 0 replies; 19+ messages in thread
From: Gregory Farnum @ 2011-04-21 22:23 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On Thursday, April 21, 2011 at 3:00 PM, Tommi Virtanen wrote:
On Thu, Apr 21, 2011 at 01:03:57PM -0700, Gregory Farnum wrote:
> > I like what Yehuda has here for its relative simplicity
> 
> It's far from simple.
> 
> Let's look at the unlink path:
> 
> 
> static int lfn_unlink(const char *pathname)
> {
>  const char *filename;
>  char short_fn[PATH_MAX];
>  char short_fn2[PATH_MAX];
>  int r, i, exist, err;
>  int path_len;
>  int is_lfn;
> 
> ** helper function to split the path to dir and file, figure out a
> ** short name for this longname, count the lenght of the directory
> ** part of the path and other things; loops through the candidates,
> ** comparing against the xattr
>  r = lfn_get(pathname, short_fn, sizeof(short_fn), &filename, &exist, &is_lfn);
>  if (r < 0)
>  return r;
> ** if the filename wasn't actually too long, take the easy way out
>  if (!is_lfn)
>  return unlink(pathname);
>  if (!exist) {
>  errno = ENOENT;
>  return -1;
>  }
> 
> ** actual file unlink here
>  err = unlink(short_fn);
>  if (err < 0)
>  return err;
> 
> ** and then, rename all the collisions, one by one, because they have
> ** a sequential number in them!
>  path_len = filename - pathname;
>  memcpy(short_fn2, pathname, path_len);
> 
> ** this loop finds the highest sequential number in this hash
> ** collision bucket, saves it in i
>  for (i = r + 1; ; i++) {
>  struct stat buf;
>  int ret;
> 
>  build_filename(&short_fn2[path_len], sizeof(short_fn2) - path_len, filename, i);
>  ret = stat(short_fn2, &buf);
>  if (ret < 0) {
>  if (i == r + 1)
>  return 0;
> 
>  break;
>  }
>  }
> 
> ** and then the highest seq number munged filename gets renamed to
> ** fill the gap we left behind
>  build_filename(&short_fn2[path_len], sizeof(short_fn2) - path_len, filename, i - 1);
>  generic_dout(0) << "renaming " << short_fn2 << " -> " << short_fn << dendl;
> 
>  if (rename(short_fn2, short_fn) < 0) {
>  generic_derr << "ERROR: could not rename " << short_fn2 << " -> " << short_fn << dendl;
>  assert(0);
>  }
> 
>  return 0;
> }
> 
> 
> Now, imagine a colliding file create between the stat and the rename
> -> boom. This is not the only race in there.
> 
> The underlying problem is that you're constructing an atomic operation
> out of multiple underlying operations, and you're not obsessively
> careful about ordering them. Once you get obsessive about ordering
> them, the extra directory my scheme creates will seem very cheap.
> 
> If you say that's not relevant because of some locking that the OSD
> does, then 1) you're building a lot of assumptions on the locking
> never changing 2) I can construct similar bugs with a single actor,
> with a crash at the wrong moment.
> 
> Simple code makes Tv happy. You don't want an unhappy Tv all up in
> your codebase.
> 

I said "relatively simple". In fact I also suggested just ditching the collision handling precisely because of issues like this -- keep in mind that we have 200+ characters to make a hash out of[1] and PGs really shouldn't ever grow big enough for collisions to happen -- and if we instead make a folder structure out of long names that's not exactly going to remove any races.
I understand that Colin likes making folders so as to speed up the prefix searches but I don't think we should optimize for RGW -- if we're going to do that we should (God help us) implement multiple ObjectStore classes and choose the appropriate one to use based on what kind of data the cluster is serving.
I think that you're inflating the cost of doing hashing and an xattr, especially in btrfs where we get the xattrs on lookup anyway, when compared to deep dir lookups. I'm also concerned about issues that may crop up when we take a 4k object name and translate it directly into a path of 4k + slashes, since at that point we're not going to be able to address it all in one go and will need to pull tricks like moving in and out of directories, which endlessly complicates your simple little loops. :(
-Greg

[1]: The current code has short hashes precisely because Yehuda wants to test his collision-handling, and it is a work in progress as you can see by the random "fix blah" patches at the end. :) 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 22:00         ` Tommi Virtanen
  2011-04-21 22:23           ` Gregory Farnum
@ 2011-04-21 22:25           ` Yehuda Sadeh Weinraub
  2011-04-21 23:07             ` Tommi Virtanen
  1 sibling, 1 reply; 19+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-04-21 22:25 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Gregory Farnum, ceph-devel

On Thu, Apr 21, 2011 at 3:00 PM, Tommi Virtanen
<tommi.virtanen@dreamhost.com> wrote:
> On Thu, Apr 21, 2011 at 01:03:57PM -0700, Gregory Farnum wrote:
>> I like what Yehuda has here for its relative simplicity
>
> It's far from simple.
>
> Let's look at the unlink path:
>
>
> static int lfn_unlink(const char *pathname)
> {
>  const char *filename;
>  char short_fn[PATH_MAX];
>  char short_fn2[PATH_MAX];
>  int r, i, exist, err;
>  int path_len;
>  int is_lfn;
>
> ** helper function to split the path to dir and file, figure out a
> ** short name for this longname, count the lenght of the directory
> ** part of the path and other things; loops through the candidates,
> ** comparing against the xattr
>  r = lfn_get(pathname, short_fn, sizeof(short_fn), &filename, &exist, &is_lfn);
>  if (r < 0)
>    return r;
> ** if the filename  wasn't actually too long, take the easy way out
>  if (!is_lfn)
>    return unlink(pathname);
>  if (!exist) {
>    errno = ENOENT;
>    return -1;
>  }
>
> ** actual file unlink here
>  err = unlink(short_fn);
>  if (err < 0)
>    return err;
>
> ** and then, rename all the collisions, one by one, because they have
> ** a sequential number in them!
>  path_len = filename - pathname;
>  memcpy(short_fn2, pathname, path_len);
>
> ** this loop finds the highest sequential number in this hash
> ** collision bucket, saves it in i
>  for (i = r + 1; ; i++) {
>    struct stat buf;
>    int ret;
>
>    build_filename(&short_fn2[path_len], sizeof(short_fn2) - path_len, filename, i);
>    ret = stat(short_fn2, &buf);
>    if (ret < 0) {
>      if (i == r + 1)
>        return 0;
>
>      break;
>    }
>  }
>
> ** and then the highest seq number munged filename gets renamed to
> ** fill the gap we left behind
>  build_filename(&short_fn2[path_len], sizeof(short_fn2) - path_len, filename, i - 1);
>  generic_dout(0) << "renaming " << short_fn2 << " -> " << short_fn << dendl;
>
>  if (rename(short_fn2, short_fn) < 0) {
>    generic_derr << "ERROR: could not rename " << short_fn2 << " -> " << short_fn << dendl;
>    assert(0);
>  }
>
>  return 0;
> }

This is a work in progress, a proper locking is required and will be applied.

>
>
> Now, imagine a colliding file create between the stat and the rename
> -> boom. This is not the only race in there.
>
Yeah, we're well aware of those races. Note that splitting to
subdirectories is racey too. Imagine one thread/process creating an
object, while the other one removing a similar object with the same
prefix. The first one tries to create a subtree, while the other is
trying to remove the same subtree. I've seen these issues before,
they're real.
The chances of hitting these issues with none hashed structure is much
greater than the chances of hitting those races when the appropriate
hash algorithm is being used (the 'zzz' hash is just a filler).

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 22:01                 ` Colin McCabe
@ 2011-04-21 22:58                   ` Zenon Panoussis
  2011-04-21 23:04                     ` Yehuda Sadeh Weinraub
  0 siblings, 1 reply; 19+ messages in thread
From: Zenon Panoussis @ 2011-04-21 22:58 UTC (permalink / raw)
  To: ceph-devel

>> It needs to be populated first before being efficient. And it'll be
>> less efficient now that you populate it with extra entries.

At the risk of being run out of town covered in tar and feathers, I'll
venture voicing the opinion of an end-user who doesn't know ceph, is not
a developer, and doesn't even understand half of the technicalities of
this discussion.

From my end-user point of view, efficiency is great and very desirable,
but is still secondary. Simplicity of code and the reduction of bugs that
comes with it is great and adds elegance to intelligence, but is still
secondary. The safety of data though, now, that is primary and above
everything else when it comes to a file system. A file system's *only*
purpose is to store and retrieve data. Efficiency and speed are features,
positive qualities that make a file system better, but only as long as
it actually can fulfil its purpose of storing and retrieving data without
losing or corrupting them.

Looking at it this way, the potential of a hash collision is catastrophic
no matter how small it might be. The measure of this problem is not the
objective likelihood that it will occur, but the subjective level of worry
that it might occur. Simply put, even if there's one chance of a hash
collision in 10 billion and I only have a couple of million files, I still
end up being unable to trust the integrity of *any* of them.

One might argue here that no file system in this world offers a 100% file
integrity guarantee. That's absolutely true, but it is and should remain
a shortcoming and not be elevated to an intentional design feature.

Z

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 22:58                   ` Zenon Panoussis
@ 2011-04-21 23:04                     ` Yehuda Sadeh Weinraub
  0 siblings, 0 replies; 19+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-04-21 23:04 UTC (permalink / raw)
  To: Zenon Panoussis; +Cc: ceph-devel

On Thu, Apr 21, 2011 at 3:58 PM, Zenon Panoussis <oracle@provocation.net> wrote:
>
>>> It needs to be populated first before being efficient. And it'll be
>>> less efficient now that you populate it with extra entries.
>
> At the risk of being run out of town covered in tar and feathers, I'll
> venture voicing the opinion of an end-user who doesn't know ceph, is not
> a developer, and doesn't even understand half of the technicalities of
> this discussion.
>
> From my end-user point of view, efficiency is great and very desirable,
> but is still secondary. Simplicity of code and the reduction of bugs that
> comes with it is great and adds elegance to intelligence, but is still
> secondary. The safety of data though, now, that is primary and above
> everything else when it comes to a file system. A file system's *only*
> purpose is to store and retrieve data. Efficiency and speed are features,
> positive qualities that make a file system better, but only as long as
> it actually can fulfil its purpose of storing and retrieving data without
> losing or corrupting them.
>
> Looking at it this way, the potential of a hash collision is catastrophic
> no matter how small it might be. The measure of this problem is not the
> objective likelihood that it will occur, but the subjective level of worry
> that it might occur. Simply put, even if there's one chance of a hash
> collision in 10 billion and I only have a couple of million files, I still
> end up being unable to trust the integrity of *any* of them.
>
> One might argue here that no file system in this world offers a 100% file
> integrity guarantee. That's absolutely true, but it is and should remain
> a shortcoming and not be elevated to an intentional design feature.
>

We fully understand your worry, and in any case with the hashing
solution it doesn't mean that when there's a collision you lose the
data, just that the data lookup needs to traverse more objects.

HTH,
Yehuda

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 22:25           ` Yehuda Sadeh Weinraub
@ 2011-04-21 23:07             ` Tommi Virtanen
  2011-04-22 15:44               ` Sage Weil
  0 siblings, 1 reply; 19+ messages in thread
From: Tommi Virtanen @ 2011-04-21 23:07 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: Gregory Farnum, ceph-devel

On Thu, Apr 21, 2011 at 03:25:35PM -0700, Yehuda Sadeh Weinraub wrote:
> Yeah, we're well aware of those races. Note that splitting to
> subdirectories is racey too. Imagine one thread/process creating an
> object, while the other one removing a similar object with the same
> prefix. The first one tries to create a subtree, while the other is
> trying to remove the same subtree. I've seen these issues before,
> they're real.

Yup, that's why I said there's a rmdir/mkdir race. You can fix that
two ways:

1. Don't rmdir; there's not going to be that much junk there
   (punting it, but not badly; no harm done, just littering).

2. Make the mkdir & create file case just handle the race; all you
   need is a simple retry loop, there's no problems and the races
   can't cause actual harm.

   And more to the point, this is the only kind of race there is.
   If FileStore needs to support arbitrary rename etc operations,
   they all need this same retry loop, but it's still just the
   same retry loop, and can probably put in a nice utility function.

   *There are no other kinds of races*, and it seems FileStore doesn't
   really do renames etc anyway.

// try to create a file, using the dynamic dirs trick for long
// filenames. note that this is only needed for file creation; opening
// an existing file needs no mkdir trickery. overwrites pathname,
// returns fd or <0 on errors. pathname is relative to dirfd.
int really_create(int dirfd, char *pathname, int flags, mode_t mode) {
  int ret;

  // split into leading path and base filename
  const char *filename = strrchr(pathname, '/');

  if (!filename) {
    // pathname has no slashes, safe to just open
    return openat(dirfd, pathname, flags, mode);
  }

  // nul terminate leading path
  filename = '\0';
  // move from slash to actual filename
  filename++;

  // go through leading prefixes and mkdir them
 retry:
  char *cursor = pathname;
  while (1) {
    printf("cursor=%p %s\n", cursor, cursor);
    cursor = strchr(cursor, '/');
    if (!cursor)
      break;
    // terminate the string here temporarily, mkdir that
    *cursor = '\0';
    ret = mkdirat(dirfd, pathname, 0755);
    // restore the slash so we don't forget
    *cursor = '/';
    // and nudge us past the slash
    cursor++;
    if (ret < 0) {
      switch (errno) {
      case EEXIST:
	// it already exists; ignore
	break;
      case ENOENT:
	// somebody rmdir'd a parent path; retry from the top
	goto retry;
      default:
	return -errno;
      }
    }
    // loop back to find the next slash and mkdir that
  }

  // leading path is created (unless we lost a race just now); now do
  // the file operation
  ret = openat(dirfd, pathname, flags, mode);
  if (ret<0) {
    switch (errno) {
    case ENOENT:
      // it seems we lost a race at the last second; do mkdirs again
      goto retry;
    default:
      return -errno;
    }
  }
  return ret;
}

-- 
:(){ :|:&};:

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-21 23:07             ` Tommi Virtanen
@ 2011-04-22 15:44               ` Sage Weil
  2011-04-22 16:34                 ` Tommi Virtanen
  2011-04-22 17:36                 ` Colin McCabe
  0 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2011-04-22 15:44 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Yehuda Sadeh Weinraub, Gregory Farnum, ceph-devel

Few things:

- I think the xattr approach is always going to be faster.  xattrs are 
stored adjacent to the inode in the btree, while creating intervening 
directories means a new inode is allocated, seeked to, and loaded, and 
_then_ the directory content is looked up in another part of the btree 
before the final inode is located.  For each level you add two seeks 
(although in the common case, at least, those inodes will be close by).

- They may make it harder to inspect things out of band (need to peek at 
xattrs instead of subdirectories).  OTOH, it's a 1:1 mapping of dirent to 
object, while subdirs are not.

- You can't make intervening directories both rare (long) and useful for 
prefix search (short) unless you really think people will be searching on 
100+ character prefixes.

- Hash collisions will be rare for all but our test cases.  If we only 
hash for long filenames (say, 200+ characters) that means someone has to 
find a SHA-256 collision (has anybody??).  And even then they only turn 1 
stat into 2.  Only if someone can generate an arbitrary number of inputs 
that hash to the same value do they get anywhere.  I don't think that's 
something we should worry about.  If someone breaks a crypto hash there 
are much bigger things to worry about.  (Even if we are super paranoid, 
then just sha(name + sha(name)).

- We can easily wrap the non-fast past with a mutex to avoid the races 
(because, again, collisions are vanishingly rare except in our test 
cases).

- I'm somewhat attracted to the idea of not escaping / and creating 
intervening directories because that's how people frequently use it.  
It's worth noting though that S3 at least doesn't treat / as anything 
special (you can delimit using anything) so we'd only optimize for the 
common case here.  And it will slow down _everything_else_ besides prefix 
search.  So... bleh.

- Those mkdir helpers may be useful for the prehashing.  Or we can just 
precreate the hash dirs (there'll be a fixed power-of-two number of them).

- For simplicity, I still think the simplest thing will be to push all the 
escaping/mangling into one layer.  Once place to audit and unit test.

sage





On Thu, 21 Apr 2011, Tommi Virtanen wrote:

> On Thu, Apr 21, 2011 at 03:25:35PM -0700, Yehuda Sadeh Weinraub wrote:
> > Yeah, we're well aware of those races. Note that splitting to
> > subdirectories is racey too. Imagine one thread/process creating an
> > object, while the other one removing a similar object with the same
> > prefix. The first one tries to create a subtree, while the other is
> > trying to remove the same subtree. I've seen these issues before,
> > they're real.
> 
> Yup, that's why I said there's a rmdir/mkdir race. You can fix that
> two ways:
> 
> 1. Don't rmdir; there's not going to be that much junk there
>    (punting it, but not badly; no harm done, just littering).
> 
> 2. Make the mkdir & create file case just handle the race; all you
>    need is a simple retry loop, there's no problems and the races
>    can't cause actual harm.
> 
>    And more to the point, this is the only kind of race there is.
>    If FileStore needs to support arbitrary rename etc operations,
>    they all need this same retry loop, but it's still just the
>    same retry loop, and can probably put in a nice utility function.
> 
>    *There are no other kinds of races*, and it seems FileStore doesn't
>    really do renames etc anyway.
> 
> 
> 
> // try to create a file, using the dynamic dirs trick for long
> // filenames. note that this is only needed for file creation; opening
> // an existing file needs no mkdir trickery. overwrites pathname,
> // returns fd or <0 on errors. pathname is relative to dirfd.
> int really_create(int dirfd, char *pathname, int flags, mode_t mode) {
>   int ret;
> 
>   // split into leading path and base filename
>   const char *filename = strrchr(pathname, '/');
> 
>   if (!filename) {
>     // pathname has no slashes, safe to just open
>     return openat(dirfd, pathname, flags, mode);
>   }
> 
>   // nul terminate leading path
>   filename = '\0';
>   // move from slash to actual filename
>   filename++;
> 
>   // go through leading prefixes and mkdir them
>  retry:
>   char *cursor = pathname;
>   while (1) {
>     printf("cursor=%p %s\n", cursor, cursor);
>     cursor = strchr(cursor, '/');
>     if (!cursor)
>       break;
>     // terminate the string here temporarily, mkdir that
>     *cursor = '\0';
>     ret = mkdirat(dirfd, pathname, 0755);
>     // restore the slash so we don't forget
>     *cursor = '/';
>     // and nudge us past the slash
>     cursor++;
>     if (ret < 0) {
>       switch (errno) {
>       case EEXIST:
> 	// it already exists; ignore
> 	break;
>       case ENOENT:
> 	// somebody rmdir'd a parent path; retry from the top
> 	goto retry;
>       default:
> 	return -errno;
>       }
>     }
>     // loop back to find the next slash and mkdir that
>   }
> 
>   // leading path is created (unless we lost a race just now); now do
>   // the file operation
>   ret = openat(dirfd, pathname, flags, mode);
>   if (ret<0) {
>     switch (errno) {
>     case ENOENT:
>       // it seems we lost a race at the last second; do mkdirs again
>       goto retry;
>     default:
>       return -errno;
>     }
>   }
>   return ret;
> }
> 
> 
> -- 
> :(){ :|:&};:
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-22 15:44               ` Sage Weil
@ 2011-04-22 16:34                 ` Tommi Virtanen
  2011-04-22 17:36                 ` Colin McCabe
  1 sibling, 0 replies; 19+ messages in thread
From: Tommi Virtanen @ 2011-04-22 16:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yehuda Sadeh Weinraub, Gregory Farnum, ceph-devel

On Fri, Apr 22, 2011 at 08:44:49AM -0700, Sage Weil wrote:
> - We can easily wrap the non-fast past with a mutex to avoid the races 
> (because, again, collisions are vanishingly rare except in our test 
> cases).

How do you guard against crashes, e.g. the create+set_xattr crashing
before set_xattr?

How do you guard against gaps in the sequence number thing? (Perhaps
make that part a random string, and change consumers to listdir
instead of probing 1,2,3...)

How do you convince yourself you've covered all the races?

> - For simplicity, I still think the simplest thing will be to push all the 
> escaping/mangling into one layer.  Once place to audit and unit test.

I think the big functional benefit with that is that you can have the
suffix not be obscured by the hash; FOO_a43fec_n_head not FOO_a43fec_n

-- 
:(){ :|:&};:

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: long object names
  2011-04-22 15:44               ` Sage Weil
  2011-04-22 16:34                 ` Tommi Virtanen
@ 2011-04-22 17:36                 ` Colin McCabe
  1 sibling, 0 replies; 19+ messages in thread
From: Colin McCabe @ 2011-04-22 17:36 UTC (permalink / raw)
  To: Sage Weil
  Cc: Tommi Virtanen, Yehuda Sadeh Weinraub, Gregory Farnum, ceph-devel

On Fri, Apr 22, 2011 at 8:44 AM, Sage Weil <sage@newdream.net> wrote:
> Few things:
>
> - I think the xattr approach is always going to be faster.  xattrs are
> stored adjacent to the inode in the btree, while creating intervening
> directories means a new inode is allocated, seeked to, and loaded, and
> _then_ the directory content is looked up in another part of the btree
> before the final inode is located.  For each level you add two seeks
> (although in the common case, at least, those inodes will be close by).

Fair enough.

> - You can't make intervening directories both rare (long) and useful for
> prefix search (short) unless you really think people will be searching on
> 100+ character prefixes.

Earlier I suggested making it configurable, so that we could have it
tuned to a short value on the cluster backing rgw, but a long value
elsewhere.

> - Hash collisions will be rare for all but our test cases.  If we only
> hash for long filenames (say, 200+ characters) that means someone has to
> find a SHA-256 collision (has anybody??).  And even then they only turn 1
> stat into 2.  Only if someone can generate an arbitrary number of inputs
> that hash to the same value do they get anywhere.  I don't think that's
> something we should worry about.  If someone breaks a crypto hash there
> are much bigger things to worry about.  (Even if we are super paranoid,
> then just sha(name + sha(name)).

A good guide to choosing a crypto hash: http://valerieaurora.org/hash.html

> - We can easily wrap the non-fast past with a mutex to avoid the races
> (because, again, collisions are vanishingly rare except in our test
> cases).

I believe that all these operations are already done under the PG
lock. So there are no race conditions in normal operation. TV is
talking about a case where there has been a crash and we're resuming
from some intermediate state. Based on our earlier discussion, perhaps
this is not a problem on btrfs because of the snapshotting mechanic?

cheers,
Colin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2011-04-22 17:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-21  4:42 long object names Sage Weil
2011-04-21 18:56 ` Tommi Virtanen
2011-04-21 19:27   ` Colin McCabe
2011-04-21 19:32     ` Tommi Virtanen
2011-04-21 20:03       ` Gregory Farnum
2011-04-21 21:09         ` Colin McCabe
2011-04-21 21:23           ` Yehuda Sadeh Weinraub
2011-04-21 21:44             ` Colin McCabe
2011-04-21 21:54               ` Yehuda Sadeh Weinraub
2011-04-21 22:01                 ` Colin McCabe
2011-04-21 22:58                   ` Zenon Panoussis
2011-04-21 23:04                     ` Yehuda Sadeh Weinraub
2011-04-21 22:00         ` Tommi Virtanen
2011-04-21 22:23           ` Gregory Farnum
2011-04-21 22:25           ` Yehuda Sadeh Weinraub
2011-04-21 23:07             ` Tommi Virtanen
2011-04-22 15:44               ` Sage Weil
2011-04-22 16:34                 ` Tommi Virtanen
2011-04-22 17:36                 ` Colin McCabe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.