* [RFC] Plumbing-only support for storing object metadata @ 2008-08-09 21:07 Jamey Sharp, Josh Triplett 2008-08-09 21:49 ` Scott Chacon 2008-08-10 11:09 ` Jan Hudec 0 siblings, 2 replies; 22+ messages in thread From: Jamey Sharp, Josh Triplett @ 2008-08-09 21:07 UTC (permalink / raw) To: git [-- Attachment #1: Type: text/plain, Size: 4624 bytes --] The attached test illustrates a proposal for minimal plumbing support usable to store permissions, ownership, and other metadata in git repositories. This proposal is fully compatible with existing repositories when the new functionality is not in use. Similar to the introduction of subprojects, we have not yet specified the porcelain. We believe that the plumbing will provide sufficient functionality for many uses, and these uses will help determine the appropriate porcelain. We would have included an implementation along with the test, but we need help with a detail of git internals. More on that at the end. We'd also appreciate feedback on the proposal. We propose representing objects with metadata using a new "inode" object. An inode object contains the hash of the real object and the hash of a "props" (properties) object. A props object contains a set of name-value pairs. Tree objects can reference inode objects in addition to the current possibilities of blobs, trees, and subproject commits; we propose using the currently invalid type 110000 (S_IFREG | S_IFIFO) for inode objects. We primarily see a use case for inodes referencing blobs and trees, though as defined they support any object type. By separating property objects from inodes, objects with the same properties can share the same property object; we expect, for instance, that repositories reflecting /etc will have many references to the "root:root 644" and "root:root 755" properties. Both object types have a unique representation: equivalent inodes and props objects will have the same hash. The exact format of an inode looks like: <object_type> SP <object_sha1> LF props SP <props_sha1> LF A property object looks like a sorted list of one or more of: <key> SP <value> LF The same key is allowed to appear more than once, in which case the lines will be sorted by the bytes of the values. Allowing duplicate keys will make it easier to retrieve a set of similar properties such as acls. This format implies certain constraints on property names and values. We propose limiting both names and values to printable ASCII (\x20-\x7E), and disallowing spaces in keys. If some use case requires property names or values with binary data, that property could use a printable encoding such as base64. We believe this proposal provides a sensible approach to storing metadata in Git repositories; however, we're happy with any reasonable solution that provides equivalent functionality. Some alternatives we considered: - We could allow UTF-8 property names or values, rather than strictly ASCII. Our proposal is conservative in this regard, allowing an extension to UTF-8 later while remaining compatible with existing repositories. - We could allow arbitrary property names or values, by changing the props format to store lengths rather than using delimiters. This would not be a compatible change, so it needs to be decided early. - Tree objects already store mode bits, but we believe that it would prove simpler to store complete modes in properties rather than adjusting Git internals to preserve arbitrary mode bits in trees. Even if new versions of Git preserved the full mode, existing versions of Git might silently give incorrect results. Furthermore, mode bits other than executability seem of limited value without ownership information. - inode objects could directly store properties, rather than referencing a separate props object. This would eliminate one indirection needed to access properties. However, it would also reduce sharing of data for objects with the same properties. Furthermore, we expect that the indirection will have negligible cost when accessing objects from packs, given appropriately sorted packs. Shared props objects also suggest caching at various layers. - We could have called them "meta" objects instead of "props", but then we couldn't make "mad props" jokes. We began trying to implement this proposal, but we found this enum definition in cache.h, which made us think there's only room for one more kind of object: enum object_type { OBJ_BAD = -1, OBJ_NONE = 0, OBJ_COMMIT = 1, OBJ_TREE = 2, OBJ_BLOB = 3, OBJ_TAG = 4, /* 5 for future expansion */ OBJ_OFS_DELTA = 6, OBJ_REF_DELTA = 7, OBJ_ANY, OBJ_MAX, }; Do these object_type values appear in any on-disk structure, or does any other reason exist why this set of values cannot change? Can we add additional object types for inodes and props? If not, what would you recommend instead? - Jamey Sharp and Josh Triplett [-- Attachment #2: t1008-inodes.sh --] [-- Type: text/plain, Size: 2838 bytes --] #!/bin/sh # # Copyright (c) 2008 Josh Triplett and Jamey Sharp # test_description="Test inode plumbing" . ./test-lib.sh cat > shadow <<EOF root:*:13943:0:99999:7::: EOF shadow_sha1=`git hash-object -t blob -w shadow` cat > props <<EOF group shadow mode 640 owner root EOF props_sha1=FIXME cat > inode <<EOF blob $shadow_sha1 props $props_sha1 EOF inode_sha1=FIXME cat > tree <<EOF 110644 inode $inode_sha1 shadow EOF tree_sha1=FIXME test_expect_success 'hash a props' ' test $props_sha1 = "`git hash-object -t props -w props`" ' test_expect_success 'cat-file a props' ' git cat-file props $props_sha1 | cmp -s - props ' test_expect_success 'hash an inode' ' test $inode_sha1 = "`git hash-object -t inode -w inode`" ' test_expect_success 'cat-file an inode' ' git cat-file inode $inode_sha1 | cmp -s - inode ' test_expect_success 'tree with inode' ' test $tree_sha1 = "`git mktree < tree`" ' test_expect_success 'ls-tree of tree with inode' ' git ls-tree $tree_sha1 | cmp -s - tree ' test_expect_success 'check type with cat-file' ' test inode = "`git cat-file -t $tree_sha1:shadow`" ' test_expect_success 'cat-file inode tree:inode' ' git cat-file inode $tree_sha1:shadow | cmp -s - inode ' test_expect_success 'cat-file blob tree:inode' ' git cat-file blob $tree_sha1:shadow | cmp -s - shadow ' test_expect_success 'cat-file props tree:inode' ' git cat-file props $tree_sha1:shadow | cmp -s - props ' test_expect_success 'read-tree' ' git read-tree $tree_sha1 ' test_expect_success 'ls-files shows no modified files' ' test -z "`git ls-files -m || echo fail`" ' test_expect_success 'write-tree' ' test $tree_sha1 = "`git write-tree`" ' test_expect_success 'commit-tree' ' COMMIT=`echo Commit with an inode | git commit-tree $tree_sha1` && git update-ref HEAD $COMMIT ' cat >shadow <<EOF root:*:13943:0:99999:7::: jamey:*:13943:0:99999:7::: josh:*:13943:0:99999:7::: EOF shadow_sha1=FIXME test_expect_success 'ls-files shows modified file' ' test "shadow" = "`git ls-files -m`" ' test_expect_success 'add modified file to index' ' git add shadow ' test_expect_success 'commit modification' ' git commit -m "Modify shadow" ' test_expect_success 'ls-files shows no modified files' ' test -z "`git ls-files -m || echo fail`" ' test_expect_success 'check type with cat-file, after modification' ' test inode = "`git cat-file -t HEAD:shadow`" ' cat > inode <<EOF blob $shadow_sha1 props $props_sha1 EOF inode_sha1=FIXME test_expect_success 'cat-file inode HEAD:inode, after modification' ' git cat-file inode HEAD:shadow | cmp -s - inode ' test_expect_success 'cat-file blob HEAD:inode, after modification' ' git cat-file blob HEAD:shadow | cmp -s - shadow ' test_expect_success 'cat-file props HEAD:inode, after modification' ' git cat-file props HEAD:shadow | cmp -s - props ' test_done ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-09 21:07 [RFC] Plumbing-only support for storing object metadata Jamey Sharp, Josh Triplett @ 2008-08-09 21:49 ` Scott Chacon 2008-08-10 3:51 ` Shawn O. Pearce 2008-08-10 11:09 ` Jan Hudec 1 sibling, 1 reply; 22+ messages in thread From: Scott Chacon @ 2008-08-09 21:49 UTC (permalink / raw) To: Jamey Sharp, Josh Triplett, git > We began trying to implement this proposal, but we found this enum > definition in cache.h, which made us think there's only room for one > more kind of object: > > enum object_type { > OBJ_BAD = -1, > OBJ_NONE = 0, > OBJ_COMMIT = 1, > OBJ_TREE = 2, > OBJ_BLOB = 3, > OBJ_TAG = 4, > /* 5 for future expansion */ > OBJ_OFS_DELTA = 6, > OBJ_REF_DELTA = 7, > OBJ_ANY, > OBJ_MAX, > }; > > Do these object_type values appear in any on-disk structure, or does any > other reason exist why this set of values cannot change? Can we add > additional object types for inodes and props? If not, what would you > recommend instead? If I'm not mistaken, these are the values used to identify data in the header sections of packfile objects. The first four bits are used to identify the object type, where the first bit is static and the next three are the object type of the data following the header. Since the type is encoded using those three bits, 0-7 is the valid range. I would assume that would be difficult to change, since all the packfiles depend on that range. Scott ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-09 21:49 ` Scott Chacon @ 2008-08-10 3:51 ` Shawn O. Pearce 2008-08-10 11:20 ` Stephen R. van den Berg 0 siblings, 1 reply; 22+ messages in thread From: Shawn O. Pearce @ 2008-08-10 3:51 UTC (permalink / raw) To: Scott Chacon; +Cc: Jamey Sharp, Josh Triplett, git Scott Chacon <schacon@gmail.com> wrote: > > We began trying to implement this proposal, but we found this enum > > definition in cache.h, which made us think there's only room for one > > more kind of object: > > > > enum object_type { > > OBJ_BAD = -1, > > OBJ_NONE = 0, > > OBJ_COMMIT = 1, > > OBJ_TREE = 2, > > OBJ_BLOB = 3, > > OBJ_TAG = 4, > > /* 5 for future expansion */ > > OBJ_OFS_DELTA = 6, > > OBJ_REF_DELTA = 7, > > OBJ_ANY, > > OBJ_MAX, > > }; > > > > Do these object_type values appear in any on-disk structure, or does any > > other reason exist why this set of values cannot change? Can we add > > additional object types for inodes and props? If not, what would you > > recommend instead? > > If I'm not mistaken, these are the values used to identify data in the > header sections of packfile objects. The first four bits are used to > identify the object type, where the first bit is static and the next > three are the object type of the data following the header. Since the > type is encoded using those three bits, 0-7 is the valid range. I > would assume that would be difficult to change, since all the > packfiles depend on that range. Correct. There is only room in the pack file for 3 bits in the type field, resulting in types 0-7 as being the only valid range. Only type 0 and 5 are available for use. Nico and I have (at least in the past) agreed that type 0 is meant as an escape indicator. If the type is set to 0 then the real type code appears in another byte of data which follows the object's inflated length. That leaves only type 5 available. Note that because type 5 can be encoded into a really small space (3 bits) compared to any other type we may add we really want to use it for something which will appear _very_frequently_. The OBJ_DICT_TREE encoding we were talking about doing for pack v4 fits that bill, as nearly any project (even huge ones like Mozilla or KDE) would probably be using OBJ_DICT_TREE thoughout their pack files, and there is a noticable reduction in disk usage (and increased performance due to lower page faults) as a result. The proposed "inode" and "props" types sound like they are useful for only less common cases, and would appear very infrequently compared to a tree object. So yea, there really aren't any new type bits available. But tossing aside the type bit argument, I'm not sure I see the value in adding limited arbitrary properties to names in a tree. How does one edit these? How do you inspect them before you get a checkout, assuming they might actually have an impact on the checkout process? How the hell do you merge them? I'm also very concerned about the limited range of values for both keys and values in a "props" type. Even if we did go down this road of supporting such a concept at the plumbing layer (and in the storage modal) everwhere else we are 8-bit clean. Commit messages, tag messages, blob contents, even file names in tree objects. (OK, file names cannot contain a NUL byte, but whatever, that is their only limitation.) The proper encoding for both keys and values should permit any data to be stored. Doesn't the extended attributes feature in Linux and FreeBSD both support any data to be attached to an inode in the fs? Please don't get me wrong. I think this is a _BAD_ idea. A bad idea that will only clutter up the core object model, and the core processing code of that object model. Extended attributes aren't used that much on local filesystems, because they are hard to work with and suck performance wise. Performance in Git is a _feature_. It matters. Our clean object model really helps to make that possible. -- Shawn. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 3:51 ` Shawn O. Pearce @ 2008-08-10 11:20 ` Stephen R. van den Berg 2008-08-10 12:16 ` david 0 siblings, 1 reply; 22+ messages in thread From: Stephen R. van den Berg @ 2008-08-10 11:20 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Scott Chacon, Jamey Sharp, Josh Triplett, git Shawn O. Pearce wrote: >The proper encoding for both keys and values should permit any data >to be stored. Doesn't the extended attributes feature in Linux and >FreeBSD both support any data to be attached to an inode in the fs? I'd think so yes, so any attempt to store the metadata should support it as well. That also would imply that any such metadata storage would have to allow for arbitrary blobs to be stored under tag-names. And *that* would imply that anything that implements a kludge like specifying a flat-file format to encode name/value pairs doesn't scale. >I think this is a _BAD_ idea. >A bad idea that will only clutter up the core object model, and >the core processing code of that object model. Extended attributes >aren't used that much on local filesystems, because they are hard >to work with and suck performance wise. Performance in Git is >a _feature_. It matters. Our clean object model really helps to >make that possible. Quite right. However, pondering the idea a bit more, I could envision something similar to the following: In the git tree the following layout would be used: plainfile.txt otherdir/otherplainfile.txt projects/README projects/README/_owner projects/README/_acl projects/README/_icon projects/README/_mimetype projects/something.mpeg projects/something.mpeg/_icon projects/something.mpeg/_mimetype projects/asubdir/thirdplainfile.txt That would imply that in the tree storage, the only extension would be that for any given reference to a blob in a tree object, there could be a reference to a tree object as well. I.e. something like this in the tree object: 100644 blob f7b7414159b8a7159538fac543b2b19ef531968e README 000000 tree df6ee415f04d6ccea5dab0de562c2f155583a2c4 README 100644 blob 0a54f8ec13df03cf6bdb5b973acec6d8141c01cc something.mpeg 000000 tree a421448d765abb7bb979dc1d56621d0fc9b41229 soemthing.mpeg The extra tree reference for README would actually refer to something like: 100644 blob be3365fdaae0f4ed8c22c4cf38a4b1f88f9069c3 _owner 100644 blob 739e9e8f3d095931084b54cbf7f90d8f64eb0ac6 _acl 100644 blob bc1a868bb50644712966a50150d21199c401d6d5 _icon 100644 blob 6076bde5b3b6b8bed4ec4968d09abdbf015b3b75 _mimetype Which would contain the extra attributes. And that would imply that during checkout you can do a rich checkout or a flat checkout for any files under the projects directory. A flat checkout results in the following files in the filesystem: plainfile.txt otherdir/otherplainfile.txt projects/README projects/README.attr/_owner projects/README.attr/_acl projects/README.attr/_icon projects/README.attr/_mimetype projects/something.mpeg projects/something.mpeg.attr/_icon projects/something.mpeg.attr/_mimetype projects/asubdir/thirdplainfile.txt A rich checkout results in the following files in the filesystem: plainfile.txt otherdir/otherplainfile.txt projects/README projects/something.mpeg projects/asubdir/thirdplainfile.txt projects/asubdir/fourthplainfile.txt The rich checkout also applies the extended attributes/metadata to the filesystem (i.e. it would store all the metadata in the appropriate places). The nice thing about this setup is that: a. There is *no* change whatsoever to existing repositories or repositoryformat. b. It's less filling (i.e. there are no special bits or object types to be used). c. Speed for files without attributes is not affected. d. It's fully 8-bit-transparent. e. It scales, even if you have large or many attributes. f. It uses the natural tree storage abstraction already supported in git repositories to store the additional data. g. It allows reuse of attribute information at many levels. h. It even allows for a hierarchy of attributes attached to a single file (no current filesystem supports that (yet)). i. The only change in the fast-path of core-git is that it would have to know how to skip tree objects referenced in a tree object if a same-name blob object is already there. This can even be optimised by requiring the attribute-tree to have a very specific (e.g. 0) mode to ease detection. j. Editing and merging the meta-information could be made an almost natural operation in the flat-checkout mode (the extension to be used to name the attribute subdir should be made configurable). -- Sincerely, Stephen R. van den Berg. Real programmers don't produce results, they return exit codes. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 11:20 ` Stephen R. van den Berg @ 2008-08-10 12:16 ` david 2008-08-10 14:50 ` Jan Hudec 0 siblings, 1 reply; 22+ messages in thread From: david @ 2008-08-10 12:16 UTC (permalink / raw) To: Stephen R. van den Berg Cc: Shawn O. Pearce, Scott Chacon, Jamey Sharp, Josh Triplett, git On Sun, 10 Aug 2008, Stephen R. van den Berg wrote: > Shawn O. Pearce wrote: >> The proper encoding for both keys and values should permit any data >> to be stored. Doesn't the extended attributes feature in Linux and >> FreeBSD both support any data to be attached to an inode in the fs? > > I'd think so yes, so any attempt to store the metadata should support it > as well. > That also would imply that any such metadata storage would have to allow > for arbitrary blobs to be stored under tag-names. > And *that* would imply that anything that implements a kludge like > specifying a flat-file format to encode name/value pairs doesn't scale. > >> I think this is a _BAD_ idea. > >> A bad idea that will only clutter up the core object model, and >> the core processing code of that object model. Extended attributes >> aren't used that much on local filesystems, because they are hard >> to work with and suck performance wise. Performance in Git is >> a _feature_. It matters. Our clean object model really helps to >> make that possible. > > Quite right. > > However, pondering the idea a bit more, I could envision something > similar to the following: > > In the git tree the following layout would be used: > > plainfile.txt > otherdir/otherplainfile.txt > projects/README > projects/README/_owner > projects/README/_acl > projects/README/_icon > projects/README/_mimetype > projects/something.mpeg > projects/something.mpeg/_icon > projects/something.mpeg/_mimetype > projects/asubdir/thirdplainfile.txt > > That would imply that in the tree storage, the only extension would be > that for any given reference to a blob in a tree object, there could be > a reference to a tree object as well. I.e. something like this in the > tree object: > > 100644 blob f7b7414159b8a7159538fac543b2b19ef531968e README > 000000 tree df6ee415f04d6ccea5dab0de562c2f155583a2c4 README > 100644 blob 0a54f8ec13df03cf6bdb5b973acec6d8141c01cc something.mpeg > 000000 tree a421448d765abb7bb979dc1d56621d0fc9b41229 soemthing.mpeg > > The extra tree reference for README would actually refer to something like: > > 100644 blob be3365fdaae0f4ed8c22c4cf38a4b1f88f9069c3 _owner > 100644 blob 739e9e8f3d095931084b54cbf7f90d8f64eb0ac6 _acl > 100644 blob bc1a868bb50644712966a50150d21199c401d6d5 _icon > 100644 blob 6076bde5b3b6b8bed4ec4968d09abdbf015b3b75 _mimetype > > Which would contain the extra attributes. > > And that would imply that during checkout you can do a rich checkout or a > flat checkout for any files under the projects directory. > > A flat checkout results in the following files in the filesystem: > > plainfile.txt > otherdir/otherplainfile.txt > projects/README > projects/README.attr/_owner > projects/README.attr/_acl > projects/README.attr/_icon > projects/README.attr/_mimetype > projects/something.mpeg > projects/something.mpeg.attr/_icon > projects/something.mpeg.attr/_mimetype > projects/asubdir/thirdplainfile.txt > > A rich checkout results in the following files in the filesystem: > > plainfile.txt > otherdir/otherplainfile.txt > projects/README > projects/something.mpeg > projects/asubdir/thirdplainfile.txt > projects/asubdir/fourthplainfile.txt > > The rich checkout also applies the extended attributes/metadata to the > filesystem (i.e. it would store all the metadata in the appropriate > places). > > The nice thing about this setup is that: > a. There is *no* change whatsoever to existing repositories or > repositoryformat. > b. It's less filling (i.e. there are no special bits or object types to be > used). > c. Speed for files without attributes is not affected. > d. It's fully 8-bit-transparent. > e. It scales, even if you have large or many attributes. > f. It uses the natural tree storage abstraction already supported in > git repositories to store the additional data. > g. It allows reuse of attribute information at many levels. > h. It even allows for a hierarchy of attributes attached to a single > file (no current filesystem supports that (yet)). > i. The only change in the fast-path of core-git is that it would have to > know how to skip tree objects referenced in a tree object if a > same-name blob object is already there. This can even be optimised > by requiring the attribute-tree to have a very specific (e.g. 0) > mode to ease detection. > j. Editing and merging the meta-information could be made an almost > natural operation in the flat-checkout mode (the extension to be used > to name the attribute subdir should be made configurable). you also need to be able to add something to the attribute tree to indicate what type of metadata is being stored in it. you could have *nix perms, windows perms, posix extended attributes, or other things. I could see this as a great way to deal with editing exif data for images. when checking in a .jpg, extract the .exif data and store it seperately, when doing a rich checkout combine it back into the .jpg file. now the large binary blob doesn't change so you don't have to try and find deltas for it. all the special case things would be in the helper routines written to do the 'rich checkin/checkout' of each type. people who don't care about this don't enable these helpers in the configs and so don't suffer any overhead (other then item (i) above) this has the potential to be horribly abused, but it also has the potential to open up some very interesting possibilities as well. David Lang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 12:16 ` david @ 2008-08-10 14:50 ` Jan Hudec 2008-08-10 17:57 ` Stephen R. van den Berg 0 siblings, 1 reply; 22+ messages in thread From: Jan Hudec @ 2008-08-10 14:50 UTC (permalink / raw) To: david Cc: Stephen R. van den Berg, Shawn O. Pearce, Scott Chacon, Jamey Sharp, Josh Triplett, git On Sun, Aug 10, 2008 at 05:16:47 -0700, david@lang.hm wrote: > On Sun, 10 Aug 2008, Stephen R. van den Berg wrote: >> However, pondering the idea a bit more, I could envision something >> similar to the following: >> >> In the git tree the following layout would be used: >> >> plainfile.txt >> otherdir/otherplainfile.txt >> projects/README >> projects/README/_owner >> projects/README/_acl >> projects/README/_icon >> projects/README/_mimetype >> projects/something.mpeg >> projects/something.mpeg/_icon >> projects/something.mpeg/_mimetype >> projects/asubdir/thirdplainfile.txt >> >> That would imply that in the tree storage, the only extension would be >> that for any given reference to a blob in a tree object, there could be >> a reference to a tree object as well. I.e. something like this in the >> tree object: >> >> 100644 blob f7b7414159b8a7159538fac543b2b19ef531968e README >> 000000 tree df6ee415f04d6ccea5dab0de562c2f155583a2c4 README >> 100644 blob 0a54f8ec13df03cf6bdb5b973acec6d8141c01cc something.mpeg >> 000000 tree a421448d765abb7bb979dc1d56621d0fc9b41229 soemthing.mpeg >> >> The extra tree reference for README would actually refer to something like: >> >> 100644 blob be3365fdaae0f4ed8c22c4cf38a4b1f88f9069c3 _owner >> 100644 blob 739e9e8f3d095931084b54cbf7f90d8f64eb0ac6 _acl >> 100644 blob bc1a868bb50644712966a50150d21199c401d6d5 _icon >> 100644 blob 6076bde5b3b6b8bed4ec4968d09abdbf015b3b75 _mimetype >> >> Which would contain the extra attributes. ... provided the two entries under the same name wouldn't drive the internal logic completely mad, I quite like this. Note by the way, that you need to allow for two trees too, because you may want to store attributes for directories too. It's no problem to differentiate them by type 04755 vs. 00000 or 11000 or whatever, but it is a problem for index, because that does not store directory entries, so metadata for a directory would conflict with regular entries in it. Can be fixed by using different filetype for the metadata. >> And that would imply that during checkout you can do a rich checkout or a >> flat checkout for any files under the projects directory. >> >> A flat checkout results in the following files in the filesystem: >> >> plainfile.txt >> otherdir/otherplainfile.txt >> projects/README >> projects/README.attr/_owner >> projects/README.attr/_acl >> projects/README.attr/_icon >> projects/README.attr/_mimetype >> projects/something.mpeg >> projects/something.mpeg.attr/_icon >> projects/something.mpeg.attr/_mimetype >> projects/asubdir/thirdplainfile.txt Storing like this in index as well would make it even more compatible. Of course you are reserving the .attr suffix. But it's probably OK to reserve /something/ for this functionality (when the functionality is needed only). Maybe it could use some special character (@, #, =, $ or something) to separate the suffix instead of normal . to decrease the chance to conflict with other use. >> A rich checkout results in the following files in the filesystem: >> >> plainfile.txt >> otherdir/otherplainfile.txt >> projects/README >> projects/something.mpeg >> projects/asubdir/thirdplainfile.txt >> projects/asubdir/fourthplainfile.txt >> >> The rich checkout also applies the extended attributes/metadata to the >> filesystem (i.e. it would store all the metadata in the appropriate >> places). >> >> The nice thing about this setup is that: >> a. There is *no* change whatsoever to existing repositories or >> repositoryformat. Well, there is a small change -- it needs to support multiple entries with different type but same name in the tree object (but could be avoided by using some special reserved suffix). Plus the index functionality needs to be modified to put the metadata entries in the right places. Still of course much less invasive than the proposal from OP. >> b. It's less filling (i.e. there are no special bits or object types to be >> used). >> c. Speed for files without attributes is not affected. >> d. It's fully 8-bit-transparent. >> e. It scales, even if you have large or many attributes. >> f. It uses the natural tree storage abstraction already supported in >> git repositories to store the additional data. >> g. It allows reuse of attribute information at many levels. >> h. It even allows for a hierarchy of attributes attached to a single >> file (no current filesystem supports that (yet)). >> i. The only change in the fast-path of core-git is that it would have to >> know how to skip tree objects referenced in a tree object if a >> same-name blob object is already there. This can even be optimised >> by requiring the attribute-tree to have a very specific (e.g. 0) >> mode to ease detection. >> j. Editing and merging the meta-information could be made an almost >> natural operation in the flat-checkout mode (the extension to be used >> to name the attribute subdir should be made configurable). > > you also need to be able to add something to the attribute tree to > indicate what type of metadata is being stored in it. you could have > *nix perms, windows perms, posix extended attributes, or other things. Well, not really. I think the best way to implement the 'rich' checkout is to use a hook to read/write the metadata. Git-core should just support storing attributes, but not actually store any of it's own, since they are nt needed for it's main purpose, which is source code control. > I could see this as a great way to deal with editing exif data for > images. when checking in a .jpg, extract the .exif data and store it > seperately, when doing a rich checkout combine it back into the .jpg > file. now the large binary blob doesn't change so you don't have to try > and find deltas for it. > > all the special case things would be in the helper routines written to do > the 'rich checkin/checkout' of each type. people who don't care about > this don't enable these helpers in the configs and so don't suffer any > overhead (other then item (i) above) > > this has the potential to be horribly abused, but it also has the > potential to open up some very interesting possibilities as well. I would say your example above belongs in the categry of abuses. The binary differ can deal with exif just OK (it's not compressed IIRC), so all you need is a custom diff driver for merging -- and that's already supported. Compressed stuff can be already handled for the differ with clean & smudge. -- Jan 'Bulb' Hudec <bulb@ucw.cz> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 14:50 ` Jan Hudec @ 2008-08-10 17:57 ` Stephen R. van den Berg 2008-08-10 18:11 ` Jan Hudec 0 siblings, 1 reply; 22+ messages in thread From: Stephen R. van den Berg @ 2008-08-10 17:57 UTC (permalink / raw) To: Jan Hudec Cc: david, Shawn O. Pearce, Scott Chacon, Jamey Sharp, Josh Triplett, git Jan Hudec wrote: >On Sun, Aug 10, 2008 at 05:16:47 -0700, david@lang.hm wrote: >> On Sun, 10 Aug 2008, Stephen R. van den Berg wrote: >>> However, pondering the idea a bit more, I could envision something >>> similar to the following: >.... provided the two entries under the same name wouldn't drive the internal >logic completely mad, I quite like this. Note by the way, that you need to >allow for two trees too, because you may want to store attributes for Well, in theory yes, but currently git doesn't store directories. How about extending git-core to allow for storage of directories by virtue of the following object in a tree: 040000 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 . I.e. the hash belongs to the empty blob. Normally you don't (have to) store these directory blobs, but if you insist on having them, git will create the empty directory on checkout (i.e. you wouldn't need the dummy file trick anymore to force the directory to be present). -- Sincerely, Stephen R. van den Berg. Real programmers don't produce results, they return exit codes. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 17:57 ` Stephen R. van den Berg @ 2008-08-10 18:11 ` Jan Hudec 2008-08-10 20:16 ` Stephen R. van den Berg 0 siblings, 1 reply; 22+ messages in thread From: Jan Hudec @ 2008-08-10 18:11 UTC (permalink / raw) To: Stephen R. van den Berg Cc: david, Shawn O. Pearce, Scott Chacon, Jamey Sharp, Josh Triplett, git [-- Attachment #1: Type: text/plain, Size: 2040 bytes --] On Sun, Aug 10, 2008 at 19:57:35 +0200, Stephen R. van den Berg wrote: > Jan Hudec wrote: > >On Sun, Aug 10, 2008 at 05:16:47 -0700, david@lang.hm wrote: > >> On Sun, 10 Aug 2008, Stephen R. van den Berg wrote: > >>> However, pondering the idea a bit more, I could envision something > >>> similar to the following: > > >.... provided the two entries under the same name wouldn't drive the internal > >logic completely mad, I quite like this. Note by the way, that you need to > >allow for two trees too, because you may want to store attributes for > > Well, in theory yes, but currently git doesn't store directories. It depends. It does store directories in the tree objects, it just does not do that in index. And we are talking about tree objects, where git does store directories. Besides, that is irrelevant to storing attributes for directories -- the attribute objects are not themselves directories, so git would store them just fine. > How about extending git-core to allow for storage of directories by > virtue of the following object in a tree: > > 040000 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 . > > I.e. the hash belongs to the empty blob. Sorry, but this is insane. If git was to store anything for empty directories, it would be empty tree, not a tree containing empty blob called '.'. There was even a prototype patch to do that sent to the list (I believe it was from Linus and was part of an argument along the lines "you could do it like this, so stop talking and finish it if you have good enough reason to want it (which you obviously don't)"). > Normally you don't (have to) store these directory blobs, but if you > insist on having them, git will create the empty directory on checkout > (i.e. you wouldn't need the dummy file trick anymore to force the > directory to be present). No, I don't give a damn about directories themselves. I want to store their attributes, which is completely different thing. -- Jan 'Bulb' Hudec <bulb@ucw.cz> [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 18:11 ` Jan Hudec @ 2008-08-10 20:16 ` Stephen R. van den Berg 2008-08-10 22:34 ` Junio C Hamano 2008-08-16 6:21 ` Josh Triplett, Jamey Sharp 0 siblings, 2 replies; 22+ messages in thread From: Stephen R. van den Berg @ 2008-08-10 20:16 UTC (permalink / raw) To: Jan Hudec Cc: david, Shawn O. Pearce, Scott Chacon, Jamey Sharp, Josh Triplett, git Jan Hudec wrote: >On Sun, Aug 10, 2008 at 19:57:35 +0200, Stephen R. van den Berg wrote: >> Jan Hudec wrote: >If git was to store anything for empty >directories, it would be empty tree, not a tree containing empty blob called >'.'. There was even a prototype patch to do that sent to the list (I believe Ok, sounds reasonable. With respect to the storage inside the tree, using a duplicate name with mode 0 or a name with some kind of rare extension... It should probably be investigated how much of the existing core needs to be touched/changed to support the duplicate name. I agree that using a custom rare extension would allow for almost no change to git-core. -- Sincerely, Stephen R. van den Berg. Real programmers don't produce results, they return exit codes. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 20:16 ` Stephen R. van den Berg @ 2008-08-10 22:34 ` Junio C Hamano 2008-08-10 23:10 ` david 2008-08-16 6:21 ` Josh Triplett, Jamey Sharp 1 sibling, 1 reply; 22+ messages in thread From: Junio C Hamano @ 2008-08-10 22:34 UTC (permalink / raw) To: Stephen R. van den Berg Cc: Jan Hudec, david, Shawn O. Pearce, Scott Chacon, Jamey Sharp, Josh Triplett, git "Stephen R. van den Berg" <srb@cuci.nl> writes: > I agree that using a custom rare extension would allow for almost no > change to git-core. And at that point there is no "plumbing" side change necessary. You just have to teach your Porcelain to notice the associated "metainfo" files and deal with them. For merging such "metainfo", you would need to do your "flattish/unrich" checkout anyway, so it might be that an easier approach for such a Porcelain might be: * Define a specific leading path, say ".attrs" the hierarchy to store the attributes information. Attributes to a file README and t/Makefile will be stored in .attrs/README and .attrs/t/Makefile. They are probably just plain text file you can do your merges and parsing easily but with this counterproposal the only requirement is they are simple plain blobs. The plumbing layer does not care what payload they carry. * When you want to "git setattr $path", the Porcelain mucks with ".attr/$path". Probably checkout codepath would give you a hook that lets you reflect what ".attr/$path" records to "$path", and checkin (i.e. not commit but update-index) codepath would have another hook to let you grab attributes for "$path" and update ".attr/$path". * Merging and handling updates to ".attrs/" hierarchy are done the usual way we handle blobs. Your Porcelain would then take the result and do whatever changes to ACL or xattrs to the corresponding path, perhaps from a hook after merge. So it will most likely boild down to a "Porcelain only" convention that different Porcelains would agree on. My reaction for the initial proposal was very similar to the one given by Shawn. I do not see much point on having plumbing side support (yet). ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 22:34 ` Junio C Hamano @ 2008-08-10 23:10 ` david 2008-08-11 10:11 ` Stephen R. van den Berg 0 siblings, 1 reply; 22+ messages in thread From: david @ 2008-08-10 23:10 UTC (permalink / raw) To: Junio C Hamano Cc: Stephen R. van den Berg, Jan Hudec, Shawn O. Pearce, Scott Chacon, Jamey Sharp, Josh Triplett, git On Sun, 10 Aug 2008, Junio C Hamano wrote: >> I agree that using a custom rare extension would allow for almost no >> change to git-core. > > And at that point there is no "plumbing" side change necessary. You just > have to teach your Porcelain to notice the associated "metainfo" files and > deal with them. > > For merging such "metainfo", you would need to do your "flattish/unrich" > checkout anyway, so it might be that an easier approach for such a > Porcelain might be: > > * Define a specific leading path, say ".attrs" the hierarchy to store the > attributes information. Attributes to a file README and t/Makefile > will be stored in .attrs/README and .attrs/t/Makefile. They are > probably just plain text file you can do your merges and parsing easily > but with this counterproposal the only requirement is they are simple > plain blobs. The plumbing layer does not care what payload they carry. > > * When you want to "git setattr $path", the Porcelain mucks with > ".attr/$path". Probably checkout codepath would give you a hook that > lets you reflect what ".attr/$path" records to "$path", and checkin > (i.e. not commit but update-index) codepath would have another hook to > let you grab attributes for "$path" and update ".attr/$path". > > * Merging and handling updates to ".attrs/" hierarchy are done the usual > way we handle blobs. Your Porcelain would then take the result and do > whatever changes to ACL or xattrs to the corresponding path, perhaps > from a hook after merge. > > So it will most likely boild down to a "Porcelain only" convention that > different Porcelains would agree on. > > My reaction for the initial proposal was very similar to the one given by > Shawn. I do not see much point on having plumbing side support (yet). a few items convienience 1. tieing the attributes to the file more directly will make it much easier to deal with them along with the file in the non-rich checkout (it's much easier to say README* then README .attr/README*) consisntancy 2. putting hooks into the plumbing that can call external programs for the rich checkin/checkout will let all porcelains make use of the features without having to modify all of them independanty. safety 3. when doing checkins/checkouts of individual files you need to be sure that you deal with the correct attributes at the same time (or else that the person is explicity requesting only a piece of it) with the attributes closely associated with the file this is much easier to do (this is another aspect of the convienience in #1 above) 4. if the configuration of what helper to use changes from one revision to another the plumbing (which is already looking at the tree object for both revisions) is in a better position to detect and alert then the porcelains David Lang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 23:10 ` david @ 2008-08-11 10:11 ` Stephen R. van den Berg 0 siblings, 0 replies; 22+ messages in thread From: Stephen R. van den Berg @ 2008-08-11 10:11 UTC (permalink / raw) To: david Cc: Junio C Hamano, Jan Hudec, Shawn O. Pearce, Scott Chacon, Jamey Sharp, Josh Triplett, git david@lang.hm wrote: >On Sun, 10 Aug 2008, Junio C Hamano wrote: >>* Define a specific leading path, say ".attrs" the hierarchy to store the >> attributes information. Attributes to a file README and t/Makefile >1. tieing the attributes to the file more directly will make it much >easier to deal with them along with the file in the non-rich checkout >(it's much easier to say README* then README .attr/README*) I have to agree that from a practical standpoint for the user, having the file and the attribute tree right next to each other in the tree is a lot easier to manage. So even though setting up a shadow attribute tree is cleaner because it doesn't need some kind of magic extension, it tends to clutter the management in the flat-file checkout case. -- Sincerely, Stephen R. van den Berg. "Beware: In C++, your friends can see your privates!" ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-10 20:16 ` Stephen R. van den Berg 2008-08-10 22:34 ` Junio C Hamano @ 2008-08-16 6:21 ` Josh Triplett, Jamey Sharp 2008-08-16 7:56 ` david ` (2 more replies) 1 sibling, 3 replies; 22+ messages in thread From: Josh Triplett, Jamey Sharp @ 2008-08-16 6:21 UTC (permalink / raw) To: Jan Hudec, Shawn O. Pearce, Stephen R. van den Berg, Junio C Hamano, david, Scott Chacon Cc: git We want to reply to a few of the common points raised in this thread first, and then we have a few point-by-point replies later in this mail. In particular, we see two common questions: whether Git should include support for metadata such as permissions and ownership at all, and how Git should store this metadata if so. We agree entirely with Jan Hudec's first point: On Sun, Aug 10, 2008 at 01:09:25PM +0200, Jan Hudec wrote: > I am glad you came up with this, as I think this is the only reasonable way > to support things like etckeeper. The metastore and similar solutions are > a kludge and fall apart in so many cases. Metastore, etckeeper, and other existing "hook-based" solutions, which attempt to handle permissions separately, have several fundamental problems. They do not integrate well with the normal Git workflow, they often have race conditions that can lead to security problems, and they store working-copy permissions separately from the filesystem permissions where they can potentially become out of sync. We want to emphasize that we really don't have a preference amongst the various reasonable proposals for storing object metadata or for presenting that metadata in porcelain. We're happy that our proposal stimulated discussion on the topic and that we now understand relevant Git internals much better. We made our proposal and test case to demonstrate that we're willing to design and implement a solution, not just complain that Git does not support permissions. Among the proposals mentioned in this thread, we see some common requirements: - All of the proposals suggest referencing the properties from the tree containing the object they apply to, rather than creating an extra object to store both hashes together. We originally thought that having a single object reference in the tree would make it easier to iterate over the tree, construct each object, and apply its permissions. However, several of the proposals address that in other ways. - Several proposals suggest storing the metadata as a tree object, rather than a custom "props" object. This makes a lot of sense. It allows Git to use existing logic for parsing, reachability checking, merging, and checkouts. On the other hand, we want to optimize for the common cases such as POSIX permissions and ownership rather than the unusual cases like extended attributes, so it might make sense to store all the metadata for a particular object as a single blob. - Several responses expressed concerns about merges and conflicts. We propose implementing support for this in plumbing the same way Git does for everything else: put entries into the index with stages marked. This works whether metadata storage uses a tree or a blob. Porcelains can choose how to resolve these merges and present conflicts to the user for resolution. - Several proposals suggest using a magic suffix or special mode to distinguish object file entries from their metadata entries. Either of these approaches seem fine. In the case of a suffix, we think it makes the most sense to use '/' or "//" in this suffix; any other suffix would potentially conflict with legitimate filenames. "//" has the advantage of working unambiguously in the index as well. Either way, any porcelain on top of this could choose a different naming scheme for non-"rich" checkouts, or check out the properties as a separate top-level directory as Junio proposed. - Several people complained about our initial proposal of printable ASCII for property names and values. We used this approach solely because it seemed like a reasonable starting place. Length-prefixed binary would work fine and provide 8-bit cleanness, as would the proposals that store properties as trees of blobs. On Sun, Aug 10, 2008 at 01:09:25PM +0200, Jan Hudec wrote: > Advantages (+), disadvantages (-) and possible (*) extensions of 1: > > + It should be possible to get to something useful with very little changes > to git. Basically all it needs to be useful for things like etckeeper is > to: > . Make sure both clean and smudge filter always get filehandle to the > disk file in question (I am /not/ suggesting path as the file may be > written in a staging area and moved into place later). > . Pass the blob id currently in index to the clean filter, so it can > maintain the data if they are not representable in this particular > checkout (eg. when checking out such repo on windows). Note, that this > would also be useful for ignoring insignificant changes, eg. when > a in some config file order is not important and the tool using it > randomly changes that order when changing that file. It might prove possible to implement a reasonable and secure interface for permissions on top of Git without standardizing the plumbing and storage formats, true. With enough specialized hooks, some of the existing problems with solutions like etckeeper and metastore go away. However, we feel that most of these solutions will have to deal with the same problems, such as storage and merging, and the solutions will end up re-solving problems already handled by Git plumbing. Those who do not understand Git's solutions are doomed to re-invent them poorly. :) On Sat, Aug 09, 2008 at 08:51:01PM -0700, Shawn O. Pearce wrote: > Nico and I have (at least in the past) agreed that type 0 is meant > as an escape indicator. If the type is set to 0 then the real type > code appears in another byte of data which follows the object's > inflated length. > > That leaves only type 5 available. [...] > So yea, there really aren't any new type bits available. If consensus opinion was that new object types were a reasonable way to solve this problem, then it sounds as if there's plenty of room to create new types using this escape mechanism. As a result we found your subsequent comments a bit confusing since they seem to say only one more new object type can exist. > But tossing aside the type bit argument, I'm not sure I see the > value in adding limited arbitrary properties to names in a tree. > How does one edit these? How do you inspect them before you get > a checkout, assuming they might actually have an impact on the > checkout process? How the hell do you merge them? Several of those questions depend on the porcelain. The plumbing would provide support for adding these properties to the index, committing them, viewing them, and doing merges in the index. The porcelain would handle friendly editing, application to the working tree, and friendly merges. > A bad idea that will only clutter up the core object model, and > the core processing code of that object model. Extended attributes > aren't used that much on local filesystems, because they are hard > to work with and suck performance wise. Performance in Git is > a _feature_. It matters. Our clean object model really helps to > make that possible. If you mean that our proposal seems too general, like extended attributes, then we can't argue with that. :) We would have no problem with a solution that only supported the standard POSIX info found in "stat" (permissions, ownership, times). We just felt that such a specific proposal would not go over well; if consensus points toward a more specialized solution, that works fine for us too. We actually proposed the simple name/value storage for props objects because we primarily cared about the case of small values like permissions, not large values like arbitrary xattrs. On Sun, Aug 10, 2008 at 03:34:37PM -0700, Junio C Hamano wrote: > For merging such "metainfo", you would need to do your "flattish/unrich" > checkout anyway, Why not just put entries into the index for each stage as merging currently does? You could then compare the metadata in the index with the filesystem metadata in the "rich" checkout, and resolve the conflict by adding the desired metadata to the index as stage 0 as usual. You would just need some sort of interface like "git add --metadata file" to add the metadata for file to the index. Alternatively, you could have some simple wrappers to directly edit the metadata in the index, much like the existing "git update-index --chmod" does for the execute bit. > * Define a specific leading path, say ".attrs" the hierarchy to store the > attributes information. Attributes to a file README and t/Makefile > will be stored in .attrs/README and .attrs/t/Makefile. They are > probably just plain text file you can do your merges and parsing easily > but with this counterproposal the only requirement is they are simple > plain blobs. The plumbing layer does not care what payload they carry. Using a top-level tree to store all of the permissions makes sub-trees not stand alone; the tree sha1 of a subdirectory doesn't give you enough information to recreate the metadata for that subdirectory. > * When you want to "git setattr $path", the Porcelain mucks with > ".attr/$path". Probably checkout codepath would give you a hook that > lets you reflect what ".attr/$path" records to "$path", and checkin > (i.e. not commit but update-index) codepath would have another hook to > let you grab attributes for "$path" and update ".attr/$path". This hook would need to provide a way to process these updates before the blob or tree contents get put into place. For example, if you check out /etc/shadow, you need to apply the non-world-readable permissions *before* you write out the contents. > So it will most likely boild down to a "Porcelain only" convention that > different Porcelains would agree on. > > My reaction for the initial proposal was very similar to the one given by > Shawn. I do not see much point on having plumbing side support (yet). We agree in principle that a sufficiently rich set of hooks might make it possible to implement metadata outside of the Git plumbing. However, in practice the set of hooks necessary for complete integration seems quite large. Furthermore, implementing these hooks efficiently seems difficult. We also don't want to force people to use a non-Git porcelain just to get support for permissions. Finally, we think that along with a common storage format, these porcelains will all have a common set of problems to solve, and it seems better to solve them once correctly in Git using code that mostly already exists. - Josh Triplett and Jamey Sharp ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-16 6:21 ` Josh Triplett, Jamey Sharp @ 2008-08-16 7:56 ` david 2008-08-16 9:55 ` Junio C Hamano 2008-08-18 6:12 ` Shawn O. Pearce 2 siblings, 0 replies; 22+ messages in thread From: david @ 2008-08-16 7:56 UTC (permalink / raw) To: Josh Triplett, Jamey Sharp Cc: Jan Hudec, Shawn O. Pearce, Stephen R. van den Berg, Junio C Hamano, Scott Chacon, git On Fri, 15 Aug 2008, Josh Triplett wrote: > > - Several proposals suggest storing the metadata as a tree object, > rather than a custom "props" object. This makes a lot of sense. It > allows Git to use existing logic for parsing, reachability > checking, merging, and checkouts. On the other hand, we want to > optimize for the common cases such as POSIX permissions and ownership > rather than the unusual cases like extended attributes, so it might > make sense to store all the metadata for a particular object as a > single blob. ahh, but if the 'tree object' that you are storing is named file.attr and contains just the posix permissions and ownership, there are a very small number of different permutations that you will see on any one system (let alone in any one repository), as such the duplicates will all hash to the same value and be combined in storage. your rich checkout porceleans can cache these into a lookup table and gain performance basicly equivalent to defining a custom object. in fact, I'd be willing to bet that even when extended attributes are in use (say SELinux tags) the number of different tree objects that would be used would still be pretty small. > On Sun, Aug 10, 2008 at 03:34:37PM -0700, Junio C Hamano wrote: >> For merging such "metainfo", you would need to do your "flattish/unrich" >> checkout anyway, > > Why not just put entries into the index for each stage as merging > currently does? You could then compare the metadata in the index with > the filesystem metadata in the "rich" checkout, and resolve the conflict > by adding the desired metadata to the index as stage 0 as usual. You > would just need some sort of interface like "git add --metadata file" to > add the metadata for file to the index. Alternatively, you could have > some simple wrappers to directly edit the metadata in the index, much > like the existing "git update-index --chmod" does for the execute bit. becouse the tools to work directly on the index are very limited. yes they can be left in the index, but then the index-manipulation tools need to understand every type of metadata. if it's able to be presented in the "flattish/unrich" mode it will work anywhere, even on operating systems that can't run your 'rich' tools David Lang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-16 6:21 ` Josh Triplett, Jamey Sharp 2008-08-16 7:56 ` david @ 2008-08-16 9:55 ` Junio C Hamano 2008-08-16 15:07 ` Jan Hudec 2008-08-18 6:12 ` Shawn O. Pearce 2 siblings, 1 reply; 22+ messages in thread From: Junio C Hamano @ 2008-08-16 9:55 UTC (permalink / raw) To: Josh Triplett Cc: Jamey Sharp, Jan Hudec, Shawn O. Pearce, Stephen R. van den Berg, david, Scott Chacon, git Josh Triplett <josh@freedesktop.org>, Jamey Sharp <jamey@minilop.net> writes: > This hook would need to provide a way to process these updates before > the blob or tree contents get put into place. For example, if you check > out /etc/shadow, you need to apply the non-world-readable permissions > *before* you write out the contents. I think such atomicity or "checkout race problem" is irrelevant. I'd like to make a comment on this point, even though at the moment (especially before the real release), I am not very interested in where this "proposal" is going. You mention that you would resolve attribute conflicts just the same way you would resolve contents conflicts, which in turn means that you would check out a half-merged state with conflict markers to the working tree, fix up the filesystem entity (both contents and presumably its attributes like perm bits, ownership, xa and whatnot), and mark the path resolved. Even without talking about attributes conflicts, what's your position on the time-window during which the contents of /etc/shadow and /etc/password have conflict markers in them? Luckily, the markers do not have sufficient number of colons, and that would protect your system from attempts to break into it with a phoney username '=======' with an empty password ;-), but I think you get the idea. Anything that has to be in some consistent state that cannot see conflicted state in the middle should not be merged in-place [*1*], [*2*]. So please simplify your requirements and at least drop atomicity argument. I am _not_ fundamentally opposed to somebody who wants to use git or any other SCM as a cooler representation of snapshots than a sequence of tarballs. I however would be unhappy if your design and implementation becomes more complicated than otherwise only because you try to deal with the atomicity issue. IOW, if your solution would become much simpler once you pare down the atomicity requirement, then I'd reject the more complex variant with atomicity in any second, even though I might still find the simpler variant that does not care about atomicity worth considering. [Footnotes] *1* That is why people often frown upon "using SCM to track changes of a live system in-place", and suggest tracking source material in SCM, and build material to deploy from the source and install into the final destination (not limited to /etc but more often so for e.g. web server assets) as a better practice. *2* Also you should realize your "/etc/shadow must be non-world-readable from the beginning" is a very application specific wish. What if the attribute you are trying to enforce is "this path must always be world-readable"? Are you going to limit this "attribute enhancements" to what you can specify at creat(2) time only? How would you handle "this path must be owned by user 'www-data' (assuming root drives git)", which would be done by creat(2) followed by chown(2)? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-16 9:55 ` Junio C Hamano @ 2008-08-16 15:07 ` Jan Hudec 0 siblings, 0 replies; 22+ messages in thread From: Jan Hudec @ 2008-08-16 15:07 UTC (permalink / raw) To: Junio C Hamano Cc: Josh Triplett, Jamey Sharp, Shawn O. Pearce, Stephen R. van den Berg, david, Scott Chacon, git On Sat, Aug 16, 2008 at 02:55:51 -0700, Junio C Hamano wrote: > Josh Triplett <josh@freedesktop.org>, Jamey Sharp <jamey@minilop.net> > writes: > > > This hook would need to provide a way to process these updates before > > the blob or tree contents get put into place. For example, if you check > > out /etc/shadow, you need to apply the non-world-readable permissions > > *before* you write out the contents. > > I think such atomicity or "checkout race problem" is irrelevant. > > I'd like to make a comment on this point, even though at the moment > (especially before the real release), I am not very interested in where > this "proposal" is going. > > You mention that you would resolve attribute conflicts just the same way > you would resolve contents conflicts, which in turn means that you would > check out a half-merged state with conflict markers to the working tree, > fix up the filesystem entity (both contents and presumably its attributes > like perm bits, ownership, xa and whatnot), and mark the path resolved. > Even without talking about attributes conflicts, what's your position on > the time-window during which the contents of /etc/shadow and /etc/password > have conflict markers in them? Well, there are situations where conflicts can happen and situations where they can't. So I think the solution is "don't merge in the live directory" (applicable to other uses of version control in other kind of live copies too). > Luckily, the markers do not have sufficient number of colons, and that > would protect your system from attempts to break into it with a phoney > username '=======' with an empty password ;-), but I think you get the > idea. Anything that has to be in some consistent state that cannot see > conflicted state in the middle should not be merged in-place [*1*], [*2*]. > > So please simplify your requirements and at least drop atomicity argument. The atomicity requirement is real for some applications, like the etckeeper. It should be restated in terms of moving the content to the work tree rather than before writing it out -- the content can be written out to a staging area, attributes applied and than moved into the tree. IIUC git already uses a staging area during checkout, no? > I am _not_ fundamentally opposed to somebody who wants to use git or any > other SCM as a cooler representation of snapshots than a sequence of > tarballs. I however would be unhappy if your design and implementation > becomes more complicated than otherwise only because you try to deal with > the atomicity issue. IOW, if your solution would become much simpler once > you pare down the atomicity requirement, then I'd reject the more complex > variant with atomicity in any second, even though I might still find the > simpler variant that does not care about atomicity worth considering. I don't think the atomicity requirement should make anything more complicated. It is only a matter of running the hook applying the attributes -- I think git should not define meaning of the attributes -- at the right point during the checkout process. > [Footnotes] > > *1* That is why people often frown upon "using SCM to track changes of a > live system in-place", and suggest tracking source material in SCM, and > build material to deploy from the source and install into the final > destination (not limited to /etc but more often so for e.g. web server > assets) as a better practice. Yes, unless you need to track the changes done in the live directory by other software, which is the case for /etc. It is also the case for ikiwiki-based web sites. You still need to avoid merging in the live tree to avoid breaking it, but git always allows you to create a separate staging tree for such tasks. > *2* Also you should realize your "/etc/shadow must be non-world-readable > from the beginning" is a very application specific wish. What if the > attribute you are trying to enforce is "this path must always be > world-readable"? Are you going to limit this "attribute enhancements" to > what you can specify at creat(2) time only? How would you handle "this > path must be owned by user 'www-data' (assuming root drives git)", which > would be done by creat(2) followed by chown(2)? Yes, that does not make sense. But if you restate the requirement that the attributes must be applied when the file becomes accessible in the work tree, than it makes sense and is easily doable by writing the file to a temporary location -- which is sufficiently protected if it is inside .git -- and moving it into the tree as the last step. (The data is available inside .git/objects and .git/packs, so they are only as well protected as the .git dir itself is, so no restrictions as long as the file is inside .git). Best regards, Jan -- Jan 'Bulb' Hudec <bulb@ucw.cz> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-16 6:21 ` Josh Triplett, Jamey Sharp 2008-08-16 7:56 ` david 2008-08-16 9:55 ` Junio C Hamano @ 2008-08-18 6:12 ` Shawn O. Pearce 2008-08-18 23:06 ` Derek Fawcus 2 siblings, 1 reply; 22+ messages in thread From: Shawn O. Pearce @ 2008-08-18 6:12 UTC (permalink / raw) To: Josh Triplett, Jamey Sharp Cc: Jan Hudec, Stephen R. van den Berg, Junio C Hamano, david, Scott Chacon, git Josh Triplett <josh@freedesktop.org>, Jamey Sharp <jamey@minilop.net> wrote: > On Sat, Aug 09, 2008 at 08:51:01PM -0700, Shawn O. Pearce wrote: > > Nico and I have (at least in the past) agreed that type 0 is meant > > as an escape indicator. If the type is set to 0 then the real type > > code appears in another byte of data which follows the object's > > inflated length. > > > > That leaves only type 5 available. > [...] > > So yea, there really aren't any new type bits available. > > If consensus opinion was that new object types were a reasonable way to > solve this problem, then it sounds as if there's plenty of room to > create new types using this escape mechanism. Yes, but we'd hate to see the majority of the encodings within a pack using the escape mechanism. So a lot of my argument here was just trying to point out that type bits aren't free, and we need to make sure the limited ones available are applied to the majority of the pack contents. Adding a new type bit is a lot more than just adding it to the pack data field. Look at the amount of code that needed to be changed to support gitlink in trees, and that was "reusing" the OBJ_COMMIT type. Anytime you start poking at the core object enumeration code with new cases there's a lot of corners that are affected. -- Shawn. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-18 6:12 ` Shawn O. Pearce @ 2008-08-18 23:06 ` Derek Fawcus 2008-08-18 23:18 ` Shawn O. Pearce 2008-08-18 23:23 ` Marcus Griep 0 siblings, 2 replies; 22+ messages in thread From: Derek Fawcus @ 2008-08-18 23:06 UTC (permalink / raw) To: git On Sun, Aug 17, 2008 at 11:12:36PM -0700, Shawn O. Pearce wrote: > Adding a new type bit is a lot more than just adding it to the pack > data field. Look at the amount of code that needed to be changed to > support gitlink in trees, and that was "reusing" the OBJ_COMMIT type. > Anytime you start poking at the core object enumeration code with > new cases there's a lot of corners that are affected. Actually, I'd been thinking of how to attach metadata - but more from the perspective of attaching it to commits, rather than individual blobs or trees. At the moment, my workaround is simply to add well known lines to the end of the commit comments, the downside being that it makes the comments a bit ugly, and one needs to know the protocol for parsing them. My other hacky thought was that tag object could be overloaded for this purpose. It is already sort of an indirect object, but seems to be limited to appearing at the edge of the graph. If we could say have: commit -> tag -> tree then arbitrary data could be stored in the tag, similarly this could be extended for when a tree or blob object is expected (I'm not sure about the blob case). I guess there'd have to be some rule - like only one indirect object allowed to be inserted (otherwise its awkward to check for loops), and there would need to be some custom merge rules. DF ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-18 23:06 ` Derek Fawcus @ 2008-08-18 23:18 ` Shawn O. Pearce 2008-08-18 23:23 ` Marcus Griep 1 sibling, 0 replies; 22+ messages in thread From: Shawn O. Pearce @ 2008-08-18 23:18 UTC (permalink / raw) To: Derek Fawcus; +Cc: git Derek Fawcus <dfawcus@cisco.com> wrote: > On Sun, Aug 17, 2008 at 11:12:36PM -0700, Shawn O. Pearce wrote: > > Adding a new type bit is a lot more than just adding it to the pack > > data field. Look at the amount of code that needed to be changed to > > support gitlink in trees, and that was "reusing" the OBJ_COMMIT type. > > Anytime you start poking at the core object enumeration code with > > new cases there's a lot of corners that are affected. > > Actually, I'd been thinking of how to attach metadata - but more from > the perspective of attaching it to commits, rather than individual > blobs or trees. > > At the moment, my workaround is simply to add well known lines to > the end of the commit comments, We've talked about adding additional header lines to the commit after the "committer" or "encoding" line but before the first blank line that ends the headers and starts the message. Most of the code will skip over an unknown header at this position, as we went through that pain when we added the "encoding" header to the commit format. However, once you start putting headers into there one has to actually understand what they mean. And it gets really ugly if your tool thinks "fixed XXX\n" means something different from what my tool thinks "fixed YYY\n" means and I use my tool against a clone of your repository. In other words there is no concept of "header namespaces". Thus far I don't think anyone has really tried adding more headers here because nobody has come up with a concrete example of how it is useful. > I guess there'd have to be some rule - like only one indirect > object allowed to be inserted (otherwise its awkward to check > for loops), and there would need to be some custom merge rules. Loops aren't possible. If you can create a loop you have a very real and very valid attack against SHA-1. You will probably be able to use that in some way that profits you better than a loop within some random Git repository. You may also want to look into the "notes" idea floated on the list in the past. It allowed attaching trees (IIRC) to any commit, and finding that later on in O(1) time during say git-log. This can be useful to attach a build report or a test report to a commit hours after it was created. -- Shawn. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-18 23:06 ` Derek Fawcus 2008-08-18 23:18 ` Shawn O. Pearce @ 2008-08-18 23:23 ` Marcus Griep 2008-08-18 23:28 ` Shawn O. Pearce 1 sibling, 1 reply; 22+ messages in thread From: Marcus Griep @ 2008-08-18 23:23 UTC (permalink / raw) To: Derek Fawcus; +Cc: Git Mailing List Derek Fawcus wrote: > My other hacky thought was that tag object could be overloaded for > this purpose. It is already sort of an indirect object, but seems > to be limited to appearing at the edge of the graph. > > If we could say have: > > commit -> tag -> tree > > then arbitrary data could be stored in the tag, similarly this > could be extended for when a tree or blob object is expected > (I'm not sure about the blob case). I was under the impression that tags were references to commit objects, and they to tree objects: tag -> commit -> tree Also, wouldn't this require large numbers tags, or the ability to multi- target tags? -- Marcus Griep GPG Key ID: 0x5E968152 —— http://www.boohaunt.net את.ψο´ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-18 23:23 ` Marcus Griep @ 2008-08-18 23:28 ` Shawn O. Pearce 0 siblings, 0 replies; 22+ messages in thread From: Shawn O. Pearce @ 2008-08-18 23:28 UTC (permalink / raw) To: Marcus Griep; +Cc: Derek Fawcus, Git Mailing List Marcus Griep <neoeinstein@gmail.com> wrote: > I was under the impression that tags were references to commit objects, > and they to tree objects: > > tag -> commit -> tree No. A tag can reference any object. See for example the junio-gpg-pub tag in git.git, it references a blob, not a commit. The linux-2.6.git tree has a tag which references a tree. Tags may also reference other tags. > Also, wouldn't this require large numbers tags, or the ability to multi- > target tags? Tag objects don't have to have names in the repository's ref space, but it helps that they do when you are doing git-lost-found. Having a tag in the database which shouldn't have a ref name in refs/tags is more than a bit funny. -- Shawn. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC] Plumbing-only support for storing object metadata 2008-08-09 21:07 [RFC] Plumbing-only support for storing object metadata Jamey Sharp, Josh Triplett 2008-08-09 21:49 ` Scott Chacon @ 2008-08-10 11:09 ` Jan Hudec 1 sibling, 0 replies; 22+ messages in thread From: Jan Hudec @ 2008-08-10 11:09 UTC (permalink / raw) To: Jamey Sharp, Josh Triplett, git Hello All, I am glad you came up with this, as I think this is the only reasonable way to support things like etckeeper. The metastore and similar solutions are a kludge and fall apart in so many cases. I am not sure your approach is the right one, though. I tend to agree with Shawn it's not. So here is a couple of alternate proposals (sorry, it's a bit long, as I have several variants with different drawbacks I would like to discuss). On Sat, Aug 09, 2008 at 14:07:33 -0700, Jamey Sharp wrote: > The attached test illustrates a proposal for minimal plumbing support > usable to store permissions, ownership, and other metadata in git > repositories. This proposal is fully compatible with existing > repositories when the new functionality is not in use. Similar to the > introduction of subprojects, we have not yet specified the porcelain. We > believe that the plumbing will provide sufficient functionality for many > uses, and these uses will help determine the appropriate porcelain. I think the main way to use it would be a hook, that would read/write the attributes to/from the tree. That will do the right thing for storing permissions, owners and other things represented in the worktree. And metadata that are neither part of the tree or directly related to git's functionality are out of our scope. > [...] > We propose representing objects with metadata using a new "inode" > object. An inode object contains the hash of the real object and the > hash of a "props" (properties) object. A props object contains a set of > name-value pairs. Tree objects can reference inode objects in addition > to the current possibilities of blobs, trees, and subproject commits; we > propose using the currently invalid type 110000 (S_IFREG | S_IFIFO) for > inode objects. We primarily see a use case for inodes referencing blobs > and trees, though as defined they support any object type. I think this is the overly complex -- and also the needlessly incompatible part. By the way, I don't think you need separate type for props -- it can be a blob too. I would suggest investigating following options: 1. It would be possible to use clean/smudge filters to encode the attributes in the blob itself. 2. Store the metadata in separate objects, but link them in the parent tree directly. In this case, each attribute could probably get it's own blob, so eg. for a file foo the tree containing it would have entries: foo foo<sep>owner foo<sep>permissions ... Where <sep> would be some sepatator (more on that below). Advantages (+), disadvantages (-) and possible (*) extensions of 1: + It should be possible to get to something useful with very little changes to git. Basically all it needs to be useful for things like etckeeper is to: . Make sure both clean and smudge filter always get filehandle to the disk file in question (I am /not/ suggesting path as the file may be written in a staging area and moved into place later). . Pass the blob id currently in index to the clean filter, so it can maintain the data if they are not representable in this particular checkout (eg. when checking out such repo on windows). Note, that this would also be useful for ignoring insignificant changes, eg. when a in some config file order is not important and the tool using it randomly changes that order when changing that file. - It does not support metadata for directories, but could be crossed with approach 2 to fix that. Git could special-case entry '.' for storing "content" of a directory, which would be wholly created by running the clean filter on a directory (I am not sure directory handles are portable, but running with that directory as current should be). This would not have the problem of approach 2 with the entry names for the metadata. * Default processing could be added to strip the metadata in smudge and re-add them from index on clean. This would require adding some marker to know which blobs need this treatment. I see two ways: . Using different file type for them. There are already two types pointing to blob (S_IFREG and S_IFLNK) and they are treated differently on read (clean) / write (smudge) from/to tree, so third type should be workable. . Using additional format. Currently a blob is encoded as "blob" <LF> <content> so maybe an extneded blob could be encoded as "blob extended" <LF> <content> without needing a special type for it. But I don't know git internals enough to know how easy, hard or dirty this would be. Advantages (+), disadvantages (-) and possible (*) extensions of 2: + It would work the same way for directories and file, or mostly so. + Different metadata would be handled independently, so it would be easier to combine support for multiple attributes (not that I can imagine any sensible use beyond access lists (owner, permission, posix acl)). + Checking out without the hooks could easily create special metadata files, providing easy way to work with the attributes where they are not supported by the underlaying filesystem. - It would require reserving some names for the metadata entries. I see basically three ways to name the attribues: . Reserving some character for the separator, eg. @ or # or something like that. So with file foo, there would be entries: foo foo@owner foo@permissions This has following pros and cons: + Minimal changes to the index <-> tree logic (remember, index is flat and has no directory entries, so the tree writer must decide to which tree each entry goes). + Trivially supports checking the metadata entries out as special files on filesystem without metadata support. - The character is reserved in trees that need the feature (the trees that don't need it don't need to care). Note, that the metadata entries could have mode either S_ISREG, or a new one. Inclined to say S_ISREG -- we have the special name to distinguish them. . Using something that does not exist in a normalized path, ie either "//" or "/./". So with file foo, there would be entries: foo foo//owner foo//permissions This has following pros and cons: + Does not reserve any characters. Every filename is permitted even when the freature is used. - Harder on the index <-> tree logic, as it would have to not consider such strings as not being directory separators. - Such files could not be checked out, though they could still be manipulated using cat-file and update-index. The metadata entries could have mode either S_ISREG or a new one again. New mode would be sensible if it would make easier on the index <-> tree logic (it's easier to check 3 bits than search string for a substring). . Leave the suffix for metadata entries to the hooks. This would be middle road between the above two: + Reserves as little as possible, while not complicating the index <-> tree logic. + Remains easy to check out as special files where you can't run the hooks, though this would require some special-casing similar to symlinks on Windows. - Would require new mode for these entries, so we know they are created and consumed by the hooks rather than directly read/written to the tree. Best regards, Jan -- Jan 'Bulb' Hudec <bulb@ucw.cz> ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2008-08-18 23:29 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-08-09 21:07 [RFC] Plumbing-only support for storing object metadata Jamey Sharp, Josh Triplett 2008-08-09 21:49 ` Scott Chacon 2008-08-10 3:51 ` Shawn O. Pearce 2008-08-10 11:20 ` Stephen R. van den Berg 2008-08-10 12:16 ` david 2008-08-10 14:50 ` Jan Hudec 2008-08-10 17:57 ` Stephen R. van den Berg 2008-08-10 18:11 ` Jan Hudec 2008-08-10 20:16 ` Stephen R. van den Berg 2008-08-10 22:34 ` Junio C Hamano 2008-08-10 23:10 ` david 2008-08-11 10:11 ` Stephen R. van den Berg 2008-08-16 6:21 ` Josh Triplett, Jamey Sharp 2008-08-16 7:56 ` david 2008-08-16 9:55 ` Junio C Hamano 2008-08-16 15:07 ` Jan Hudec 2008-08-18 6:12 ` Shawn O. Pearce 2008-08-18 23:06 ` Derek Fawcus 2008-08-18 23:18 ` Shawn O. Pearce 2008-08-18 23:23 ` Marcus Griep 2008-08-18 23:28 ` Shawn O. Pearce 2008-08-10 11:09 ` Jan Hudec
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).