Revisiting metadata storage

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Revisiting metadata storage
@ 2011-11-14  0:07 Richard Hartmann
  2011-11-24  1:10 ` Jonathan Nieder
       [not found] ` <87sjkx8gll.fsf@an-dro.info.enstb.org>
  0 siblings, 2 replies; 7+ messages in thread
From: Richard Hartmann @ 2011-11-14  0:07 UTC (permalink / raw)
  To: Git List

Hi all,

every few months, a thread seems to pop up regarding metadata storage,
be it owner, mtime, xattr or what have you.

As of today there is (ttbomk) still no sane way to carry & restore
metadata across different repositories.

metastore[1] is closest to a working solution but its storage format
is binary and merge-unfriendly and it does not store mtime which is a
deal-breaker, for me.
I tried extend & fix metastore, but failed. Others looked at it,
deemed it possible and lost interest and/or their development box.

I know from Joey Hess [2] that the GitTogether 2011 saw some
discussion about metadata, but I was unable to find any follow-up to
this issue.

To make a long story short: Does anyone have a working solution,
today? If not, is anyone working on one? If not, is anyone interested
in working on one? And is there any follow-up to the GitTogether
discussion?

The feature set of any solution should probably include save, display,
diff, and apply on a per-metadata and per-file basis.

Thanks,
Richard

[1] http://david.hardeman.nu/software.php
[2] http://kitenet.net/~joey/blog/entry/GitTogether2011/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Revisiting metadata storage
  2011-11-14  0:07 Revisiting metadata storage Richard Hartmann
@ 2011-11-24  1:10 ` Jonathan Nieder
       [not found] ` <87sjkx8gll.fsf@an-dro.info.enstb.org>
  1 sibling, 0 replies; 7+ messages in thread
From: Jonathan Nieder @ 2011-11-24  1:10 UTC (permalink / raw)
  To: Richard Hartmann; +Cc: Git List, Joey Hess, David Barr

Richard Hartmann wrote:

> To make a long story short: Does anyone have a working solution,
> today?

Sure.  etckeeper handles metadata such as owner and permissions
reasonably well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Revisiting metadata storage
       [not found] ` <87sjkx8gll.fsf@an-dro.info.enstb.org>
@ 2011-12-14 17:59   ` Richard Hartmann
  2011-12-15 21:40     ` Hilco Wijbenga
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Hartmann @ 2011-12-14 17:59 UTC (permalink / raw)
  To: Ronan Keryell; +Cc: Git List

On Tue, Dec 6, 2011 at 23:45, Ronan Keryell
<Ronan.Keryell@hpc-project.com> wrote:

> At least I'm interested and began to dig into it but I do not have a lot
> of time to work on it...

If we can agree on Perl, I can try to help. I don't think I speak
enough Python to be of use with that.

Other people who have an interest in this: Please pipe up so we can
hammer out a rough consensus & roadmap.

Richard

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Revisiting metadata storage
  2011-12-14 17:59   ` Richard Hartmann
@ 2011-12-15 21:40     ` Hilco Wijbenga
  2011-12-16  7:52       ` Jonathan Nieder
  0 siblings, 1 reply; 7+ messages in thread
From: Hilco Wijbenga @ 2011-12-15 21:40 UTC (permalink / raw)
  To: Richard Hartmann; +Cc: Ronan Keryell, Git List

On 14 December 2011 09:59, Richard Hartmann
<richih.mailinglist@gmail.com> wrote:
> On Tue, Dec 6, 2011 at 23:45, Ronan Keryell
> <Ronan.Keryell@hpc-project.com> wrote:
>
>> At least I'm interested and began to dig into it but I do not have a lot
>> of time to work on it...
>
> If we can agree on Perl, I can try to help. I don't think I speak
> enough Python to be of use with that.
>
> Other people who have an interest in this: Please pipe up so we can
> hammer out a rough consensus & roadmap.

I'd love to have better support for metadata (specifically
timestamps). I don't care whether it's Perl, Python, Bash, or C. I
don't think I'll be much help coding but I'd like to try.

In all honesty though, I plan to rewrite our build to use file digests
instead of timestamps. Right now every rebase means a full (and almost
completely unnecessary) rebuild. Luckily I'm using the wonderful
git-new-workdir so there is no pain when switching branches.

Once the rewrite is complete (in one or two months) Git's relentless
timestamp changes should no longer affect us as much anymore. I would
still like to get a better grip on metadata though. Git should be able
to not touch files that have not changed. But whether that's feasible
or even in scope of what you have in mind... :-)

Cheers,
Hilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Revisiting metadata storage
  2011-12-15 21:40     ` Hilco Wijbenga
@ 2011-12-16  7:52       ` Jonathan Nieder
  2011-12-16 18:55         ` Hilco Wijbenga
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Nieder @ 2011-12-16  7:52 UTC (permalink / raw)
  To: Hilco Wijbenga; +Cc: Richard Hartmann, Ronan Keryell, Git List

Hilco Wijbenga wrote:

>                        Right now every rebase means a full (and almost
> completely unnecessary) rebuild.

It sounds like what you are suffering from is that "git rebase" uses
the worktree as its workspace instead of doing all that work
in-memory, right?

If I were in your situation, I would do the following:

 1. Don't rebase so often.  When wanting to take advantage of features
    from a new upstream version, use "git merge" to pull it in.  Only
    rebase when it is time to make the history presentable for other
    people.

    This way, "git log --first-parent" will give easy access to
    the intermediate versions you have hacked on and tested recently.

 2. When history gets ugly and you want to rebase to make the series
    easier to make sense of, use a separate workdir:

	$ git branch tmp; # make a copy to rebase

	$ cd ..
	$ git new-workdir repo rebase-scratch tmp
	$ cd rebase-scratch
	$ git rebase -i origin/master
	...
	$ cd ..
	$ rm -fr rebase-scratch

	$ cd repo
	$ git diff HEAD tmp;	# Does the rebased version look better?
	$ git reset --keep tmp;	# Yes.  Use it.
	$ git branch -d tmp

 3. Once the rebased history looks reasonably good, be sure to rebase
    one final time and test each commit before submitting for other
    people's use.

Hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Revisiting metadata storage
  2011-12-16  7:52       ` Jonathan Nieder
@ 2011-12-16 18:55         ` Hilco Wijbenga
  2011-12-17  0:48           ` Jonathan Nieder
  0 siblings, 1 reply; 7+ messages in thread
From: Hilco Wijbenga @ 2011-12-16 18:55 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Richard Hartmann, Ronan Keryell, Git List

On 15 December 2011 23:52, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Hilco Wijbenga wrote:
>
>>                        Right now every rebase means a full (and almost
>> completely unnecessary) rebuild.
>
> It sounds like what you are suffering from is that "git rebase" uses
> the worktree as its workspace instead of doing all that work
> in-memory, right?

Yes, I guess the problem is that it uses the worktree as its workspace.

(I know others disagree but to me it's a bug that Git touches files
that it doesn't actually change.)

> If I were in your situation, I would do the following:
>
>  1. Don't rebase so often.  When wanting to take advantage of features
>    from a new upstream version, use "git merge" to pull it in.  Only
>    rebase when it is time to make the history presentable for other
>    people.

I usually rebase in the morning to get an up-to-date tree. Is that
considered too often? Perhaps it's my Subversion background but I'm
not comfortable diverging too much. Is that too paranoid? :-)

So IIUC, I can do "git rebase master" even after multiple "git merge master"s?

>    This way, "git log --first-parent" will give easy access to
>    the intermediate versions you have hacked on and tested recently.

Why is "git log --first-parent" important? I read "git help log" on
first-parent but that didn't really tell me much. Google was not very
helpful either.

>  2. When history gets ugly and you want to rebase to make the series
>    easier to make sense of, use a separate workdir:
>
>        $ git branch tmp; # make a copy to rebase

This is in my merged branch, right?

>
>        $ cd ..
>        $ git new-workdir repo rebase-scratch tmp
>        $ cd rebase-scratch
>        $ git rebase -i origin/master
>        ...
>        $ cd ..
>        $ rm -fr rebase-scratch
>
>        $ cd repo
>        $ git diff HEAD tmp;    # Does the rebased version look better?
>        $ git reset --keep tmp; # Yes.  Use it.
>        $ git branch -d tmp

Interesting. If I run the rebase after the merge, rebase appears to do
much less work. I.e. it appears to only touch files that have actually
changed. Is that true?

>  3. Once the rebased history looks reasonably good, be sure to rebase
>    one final time and test each commit before submitting for other
>    people's use.
>
> Hope that helps,

Yes, thanks for pointing out yet more useful Git options. There seems
no end to them. :-)

Cheers,
Hilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Revisiting metadata storage
  2011-12-16 18:55         ` Hilco Wijbenga
@ 2011-12-17  0:48           ` Jonathan Nieder
  0 siblings, 0 replies; 7+ messages in thread
From: Jonathan Nieder @ 2011-12-17  0:48 UTC (permalink / raw)
  To: Hilco Wijbenga
  Cc: Richard Hartmann, Ronan Keryell, Git List, Martin von Zweigbergk

(+cc: Martin, who has been doing excellent work on "git rebase", just
 in case he's curious)
Hi again,

Hilco Wijbenga wrote:

> Yes, I guess the problem is that it uses the worktree as its workspace.

That's a comfort.  Thanks for explaining.

> (I know others disagree but to me it's a bug that Git touches files
> that it doesn't actually change.)

No, I somewhat agree.  If a command is touching more files than it needs
to, then that is likely to be a bug, or at least an opportunity for
improvement.

Where we might disagree is in how many files "git rebase" needs to
touch.  So let's consider your use case.

[...]
> I usually rebase in the morning to get an up-to-date tree. Is that
> considered too often? Perhaps it's my Subversion background but I'm
> not comfortable diverging too much. Is that too paranoid? :-)
>
> So IIUC, I can do "git rebase master" even after multiple "git merge master"s?

The second question is easy --- the answer is "yes".

Your first question is more a matter of opinion.  I will just say a
little about "git rebase", to help you decide for yourself.

The original and still-primary purpose of "git rebase" is to refresh and
clean up a short patch series that is going to be submitted by email
to some project, by making the series apply to a newer basis version.
You can imitate what it does fairly simply by hand:

	# on master
	git checkout -b master-rebased new-upstream
	git cherry-pick HEAD..master; # [*]
	git branch -M master

That is, we check out the new basis version and apply any "local"
changes on top of it one at a time, using human help to resolve
conflicts as necessary.

This procedure has the nice property that it is dead simple.  It also
is easily tweaked to produce an "interactive" variant that reorders
the patches or runs other commands in between applying patches (for
example, you can ask git to run the test suite after each commit when
rebasing by adding "exec make test" after each "pick" line in the
editor shown by "git rebase -i").  And in the end you have a nice
patch series that applies without fuzz and doesn't require people
reading your patches to think about the older code base they were
originally written against.

However, rebasing has a few downsides.

The most important one is that each time you rebase, you are making
new, untested commits.  When you rebase the 300-patch series that
you have been debugging in collaboration with other people, you
can no longer say "these changes have been in use and being tested
for a few months now; chances are we have already ironed out most
of the obvious bugs".  The usual heuristic that patches towards the
beginning of the series most likely work better and are less likely to
have introduced that new crash that makes your program not work at all
no longer applies, since when you rebase, you can easily introduce
a mistake in conflict resolution.  All "local" code is new code.

In the history

  A --- o --- o --- B [upstream]
   \
    P --- Q --- R [master]

after rebasing

  A --- o --- o --- B --- P' --- Q' --- R' [master]

it is even tempting for people to not test the intermediate commits P'
and Q' before publishing their work, resulting in a history where
intermediate commits involved in telling the story do not even build.
So building and testing old versions to track down a change in
behavior (e.g., with "git bisect") becomes hard.  The history is not
actual history.

That is easy to mitigate by only rebasing your small, _private_ patch
series that is not part of meaningful history.  When asking others to
incorporate the changes into permanent history, the contributor
hopefully carefully checks over them for sanity and checks each
intermediate version before they can be applied.  And history on a
large, public scale is still stable.

For similar reasons, rebasing can make life difficult for people
trying to write patches based on your patches.  The section RECOVERING
FROM UPSTREAM REBASE in the git-rebase(1) manual page has more on
that.

If you want to incorporate changes into your branch and preserve the
history of well-tested commits (for example, if you are the upstream
maintainer, pulling in changes from other people), a command to do
this is git-merge(1).  It does not have to rewind or rewrite anything;
it just uses a 3-way merge algorithm to apply the new changes and
writes a commit indicating it has done so and with pointers to the two
parent commits so history consumers can see the full story.

Another consideration.

When using Subversion and working against the trunk, I find myself
using "svn update" every day and right before commiting.  Otherwise, I
may be forced to deal with a painful conflict resolution, or worse,
commit a change to one file that uses an API that has been removed in
another file.

However, when using git, I do not find myself needing to do that.
Instead, most work pertaining to a particular goal happens on a branch
specific to that topic, I pick one version to develop against, and I
mostly stick to it.  This way, I am not distracted by irrelevant
breakage or other changes introduced in areas orthogonal to my topic.

"But how do you make sure your changes work with the current
codebase?" you might ask.  Here:

	# on branch "topic"

	# switch to working on an anonymous branch, or rather no
	# branch at all
	git checkout --detach
	# grab latest changes to test against
	git merge origin/master
	# test!
	make test

The "git merge" step does not present me with the same conflicts it
did yesterday thanks to the "git rerere" facility (since I have
[rerere] enabled set to true).  If my topic needs some nontrivial
reconciliation with the wider changes in the project (if there is an
API change, say), I might use "git merge" when on branch topic (i.e.,
_not_ detached) to record the resolution and use the commit message to
describe what happened.  Or I might just rebase.

Because of the nature of patches applied by mail, before sending the
patches out, I either rebase one last time or very loudly mention
which old version of the codebase the patch series applies to.
Usually the former.  But this step would not be necessary if asking
people to pull from me using a protocol that transfers the actual
objects.

Hope that helps,
Jonathan

[*] Actually, "git rebase" is a little smarter than that, in that it
notices and skips patches that have already been applied in
new-upstream.  A better imitation would be to use

	git cherry-pick --cherry-pick --right-only HEAD...master

Even better is to look at git-rebase.sh to see what it actually does.
:)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-12-17  0:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-14  0:07 Revisiting metadata storage Richard Hartmann
2011-11-24  1:10 ` Jonathan Nieder
     [not found] ` <87sjkx8gll.fsf@an-dro.info.enstb.org>
2011-12-14 17:59   ` Richard Hartmann
2011-12-15 21:40     ` Hilco Wijbenga
2011-12-16  7:52       ` Jonathan Nieder
2011-12-16 18:55         ` Hilco Wijbenga
2011-12-17  0:48           ` Jonathan Nieder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).