* rebase parents, or tracking upstream but removing non-distributable bits
@ 2010-12-30 17:54 Alexandre Oliva
2010-12-30 20:58 ` Jonathan Nieder
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Alexandre Oliva @ 2010-12-30 17:54 UTC (permalink / raw)
To: git
Say the git repository of a project I use (with changes) on another
projet I work on contains portions that I oughtn't distribute. Say,
portions that are illegal, immoral or too risky in my jurisdiction:
patented stuff that lawyers say I should not distribute in anyway,
unauthorized or otherwise copyright-infringing bits, text or pictures
that are offensive or even illegal to publish, i.e., stuff that I must
not be caught distributing and that, ideally, I could arrange to not
even possess.
If you guessed that my primary reason to want this is the non-Free
Software in the Linux git repository, you got it right :-) Anyhow,
regardless of your opinion as to my stance in this matter, I hope you'll
agree that the scenarios above are relevant and desirable. Heck, even a
business that decides to remove all traces from a feature that was
planned for a certain release, but that is pushed back to a later
release, could benefit from this.
Note that simply reverting/removing these bits from the head of a branch
wouldn't be enough: since the repository carries the entire history,
pushing the head of the branch to my public repository would amount to
publishing the bits I must not publish.
I need to be able to maintain and publish a modified repository, that
filters out the unwanted portions, but still be able to pull changes
from the upstream repository. Desirable, but not strictly necessary, is
the possibility of letting upstream pull my improvements, without
bringing in the changes I made to remove the bits I'm not supposed to
distribute.
Given this problem statement, I started looking for solutions that
didn't require modifying git.
I first looked into rewriting history, removing the unwanted bits and
replaying subsequent changes, but quickly discarded it, for it would
make my local repository incompatible with upstream both ways: I
wouldn't be able to pull from it; upstream wouldn't be able to pull to
it; third parties would run into ugly situations trying to carry patches
from either one to the other.
Now, it looks like I might be able to pull from upstream if I maintain
manually a graft file that named each upstream commit as an additional
parent of the corresponding local rebase commit that brought it into my
rewritten tree. Workable, maybe, but this wouldn't help third parties
that used my public repository.
Besides, I'm concerned that pushing from the local repository (with the
graft file) to the public repository would end up publishing the changes
I'm not supposed to distribute, because they'd be taken as parents of
the local commits.
Are there any other ways to support the desired features with git as-is?
AFAICT, there isn't, so I've been thinking of how to introduce this. I
suppose the simplest way to accomplish this is to introduce the notion
of a “weak parent”: one that is taken into account for purposes of
checking whether a commit is present in a branch being merged- or
rebased into, but that is not transmitted over pushes, and that is not
retained over purges, and not complained about when missing.
I'm under the impression that this could not just work, but also make
rebasing in general (especially the hard case) far less problematic, for
git would be able to relate a rebased commit with an original commit.
Now, assuming I'm correct in this assessment, there are two questions
that remain:
- how to represent this?
I thought of changing the commit blob format so as to somehow mark the
weak parents, say, with an additional character on the same line:
parent f00ba5... W
an alternate header:
wparent f00ba5...
or even an additional line:
parent f00ba5...
...
weak f00ba5...
For some backward compatibility, it looks like only the last form would
as much as stand a chance of being properly parsed, if the weak notes
are added at the end of the blob.
Another possibility is to create another kind of object, that named an
original and rebased commit and that, like a tag object, would be
(optionally?) transmitted when the (rebased) commit it named was
transmitted. This could be more interesting, in that it might enable
all traces of a rebase to be eventually removed. A (named?) object that
names multiple such pairs of commits might make even more sense to this
end.
Am I on the right track? Any thoughts, preferences, suggestions,
concerns, recommendations, advice, pointers or gotchas to watch out for
before I start implementing any of these possibilities?
I realize that, although this option could make “git pull --rebase” work
to track upstream in the rebased branch, and would enable me to publish
the repository with the rebased branch without the pieces I shouldn't
distribute, I'm not sure this would enable upstream to easily integrate
my changes. Or would it?
Thanks in advance,
I'm not subscribed, but I'm going to look for replies in the archives.
That said, I'd appreciate if you'd explicitly copy me in any follow ups.
(Mail-Followup-To: set accordingly)
Last but not least: Happy GNU Year! :-)
--
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist Red Hat Brazil Compiler Engineer
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: rebase parents, or tracking upstream but removing non-distributable bits
2010-12-30 17:54 rebase parents, or tracking upstream but removing non-distributable bits Alexandre Oliva
@ 2010-12-30 20:58 ` Jonathan Nieder
2010-12-30 22:32 ` Alexandre Oliva
2010-12-30 22:52 ` Yann Dirson
2010-12-30 22:58 ` Alexandre Erwin Ittner
2 siblings, 1 reply; 7+ messages in thread
From: Jonathan Nieder @ 2010-12-30 20:58 UTC (permalink / raw)
To: Alexandre Oliva; +Cc: git
Alexandre Oliva wrote:
> Now, it looks like I might be able to pull from upstream if I maintain
> manually a graft file that named each upstream commit as an additional
> parent of the corresponding local rebase commit that brought it into my
> rewritten tree. Workable, maybe, but this wouldn't help third parties
> that used my public repository.
Have you looked into "git replace"?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: rebase parents, or tracking upstream but removing non-distributable bits
2010-12-30 20:58 ` Jonathan Nieder
@ 2010-12-30 22:32 ` Alexandre Oliva
2010-12-30 23:14 ` Jakub Narebski
0 siblings, 1 reply; 7+ messages in thread
From: Alexandre Oliva @ 2010-12-30 22:32 UTC (permalink / raw)
To: Jonathan Nieder; +Cc: git
On Dec 30, 2010, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Alexandre Oliva wrote:
>> Now, it looks like I might be able to pull from upstream if I maintain
>> manually a graft file that named each upstream commit as an additional
>> parent of the corresponding local rebase commit that brought it into my
>> rewritten tree. Workable, maybe, but this wouldn't help third parties
>> that used my public repository.
> Have you looked into "git replace"?
As far as I could tell, it solves a complementary problem. IIUC, it
would enable me to replace objects (say files, trees or commits) in my
local repository so as to remove objectionable stuff, but when I pushed
a branch out of it, it would go out with the very stuff I'm not supposed
to publish. This is because AFAICT replace objects are not sent over
the wire.
Even if they were, I still don't think it would be appropriate to use
them, for I'm speaking of really different trees. Publishing a commit
replacement would, for anyone who had both my public repository and my
upstream, affect not just the branches I published, but also those in
upstream, which would be surprising and undesirable.
Finally, it wouldn't be a complete solution. Consider, for example, an
objectionable file or tree from an early commit, that I replaced with
something I can live with. A later commit that changed that tree, or
any of those files, would AFAICT *silently* override my replacement,
requiring constant monitoring and new replacements for every such
change.
With the rewrite/rebase model I have in mind, changes to modified files
would conflict, prompting an immediate fix, without any risk of
publishing modified versions of unwanted files. (Of course, in my
particular case I'd still have to monitor for newly-introduced
objectionable stuff, but that's to be expected.)
Did I make any mistakes in my analysis of the “replace” feature? It
would be lovely if I could use it, but, in a way, it appears to be the
dual of what I need: I need to fix a problem in what I provide to
others, while replace would fix the problem in what I see myself.
Anyhow, thanks for the pointer, appreciated!
--
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist Red Hat Brazil Compiler Engineer
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: rebase parents, or tracking upstream but removing non-distributable bits
2010-12-30 17:54 rebase parents, or tracking upstream but removing non-distributable bits Alexandre Oliva
2010-12-30 20:58 ` Jonathan Nieder
@ 2010-12-30 22:52 ` Yann Dirson
2010-12-30 22:58 ` Alexandre Erwin Ittner
2 siblings, 0 replies; 7+ messages in thread
From: Yann Dirson @ 2010-12-30 22:52 UTC (permalink / raw)
To: Alexandre Oliva, git
On Thu, Dec 30, 2010 at 03:54:29PM -0200, Alexandre Oliva wrote:
> Given this problem statement, I started looking for solutions that
> didn't require modifying git.
This is a problem I have come to think a bit about already. Although
I do not have a proper solution either, let's share those ideas.
> I first looked into rewriting history, removing the unwanted bits and
> replaying subsequent changes, but quickly discarded it, for it would
> make my local repository incompatible with upstream both ways: I
> wouldn't be able to pull from it; upstream wouldn't be able to pull to
> it; third parties would run into ugly situations trying to carry patches
> from either one to the other.
>
> Now, it looks like I might be able to pull from upstream if I maintain
> manually a graft file that named each upstream commit as an additional
> parent of the corresponding local rebase commit that brought it into my
> rewritten tree. Workable, maybe, but this wouldn't help third parties
> that used my public repository.
As a side note: I fear grafts won't scale very nicely performance-wise
if you graft every commit.
My use-case was similar to yours, about "how could Debian distribute a
git tree of the kernel ?", but my focus was on distributing
well-defined snapshots, and I ended up with the idea of grafting only
the set of successive revisions that get published - and not
necessarily have a mirror of all upstream revisions in the filtered
tree: such grafts do represent points in history where we would merge
from upstream, and without the grafts the filtered tree would look
like successive imports of well-defined revisions (notably makes it
realistic to adhere to the "all commits have been tested" paradigm).
> Besides, I'm concerned that pushing from the local repository (with the
> graft file) to the public repository would end up publishing the changes
> I'm not supposed to distribute, because they'd be taken as parents of
> the local commits.
Grafts cannot be pushed/pulled. For this, as Jonathan suggests in
another reply, "git replace" looks like a better choice, esp. in that
by default the replacement commits don't get pulled: you need to
explicitely request fetching refs/replace/commits (note: you may need
to handle merges there when needed). OTOH, those replacement commits
would pull the whole upstream history, so they cannot be part of the
filtered repository: it must come from an unfiltered kernel repo, that
may be a real problem if you cannot redistribute some upstream parts.
OTOH, a grafts file can be distributed out-of-band, and would only
pull the problematic contents when put in place, so it might indeed be
more adequate.
> Are there any other ways to support the desired features with git as-is?
> AFAICT, there isn't, so I've been thinking of how to introduce this. I
> suppose the simplest way to accomplish this is to introduce the notion
> of a ???weak parent???: one that is taken into account for purposes of
> checking whether a commit is present in a branch being merged- or
> rebased into, but that is not transmitted over pushes, and that is not
> retained over purges, and not complained about when missing.
That sounds like heavy surgery with too many implications I can think
of.
When it comes to "modifying commit metadata", it rings "git notes" in
my ear - but then, you cannot add/change commit metadata from a note,
maybe that could be a better direction to dig into, so we can add
parents when a specific notes namespace is activated ? But then we
would get back to the problem I mentionned for refs/replace/
> I'm under the impression that this could not just work, but also make
> rebasing in general (especially the hard case) far less problematic, for
> git would be able to relate a rebased commit with an original commit.
I suppose that by "hard case" you mean forking off a branch that gets
rebased later ? Note that "git pull" seems to be able to cope with
this using reflogs already (although I did not test that feature). A
less volatile place than reflogs could again using notes, without a
need for core changes.
> I'm not sure this would enable upstream to easily integrate
> my changes. Or would it?
This problem suggests a more generic one: how to "merge back" most
changes from a branch while still not merging some specific changes ?
It would also help when a maint branch gets some short-term
workarounds that we don't want in the master branch, but the very idea
has a serious flaw: it implies that the "merge back" commit contains
also the commits we don't want (here, the "filtering commits"). So I
guess cherry-pick will be the way here. Anyway, I doubt Linus would
like the idea of merging from such a filtered repo, sending patches is
probably prefered.
Best regards,
--
Yann
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: rebase parents, or tracking upstream but removing non-distributable bits
2010-12-30 17:54 rebase parents, or tracking upstream but removing non-distributable bits Alexandre Oliva
2010-12-30 20:58 ` Jonathan Nieder
2010-12-30 22:52 ` Yann Dirson
@ 2010-12-30 22:58 ` Alexandre Erwin Ittner
2 siblings, 0 replies; 7+ messages in thread
From: Alexandre Erwin Ittner @ 2010-12-30 22:58 UTC (permalink / raw)
To: git
Alexandre Oliva <lxoliva@fsfla.org> wrote
> I need to be able to maintain and publish a modified repository, that
> filters out the unwanted portions, but still be able to pull changes
> from the upstream repository.
Have you tried something with "git filter-branch"? I have never tried
something like this but I think it is possible to automate a process to
(1) pull the changes from the origin into a complete clone, (2) branch
from the HEAD and run "filter-branch" with a customized script to create
a temporary sanitized branch, (3) merge this temporary branch into a
complete sanitized branch, (4) record all the branchpoints, sparing the
next iteration from running through all the history again, and, (5) push
the sanitized branch somewhere.
Of course, this approach creates a complete nightmare with the
integration, testing, and code attribution: it changes the SHA1s,
invalidates signed tags, forces the use of patches instead of pull
requests to the upstream contributions, may taint the validity of the
commit messages and authorship, etc. Publishing edited commits seems a
delicate subject for me -- your "filter-branch" script should mark their
commit messages as such.
Att.
--
Alexandre Erwin Ittner - alexandre@ittner.com.br
OpenPGP pubkey 0x0041A1FB @ http://pgp.mit.edu
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: rebase parents, or tracking upstream but removing non-distributable bits
2010-12-30 22:32 ` Alexandre Oliva
@ 2010-12-30 23:14 ` Jakub Narebski
2011-01-05 11:44 ` Alexandre Oliva
0 siblings, 1 reply; 7+ messages in thread
From: Jakub Narebski @ 2010-12-30 23:14 UTC (permalink / raw)
To: Alexandre Oliva; +Cc: Jonathan Nieder, git
Alexandre Oliva <lxoliva@fsfla.org> writes:
> On Dec 30, 2010, Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> > Alexandre Oliva wrote:
> >> Now, it looks like I might be able to pull from upstream if I maintain
> >> manually a graft file that named each upstream commit as an additional
> >> parent of the corresponding local rebase commit that brought it into my
> >> rewritten tree. Workable, maybe, but this wouldn't help third parties
> >> that used my public repository.
>
> > Have you looked into "git replace"?
>
> As far as I could tell, it solves a complementary problem. IIUC, it
> would enable me to replace objects (say files, trees or commits) in my
> local repository so as to remove objectionable stuff, but when I pushed
> a branch out of it, it would go out with the very stuff I'm not supposed
> to publish. This is because AFAICT replace objects are not sent over
> the wire.
They are not sent by default, but they (refs/replace/*) can be send as
any other ref.
>
> Even if they were, I still don't think it would be appropriate to use
> them, for I'm speaking of really different trees. Publishing a commit
> replacement would, for anyone who had both my public repository and my
> upstream, affect not just the branches I published, but also those in
> upstream, which would be surprising and undesirable.
[...]
I guess what Jonathan had in mind was something like that:
* you have two branches, 'clean' and 'contaminated'
* you want to merge 'contaminated' into 'clean', but you don't
want people to see history of 'contaminated'
* in your private repository you merge 'contaminated' into 'clear'
(with --no-ff, just in case), save merge commit, then rewrite
top commit to be ordinary commit not a merge commit; it would
bring [redacted] changes but not history
* you replace merge-turned-ordinary commit with a proper merge
commit
* you don't distribute replacement refs to public repository
Though I think that better solution would be feature-branch based
workflow. Each feature is developed in seperate feature branch. If
given feature is sutable for 'clean', you merge it into both 'clean'
and in 'contaminated'. If it is not, you merge it only into
'contaminated'.
Hopefully that would help develop workflow for you.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: rebase parents, or tracking upstream but removing non-distributable bits
2010-12-30 23:14 ` Jakub Narebski
@ 2011-01-05 11:44 ` Alexandre Oliva
0 siblings, 0 replies; 7+ messages in thread
From: Alexandre Oliva @ 2011-01-05 11:44 UTC (permalink / raw)
To: Jakub Narebski, Yann Dirson; +Cc: Jonathan Nieder, git
On Dec 30, 2010, Jakub Narebski <jnareb@gmail.com> wrote:
> They are not sent by default, but they (refs/replace/*) can be send as
> any other ref.
Oh, doh, I was modeling them after grafts, but indeed the replace refs,
unlike grafts, can be sent out. Which doesn't really help, since they'd
be sent out in addition to the objectionable stuff.
Unless the idea is to replace the other way round, i.e., instead of
cleaned-up commit replacing contaminated commit, mark the contaminated
commit as replacing the cleaned-up one. I haven't explored this
possibility, for it dids't seem to make much sense at first.
> * you replace merge-turned-ordinary commit with a proper merge
> commit
Aah... and this would presumably enable further merges onto my local
tree, but I'd public commits that lost history and relationship with
their upstream commits.
I'm aiming at something better than this, something more like the result
of filter-branch, but with improvements for git pull/merge that (i) use
some ref/original mapping (that provides nearly equivalent info to that
of the weak parent idea I proposed before) to tell where we are, what we
have and what needs rewriting, and (ii) perform rewriting of each
brought in commit, keeping local history isomorphic to that of upstream,
and updating the remapping. Ideally, (iii) have means for merge to use
the remapping backwards, so that one could merge from the cleaned-up
branch to the contaminated branch, or even to publish the remapping as
equivalences rather than unidirectional mappings. Perhaps storing them
as trees (or some other format) rather than as long lists of refs would
make them more efficient to deal with, especially after packing.
More details about what we're after in the thread containing:
http://www.mail-archive.com/gnu-linux-libre@nongnu.org/msg00903.html
As for the rewriting itself (which I regard as a solved problem, it's
compatibility between rewritten branches that I'm trying to adress), I'm
thinking of making manual changes to the trees whose commits introduced
undesirable content, taking note of the contaminated and clean objects,
and then writing a script to remap with git filter-branch the contents
of the index for each commit, replacing contaminated with clean file, or
removing fully-contaminated file.
> Though I think that better solution would be feature-branch based
> workflow.
We are not in a position to influence how upstream does their
development, and I suppose this would be the case in many (but not all)
of the situations I described as motivators.
On Dec 30, 2010, Yann Dirson <ydirson@free.fr> wrote:
>> I'm under the impression that this could not just work, but also make
>> rebasing in general (especially the hard case) far less problematic, for
>> git would be able to relate a rebased commit with an original commit.
> I suppose that by "hard case" you mean forking off a branch that gets
> rebased later ?
I meant the case described as “hard case” in the git-rebase man page:
http://www.kernel.org/pub/software/scm/git/docs/git-rebase.html
Hard case: The changes are not the same.
This happens if the subsystem rebase had conflicts, or used
--interactive to omit, edit, squash, or fixup commits; or if the
upstream used one of commit --amend, reset, or filter-branch.
> This problem suggests a more generic one: how to "merge back" most
> changes from a branch while still not merging some specific changes ?
Thanks for the suggestion. That made me think that, more than a
parent/child relationship, the original and rewritten commits should be
perceived as siblings as far as merges are concerned, when a
correspondence/equivalence table is given. Hopefully this wouldn't be
too much of a change to merge and rebase.
Am I making sense? Does this seem generally useful, say, for someone
trying to do participate in the development of unencumbered portions of
a (patent|copyright|contractually|restriction)-encumbered project?
--
Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/ FSF Latin America board member
Free Software Evangelist Red Hat Brazil Compiler Engineer
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-01-05 11:44 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-30 17:54 rebase parents, or tracking upstream but removing non-distributable bits Alexandre Oliva
2010-12-30 20:58 ` Jonathan Nieder
2010-12-30 22:32 ` Alexandre Oliva
2010-12-30 23:14 ` Jakub Narebski
2011-01-05 11:44 ` Alexandre Oliva
2010-12-30 22:52 ` Yann Dirson
2010-12-30 22:58 ` Alexandre Erwin Ittner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).