* Maintaining historical data in a git repo @ 2012-03-30 13:34 Yuval Adam 2012-03-30 15:10 ` Seth Robertson 2012-04-03 9:25 ` Maintaining historical data in a git repo Andreas Stricker 0 siblings, 2 replies; 13+ messages in thread From: Yuval Adam @ 2012-03-30 13:34 UTC (permalink / raw) To: git As part of a public project to open-source the Israeli law code, we are looking into ways of represent such data in a git repository. The main challenge is to represent historical data _in a semantically correct way_ within a git repository, while having the ability to change data that has occurred in the past. For example, we might have revisions B and C of a certain legal document, commit to repo, and at a later time want to add revision A to the proper place in the git commit tree (probably with rebasing or replacing). Allowing decentralization and updates is a major requirement. We're trying to map out the various pros and cons of the different options of maintaining such a repo. Has anyone ever attempted something like this? Are there any projects that build on the git plumbing which provide wrapper APIs to handle historic data? We really could use any reference or advice we can get on this subject. Thanks, -- Yuval Adam http://y3xz.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 13:34 Maintaining historical data in a git repo Yuval Adam @ 2012-03-30 15:10 ` Seth Robertson 2012-03-30 15:55 ` Yuval Adam 2012-04-03 9:25 ` Maintaining historical data in a git repo Andreas Stricker 1 sibling, 1 reply; 13+ messages in thread From: Seth Robertson @ 2012-03-30 15:10 UTC (permalink / raw) To: Yuval Adam; +Cc: git In message <CA+P+rLeyEcZPudhLWavB74CiDAqpn+iNkk4F8=NK_yGaJPMmyA@mail.gmail.com>, Yuval Adam writes: As part of a public project to open-source the Israeli law code, we are looking into ways of represent such data in a git repository. This is extremely cool. I wish others were forward thinking enough to do this. The main challenge is to represent historical data _in a semantically correct way_ within a git repository, while having the ability to change data that has occurred in the past. Revision control shouldn't be used to change the past (even if git allows this with sufficient amounts of pain/warning to all users). What it is extremely good at is preserving the past and tracking the changes that are made. For example, we might have revisions B and C of a certain legal document, commit to repo, and at a later time want to add revision A to the proper place in the git commit tree (probably with rebasing or replacing). There is no problem doing this. I'll make up a mythical workflow which might be realistic. Someone proposes a bill, so a branch for the proposal is created. In many of the laws I am familiar with, there is the text of the law and then the text says "Amend V.5.12.A.b to add '25: or to commit a nasal offense (as defined in V.5.12.A) with a shoe'". So the branch might contain the text of the proposed law and then actually go through to the document V.5.12.A.b and add the new data to the appropriate file (in an ideal world that might be an automatic process, but laws are rarely so precise). The proposed law changes and the bill text changes would be committed onto the branch. As the bill goes through committee people make changes, adding things, removing things, etc. Each change is a commit. One example change might be a new change saying "remove the change made 2 days ago" or "make the current version the version from 10 days ago". Both of those specific changes would ideally be positive changes. You would not actually be deleting the change made two days ago or removing all changes made between 10 days ago and now, you would be making a new commit to remove the effects of the unwanted changes. When the negotiations are over and assuming the bill gets all three readings (each reading could be a "tag" to document exactly what was read) and voted into a law, you would then merge the bill branch into the "law" branch which represents the actual legally active laws. This could be done as a "squash" merge which hides all of the committee negotiations or it could be done as a normal merge which allows the history of the negotiations to be visible, or, depending on the visibility of the committee negotiations, you could even do a combination of the two. And yes, git supports more complex processes automatically, like each Knesset member making their own proposed changes and the committee chair merging the appropriate version in if it was approved and the others being either discarded or archived for history but not incorporated. Allowing decentralization and updates is a major requirement. git is extremely good at this. We're trying to map out the various pros and cons of the different options of maintaining such a repo. Ideally the data being represented would be structured, textual, and somewhat line oriented, plain text/UTF-8 files (no matter the word direction) like this email are ideal. Committing binary Office documents (Word, OpenXML, ODF, etc) is not ideal, since under most circumstances/without a lot of work you are not going to get good differences so that you can easily see the history of the law. You can write custom binary drivers to extract this difference information from these binary documents, but that is the "lot of work" I was talking about. You additionally might want to have separate repositories for separate groups of laws to prevent repositories from getting unwieldy. There are tools which let you group repositories together. Has anyone ever attempted something like this? Many people use git to track living documents. Perhaps not law per se, but I don't particularly see why that would matter. Are there any projects that build on the git plumbing which provide wrapper APIs to handle historic data? Are you talking about "get rid of that change, it was bad" and "restore this version of the document as the good one" or "how do I import 64 years of law into git"? Git provides native tools to handle both. We really could use any reference or advice we can get on this subject. I'll point you at http://progit.org/book/ as a general reference about git and http://sethrobertson.github.com/GitBestPractices/ as a reference about best practices. -Seth Robertson ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 15:10 ` Seth Robertson @ 2012-03-30 15:55 ` Yuval Adam 2012-03-30 16:18 ` Seth Robertson ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Yuval Adam @ 2012-03-30 15:55 UTC (permalink / raw) To: git On Fri, Mar 30, 2012 at 6:10 PM, Seth Robertson <in-gitvger@baka.org> wrote: > > Revision control shouldn't be used to change the past (even if git > allows this with sufficient amounts of pain/warning to all users). > What it is extremely good at is preserving the past and tracking the > changes that are made. This is exactly what we _do_ want to do. Our use case for this is like so: "ok, this is how the law is today, and we're not quite sure how it got to this point" But then some X time later: "so we found out that clauses (1), (e) and (X) were changed on March 30, 1957, and we want to know this for future reference" So, yes, we do need a way of knowing (blaming?) what happened in the past and how the law was shaped over time. Is this something that is definitively complicated with git? -- Yuval Adam http://y3xz.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 15:55 ` Yuval Adam @ 2012-03-30 16:18 ` Seth Robertson 2012-03-30 16:32 ` Jakub Narebski 2012-03-30 16:52 ` Junio C Hamano 2012-04-03 19:45 ` Maintaining historical outlines Markus Elfring 2 siblings, 1 reply; 13+ messages in thread From: Seth Robertson @ 2012-03-30 16:18 UTC (permalink / raw) To: Yuval Adam; +Cc: git In message <CA+P+rLcWT0SZQjW2LtFXXCDRwjMp8daJ2hVup=7cnsRGbKw7xw@mail.gmail.com>, Yuval Adam writes: On Fri, Mar 30, 2012 at 6:10 PM, Seth Robertson <in-gitvger@baka.org> wrote: > Revision control shouldn't be used to change the past (even if git > allows this with sufficient amounts of pain/warning to all users). > What it is extremely good at is preserving the past and tracking the > changes that are made. This is exactly what we _do_ want to do. Is this something that is definitively complicated with git? Ah, I understand now. I imagine others will chime in as well, but this should not be too complex with git. You can easily go back into history and change it. The problem comes in when you have shared your repository with other people. In general, rewriting public history is a bad idea because git cannot tell the difference between someone adding to history for good reasons (expanding on known history) and bad reasons (retroactively rewriting the law to add a loophole). You can absolutely do it, but then you have to "force push" your changes to the master server to override the history (assuming that is allowed, and it typically is not by default) and then everyone else would have to do special things (`git pull --rebase` in the simple case, rebuilding branches and tags in more complex cases) to get the new history. Clearly for something like the law and the probable complex workflow it will have, this isn't a good method. What I would probably suggest is having either a historical branch or a historical repository which is allowed and expected to be rewritten. The changes would then be confined to places where active "development" would not be occurring and the process to recover from the retroactive changes could be automated. The "git replace" and "git grafts" (the last might be deprecated) functionality could be used to merge the two histories together so it is transparent to those who need a consistent view from now to the beginning. With a separate repo then the normal users who only care about the recent changes and current state don't ever have to do anything special or worry about the history changes, but it should work in either case. -Seth Robertson ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 16:18 ` Seth Robertson @ 2012-03-30 16:32 ` Jakub Narebski 0 siblings, 0 replies; 13+ messages in thread From: Jakub Narebski @ 2012-03-30 16:32 UTC (permalink / raw) To: Seth Robertson; +Cc: Yuval Adam, git Seth Robertson <in-gitvger@baka.org> writes: > In message <CA+P+rLcWT0SZQjW2LtFXXCDRwjMp8daJ2hVup=7cnsRGbKw7xw@mail.gmail.com>, Yuval Adam writes: > > On Fri, Mar 30, 2012 at 6:10 PM, Seth Robertson <in-gitvger@baka.org> wrote: > > Revision control shouldn't be used to change the past (even if git > > allows this with sufficient amounts of pain/warning to all users). > > What it is extremely good at is preserving the past and tracking the > > changes that are made. > > This is exactly what we _do_ want to do. > > Is this something that is definitively complicated with git? > > Ah, I understand now. I imagine others will chime in as well, but > this should not be too complex with git. You can easily go back into > history and change it. The problem comes in when you have shared your > repository with other people. > > In general, rewriting public history is a bad idea because git cannot > tell the difference between someone adding to history for good reasons > (expanding on known history) and bad reasons (retroactively rewriting > the law to add a loophole). > > You can absolutely do it, For example using `git filter-branch`, or grafts mechanism plus said git-filter-branch, or interactive rebase for changes closer to current version, or `git commit --amend` for latest version (latest commit). > but then you have to "force push" your > changes to the master server to override the history (assuming that is > allowed, and it typically is not by default) and then everyone else > would have to do special things (`git pull --rebase` in the simple > case, rebuilding branches and tags in more complex cases) to get the > new history. Clearly for something like the law and the probable > complex workflow it will have, this isn't a good method. Well, if nobody is basing their work on this repository, and it is meant as read-only source of information, that doesn't matter much. > > What I would probably suggest is having either a historical branch or > a historical repository which is allowed and expected to be rewritten. [...] Yet another solution would be to fix mistakes using `git replace` mechanism. It doesn't as much rewrite history, as paste on fixes; this of course requires setting up sharing of those replacements (fixes). See git-replace(1) manpage for more information. -- Jakub Narebski ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 15:55 ` Yuval Adam 2012-03-30 16:18 ` Seth Robertson @ 2012-03-30 16:52 ` Junio C Hamano 2012-03-30 20:39 ` Yuval Adam 2012-04-03 19:45 ` Maintaining historical outlines Markus Elfring 2 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2012-03-30 16:52 UTC (permalink / raw) To: Yuval Adam; +Cc: git Yuval Adam <yuv.adm@gmail.com> writes: > Is this something that is definitively complicated with git? That's not really "is it complicated with git" question, I would have to say. Any version control system you would build history starting from one point going _forward_, never inserting past event as you dig back. Surely, you could fake it by rewriting history, but I do not think SCM is particularly geared towards such a use case *while* investigating the history of the law and recording your findings. I do agree that once such a discovered history is *complete*, it would be nice to record it in a SCM with a powerful history digging capability in chronological order, though. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 16:52 ` Junio C Hamano @ 2012-03-30 20:39 ` Yuval Adam 2012-03-30 22:29 ` david ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Yuval Adam @ 2012-03-30 20:39 UTC (permalink / raw) To: git On Fri, Mar 30, 2012 at 7:52 PM, Junio C Hamano <gitster@pobox.com> wrote: > That's not really "is it complicated with git" question, I would have to > say. Any version control system you would build history starting from one > point going _forward_, never inserting past event as you dig back. That is true. It is very clear to us that an SCM is optimized for the prevalent use case, which is tracking code (well, mostly code) as it is written. Naturally this always starts at some point in time and progresses into the future. However, we perceive git as a very powerful tool, that can fit beautifully with the way legislation works today. The challenge for us - should we choose to accept it ;) - is to build a set of wrapper tools that allow us to use git in such a way, while enabling us to build up past history. Yes, this is not the usual use case, but we're highly motivated on making this work. We believe this could also be an interesting experience for the git community in seeing how the git plumbing can be used for other cases, even if they veer off on some weird tangent. We'll definitely be back with more questions and updates, as we progress. Thanks, everyone, for your responses and feedback! -- Yuval Adam http://y3xz.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 20:39 ` Yuval Adam @ 2012-03-30 22:29 ` david 2012-03-31 1:04 ` Mark Lodato 2012-04-02 11:38 ` Ævar Arnfjörð Bjarmason 2 siblings, 0 replies; 13+ messages in thread From: david @ 2012-03-30 22:29 UTC (permalink / raw) To: Yuval Adam; +Cc: git [-- Attachment #1: Type: TEXT/PLAIN, Size: 1820 bytes --] On Fri, 30 Mar 2012, Yuval Adam wrote: > On Fri, Mar 30, 2012 at 7:52 PM, Junio C Hamano <gitster@pobox.com> wrote: >> That's not really "is it complicated with git" question, I would have to >> say. Any version control system you would build history starting from one >> point going _forward_, never inserting past event as you dig back. > > That is true. > It is very clear to us that an SCM is optimized for the prevalent use > case, which is tracking code (well, mostly code) as it is written. > Naturally this always starts at some point in time and progresses into > the future. > > However, we perceive git as a very powerful tool, that can fit > beautifully with the way legislation works today. > The challenge for us - should we choose to accept it ;) - is to build > a set of wrapper tools that allow us to use git in such a way, while > enabling us to build up past history. > > Yes, this is not the usual use case, but we're highly motivated on > making this work. > We believe this could also be an interesting experience for the git > community in seeing how the git plumbing can be used for other cases, > even if they veer off on some weird tangent. > > We'll definitely be back with more questions and updates, as we progress. > Thanks, everyone, for your responses and feedback! you may want to take a hint from how the linux repository works. When git was created, the as-of-then current version was commited as the base and development went on from there. Later on the linux historical repository was created (and re-created over time as other versions were found). The git graft command can be used to join the 'current' repository to the 'historical' repository so that they can be treated as one. I strongly suspect that something along these lines is what you are needing. David Lang ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 20:39 ` Yuval Adam 2012-03-30 22:29 ` david @ 2012-03-31 1:04 ` Mark Lodato 2012-04-01 4:14 ` Holding, Lawrence 2012-04-02 11:38 ` Ævar Arnfjörð Bjarmason 2 siblings, 1 reply; 13+ messages in thread From: Mark Lodato @ 2012-03-31 1:04 UTC (permalink / raw) To: Yuval Adam; +Cc: git, Junio C Hamano, Jakub Narebski, Seth Robertson On Fri, Mar 30, 2012 at 4:39 PM, Yuval Adam <yuv.adm@gmail.com> wrote: > However, we perceive git as a very powerful tool, that can fit > beautifully with the way legislation works today. > The challenge for us - should we choose to accept it ;) - is to build > a set of wrapper tools that allow us to use git in such a way, while > enabling us to build up past history. If you're willing to put some time into either writing new tools or doing complicated work by hand, you could use git to keep track of the history's history. Have two branches: a real "master" branch and a "meta" branch to keep track of master's history. The former is what end users would see: the most accurate history of the code to date. The latter is what "developers" would use to rebuild the master branch with new information (say, adding A before B and C). To do this, you could try the following: Use normal git commands on the master branch, but every time you change master (say, commit or rebase), also make a special commit on the meta branch with the first parent being a reference to the new value of master. Use the remaining parents as "normal" references to previous meta commits, and use an empty tree. Now, the meta branch contains a complete history of the history, though viewing it will be extremely ugly unless you develop a custom tool to deal with its special form. Optionally, on the server, you could set up an update hook to disallow updates of the master branch and disallow non-fast-forward updates of the meta branch, and a post-receive hook to the master branch to point to the first parent of the meta branch each time the meta branch is updated. One caveat is that you must be careful about merges on the meta branch, since git's default strategy will automatically do the wrong thing. You could write your own merge strategy to handle this. (Sadly there does not appear to be a way to use this strategy automatically on per-branch basis.) Another workaround would be to use something that is unmergable in the tree of the meta commit, rather than an empty tree - say, a single file with the commit ID of the master branch - which would prevent the default strategy from trivially and incorrectly merging. Using such a system would be awkward by hand but not terribly difficult to automate. You could create a "git-meta-commit" command to create a meta commit for the current branch. You might find contrib/examples/git-merge.sh useful as a guide for how to do this. If you'd like more details, please ask. It would be nice if you could write a hook that automatically creates a meta commit every time master's reflog is updated, but this does not seem possible at the moment. ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: Maintaining historical data in a git repo 2012-03-31 1:04 ` Mark Lodato @ 2012-04-01 4:14 ` Holding, Lawrence 0 siblings, 0 replies; 13+ messages in thread From: Holding, Lawrence @ 2012-04-01 4:14 UTC (permalink / raw) To: Mark Lodato, Yuval Adam Cc: git, Junio C Hamano, Jakub Narebski, Seth Robertson > Mark Lodato wrote: > On Fri, Mar 30, 2012 at 4:39 PM, Yuval Adam <yuv.adm@gmail.com> wrote: > > However, we perceive git as a very powerful tool, that can fit > > beautifully with the way legislation works today. > > The challenge for us - should we choose to accept it ;) - is to build > > a set of wrapper tools that allow us to use git in such a way, while > > enabling us to build up past history. > > If you're willing to put some time into either writing new tools or > doing complicated work by hand, you could use git to keep track of the > history's history. Have two branches: a real "master" branch and a > "meta" branch to keep track of master's history. The former is what > end users would see: the most accurate history of the code to date. > The latter is what "developers" would use to rebuild the master branch > with new information (say, adding A before B and C). > Why not just skip the master branch altogether? Create a branch named for today's date and commit to it the history of the law as seen at today. When historic changes are discovered, create a branch where it fits into the record (named for the date of the discovery), commit the new version, then cherry pick the remainder of the history from then on top of it. Ending up with two parallel historic records showing what you thought the history of the document was up until last Wednesday and the new branch of what we know now. Having the same version of the document in multiple branches has no storage penalties in git. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 20:39 ` Yuval Adam 2012-03-30 22:29 ` david 2012-03-31 1:04 ` Mark Lodato @ 2012-04-02 11:38 ` Ævar Arnfjörð Bjarmason 2 siblings, 0 replies; 13+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2012-04-02 11:38 UTC (permalink / raw) To: Yuval Adam; +Cc: git On Fri, Mar 30, 2012 at 22:39, Yuval Adam <yuv.adm@gmail.com> wrote: > However, we perceive git as a very powerful tool, that can fit > beautifully with the way legislation works today. > The challenge for us - should we choose to accept it ;) - is to build > a set of wrapper tools that allow us to use git in such a way, while > enabling us to build up past history. You can always solve this by having two repositories, you have one canonical Git repository with your laws using some text-based format to describe when each change was added. You'd never rewrite the history of this repository since it would represent the history of your project to give a commit timeline to the law, and not attempt to make your commit log reflect changes in the law. You could then have tools to export another Git history from that original repository, that one would be constantly rewritten and nobody would base changes on that. You could also make the two one and the same, but you don't have to. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical outlines 2012-03-30 15:55 ` Yuval Adam 2012-03-30 16:18 ` Seth Robertson 2012-03-30 16:52 ` Junio C Hamano @ 2012-04-03 19:45 ` Markus Elfring 2 siblings, 0 replies; 13+ messages in thread From: Markus Elfring @ 2012-04-03 19:45 UTC (permalink / raw) To: Yuval Adam; +Cc: git > Our use case for this is like so: > "ok, this is how the law is today, and we're not quite sure how it got > to this point" > But then some X time later: > "so we found out that clauses (1), (e) and (X) were changed on March > 30, 1957, and we want to know this for future reference" I imagine that technical challenges come from a different view for your use case. Content management systems can eventually show differences for line-oriented text files easily. But I guess that you are also interested in the maintenance of higher level semantic data structures that are usually contained in outlines. How would you like to build relationships between commit logs and changes to items like chapters, sections, paragraphs and sentences? Do you need to combine several information sources to generate a document query and result representation you desire? Regards, Markus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Maintaining historical data in a git repo 2012-03-30 13:34 Maintaining historical data in a git repo Yuval Adam 2012-03-30 15:10 ` Seth Robertson @ 2012-04-03 9:25 ` Andreas Stricker 1 sibling, 0 replies; 13+ messages in thread From: Andreas Stricker @ 2012-04-03 9:25 UTC (permalink / raw) To: Yuval Adam; +Cc: git [-- Attachment #1: Type: text/plain, Size: 538 bytes --] Am 30.03.12 15:34, schrieb Yuval Adam: > As part of a public project to open-source the Israeli law code, we > are looking into ways of represent such data in a git repository. I remember a discussion about to US constitution [1] a while ago. There are a few projects still available on github [2,3]. Maybe this helps as a starting point. Regards, Andy [1] http://thread.gmane.org/gmane.comp.version-control.git/152433 [2] https://github.com/jcsalomon/constitution [3] https://github.com/zorz/InternetFreedomAmendment [-- Attachment #2: S/MIME Kryptografische Unterschrift --] [-- Type: application/pkcs7-signature, Size: 3518 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-04-03 19:47 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-30 13:34 Maintaining historical data in a git repo Yuval Adam 2012-03-30 15:10 ` Seth Robertson 2012-03-30 15:55 ` Yuval Adam 2012-03-30 16:18 ` Seth Robertson 2012-03-30 16:32 ` Jakub Narebski 2012-03-30 16:52 ` Junio C Hamano 2012-03-30 20:39 ` Yuval Adam 2012-03-30 22:29 ` david 2012-03-31 1:04 ` Mark Lodato 2012-04-01 4:14 ` Holding, Lawrence 2012-04-02 11:38 ` Ævar Arnfjörð Bjarmason 2012-04-03 19:45 ` Maintaining historical outlines Markus Elfring 2012-04-03 9:25 ` Maintaining historical data in a git repo Andreas Stricker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).