* Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing @ 2012-02-28 13:04 Seth Robertson 2012-02-28 22:52 ` Jeff King 2012-02-29 1:00 ` Junio C Hamano 0 siblings, 2 replies; 6+ messages in thread From: Seth Robertson @ 2012-02-28 13:04 UTC (permalink / raw) To: git I would like to announce three git documents I have written which others (primarily on #git) have thought to be very useful, and so I would like to share them with the wider community. Commit Often, Perfect Later, Publish Once: Git Best Practices ---------------------------------------------------------------------- http://sethrobertson.github.com/GitBestPractices This first document covers a variety of topics, providing references and recommendations for using git. These best practices have been built up through decades of professional software management and development, years of git usage, and countless hours helping people on #git. Table of Contents: Do read about git On Sausage Making Do commit early and often Do keep up to date Don't panic Do periodic maintenance Do backups Do enforce Standards Don't change published history Do use useful tools Do choose a workflow Do integrate with external tools Do divide work into repositories Miscellaneous "Do"s Do make useful commit messages Miscellaneous "Don't"s On undoing, fixing, or removing commits or mistakes in git ---------------------------------------------------------------------- http://sethrobertson.github.com/GitFixUm This next document covers the process of recovering from mistakes made either while or when using git. It is a choose-your-own-adventure(1) style document which asks a series of questions to try and understand exactly what you did and what you want to do. Currently it provides twenty different solutions to various problems I have seen people have. This was primarily developed to stop answering the same questions over and over again in #git, and worse, providing the wrong answers when questioners either failed to provide critical information or totally misunderstood what was going on. Post-Production Editing using Git ---------------------------------------------------------------------- http://sethrobertson.github.com/GitPostProduction This most recent document covers the topic of how to use git to make your commits appear like they were made perfectly to the outside world. Doing so is something which is required by some projects, is recommended in gitworkflows(7) and the best practices document (On Sausage Making), and is a major feature of git. However, I have not found good documentation on exactly how to use git to accomplish this. The git-rebase man page is quite extensive, but also fairly confusing to the uninitiated. I would appreciate comments, suggestions, or contributions for all three documents. -Seth Robertson (1) Not affiliated with Chooseco, LLC's "Choose Your Own Adventure"â¡. Good books, but a little light on the details of recovering from git merge errors. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing 2012-02-28 13:04 Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing Seth Robertson @ 2012-02-28 22:52 ` Jeff King 2012-03-04 19:20 ` Seth Robertson 2012-02-29 1:00 ` Junio C Hamano 1 sibling, 1 reply; 6+ messages in thread From: Jeff King @ 2012-02-28 22:52 UTC (permalink / raw) To: Seth Robertson; +Cc: git On Tue, Feb 28, 2012 at 08:04:30AM -0500, Seth Robertson wrote: > Commit Often, Perfect Later, Publish Once: Git Best Practices > ---------------------------------------------------------------------- > http://sethrobertson.github.com/GitBestPractices I have only read the first of the three so far, but it looks very nice. I did notice a few things which were worth commenting on (I'll quote directly from the page in question below). > [section Don't Panic, subsection Lost and Found] > > Dangling Commit > > These are the most likely candidates for finding lost data. A dangling > commit is a commit no longer reachable by any branch or tag. This can > happen due to resets and rebases and are normal. git show SHA will let > you inspect them. Resets and rebases record the commits in the reflog (at the very least in the HEAD reflog), and should generally not be the cause of dangling commits (the objects should usually expire in the same "git gc" that expires the reflog entries). I suspect a more common cause is deleting branches, which leaves no reflog (the commits may be in the HEAD reflog if they were ever checked out, though). It's somewhat minor; the overall advice ("do not worry about dangling commits") holds. But it might be worth pointing out that the method for recovering an accidentally deleted branch is usually: 1. look in the HEAD reflog 2. if you can't find it there, try dangling commits > [section Do make useful commit messages] This talks about formatting, but not about content. I have long wanted to write a nice essay on what should go into a good commit message, but when I've tried it ends up very specific to the project, the type of commit, and the individual change. I wonder if anybody knows of something good you could link to. > [section On Sausage Making] > > Some people like to hide the sausage making, or in other words pretend to > the outside world that their commits sprung full-formed in utter > perfection into their git repository. Certain large public projects > demand this, others demand smushing all work into one large commit, and > still others do not care. > > A good reason to hide the sausage making is if you feel you may be > cherry-picking commits a lot (though this too is often a sign of bad > workflow). Having one or a small number of commits to pick is much > easier than having to find one commit here, one there, and half of this > other one. The latter approach makes your problem much much harder and > typically will lead to merge conflicts when the donor branch is finally > merged in. > > Another good reason is to ensure each commit compiles and/or passes > regression tests, and represents a different easily understood concept > (important for archeology). The former allows git-bisect to chose any > commit and have a good chance of that commit doing something useful, and > the latter allows for easy change review, understanding, and > cherry-picking. This is a nice overview of the motivation, but I think it misses one of the main reasons we clean up patches in git.git: code review. When you publish patches, you have several audiences. One of those audiences is the end user who will compile release v1.5. They only care about the end result of the patches. Another is people bisecting, who care that everything intermediate builds and is reasonable; you can satisfy that by checking each commit against a test suite. Another is people reading the logs later to find out what happened; it's OK for them to see that a bug was in the initial version, and then fixed 5 minutes later. But yet another audience is reviewers who will read your changes and decide they should be applied, rejected, or re-worked. For those people, it is much harder to review a series that introduces a bug in patch 1, but fixes it in patch 5. The reviewer may also notice the bug, take time thinking about and writing an analysis, and then get frustrated to find that their work was wasted when they get to patch 5. The alternative is that they stop thinking about individual patches and consider the whole series (e.g., when they see patch 1 has a bug, stop reading and look through the other patches for a fix). But that makes review much harder, because you have to think about a much larger series of changes. By cleaning up patches into single, logical changes that build on one another, and which don't individually regress (i.e., they are always moving towards some desirable common endpoint), the author is writing a chronological story not of what happened, but what _should_ happen, with the intent that the audience (i.e., reviewers) are convinced that the change is the right thing to do. I really liked your movie analogy. Patch series are really just documentaries about a change, arranged for greatest impact on the viewer. :) > [Do periodic maintenance] > > Compact your repo (git gc --aggressive) > > This will removed outdated dangling objects (after the two+ week grace > period). It will also compress any loose objects git has added since > your last gc. git will run gc automatically after certain commands, but > doing a manual --aggressive will save space and speed git operations. Most people shouldn't be using "--aggressive". Unless you have an existing pack that is poorly packed (e.g., because you did a fast-import that did not do much delta searching), you are not going to see much benefit, and it will take a lot longer. Basically the three levels of "gc" are: 1. git gc --auto; if there are too many loose objects, they will all go into a new incremental pack. If there are already too many packs, all of the existing packs will get re-packed together. If we are making an incremental pack, this is by far the fastest, because the speed is independent of the existing history. If we pack everything together, it should be more or less the same as (2) below. 2. git gc; this packs everything into a single pack. It does not use high window and depth parameters, but more importantly, it reuses existing deltas. That makes the delta compression phase _much_ faster, and it often makes the writing phase faster (because for older objects, we are primarily streaming them right out of the existing pack). On a big repo, though, it does do a lot of I/O, because it has to rewrite the whole pack. 3. git gc --aggressive; this is often way slower than the above because we throw out all of the existing deltas and recompute them from scratch. The higher window parameter means it will spend a bit more time computing, and it may end up with a smaller pack. In practical applications, I would expect (2) to achieve similar results to (3). If that isn't the case, then I think we should be tuning up the default window and depth parameters for non-aggressive "git gc" a bit. > [section Miscellaneous "don't"s] > > create very large repositories (when possible) > > Git can be slow in the face of large repositories. There are > git-config options that can help. pack.threads=1 pack.deltaCacheSize=1 > pack.windowMemory=512m core.packedGitWindowSize=16m > core.packedGitLimit=128m. Other likely ones exist. It might help to qualify "big" here. To some people, 10,000 files, 50,000 commits, and a 200M packfile is big. But that's a fraction of linux-2.6, which most people use. I think big here is probably getting into 100K-200K files (where the time to stat() files becomes noticeable, commits are probably not relevant (because git is usually good at only looking at recent bits of history for most operations), and packfiles above 1G or so start to get cumbersome (mostly because of the I/O on a full repack; but then you should consider marking a pack as .keep). But those numbers are just pulled out of a hat based on the last few years. Your OS, your hardware, and your expectations make a huge difference in what seems reasonable. Your config recommendations seem mostly related to relieving memory pressure for packing (at the expense of making the pack a lot slower). Dropping --aggressive from your gc might help a lot with that, too. It might be worth noting that you should only start twiddling these options if you are running out of memory during a repack. They will not affect git performance for day-to-day commands. I don't think you should need to adjust core.packedGitWindowSize or core.packedGitLimit at all. Those files are mmap'd, so it is up to the OS to be reasonable about faulting in or releasing the memory. The main motivation of pack windows is not memory _usage_, but rather getting a large contiguous chunk of the address space. mmap-ing a 4G packfile on a 32-bit system just doesn't work. But the defaults are set to reasonable values for each architecture. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing 2012-02-28 22:52 ` Jeff King @ 2012-03-04 19:20 ` Seth Robertson 0 siblings, 0 replies; 6+ messages in thread From: Seth Robertson @ 2012-03-04 19:20 UTC (permalink / raw) To: Jeff King; +Cc: git First, I'd like to thank you for your comments. They certainly improved the document and made me think and experiment. In message <20120228225205.GA23804@sigill.intra.peff.net>, Jeff King writes: On Tue, Feb 28, 2012 at 08:04:30AM -0500, Seth Robertson wrote: > [section Don't Panic, subsection Lost and Found] > > Dangling Commit > > These are the most likely candidates for finding lost data. A dangling > commit is a commit no longer reachable by any branch or tag. This can > happen due to resets and rebases and are normal. git show SHA will let > you inspect them. Resets and rebases record the commits in the reflog (at the very least in the HEAD reflog), and should generally not be the cause of dangling commits (the objects should usually expire in the same "git gc" that expires the reflog entries). I suspect a more common cause is deleting branches, which leaves no reflog (the commits may be in the HEAD reflog if they were ever checked out, though). I get them all of time and I never delete branches. It's somewhat minor; the overall advice ("do not worry about dangling commits") holds. But it might be worth pointing out that the method for recovering an accidentally deleted branch is usually: 1. look in the HEAD reflog 2. if you can't find it there, try dangling commits My understanding is that if a commit gets packed, it sticks around for a few weeks longer than the reflog since the clock gets reset when it gets evicted from a pack. > [section Do make useful commit messages] This talks about formatting, but not about content. I have long wanted to write a nice essay on what should go into a good commit message, but when I've tried it ends up very specific to the project, the type of commit, and the individual change. I wonder if anybody knows of something good you could link to. I'd certainly like to see such a thing. I did touch on the subject further when I started talking about integration with bug tracking systems. > [section On Sausage Making] > > Some people like to hide the sausage making, or in other words pretend to > the outside world that their commits sprung full-formed in utter > perfection into their git repository. Certain large public projects > demand this, others demand smushing all work into one large commit, and > still others do not care. > > A good reason to hide the sausage making is if you feel you may be > cherry-picking commits a lot (though this too is often a sign of bad > workflow). Having one or a small number of commits to pick is much > easier than having to find one commit here, one there, and half of this > other one. The latter approach makes your problem much much harder and > typically will lead to merge conflicts when the donor branch is finally > merged in. > > Another good reason is to ensure each commit compiles and/or passes > regression tests, and represents a different easily understood concept > (important for archeology). The former allows git-bisect to chose any > commit and have a good chance of that commit doing something useful, and > the latter allows for easy change review, understanding, and > cherry-picking. This is a nice overview of the motivation, but I think it misses one of the main reasons we clean up patches in git.git: code review. Well, I said "change review" instead of "code review". I added the word "code" specifically, but I'll stick some wording on why it is important to code review. I already touched on people who wanted to bisect. By cleaning up patches into single, logical changes that build on one another, and which don't individually regress (i.e., they are always moving towards some desirable common endpoint), the author is writing a chronological story not of what happened, but what _should_ happen, with the intent that the audience (i.e., reviewers) are convinced that the change is the right thing to do. I'll add this paragraph as well. > [Do periodic maintenance] > > Compact your repo (git gc --aggressive) > > This will removed outdated dangling objects (after the two+ week grace > period). It will also compress any loose objects git has added since > your last gc. git will run gc automatically after certain commands, but > doing a manual --aggressive will save space and speed git operations. Most people shouldn't be using "--aggressive". I'll add `git gc` as an intermediate stage and take wording from the manual to run `git gc --aggressive` every few hundred changesets. I suppose it all depends on your definition of the period in periodic maintenance. > [section Miscellaneous "don't"s] > > create very large repositories (when possible) > > Git can be slow in the face of large repositories. There are > git-config options that can help. pack.threads=1 pack.deltaCacheSize=1 > pack.windowMemory=512m core.packedGitWindowSize=16m > core.packedGitLimit=128m. Other likely ones exist. It might help to qualify "big" here. ... I think big here is probably getting into 100K-200K files (where the time to stat() files becomes noticeable, commits are probably not relevant (because git is usually good at only looking at recent bits of history for most operations), and packfiles above 1G or so start to get cumbersome (mostly because of the I/O on a full repack; but then you should consider marking a pack as .keep). But those numbers are just pulled out of a hat based on the last few years. Your OS, your hardware, and your expectations make a huge difference in what seems reasonable. That was why I didn't mention any specific limits. However, since you were kind enough to do provide some, I will include them. I will also add that my suggested configuration values are only needed if you are experiencing memory pressure on packing. Your config recommendations seem mostly related to relieving memory pressure for packing (at the expense of making the pack a lot slower). Very true, that was the problem I was running into. I will specifically make that comment. I'll make a wild recommendation about sizing these variables, which I'd certainly accept corrections to or advice on. Specifically the next sentence: ---------------------------------------------------------------------- My gut tells me that sizing ("deltaCacheSize" + "windowMemory" + min("core.bigFileThreshold[512m]", TheSizeOfTheLargestObject)) * "threads" to be around *half* the amount of RAM you can dedicate to running `git gc` will optimize your packing experience, but I will be the first to admit that made up that formula based on a very few samples and it could be drastically wrong. ------------------------------------------------------------------ I don't think you should need to adjust core.packedGitWindowSize or core.packedGitLimit at all. Well, certainly git takes up a ton (specifically double or just over 1GB additional) more RAM during gc with them unset, and caused some limited swapping of other processes (but no thrashing). However, the real question is, did it take more time? It did, but the amount of added time was about 3% and thus probably well under my test accuracy. -Seth Robertson ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing 2012-02-28 13:04 Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing Seth Robertson 2012-02-28 22:52 ` Jeff King @ 2012-02-29 1:00 ` Junio C Hamano 2012-03-04 19:20 ` Seth Robertson 1 sibling, 1 reply; 6+ messages in thread From: Junio C Hamano @ 2012-02-29 1:00 UTC (permalink / raw) To: Seth Robertson; +Cc: git Just a few I noticed that are dubious to be in a document that is meant to describe "best practices". "Do commit early and often" --------------------------- * "Personally ... history of this repository!". That looks somewhat out of place when you are trying to document "best practices". "Don't panic" ------------- * As we never "auto-stash", anything that is on stash is by definition what the user deliberately placed, just like a commit on a branch that the user may have forgotten. So it is strange to count it as one of the three places that "lost" commit may be hiding. If you make it four and add "a branch you might have forgotten" to the mix, it would make a bit more sense, though. * The example command line for gitk passes --all and also everything from "log -g" output, which should be OK for toy history, but wouldn't be such a good idea when you can expect tons of data from "log -g". Doesn't "gitk" itself accept -g these days? * Lost and found Why "git ls-tree -r"? Doesn't "git show" work eqully well? Also, the name of the hash we happen to use to produce the "object name" is "SHA-1", so either of these two are fine, but do not say "SHA" (throughout the document). "On Sausage Making" ------------------- * The desription of "downside" shows a bias against efforts to strive for useful history, and also shows ignorance of the true motivation behind such discipline. It is _not_ blame or ego. It is all about leaving a history others can later use to understand _why_ the code became the way it is now, to make it less likely for others to break it. If I were writing this, I would either remove that one paragraph altogether, or tone it down dramatically. There is a short-term downside that you would be spending time on perfecting the history instead of advancing the tip of the branch, especially when you know the tree at the tip of the perfected history will be identical to the tip of the messy history you currently have. If you plan to leave the project in a month or so and will never look back, that is totally wasted effort as maintaining the result will be other people's problem. But if you are planning to be involved in the project for a longer haul, the time and effort is worth spending to make less-than-useful history into useful one. "Do keep up to date" -------------------- * You explained in "Do choose a workflow" section that different workflows suite different projects. It would read better to rephrase this paragraph in which you are admitting that not everybody agrees with your "pull --rebase". Instead of saying "but they should agree with me", it would be more useful to say in what workflow and the workflow elements such as "pull --rebase" you advocate in this section are suited (you do not have to say in what other workflow they are inappropriate). I stopped reading at this point, but will look at the rest some other day. Thanks for a fun reading. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing 2012-02-29 1:00 ` Junio C Hamano @ 2012-03-04 19:20 ` Seth Robertson 2012-03-04 23:26 ` Junio C Hamano 0 siblings, 1 reply; 6+ messages in thread From: Seth Robertson @ 2012-03-04 19:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: git In message <7v399uxxkq.fsf@alter.siamese.dyndns.org>, Junio C Hamano writes: Just a few I noticed that are dubious to be in a document that is meant to describe "best practices". Thanks for the comments. I will incorporate most of them and certainly thought hard about all of them. "Don't panic" ------------- * As we never "auto-stash", anything that is on stash is by definition what the user deliberately placed, just like a commit on a branch that the user may have forgotten. So it is strange to count it as one of the three places that "lost" commit may be hiding. If you make it four and add "a branch you might have forgotten" to the mix, it would make a bit more sense, though. I do. That was the next bullet "misplaced". I also expand on this a bit during the second document about finding and fixing mistakes. * The example command line for gitk passes --all and also everything from "log -g" output, which should be OK for toy history, but wouldn't be such a good idea when you can expect tons of data from "log -g". My reasoning is that the live/referenced history provides context. Seeing a series of commits going back in time is nice and all, but knowing that at some point it branched from some particular still-referenced branch allows you to concentrate only on the commits that were "lost" (abandoned/replaced/etc), lets you have a better idea on whether those commits are relevant, and perhaps you will even see similar commits nearby on a still referenced branch. Yes, for projects with dozens of simultaneously active branches it may cause information overload. Ideally there would be an easy way to only have gitk show relevant branches without a lot of work. Right now, the only way I can think of is to find the --contains of the first referenced parent of the unreferenced commits and then pick the closest named branch to display using some algorithm. I'll also suggest using `git log -Sfoo -g` in addition to my current alternate suggestion of looking at the reflog directly. Anyway, someone managing dozens of branches should know what --all does and that they can remove it. Doesn't "gitk" itself accept -g these days? My gitk (1.7.9.2) accepts -g but doesn't show the reflog. * Lost and found Why "git ls-tree -r"? Doesn't "git show" work eqully well? I find the added information of ls-tree more useful since you can more easily examine the contents/blobs of the tree. git show | git ls-tree -r ------------|-------------------------------------------------------------- tree 51e4 | | A | 100644 blob e900b1c81c65dc52463027be827c1418fc7ff505 A asdf/ | 100644 blob 8b137891791fe96927ad78e64b0aad7bded08bdc asdf/a x | 100644 blob e900b1c81c65dc52463027be827c1418fc7ff505 x "On Sausage Making" ------------------- * The desription of "downside" shows a bias against efforts to strive for useful history, and also shows ignorance of the true motivation behind such discipline. It is _not_ blame or ego. It is all about leaving a history others can later use to understand _why_ the code became the way it is now, to make it less likely for others to break it. I have included that last sentence in the argument for creating a perfected history. I personally believe that there are many contexts in which a perfected history is critical, but I also feel there are many cases where it is entirely overkill, which is why I talk about both sides of the issue. But I think it important enough that I made it one of the three things I mention in the title of the document (perfect later) *and* I wrote the third document describing how someone might actually go about the process. "Do keep up to date" -------------------- * You explained in "Do choose a workflow" section that different workflows suite different projects. ... it would be more useful to say in what workflow and the workflow elements such as "pull --rebase" you advocate in this section are suited (you do not have to say in what other workflow they are inappropriate). In the pull --rebase section, I spend one short paragraph talking about why I think it is a good idea and four providing arguments against it. In my opinion, it rebase should always be used when it is possible, and I did specifically mark it as my opinion and that people disagree with me. I think I did about as good as I can presenting the negative side, but if you have more specific arguments against rebase, I'll be happy to include them. Perhaps it will even change my stance about using rebase. -Seth Robertson ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing 2012-03-04 19:20 ` Seth Robertson @ 2012-03-04 23:26 ` Junio C Hamano 0 siblings, 0 replies; 6+ messages in thread From: Junio C Hamano @ 2012-03-04 23:26 UTC (permalink / raw) To: Seth Robertson; +Cc: git Seth Robertson <in-gitvger@baka.org> writes: > In message <7v399uxxkq.fsf@alter.siamese.dyndns.org>, Junio C Hamano writes: > Just a few I noticed that are dubious to be in a document that is meant to > describe "best practices". > ... > "Don't panic" > ------------- > > * As we never "auto-stash", anything that is on stash is by definition > what the user deliberately placed, just like a commit on a branch that > the user may have forgotten. So it is strange to count it as one of the > three places that "lost" commit may be hiding. If you make it four and > add "a branch you might have forgotten" to the mix, it would make a bit > more sense, though. > > I do. You don't. You say "There are THREE places where "last" changes can be hiding" and list these three things, not four. > "Do keep up to date" > -------------------- > > * You explained in "Do choose a workflow" section that different workflows > suite different projects. ... it > would be more useful to say in what workflow and the workflow elements > such as "pull --rebase" you advocate in this section are suited (you do > not have to say in what other workflow they are inappropriate). > > In the pull --rebase section, I spend one short paragraph talking > about why I think it is a good idea and four providing arguments > against it. In my opinion,... I do not know if you have updated the version seen on the web since the review comments, but I was merely suggesting that "what I recommend here may not be desirable for some workflows" without spelling out what these workflows are would be less helpful to readers than being more explicit, i.e. "these suggestions are good for this and that workflows". This section by nature of what is discussed is bound to be incomplete and will not be "universal truth" as there does no "universal truth" exist. Letting the users know that for what kind of workflows these are good suggestions upfront will help them to decide if the recommendations are applicalble to them. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-03-04 23:26 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-02-28 13:04 Announcing 3 git docs: Best Practices, fixing mistakes, post-production editing Seth Robertson 2012-02-28 22:52 ` Jeff King 2012-03-04 19:20 ` Seth Robertson 2012-02-29 1:00 ` Junio C Hamano 2012-03-04 19:20 ` Seth Robertson 2012-03-04 23:26 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).