* git repack command on larger pack file @ 2015-10-26 5:57 Sivakumar Selvam 2015-10-26 6:41 ` Junio C Hamano 0 siblings, 1 reply; 9+ messages in thread From: Sivakumar Selvam @ 2015-10-26 5:57 UTC (permalink / raw) To: git Hi, I ran git repack on a single larger repository abc.git where the pack file size 34 GB. Generally it used to take 20-25 minutes in my server to complete the repacking. During repacking I noticed, disk usage was more, So I thought of splitting the pack file into 4 GB chunks. I used the following command to do repacking. git repack -A -b -d -q --depth=50 --window=10 abc.git After adding --max-pack-size=4g to the above command again I ran to split pack files.. git repack -A -b -d -q --depth=50 --window=10 --max-pack-size=4g abc.git When I finished running, I found 12 pack files with each 4 GB and the size is 48 GB. Now my disk usage has increased by 14 GB. Again, I ran to check the performance, but the size (48 GB) and time to repacking takes another 35 minutes more. Why this issue? If we split a larger pack file, repacking takes more time with more disk usage for storing pack files. Any thoughts on this why this happens? Thanks, Sivakumar Selvam. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git repack command on larger pack file 2015-10-26 5:57 git repack command on larger pack file Sivakumar Selvam @ 2015-10-26 6:41 ` Junio C Hamano 2015-10-26 7:11 ` Junio C Hamano 2015-10-27 23:47 ` Jeff King 0 siblings, 2 replies; 9+ messages in thread From: Junio C Hamano @ 2015-10-26 6:41 UTC (permalink / raw) To: Sivakumar Selvam; +Cc: git Sivakumar Selvam <gerritcode@gmail.com> writes: > I ran git repack on a single larger repository abc.git where the pack > file size 34 GB. Generally it used to take 20-25 minutes in my server to > complete the repacking. During repacking I noticed, disk usage was more, So > I thought of splitting the pack file into 4 GB chunks. I used the following > command to do repacking. > git repack -A -b -d -q --depth=50 --window=10 abc.git > > After adding --max-pack-size=4g to the above command again I ran to split > pack files.. > git repack -A -b -d -q --depth=50 --window=10 --max-pack-size=4g abc.git > > When I finished running, I found 12 pack files with each 4 GB and the > size is 48 GB. Now my disk usage has increased by 14 GB. Again, I ran to > check the performance, but the size (48 GB) and time to repacking takes > another 35 minutes more. Why this issue? Hmmm, what is "this issue"? I do not see anything surprising. If you have N objects and run repack with window=10, you would (roughly speaking, without taking various optimization we have and bootstrap conditions into account) check each of these N objects against 10 other objects to find good delta base, no matter how big your max pack-size is set. And that takes the bulk of time in the repack process. Also it has to write more data to disk (see below), it has to find a good place to split, it has to adjust bookkeeping data at the pack boundary, in general it has to do more, not less, to produce split packs. It would be surprising if it took less time. Each pack by definition has to be self-sufficient; all delta in the pack must have its base object in the same pack. Now, imagine that an object (call it X) would have been expressed as a delta derived from another object (call it Y) if you were producing a single pack, and imagine that the pack has grown to be 4 GB big just before you write object X out. The current pack (which contains the base object Y already) needs to be closed and then a new pack is opened. Imagine how you would write X now into that new pack. You have to discard the deltified representation of X (which by definition is much smaller, because it is an instruction to reconstitute X given an object Y whose contents is very similar to X) and write the base representation of X to the pack, because X can no longer be expressed as a delta derived from Y. That is why you would need to write more. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git repack command on larger pack file 2015-10-26 6:41 ` Junio C Hamano @ 2015-10-26 7:11 ` Junio C Hamano 2015-10-27 2:04 ` Sivakumar Selvam 2015-10-27 8:52 ` Philip Oakley 2015-10-27 23:47 ` Jeff King 1 sibling, 2 replies; 9+ messages in thread From: Junio C Hamano @ 2015-10-26 7:11 UTC (permalink / raw) To: Sivakumar Selvam; +Cc: git Junio C Hamano <gitster@pobox.com> writes: > Sivakumar Selvam <gerritcode@gmail.com> writes: > >> ... So >> I thought of splitting the pack file into 4 GB chunks. > ... > Hmmm, what is "this issue"? I do not see anything surprising. While the explanation might have been enlightening, the knowledge conveyed by the explanation by itself would not be of much practical use, and enlightment without practical use is never fun. So let's do another tangent that may be more useful. In many repositories, older parts of the history often hold the bulk of objects that do not change, and it is wasteful to repack them over and over. If your project is at around v40.0 today, and it was at around v36.0 6 months ago, for example, you may want to pack everything that happened before v36.0 into a single pack just once, pack them really well, and have your "repack" not touch that old part of the history. $ git rev-list --objects v36.0 | git pack-objects --window=200 --depth=128 pack would produce such a pack [*1*] The standard output from the above pipeline will give you a 40-hex string (e.g. 51c472761b4690a331c02c90ec364e47cca1b3ac, call it $HEX), and in the current directory you will find two files, pack-$HEX.pack and pack-$HEX.idx. You can then do this: $ echo "v36.0 with W/D 200/128" >pack-$HEX.keep $ mv pack-$HEX.* .git/objects/pack/. $ git repack -a -d A pack that has an accompanying .keep file is excempt from repacking, so once you do this, your future "git repack" will only repack objects that are not in the kept packs. [Footnote] *1* I won't say 200/128 gives you a good pack; you would need to experiment. In general, larger depth will result in smaller pack but it will result in bigger overhead while you use the repository every day. Larger window will spend a lot of cycles while packing, but will result in a smaller pack. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git repack command on larger pack file 2015-10-26 7:11 ` Junio C Hamano @ 2015-10-27 2:04 ` Sivakumar Selvam 2015-10-27 23:44 ` Jeff King 2015-10-27 8:52 ` Philip Oakley 1 sibling, 1 reply; 9+ messages in thread From: Sivakumar Selvam @ 2015-10-27 2:04 UTC (permalink / raw) To: git Junio C Hamano <gitster <at> pobox.com> writes: > > Junio C Hamano <gitster <at> pobox.com> writes: > > > Sivakumar Selvam <gerritcode <at> gmail.com> writes: > > > >> ... So > >> I thought of splitting the pack file into 4 GB chunks. > > ... > > Hmmm, what is "this issue"? I do not see anything surprising. > > While the explanation might have been enlightening, the knowledge > conveyed by the explanation by itself would not be of much practical > use, and enlightment without practical use is never fun. > > So let's do another tangent that may be more useful. > > In many repositories, older parts of the history often hold the bulk > of objects that do not change, and it is wasteful to repack them > over and over. If your project is at around v40.0 today, and it was > at around v36.0 6 months ago, for example, you may want to pack > everything that happened before v36.0 into a single pack just once, > pack them really well, and have your "repack" not touch that old > part of the history. > > $ git rev-list --objects v36.0 | > git pack-objects --window=200 --depth=128 pack > > would produce such a pack [*1*] > > The standard output from the above pipeline will give you a 40-hex > string (e.g. 51c472761b4690a331c02c90ec364e47cca1b3ac, call it > $HEX), and in the current directory you will find two files, > pack-$HEX.pack and pack-$HEX.idx. > > You can then do this: > > $ echo "v36.0 with W/D 200/128" >pack-$HEX.keep > $ mv pack-$HEX.* .git/objects/pack/. > $ git repack -a -d > > A pack that has an accompanying .keep file is excempt from > repacking, so once you do this, your future "git repack" will only > repack objects that are not in the kept packs. > > [Footnote] > > *1* I won't say 200/128 gives you a good pack; you would need to > experiment. In general, larger depth will result in smaller pack > but it will result in bigger overhead while you use the repository > every day. Larger window will spend a lot of cycles while packing, > but will result in a smaller pack. > Hi Junio, When I finished git repacking, I found 12 pack files with each 4 GB and the total size is 48 GB. Again I ran the same git repack command by just removing only --max-pack-size= parameter, the size of the single pack file is 66 GB. git repack -A -b -d -q --depth=50 --window=10 abc.git Now, I see the total size of the single abc.git has become 66 GB. Initially it was 34 GB, After using --max-pack-size=4g it become 48 GB. When we remove the --max-pack-size=4g parameter and tried to create a single pack file now it become 66 GB. Looks like once we do git repack with multiple pack files, we can't revert back to the original size. Thanks, Sivakumar Selvam. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git repack command on larger pack file 2015-10-27 2:04 ` Sivakumar Selvam @ 2015-10-27 23:44 ` Jeff King 2015-10-28 6:23 ` Junio C Hamano 0 siblings, 1 reply; 9+ messages in thread From: Jeff King @ 2015-10-27 23:44 UTC (permalink / raw) To: Sivakumar Selvam; +Cc: git On Tue, Oct 27, 2015 at 02:04:23AM +0000, Sivakumar Selvam wrote: > When I finished git repacking, I found 12 pack files with each 4 GB and > the total size is 48 GB. Again I ran the same git repack command by just > removing only --max-pack-size= parameter, the size of the single pack file > is 66 GB. > > git repack -A -b -d -q --depth=50 --window=10 abc.git > > Now, I see the total size of the single abc.git has become 66 GB. Initially > it was 34 GB, After using --max-pack-size=4g it become 48 GB. When we > remove the --max-pack-size=4g parameter and tried to create a single pack > file now it become 66 GB. > > Looks like once we do git repack with multiple pack files, we can't revert > back to the original size. Git tries to take some shortcuts when repacking: if two objects are in the same pack but not deltas, it will not consider making deltas out of them. The logic is we would already have tried that while making the original pack. But of course when you are doing weird things with the packing parameters, that is not always a good assumption. When doing experiments like this, add "-f" to your repack command-line to avoid reusing deltas. The result should be much smaller (at the expense of more CPU time to do the repack). I'd also recommend increasing "--window" if you can afford the extra CPU during the repack. It can often produce smaller packs. And it has less cost than you might think (e.g., window=20 is not twice as expensive as window=10, because the work to access the objects is cached). You can also increase --depth, but I have never found it to be particularly helpful for decreasing size[1]. -Peff [1] This is all theory, and I don't know how well git actually finds such deltas, but it is probably better to have a dense tree of deltas rather than long chains. If you have a chain of N objects and would to add object N+1 to it, you are probably not much worse off to base it on object N-1, creating a "fork" at N. The resulting objects should be less expensive to access for subsequent operations (as any time you want the Nth object, you have to resolve all parts of the chain, so shorter chains are better, and you the delta cache is more likely to get a hit on that N-1 object). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git repack command on larger pack file 2015-10-27 23:44 ` Jeff King @ 2015-10-28 6:23 ` Junio C Hamano 2015-10-28 6:47 ` Junio C Hamano 0 siblings, 1 reply; 9+ messages in thread From: Junio C Hamano @ 2015-10-28 6:23 UTC (permalink / raw) To: Jeff King; +Cc: Sivakumar Selvam, git Jeff King <peff@peff.net> writes: > Git tries to take some shortcuts when repacking: if two objects are in > the same pack but not deltas, it will not consider making deltas out of > them. The logic is we would already have tried that while making the > original pack. But of course when you are doing weird things with the > packing parameters, that is not always a good assumption. Yup, that is http://thread.gmane.org/gmane.comp.version-control.git/16223/focus=16267 > [1] This is all theory, and I don't know how well git actually finds > such deltas, but it is probably better to have a dense tree of > deltas rather than long chains. If you have a chain of N objects > and would to add object N+1 to it, you are probably not much worse > off to base it on object N-1, creating a "fork" at N. Yes, your guess is perfectly correct here, and indeed we did an extensive work along that line in 2006/2007. For an example, see http://thread.gmane.org/gmane.comp.version-control.git/51949/focus=52003 The histogram "verify-pack -v" produces was in fact done primarily in order to make it easy to check the distribution of delta depth. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git repack command on larger pack file 2015-10-28 6:23 ` Junio C Hamano @ 2015-10-28 6:47 ` Junio C Hamano 0 siblings, 0 replies; 9+ messages in thread From: Junio C Hamano @ 2015-10-28 6:47 UTC (permalink / raw) To: Jeff King; +Cc: Sivakumar Selvam, git Junio C Hamano <gitster@pobox.com> writes: >> [1] This is all theory, and I don't know how well git actually finds >> such deltas, but it is probably better to have a dense tree of >> deltas rather than long chains. If you have a chain of N objects >> and would to add object N+1 to it, you are probably not much worse >> off to base it on object N-1, creating a "fork" at N. > > Yes, your guess is perfectly correct here, and indeed we did an > extensive work along that line in 2006/2007. For an example, see > http://thread.gmane.org/gmane.comp.version-control.git/51949/focus=52003 And here is another, which is probably one of the most important thread on pack-objects, before the bitmap was introduced: http://thread.gmane.org/gmane.comp.version-control.git/20056/focus=20134 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git repack command on larger pack file 2015-10-26 7:11 ` Junio C Hamano 2015-10-27 2:04 ` Sivakumar Selvam @ 2015-10-27 8:52 ` Philip Oakley 1 sibling, 0 replies; 9+ messages in thread From: Philip Oakley @ 2015-10-27 8:52 UTC (permalink / raw) To: Junio C Hamano, Sivakumar Selvam; +Cc: git From: "Junio C Hamano" <gitster@pobox.com> > Junio C Hamano <gitster@pobox.com> writes: > >> Sivakumar Selvam <gerritcode@gmail.com> writes: >> >>> ... So >>> I thought of splitting the pack file into 4 GB chunks. >> ... >> Hmmm, what is "this issue"? I do not see anything surprising. > > While the explanation might have been enlightening, the knowledge > conveyed by the explanation by itself would not be of much practical > use, and enlightment without practical use is never fun. > > So let's do another tangent that may be more useful. > > In many repositories, older parts of the history often hold the bulk > of objects that do not change, and it is wasteful to repack them > over and over. If your project is at around v40.0 today, and it was > at around v36.0 6 months ago, for example, you may want to pack > everything that happened before v36.0 into a single pack just once, > pack them really well, and have your "repack" not touch that old > part of the history. > > $ git rev-list --objects v36.0 | > git pack-objects --window=200 --depth=128 pack > > would produce such a pack [*1*] > > The standard output from the above pipeline will give you a 40-hex > string (e.g. 51c472761b4690a331c02c90ec364e47cca1b3ac, call it > $HEX), and in the current directory you will find two files, > pack-$HEX.pack and pack-$HEX.idx. > > You can then do this: > > $ echo "v36.0 with W/D 200/128" >pack-$HEX.keep > $ mv pack-$HEX.* .git/objects/pack/. > $ git repack -a -d > > A pack that has an accompanying .keep file is excempt from > repacking, so once you do this, your future "git repack" will only > repack objects that are not in the kept packs. > I had a quick look at the man pages and couln't find an explanation (such as this one) to explain the purpose, highlight the use of and how to create such .keep packs. Could this form the basis of a short section on .keep packs?(or did I miss something) > > > [Footnote] > > *1* I won't say 200/128 gives you a good pack; you would need to > experiment. In general, larger depth will result in smaller pack > but it will result in bigger overhead while you use the repository > every day. Larger window will spend a lot of cycles while packing, > but will result in a smaller pack. > -- Philip ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: git repack command on larger pack file 2015-10-26 6:41 ` Junio C Hamano 2015-10-26 7:11 ` Junio C Hamano @ 2015-10-27 23:47 ` Jeff King 1 sibling, 0 replies; 9+ messages in thread From: Jeff King @ 2015-10-27 23:47 UTC (permalink / raw) To: Junio C Hamano; +Cc: Sivakumar Selvam, git On Sun, Oct 25, 2015 at 11:41:23PM -0700, Junio C Hamano wrote: > Also it has to write more data to disk (see below), it has to find a > good place to split, it has to adjust bookkeeping data at the pack > boundary, in general it has to do more, not less, to produce split > packs. It would be surprising if it took less time. This may go without saying, but the main cost in the write is that we have to zlib deflate the output. I don't have any numbers at hand, but when I've benchmarked serving fetches, it is often a balance game between CPU time spent on a more aggressive delta search and CPU time that goes into deflating the results of the search. Spending more CPU on the former may yield more and smaller deltas which pay for themselves in time spent on the latter. There's definitely a balance point, and it varies from repo to repo, and even within repos from fetch to fetch. I wish I had better heuristics to report, but it's an ongoing thing I'm exploring. :) -Peff ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-10-28 6:47 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-10-26 5:57 git repack command on larger pack file Sivakumar Selvam 2015-10-26 6:41 ` Junio C Hamano 2015-10-26 7:11 ` Junio C Hamano 2015-10-27 2:04 ` Sivakumar Selvam 2015-10-27 23:44 ` Jeff King 2015-10-28 6:23 ` Junio C Hamano 2015-10-28 6:47 ` Junio C Hamano 2015-10-27 8:52 ` Philip Oakley 2015-10-27 23:47 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).