* large files and low memory @ 2010-10-04 9:20 Enrico Weigelt 2010-10-04 18:05 ` Shawn Pearce 0 siblings, 1 reply; 29+ messages in thread From: Enrico Weigelt @ 2010-10-04 9:20 UTC (permalink / raw) To: git Hi folks, when adding files which are larger than available physical memory, git performs very slow. Perhaps it has to do with git's mmap()ing the whole file. Is there any way to do it w/o mmap (hoping that might perform a bit better) ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 9:20 large files and low memory Enrico Weigelt @ 2010-10-04 18:05 ` Shawn Pearce 2010-10-04 18:24 ` Joshua Jensen ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Shawn Pearce @ 2010-10-04 18:05 UTC (permalink / raw) To: weigelt, git On Mon, Oct 4, 2010 at 2:20 AM, Enrico Weigelt <weigelt@metux.de> wrote: > > when adding files which are larger than available physical memory, > git performs very slow. Perhaps it has to do with git's mmap()ing > the whole file. Is there any way to do it w/o mmap (hoping that > might perform a bit better) ? The mmap() isn't the problem. Its the allocation of a buffer that is larger than the file in order to hold the result of deflating the file before it gets written to disk. When the file is bigger than physical memory, the kernel has to page in parts of the file as well as swap in and out parts of that allocated buffer to hold the deflated file. This is a known area in Git where big files aren't handled well. -- Shawn. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 18:05 ` Shawn Pearce @ 2010-10-04 18:24 ` Joshua Jensen 2010-10-04 18:57 ` Shawn Pearce 2010-10-04 18:58 ` Jonathan Nieder 2010-10-05 0:50 ` Enrico Weigelt 2 siblings, 1 reply; 29+ messages in thread From: Joshua Jensen @ 2010-10-04 18:24 UTC (permalink / raw) To: Shawn Pearce; +Cc: weigelt, git ----- Original Message ----- From: Shawn Pearce Date: 10/4/2010 12:05 PM > On Mon, Oct 4, 2010 at 2:20 AM, Enrico Weigelt<weigelt@metux.de> wrote: >> when adding files which are larger than available physical memory, >> git performs very slow. Perhaps it has to do with git's mmap()ing >> the whole file. Is there any way to do it w/o mmap (hoping that >> might perform a bit better) ? > The mmap() isn't the problem. Its the allocation of a buffer that is > larger than the file in order to hold the result of deflating the file > before it gets written to disk. When the file is bigger than physical > memory, the kernel has to page in parts of the file as well as swap in > and out parts of that allocated buffer to hold the deflated file. > > This is a known area in Git where big files aren't handled well. As a curiosity, I've always done streaming decompression with zlib using minimal buffer sizes (64k, perhaps). I'm sure there is good reason why Git doesn't do this (delta application?). Do you know what it is? Josh ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 18:24 ` Joshua Jensen @ 2010-10-04 18:57 ` Shawn Pearce 2010-10-05 0:59 ` Enrico Weigelt 0 siblings, 1 reply; 29+ messages in thread From: Shawn Pearce @ 2010-10-04 18:57 UTC (permalink / raw) To: Joshua Jensen; +Cc: weigelt, git On Mon, Oct 4, 2010 at 11:24 AM, Joshua Jensen <jjensen@workspacewhiz.com> wrote: >> On Mon, Oct 4, 2010 at 2:20 AM, Enrico Weigelt<weigelt@metux.de> wrote: >>> >>> when adding files which are larger than available physical memory, >>> git performs very slow. >> >> The mmap() isn't the problem. Its the allocation of a buffer that is >> larger than the file in order to hold the result of deflating the file >> before it gets written to disk. ... >> This is a known area in Git where big files aren't handled well. > > As a curiosity, I've always done streaming decompression with zlib using > minimal buffer sizes (64k, perhaps). I'm sure there is good reason why Git > doesn't do this (delta application?). Do you know what it is? Laziness. Git originally assumed it would only be used for smaller source files written by humans. Its easier to write the code as a single malloc'd buffer than to stream it. We'd like to fix it, but its harder than it sounds. Today we copy the file into a buffer before we deflate and compute the SHA-1 as this prevents us from getting into a consistency error when the file is modified between these two stages. -- Shawn. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 18:57 ` Shawn Pearce @ 2010-10-05 0:59 ` Enrico Weigelt 2010-10-05 7:41 ` Enrico Weigelt 0 siblings, 1 reply; 29+ messages in thread From: Enrico Weigelt @ 2010-10-05 0:59 UTC (permalink / raw) To: git * Shawn Pearce <spearce@spearce.org> wrote: > Laziness. Git originally assumed it would only be used for smaller > source files written by humans. Its easier to write the code as a > single malloc'd buffer than to stream it. We'd like to fix it, but > its harder than it sounds. Today we copy the file into a buffer > before we deflate and compute the SHA-1 as this prevents us from > getting into a consistency error when the file is modified between > these two stages. hmm, perhaps copy it to a temporary file, if it's too large ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 0:59 ` Enrico Weigelt @ 2010-10-05 7:41 ` Enrico Weigelt 2010-10-05 8:01 ` Matthieu Moy 2010-10-05 10:13 ` Nguyen Thai Ngoc Duy 0 siblings, 2 replies; 29+ messages in thread From: Enrico Weigelt @ 2010-10-05 7:41 UTC (permalink / raw) To: git * Enrico Weigelt <weigelt@metux.de> wrote: <snip> Found another possible bottleneck: git-commit seems to scan through a lot of files. Shouldnt it just create a commit object from the current index and update the head ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 7:41 ` Enrico Weigelt @ 2010-10-05 8:01 ` Matthieu Moy 2010-10-05 8:17 ` Enrico Weigelt 2010-10-05 10:13 ` Nguyen Thai Ngoc Duy 1 sibling, 1 reply; 29+ messages in thread From: Matthieu Moy @ 2010-10-05 8:01 UTC (permalink / raw) To: git Enrico Weigelt <weigelt@metux.de> writes: > * Enrico Weigelt <weigelt@metux.de> wrote: > > <snip> > > Found another possible bottleneck: git-commit seems to scan through > a lot of files. Shouldnt it just create a commit object from the > current index and update the head ? git commit will show what's being commited (the output of "git commit --dry-run") in your editor, hence it needs to compute that. -- Matthieu Moy http://www-verimag.imag.fr/~moy/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 8:01 ` Matthieu Moy @ 2010-10-05 8:17 ` Enrico Weigelt 2010-10-05 11:29 ` Alex Riesen 0 siblings, 1 reply; 29+ messages in thread From: Enrico Weigelt @ 2010-10-05 8:17 UTC (permalink / raw) To: git * Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote: > git commit will show what's being commited (the output of "git commit > --dry-run") in your editor, hence it needs to compute that. hmm, is there any way to get around this ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 8:17 ` Enrico Weigelt @ 2010-10-05 11:29 ` Alex Riesen 2010-10-05 11:38 ` Matthieu Moy 0 siblings, 1 reply; 29+ messages in thread From: Alex Riesen @ 2010-10-05 11:29 UTC (permalink / raw) To: weigelt, git On Tue, Oct 5, 2010 at 10:17, Enrico Weigelt <weigelt@metux.de> wrote: > * Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote: > >> git commit will show what's being commited (the output of "git commit >> --dry-run") in your editor, hence it needs to compute that. > > hmm, is there any way to get around this ? > Try "git commit -q -uno". This should skip creation of summary in the commit message and lookup for untracked files. This will somewhat speed things up ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 11:29 ` Alex Riesen @ 2010-10-05 11:38 ` Matthieu Moy 2010-10-05 11:55 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 29+ messages in thread From: Matthieu Moy @ 2010-10-05 11:38 UTC (permalink / raw) To: Alex Riesen; +Cc: weigelt, git Alex Riesen <raa.lkml@gmail.com> writes: > On Tue, Oct 5, 2010 at 10:17, Enrico Weigelt <weigelt@metux.de> wrote: >> * Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote: >> >>> git commit will show what's being commited (the output of "git commit >>> --dry-run") in your editor, hence it needs to compute that. >> >> hmm, is there any way to get around this ? >> > > Try "git commit -q -uno". This should skip creation of summary in the > commit message and lookup for untracked files. To avoid including the summary, the option would be --no-status (-q makes commit less verbose in stdout, not in COMMIT_EDITMSG). But strace -fe lstat64 git commit -uno --no-status -q still shows lstat64 for each tracked file in my working tree (even when using -m to avoid launching the editor). I don't know if this is intended, or just that nobody cared enough to optimize this. -- Matthieu Moy http://www-verimag.imag.fr/~moy/ ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 11:38 ` Matthieu Moy @ 2010-10-05 11:55 ` Nguyen Thai Ngoc Duy 2010-10-05 16:42 ` Junio C Hamano 0 siblings, 1 reply; 29+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-10-05 11:55 UTC (permalink / raw) To: Matthieu Moy; +Cc: Alex Riesen, weigelt, git, Junio C Hamano On Tue, Oct 5, 2010 at 6:38 PM, Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote: > Alex Riesen <raa.lkml@gmail.com> writes: > >> On Tue, Oct 5, 2010 at 10:17, Enrico Weigelt <weigelt@metux.de> wrote: >>> * Matthieu Moy <Matthieu.Moy@grenoble-inp.fr> wrote: >>> >>>> git commit will show what's being commited (the output of "git commit >>>> --dry-run") in your editor, hence it needs to compute that. >>> >>> hmm, is there any way to get around this ? >>> >> >> Try "git commit -q -uno". This should skip creation of summary in the >> commit message and lookup for untracked files. > > To avoid including the summary, the option would be --no-status (-q > makes commit less verbose in stdout, not in COMMIT_EDITMSG). > > But > > strace -fe lstat64 git commit -uno --no-status -q > > still shows lstat64 for each tracked file in my working tree (even > when using -m to avoid launching the editor). I don't know if this is > intended, or just that nobody cared enough to optimize this. I assume you do git-commit with no index modification at all. The index refresh part dated back in 2888605 (builtin-commit: fix partial-commit support - 2007-11-18). Commit message does not tell why index refresh is needed (for summary stat maybe?). If so, then we can skip refreshing if -q is given. -- Duy ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 11:55 ` Nguyen Thai Ngoc Duy @ 2010-10-05 16:42 ` Junio C Hamano 0 siblings, 0 replies; 29+ messages in thread From: Junio C Hamano @ 2010-10-05 16:42 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Matthieu Moy, Alex Riesen, weigelt, git, Junio C Hamano Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > I assume you do git-commit with no index modification at all. The > index refresh part dated back in 2888605 (builtin-commit: fix > partial-commit support - 2007-11-18). Commit message does not tell why > index refresh is needed (for summary stat maybe?). If so, then we can > skip refreshing if -q is given. Most likely to give a clean index to post-commit hook. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 7:41 ` Enrico Weigelt 2010-10-05 8:01 ` Matthieu Moy @ 2010-10-05 10:13 ` Nguyen Thai Ngoc Duy 2010-10-05 19:12 ` Nicolas Pitre 1 sibling, 1 reply; 29+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-10-05 10:13 UTC (permalink / raw) To: weigelt, git On Tue, Oct 5, 2010 at 2:41 PM, Enrico Weigelt <weigelt@metux.de> wrote: > * Enrico Weigelt <weigelt@metux.de> wrote: > > <snip> > > Found another possible bottleneck: git-commit seems to scan through > a lot of files. Shouldnt it just create a commit object from the > current index and update the head ? You mean a lot of stat()? There is no way to avoid that unless you set assume-unchanged bits. Or you could use write-tree/commit-tree/update-ref directly. -- Duy ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 10:13 ` Nguyen Thai Ngoc Duy @ 2010-10-05 19:12 ` Nicolas Pitre 0 siblings, 0 replies; 29+ messages in thread From: Nicolas Pitre @ 2010-10-05 19:12 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: weigelt, git On Tue, 5 Oct 2010, Nguyen Thai Ngoc Duy wrote: > On Tue, Oct 5, 2010 at 2:41 PM, Enrico Weigelt <weigelt@metux.de> wrote: > > * Enrico Weigelt <weigelt@metux.de> wrote: > > > > <snip> > > > > Found another possible bottleneck: git-commit seems to scan through > > a lot of files. Shouldnt it just create a commit object from the > > current index and update the head ? > > You mean a lot of stat()? There is no way to avoid that unless you set > assume-unchanged bits. Or you could use > write-tree/commit-tree/update-ref directly. Avoiding memory exhaustion is also going to help a lot as the stat() information will remain cached instead of requiring disk access. Just a guess given $subject. Nicolas ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 18:05 ` Shawn Pearce 2010-10-04 18:24 ` Joshua Jensen @ 2010-10-04 18:58 ` Jonathan Nieder 2010-10-04 19:11 ` Shawn Pearce 2010-10-05 0:57 ` Enrico Weigelt 2010-10-05 0:50 ` Enrico Weigelt 2 siblings, 2 replies; 29+ messages in thread From: Jonathan Nieder @ 2010-10-04 18:58 UTC (permalink / raw) To: Shawn Pearce; +Cc: weigelt, git Shawn Pearce wrote: > The mmap() isn't the problem. Its the allocation of a buffer that is > larger than the file in order to hold the result of deflating the file > before it gets written to disk. Wasn't this already fixed, at least in some cases? commit 9892bebafe0865d8f4f3f18d60a1cfa2d1447cd7 (tags/v1.7.0.2~11^2~1) Author: Nicolas Pitre <nico@fluxnic.net> Date: Sat Feb 20 23:27:31 2010 -0500 sha1_file: don't malloc the whole compressed result when writing out objects There is no real advantage to malloc the whole output buffer and deflate the data in a single pass when writing loose objects. That is like only 1% faster while using more memory, especially with large files where memory usage is far more. It is best to deflate and write the data out in small chunks reusing the same memory instead. For example, using 'git add' on a few large files averaging 40 MB ... Before: 21.45user 1.10system 0:22.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+828040outputs (0major+142640minor)pagefaults 0swaps After: 21.50user 1.25system 0:22.76elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+828040outputs (0major+104408minor)pagefaults 0swaps While the runtime stayed relatively the same, the number of minor page faults went down significantly. Signed-off-by: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 18:58 ` Jonathan Nieder @ 2010-10-04 19:11 ` Shawn Pearce 2010-10-04 19:16 ` Jonathan Nieder 2010-10-05 0:57 ` Enrico Weigelt 1 sibling, 1 reply; 29+ messages in thread From: Shawn Pearce @ 2010-10-04 19:11 UTC (permalink / raw) To: Jonathan Nieder; +Cc: weigelt, git On Mon, Oct 4, 2010 at 11:58 AM, Jonathan Nieder <jrnieder@gmail.com> wrote: > Shawn Pearce wrote: > >> The mmap() isn't the problem. Its the allocation of a buffer that is >> larger than the file in order to hold the result of deflating the file >> before it gets written to disk. > > Wasn't this already fixed, at least in some cases? > > commit 9892bebafe0865d8f4f3f18d60a1cfa2d1447cd7 (tags/v1.7.0.2~11^2~1) > Author: Nicolas Pitre <nico@fluxnic.net> > Date: Sat Feb 20 23:27:31 2010 -0500 > > sha1_file: don't malloc the whole compressed result when writing out objects This change only removes the deflate copy. But due to the SHA-1 consistency issue I alluded to earlier, I think we're still making a full copy of the file in memory before we SHA-1 it or deflate it. So Nico halved the memory usage, but we're still using 1x the size of the file rather than ~2x. -- Shawn. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 19:11 ` Shawn Pearce @ 2010-10-04 19:16 ` Jonathan Nieder 2010-10-05 10:59 ` Nguyen Thai Ngoc Duy 2010-10-05 20:17 ` Nicolas Pitre 0 siblings, 2 replies; 29+ messages in thread From: Jonathan Nieder @ 2010-10-04 19:16 UTC (permalink / raw) To: Shawn Pearce; +Cc: weigelt, git Shawn Pearce wrote: > This change only removes the deflate copy. But due to the SHA-1 > consistency issue I alluded to earlier, I think we're still making a > full copy of the file in memory before we SHA-1 it or deflate it. Hmm, I _think_ we still use mmap for that (which is why 748af44c needs to compare the sha1 before and after). But 1) a one-pass calculation would presumably be a little (5%?) faster 2) if there are smudge/clean filters or autocrlf involved, the cleaned-up file is backed by swap and this all becomes moot. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 19:16 ` Jonathan Nieder @ 2010-10-05 10:59 ` Nguyen Thai Ngoc Duy 2010-10-05 20:17 ` Nicolas Pitre 1 sibling, 0 replies; 29+ messages in thread From: Nguyen Thai Ngoc Duy @ 2010-10-05 10:59 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Shawn Pearce, weigelt, git On Tue, Oct 5, 2010 at 2:16 AM, Jonathan Nieder <jrnieder@gmail.com> wrote: > Shawn Pearce wrote: > >> This change only removes the deflate copy. But due to the SHA-1 >> consistency issue I alluded to earlier, I think we're still making a >> full copy of the file in memory before we SHA-1 it or deflate it. > > Hmm, I _think_ we still use mmap for that (which is why 748af44c needs > to compare the sha1 before and after). Just tried valgrind massif on a 200MB file with master. It used ~270kb heap. I haven't tested but I believe git-checkout will keep the whole inflated copy in memory. So git-add alone does not help much. -- Duy ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 19:16 ` Jonathan Nieder 2010-10-05 10:59 ` Nguyen Thai Ngoc Duy @ 2010-10-05 20:17 ` Nicolas Pitre 2010-10-05 20:34 ` Jonathan Nieder 1 sibling, 1 reply; 29+ messages in thread From: Nicolas Pitre @ 2010-10-05 20:17 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Shawn Pearce, weigelt, git On Mon, 4 Oct 2010, Jonathan Nieder wrote: > Shawn Pearce wrote: > > > This change only removes the deflate copy. But due to the SHA-1 > > consistency issue I alluded to earlier, I think we're still making a > > full copy of the file in memory before we SHA-1 it or deflate it. > > Hmm, I _think_ we still use mmap for that (which is why 748af44c needs > to compare the sha1 before and after). > > But > > 1) a one-pass calculation would presumably be a little (5%?) faster You can't do a one-pass calculation. The first one is required to compute the SHA1 of the file being added, and if that corresponds to an object that we already have then the operation stops right there as there is actually nothing to do. The second pass is to deflate the data, and recompute the SHA1 to make sure what we deflated and written out is still the same data. In the case of big files, what we need to do is to stream the file data in, compute the SHA1 and deflate it, in order to stream it out into a temporary file, then rename it according to the final SHA1. This would allow Git to work with big files, but of course it won't be possible to know if the object corresponding to the file is already known until all the work has been done, possibly just to throw it away. But normally big files are the minority. Nicolas ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 20:17 ` Nicolas Pitre @ 2010-10-05 20:34 ` Jonathan Nieder 2010-10-05 21:11 ` Nicolas Pitre 0 siblings, 1 reply; 29+ messages in thread From: Jonathan Nieder @ 2010-10-05 20:34 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Shawn Pearce, weigelt, git Nicolas Pitre wrote: > You can't do a one-pass calculation. The first one is required to > compute the SHA1 of the file being added, and if that corresponds to an > object that we already have then the operation stops right there as > there is actually nothing to do. Ah. Thanks for a reminder. > In the case of big files, what we need to do is to stream the file data > in, compute the SHA1 and deflate it, in order to stream it out into a > temporary file, then rename it according to the final SHA1. This would > allow Git to work with big files, but of course it won't be possible to > know if the object corresponding to the file is already known until all > the work has been done, possibly just to throw it away. To make sure I understand correctly: are you suggesting that for big files we should skip the first pass? I suppose that makes sense: for small files, using a patch application tool to reach a postimage that matches an existing object is something git historically needed to expect, but for typical big files: - once you've computed the SHA1, you've already invested a noticeable amount of time. - emailing patches around is difficult, making "git am" etc less important - hopefully git or zlib can notice when files are uncompressible, making the deflate not cost so much in that case. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 20:34 ` Jonathan Nieder @ 2010-10-05 21:11 ` Nicolas Pitre 0 siblings, 0 replies; 29+ messages in thread From: Nicolas Pitre @ 2010-10-05 21:11 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Shawn Pearce, weigelt, git On Tue, 5 Oct 2010, Jonathan Nieder wrote: > Nicolas Pitre wrote: > > > You can't do a one-pass calculation. The first one is required to > > compute the SHA1 of the file being added, and if that corresponds to an > > object that we already have then the operation stops right there as > > there is actually nothing to do. > > Ah. Thanks for a reminder. > > > In the case of big files, what we need to do is to stream the file data > > in, compute the SHA1 and deflate it, in order to stream it out into a > > temporary file, then rename it according to the final SHA1. This would > > allow Git to work with big files, but of course it won't be possible to > > know if the object corresponding to the file is already known until all > > the work has been done, possibly just to throw it away. > > To make sure I understand correctly: are you suggesting that for big > files we should skip the first pass? For big files we need a totally separate code path to process the file data in small chunks at 'git add' time, using a loop containing read()+SHA1sum()+deflate()+write(). Then, if the SHA1 matches an existing object we delete the temporary output file, otherwise we rename it as a valid object. No CRLF, no smudge filters, no diff, no deltas, just plain storage of huge objects, based on the value of core.bigFileThreshold config option. Same thing on the checkout path: a simple loop to read()+inflate()+write() in small chunks. That's the only sane way to kinda support big files with Git. > I suppose that makes sense: for small files, using a patch application > tool to reach a postimage that matches an existing object is something > git historically needed to expect, but for typical big files: > > - once you've computed the SHA1, you've already invested a noticeable > amount of time. > - emailing patches around is difficult, making "git am" etc less important > - hopefully git or zlib can notice when files are uncompressible, > making the deflate not cost so much in that case. Emailing is out of the question. We're talking file sizes in the hundreds of megabytes and above here. So yes, simply computing the SHA1 is a significant cost, given that you are going to trash your page cache in the process already, so better pay the price of deflating it at the same time even if it turns out to be unnecessary. Nicolas ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 18:58 ` Jonathan Nieder 2010-10-04 19:11 ` Shawn Pearce @ 2010-10-05 0:57 ` Enrico Weigelt 2010-10-05 1:07 ` Ævar Arnfjörð Bjarmason 2010-10-05 1:10 ` Jonathan Nieder 1 sibling, 2 replies; 29+ messages in thread From: Enrico Weigelt @ 2010-10-05 0:57 UTC (permalink / raw) To: git * Jonathan Nieder <jrnieder@gmail.com> wrote: > Shawn Pearce wrote: > > > The mmap() isn't the problem. Its the allocation of a buffer that is > > larger than the file in order to hold the result of deflating the file > > before it gets written to disk. > > Wasn't this already fixed, at least in some cases? > > commit 9892bebafe0865d8f4f3f18d60a1cfa2d1447cd7 (tags/v1.7.0.2~11^2~1) > Author: Nicolas Pitre <nico@fluxnic.net> > Date: Sat Feb 20 23:27:31 2010 -0500 I guess I'll have to do a update. But: latest tag (1.7.3.1) doesnt build: CC read-cache.o read-cache.c: In function `fill_stat_cache_info': read-cache.c:73: structure has no member named `st_ctim' read-cache.c:74: structure has no member named `st_mtim' read-cache.c: In function `read_index_from': read-cache.c:1334: structure has no member named `st_mtim' read-cache.c: In function `write_index': read-cache.c:1614: structure has no member named `st_mtim' make: *** [read-cache.o] Fehler 1 Is my libc too old ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 0:57 ` Enrico Weigelt @ 2010-10-05 1:07 ` Ævar Arnfjörð Bjarmason 2010-10-05 1:10 ` Jonathan Nieder 1 sibling, 0 replies; 29+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2010-10-05 1:07 UTC (permalink / raw) To: weigelt, git On Tue, Oct 5, 2010 at 00:57, Enrico Weigelt <weigelt@metux.de> wrote: > * Jonathan Nieder <jrnieder@gmail.com> wrote: >> Shawn Pearce wrote: >> >> > The mmap() isn't the problem. Its the allocation of a buffer that is >> > larger than the file in order to hold the result of deflating the file >> > before it gets written to disk. >> >> Wasn't this already fixed, at least in some cases? >> >> commit 9892bebafe0865d8f4f3f18d60a1cfa2d1447cd7 (tags/v1.7.0.2~11^2~1) >> Author: Nicolas Pitre <nico@fluxnic.net> >> Date: Sat Feb 20 23:27:31 2010 -0500 > > I guess I'll have to do a update. > > But: latest tag (1.7.3.1) doesnt build: > > > CC read-cache.o > read-cache.c: In function `fill_stat_cache_info': > read-cache.c:73: structure has no member named `st_ctim' > read-cache.c:74: structure has no member named `st_mtim' > read-cache.c: In function `read_index_from': > read-cache.c:1334: structure has no member named `st_mtim' > read-cache.c: In function `write_index': > read-cache.c:1614: structure has no member named `st_mtim' > make: *** [read-cache.o] Fehler 1 > > Is my libc too old ? Those lines are accessing members called st_ctime, i.e. with an "e" at the end, but your errors just report "st_ctim". What gives? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 0:57 ` Enrico Weigelt 2010-10-05 1:07 ` Ævar Arnfjörð Bjarmason @ 2010-10-05 1:10 ` Jonathan Nieder 2010-10-05 7:35 ` Enrico Weigelt 1 sibling, 1 reply; 29+ messages in thread From: Jonathan Nieder @ 2010-10-05 1:10 UTC (permalink / raw) To: git; +Cc: Enrico Weigelt Enrico Weigelt wrote: > CC read-cache.o > read-cache.c: In function `fill_stat_cache_info': > read-cache.c:73: structure has no member named `st_ctim' > read-cache.c:74: structure has no member named `st_mtim' > read-cache.c: In function `read_index_from': > read-cache.c:1334: structure has no member named `st_mtim' > read-cache.c: In function `write_index': > read-cache.c:1614: structure has no member named `st_mtim' > make: *** [read-cache.o] Fehler 1 > > Is my libc too old ? What platform are you on? You probably need USE_ST_TIMESPEC; if so, please send a makefile patch so the next person trying it doesn't need to worry about it. Also, please don't destroy the cc lists; it makes it hard for people with some mail setups (e.g., mine) to notice when you've replied to them. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 1:10 ` Jonathan Nieder @ 2010-10-05 7:35 ` Enrico Weigelt 2010-10-05 13:47 ` Jonathan Nieder 0 siblings, 1 reply; 29+ messages in thread From: Enrico Weigelt @ 2010-10-05 7:35 UTC (permalink / raw) To: git * Jonathan Nieder <jrnieder@gmail.com> wrote: > What platform are you on? GNU/Linux. glibc-2.25 cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 7:35 ` Enrico Weigelt @ 2010-10-05 13:47 ` Jonathan Nieder 0 siblings, 0 replies; 29+ messages in thread From: Jonathan Nieder @ 2010-10-05 13:47 UTC (permalink / raw) To: git; +Cc: Enrico Weigelt, Ævar Arnfjörð Bjarmason Enrico Weigelt wrote: > * Jonathan Nieder <jrnieder@gmail.com> wrote: >> What platform are you on? > > GNU/Linux. glibc-2.25 Hmm, I've heard of glib 2.25 but never glibc 2.25. :) $ /lib/libc.so.6 | head -1 GNU C Library (Debian EGLIBC 2.11.2-6) stable release version 2.11.2, by Roland McGrath et al. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-04 18:05 ` Shawn Pearce 2010-10-04 18:24 ` Joshua Jensen 2010-10-04 18:58 ` Jonathan Nieder @ 2010-10-05 0:50 ` Enrico Weigelt 2010-10-05 19:06 ` Nicolas Pitre 2 siblings, 1 reply; 29+ messages in thread From: Enrico Weigelt @ 2010-10-05 0:50 UTC (permalink / raw) To: git * Shawn Pearce <spearce@spearce.org> wrote: > The mmap() isn't the problem. Its the allocation of a buffer that is > larger than the file in order to hold the result of deflating the file > before it gets written to disk. > When the file is bigger than physical memory, the kernel has to > page in parts of the file as well as swap in and out parts of > that allocated buffer to hold the deflated file. What are the access pattern of these memory areas ? Perhaps madvise() could help ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 0:50 ` Enrico Weigelt @ 2010-10-05 19:06 ` Nicolas Pitre 2010-10-05 22:51 ` Enrico Weigelt 0 siblings, 1 reply; 29+ messages in thread From: Nicolas Pitre @ 2010-10-05 19:06 UTC (permalink / raw) To: Enrico Weigelt; +Cc: git On Tue, 5 Oct 2010, Enrico Weigelt wrote: > * Shawn Pearce <spearce@spearce.org> wrote: > > > The mmap() isn't the problem. Its the allocation of a buffer that is > > larger than the file in order to hold the result of deflating the file > > before it gets written to disk. > > When the file is bigger than physical memory, the kernel has to > > page in parts of the file as well as swap in and out parts of > > that allocated buffer to hold the deflated file. > > What are the access pattern of these memory areas ? Perfectly linear. > Perhaps madvise() could help ? Perhaps. Nicolas ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: large files and low memory 2010-10-05 19:06 ` Nicolas Pitre @ 2010-10-05 22:51 ` Enrico Weigelt 0 siblings, 0 replies; 29+ messages in thread From: Enrico Weigelt @ 2010-10-05 22:51 UTC (permalink / raw) To: git * Nicolas Pitre <nico@fluxnic.net> wrote: > > > The mmap() isn't the problem. Its the allocation of a buffer that is > > > larger than the file in order to hold the result of deflating the file > > > before it gets written to disk. > > > When the file is bigger than physical memory, the kernel has to > > > page in parts of the file as well as swap in and out parts of > > > that allocated buffer to hold the deflated file. > > > > What are the access pattern of these memory areas ? > > Perfectly linear. In this case, I wonder why my machine goes into thrashing so easily (P3 w/ 256MB ram). Seems the mmu/paging code doesnt recognize that the previously-used pages can be kicked-off quickly ;-o Perhaps I should talk to the kernel folks. > > Perhaps madvise() could help ? > > Perhaps. hmm, so we should try it ;-p where'd be the right place to add it ? cu -- ---------------------------------------------------------------------- Enrico Weigelt, metux IT service -- http://www.metux.de/ phone: +49 36207 519931 email: weigelt@metux.de mobile: +49 151 27565287 icq: 210169427 skype: nekrad666 ---------------------------------------------------------------------- Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme ---------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2010-10-05 22:57 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-10-04 9:20 large files and low memory Enrico Weigelt 2010-10-04 18:05 ` Shawn Pearce 2010-10-04 18:24 ` Joshua Jensen 2010-10-04 18:57 ` Shawn Pearce 2010-10-05 0:59 ` Enrico Weigelt 2010-10-05 7:41 ` Enrico Weigelt 2010-10-05 8:01 ` Matthieu Moy 2010-10-05 8:17 ` Enrico Weigelt 2010-10-05 11:29 ` Alex Riesen 2010-10-05 11:38 ` Matthieu Moy 2010-10-05 11:55 ` Nguyen Thai Ngoc Duy 2010-10-05 16:42 ` Junio C Hamano 2010-10-05 10:13 ` Nguyen Thai Ngoc Duy 2010-10-05 19:12 ` Nicolas Pitre 2010-10-04 18:58 ` Jonathan Nieder 2010-10-04 19:11 ` Shawn Pearce 2010-10-04 19:16 ` Jonathan Nieder 2010-10-05 10:59 ` Nguyen Thai Ngoc Duy 2010-10-05 20:17 ` Nicolas Pitre 2010-10-05 20:34 ` Jonathan Nieder 2010-10-05 21:11 ` Nicolas Pitre 2010-10-05 0:57 ` Enrico Weigelt 2010-10-05 1:07 ` Ævar Arnfjörð Bjarmason 2010-10-05 1:10 ` Jonathan Nieder 2010-10-05 7:35 ` Enrico Weigelt 2010-10-05 13:47 ` Jonathan Nieder 2010-10-05 0:50 ` Enrico Weigelt 2010-10-05 19:06 ` Nicolas Pitre 2010-10-05 22:51 ` Enrico Weigelt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).