* git-diff opens too many files?
@ 2006-11-20 10:12 Nguyen Thai Ngoc Duy
2006-11-20 15:20 ` Johannes Schindelin
2006-11-20 17:00 ` Linus Torvalds
0 siblings, 2 replies; 7+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2006-11-20 10:12 UTC (permalink / raw)
To: git
I got this error in a quite big (in files) repository:
error: open("vnexpress.net/Suc-khoe/2001/04/3B9AF976"): Too many open
files in system
fatal: cannot hash vnexpress.net/Suc-khoe/2001/04/3B9AF976
The repository contained about 67.000 files and probably all were modified.
git version 1.4.4.rc1.g2bba
--
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: git-diff opens too many files? 2006-11-20 10:12 git-diff opens too many files? Nguyen Thai Ngoc Duy @ 2006-11-20 15:20 ` Johannes Schindelin 2006-11-20 15:32 ` Nguyen Thai Ngoc Duy 2006-11-20 17:00 ` Linus Torvalds 1 sibling, 1 reply; 7+ messages in thread From: Johannes Schindelin @ 2006-11-20 15:20 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git Hi, On Mon, 20 Nov 2006, Nguyen Thai Ngoc Duy wrote: > I got this error in a quite big (in files) repository: > error: open("vnexpress.net/Suc-khoe/2001/04/3B9AF976"): Too many open > files in system > fatal: cannot hash vnexpress.net/Suc-khoe/2001/04/3B9AF976 > > The repository contained about 67.000 files and probably all were modified. > git version 1.4.4.rc1.g2bba What was the command line? "git diff" really is a wrapper around different programs... Ciao, Dscho ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git-diff opens too many files? 2006-11-20 15:20 ` Johannes Schindelin @ 2006-11-20 15:32 ` Nguyen Thai Ngoc Duy 2006-11-20 15:48 ` Johannes Schindelin 0 siblings, 1 reply; 7+ messages in thread From: Nguyen Thai Ngoc Duy @ 2006-11-20 15:32 UTC (permalink / raw) To: Johannes Schindelin; +Cc: git On 11/20/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > Hi, > > On Mon, 20 Nov 2006, Nguyen Thai Ngoc Duy wrote: > > > I got this error in a quite big (in files) repository: > > error: open("vnexpress.net/Suc-khoe/2001/04/3B9AF976"): Too many open > > files in system > > fatal: cannot hash vnexpress.net/Suc-khoe/2001/04/3B9AF976 > > > > The repository contained about 67.000 files and probably all were modified. > > git version 1.4.4.rc1.g2bba > > What was the command line? "git diff" really is a wrapper around different > programs... "git diff" from working directory with no argument. The real command is git diff |LANG=C grep '^-'|LANG=C grep -v '^--'|LANG=C sort|LANG=C uniq -c (I squeezed blank lines, so I wanted to check that it just did that, nothing else) > Ciao, > Dscho > > -- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git-diff opens too many files? 2006-11-20 15:32 ` Nguyen Thai Ngoc Duy @ 2006-11-20 15:48 ` Johannes Schindelin 0 siblings, 0 replies; 7+ messages in thread From: Johannes Schindelin @ 2006-11-20 15:48 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: git Hi, On Mon, 20 Nov 2006, Nguyen Thai Ngoc Duy wrote: > git diff |LANG=C grep '^-'|LANG=C grep -v '^--'|LANG=C sort|LANG=C uniq -c I looked in the code, but nothing jumps into my eyes. There is a place where files are mmap()ed, but AFAICT they are unmmap()ed just after diffing. Could you run it with strace and find unclosed open calls? Ciao, Dscho ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git-diff opens too many files? 2006-11-20 10:12 git-diff opens too many files? Nguyen Thai Ngoc Duy 2006-11-20 15:20 ` Johannes Schindelin @ 2006-11-20 17:00 ` Linus Torvalds 2006-11-20 19:51 ` Junio C Hamano 2006-11-20 21:02 ` Junio C Hamano 1 sibling, 2 replies; 7+ messages in thread From: Linus Torvalds @ 2006-11-20 17:00 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy, Junio C Hamano; +Cc: Git Mailing List On Mon, 20 Nov 2006, Nguyen Thai Ngoc Duy wrote: > > I got this error in a quite big (in files) repository: > error: open("vnexpress.net/Suc-khoe/2001/04/3B9AF976"): Too many open > files in system Ok, "too many open files in system" is ENFILE - you haven't run out of file descriptors in _one_ process, but you've exceeded the total number of file descriptors in the whole system. Which is not because we forget to close() something, but because we're keeping file descriptors busy another way. > fatal: cannot hash vnexpress.net/Suc-khoe/2001/04/3B9AF976 Hmm. We keep files mmap'ed in "git diff" for possibly too long. What happens is that we mmap a file that we want to diff when we start the whole thing, and keep it mapped over the whole diff session, because we're potentially going to need to compare it against other files (ie rename detection etc). And then we unmap it only at the end (in "diff_flush()" -> "diff_free_filepair()" -> "diff_free_filespec_data()"). And that's normally great, and means that we don't need to worry about the file data (we map it once, and can keep it in memory), but yeah, if you have thousands of files changed, you'll have thousands of mappings. And each one will have a pointer to a "struct file" inside the kernel. What OS/distro is this? Normally, you shouldn't have that low a limit on number of files open, but we do end up potentially opening thousands. For example, under Linux, you can do this: # in one terminal window, do: while : ; do cat /proc/sys/fs/file-nr ; sleep 1; done # in another one: cd linux-repo git ls-files '*.c' | xargs touch git diff and if it looks anything like mine, it could be: 2464 0 349662 2464 0 349662 2464 0 349662 * 5920 0 349662 ** 7616 0 349662 *** 9024 0 349662 **** 10944 0 349662 2464 0 349662 2464 0 349662 (see how the numnber of active files grows by thousands). Anyway, there's two possible solutions: - simply make sure that you can have that many open files. If it's a Linux system, just increase the value of the file /proc/sys/fs/file-max, and you're done. Of course, if you're not the admin of the box, you may need to ask somebody else to do it for you.. - we could try to make git not keep them mmap'ed for the whole time. Junio? This is your speciality, I'm not sure how painful it would be to unmap and remap on demand.. (or switch it to some kind of "keep the last <n> mmaps active" kind of thing to avoid having thousands and thousands of mmaps active). One simple thing that might be worth it is to simply _not_ use mmap() at all for small files. If a file is less than 1kB, it might be better to do a malloc() and a read() - partly because it avoids having tons of file descriptors, but partly because it's also more efficient from a virtual memory usage perspective (not that you're probably very likely to ever really hit that problem in practice). Nguyen - that "use malloc+read" thing might be a quick workaround, but only if you have tons of _small_ files (and if you can't easily just increase file-max). ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git-diff opens too many files? 2006-11-20 17:00 ` Linus Torvalds @ 2006-11-20 19:51 ` Junio C Hamano 2006-11-20 21:02 ` Junio C Hamano 1 sibling, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2006-11-20 19:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: git, Nguyen Thai Ngoc Duy Linus Torvalds <torvalds@osdl.org> writes: > Anyway, there's two possible solutions: > > - simply make sure that you can have that many open files. > > If it's a Linux system, just increase the value of the file > /proc/sys/fs/file-max, and you're done. Of course, if you're not the > admin of the box, you may need to ask somebody else to do it for you.. > > - we could try to make git not keep them mmap'ed for the whole time. > > Junio? This is your speciality, I'm not sure how painful it would be to > unmap and remap on demand.. (or switch it to some kind of "keep the last > <n> mmaps active" kind of thing to avoid having thousands and thousands of > mmaps active). 60,000 files 1kB each is 60MB which is a peanuts these days, but 10kB each would be already nontrivial burden on 32-bit (20% under 3+1 split), so even if we do the "read in small files instead of mapping" we would need diff_unpopulate_filespec() calls. I think after diffcore_rename() runs, the data in filespec is used only once during final textual diff generation. We would use once more before diff generation if diffcore_pickaxe() is in use. These codepaths begin with diff_populate_filespec(), so if we unpopulate them after diffcore_rename() runs nobody would notice (other than performance degradation and strace showing us reading the same thing twice). The diffcore_rename() matrix code expects all filespecs involved can be populated at the same time, but it should not be too hard to change it to keep one dst and all src candidates populated but others dropped if space gets tight. I need to look at the code for the details. But Nguyen's command line did not have -M; I think the filespecs are populated only during the text generation in that case, so the above would not help him while it would be a worthwhile change. Because there is _no_ processing that comes after textual diff generation that looks at the data, I think diff_flush_patch() after calling run_diff() can unpopulate the data from the filepair *p before returning without harming performance. I think diff_flush_stat() and diff_flush_checkdiff() would have the same issue, though. Ideally these should be able to do their processing while the main textual diff holds the data in memory for its processing but that is currently not the way the code is structured. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: git-diff opens too many files? 2006-11-20 17:00 ` Linus Torvalds 2006-11-20 19:51 ` Junio C Hamano @ 2006-11-20 21:02 ` Junio C Hamano 1 sibling, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2006-11-20 21:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: git, Nguyen Thai Ngoc Duy Linus Torvalds <torvalds@osdl.org> writes: > Junio? This is your speciality, I'm not sure how painful it would be to > unmap and remap on demand.. (or switch it to some kind of "keep the last > <n> mmaps active" kind of thing to avoid having thousands and thousands of > mmaps active). > > One simple thing that might be worth it is to simply _not_ use mmap() at > all for small files. If a file is less than 1kB, it might be better to do > a malloc() and a read() - partly because it avoids having tons of file > descriptors, but partly because it's also more efficient from a virtual > memory usage perspective (not that you're probably very likely to ever > really hit that problem in practice). > > Nguyen - that "use malloc+read" thing might be a quick workaround, but > only if you have tons of _small_ files (and if you can't easily just > increase file-max). So here is a lunch-time hack to get Nguyen unstuck. -- >8 -- [PATCH] diff.c: avoid mmap() of small files in populate_filespec() This would hopefully behave better in VM usage. It is not a real fix if you are dealing with truly huge diff, in which case we would have to LRU out the data in memory for filespecs that are not used in immediate future. Signed-off-by: Junio C Hamano <junkio@cox.net> --- diff --git a/diff.c b/diff.c index 3315378..af50b6f 100644 --- a/diff.c +++ b/diff.c @@ -1294,11 +1294,22 @@ int diff_populate_filespec(struct diff_f fd = open(s->path, O_RDONLY); if (fd < 0) goto err_empty; - s->data = mmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0); + if (s->size < MINIMUM_MMAP) { + s->data = xmalloc(s->size); + s->should_free = 1; + if (xread(fd, s->data, s->size) != s->size) { + free(s->data); + goto err_empty; + } + } + else { + s->data = mmap(NULL, s->size, PROT_READ, MAP_PRIVATE, + fd, 0); + s->should_munmap = 1; + } close(fd); if (s->data == MAP_FAILED) goto err_empty; - s->should_munmap = 1; } else { char type[20]; diff --git a/diffcore.h b/diffcore.h index 2249bc2..f129aa0 100644 --- a/diffcore.h +++ b/diffcore.h @@ -21,6 +21,7 @@ #define DEFAULT_MERGE_SCORE 36000 /* maximum for break-merge to happen 60%) */ #define MINIMUM_BREAK_SIZE 400 /* do not break a file smaller than this */ +#define MINIMUM_MMAP 4096 /* do not mmap a file smaller than this */ struct diff_filespec { unsigned char sha1[20]; ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-11-20 21:02 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-11-20 10:12 git-diff opens too many files? Nguyen Thai Ngoc Duy 2006-11-20 15:20 ` Johannes Schindelin 2006-11-20 15:32 ` Nguyen Thai Ngoc Duy 2006-11-20 15:48 ` Johannes Schindelin 2006-11-20 17:00 ` Linus Torvalds 2006-11-20 19:51 ` Junio C Hamano 2006-11-20 21:02 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox