* win2k/cygwin cannot handle even moderately sized packs
@ 2006-11-07 11:02 Alex Riesen
2006-11-07 12:17 ` Noel Grandin
2006-11-13 12:45 ` Johannes Schindelin
0 siblings, 2 replies; 21+ messages in thread
From: Alex Riesen @ 2006-11-07 11:02 UTC (permalink / raw)
To: git
For me, it fails even on 332Mb pack:
$ git reset --hard 61bb7fcb
fatal: packfile .git/objects/pack/pack-ad37...pack cannot be mapped.
Instrumenting the code reveals that it fails on 348876870 bytes.
Strangely enough, a cygwin program which just reads that pack
many times without freeing the mem goes up to 1395507480 (1330Mb).
If I replace the malloc (cygwin) with native VirtualAlloc(MEM_COMMIT)
it reports that "Not enough storage is available to process this command",
which is just ENOMEM, I think.
This is a 2Gb machine, with almost 1.3Gb free, so I'm a bit confused,
what could here go wrong (besides what already is wrong).
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 11:02 win2k/cygwin cannot handle even moderately sized packs Alex Riesen
@ 2006-11-07 12:17 ` Noel Grandin
2006-11-07 13:55 ` Alex Riesen
2006-11-13 12:45 ` Johannes Schindelin
1 sibling, 1 reply; 21+ messages in thread
From: Noel Grandin @ 2006-11-07 12:17 UTC (permalink / raw)
To: Alex Riesen; +Cc: git
[-- Attachment #1: Type: text/plain, Size: 1632 bytes --]
Looking at
http://msdn2.microsoft.com/en-us/library/ms810627.aspx
it looks like
(a) windows provides 2G of address space to play with
(b) VirtualAlloc is constrained to allocating contiguous ranges of
memory within that 2G address space
So the problem is probably memory fragmentation.
You might have more joy if you allocated one HUGE chunk immediately on
startup to use for the pack,
and then kept re-using that chunk.
Alex Riesen wrote:
> For me, it fails even on 332Mb pack:
>
> $ git reset --hard 61bb7fcb
> fatal: packfile .git/objects/pack/pack-ad37...pack cannot be mapped.
>
> Instrumenting the code reveals that it fails on 348876870 bytes.
> Strangely enough, a cygwin program which just reads that pack
> many times without freeing the mem goes up to 1395507480 (1330Mb).
>
> If I replace the malloc (cygwin) with native VirtualAlloc(MEM_COMMIT)
> it reports that "Not enough storage is available to process this
> command",
> which is just ENOMEM, I think.
>
> This is a 2Gb machine, with almost 1.3Gb free, so I'm a bit confused,
> what could here go wrong (besides what already is wrong).
>
> Any ideas?
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
NOTICE: This email, and the contents thereof,
are subject to the standard Peralex email disclaimer, which may
be found at: http://www.peralex.com/disclaimer.html
If you cannot access the disclaimer through the URL attached
and you wish to receive a copy thereof please send
an email to email@peralex.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 12:17 ` Noel Grandin
@ 2006-11-07 13:55 ` Alex Riesen
2006-11-07 15:50 ` Jakub Narebski
2006-11-08 19:22 ` Christopher Faylor
0 siblings, 2 replies; 21+ messages in thread
From: Alex Riesen @ 2006-11-07 13:55 UTC (permalink / raw)
To: Noel Grandin; +Cc: git
> So the problem is probably memory fragmentation.
probably.
> You might have more joy if you allocated one HUGE chunk immediately on
> startup to use for the pack, and then kept re-using that chunk.
Well, it is not _one_ chunk. The windows/cygwin abomin...combination
may take an issue with this: it seem to copy complete address space
at fork, which even for such a small packs I have here takes system
down lightly (yes, I tried it).
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 13:55 ` Alex Riesen
@ 2006-11-07 15:50 ` Jakub Narebski
2006-11-07 17:28 ` Alex Riesen
2006-11-08 19:22 ` Christopher Faylor
1 sibling, 1 reply; 21+ messages in thread
From: Jakub Narebski @ 2006-11-07 15:50 UTC (permalink / raw)
To: git
Alex Riesen wrote:
>> So the problem is probably memory fragmentation.
>
> probably.
>
>> You might have more joy if you allocated one HUGE chunk immediately on
>> startup to use for the pack, and then kept re-using that chunk.
>
> Well, it is not _one_ chunk. The windows/cygwin abomin...combination
> may take an issue with this: it seem to copy complete address space
> at fork, which even for such a small packs I have here takes system
> down lightly (yes, I tried it).
Perhaps planned mmapping only parts of packs would help there.
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 15:50 ` Jakub Narebski
@ 2006-11-07 17:28 ` Alex Riesen
2006-11-07 17:48 ` Shawn Pearce
0 siblings, 1 reply; 21+ messages in thread
From: Alex Riesen @ 2006-11-07 17:28 UTC (permalink / raw)
To: Jakub Narebski; +Cc: git
> Perhaps planned mmapping only parts of packs would help there.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 17:28 ` Alex Riesen
@ 2006-11-07 17:48 ` Shawn Pearce
2006-11-07 18:13 ` Alex Riesen
0 siblings, 1 reply; 21+ messages in thread
From: Shawn Pearce @ 2006-11-07 17:48 UTC (permalink / raw)
To: Alex Riesen; +Cc: Jakub Narebski, git
Alex Riesen <raa.lkml@gmail.com> wrote:
> >Perhaps planned mmapping only parts of packs would help there.
>
> BTW, can I find this code somewhere to try it out?
The patches are on the mailing list archives somewhere around
Sept. 5th timeframe from me; as I recall we dropped them as they
didn't apply on top of Junio's 64 bit index changes (which were
reverted out of next anyway).
I was going to push my branch to a public mirror but it turns
out I deleted that branch. :-(
--
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 17:48 ` Shawn Pearce
@ 2006-11-07 18:13 ` Alex Riesen
2006-11-07 18:18 ` Shawn Pearce
0 siblings, 1 reply; 21+ messages in thread
From: Alex Riesen @ 2006-11-07 18:13 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jakub Narebski, Junio C Hamano, git
> The patches are on the mailing list archives somewhere around
> Sept. 5th timeframe from me; as I recall we dropped them as they
> didn't apply on top of Junio's 64 bit index changes (which were
> reverted out of next anyway).
I seem to be unable to find them. Does anyone still has the
patches/branch please? Junio, you did sound interested?
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 18:13 ` Alex Riesen
@ 2006-11-07 18:18 ` Shawn Pearce
2006-11-07 18:26 ` Shawn Pearce
0 siblings, 1 reply; 21+ messages in thread
From: Shawn Pearce @ 2006-11-07 18:18 UTC (permalink / raw)
To: Alex Riesen; +Cc: Jakub Narebski, Junio C Hamano, git
Alex Riesen <raa.lkml@gmail.com> wrote:
> >The patches are on the mailing list archives somewhere around
> >Sept. 5th timeframe from me; as I recall we dropped them as they
> >didn't apply on top of Junio's 64 bit index changes (which were
> >reverted out of next anyway).
>
> I seem to be unable to find them. Does anyone still has the
> patches/branch please? Junio, you did sound interested?
> (God, I wish I have paid attention then...)
Junio definately wants to implement at least something similar
to what I did; I just happened to do it at the wrong time. :-)
I keep saying I'll get back around to rewriting the patch (as the
affected regions of git have been heavily modified recently by Nico
and thus it doesn't apply) but I keep not finding the time.
Maybe that's because I was working on git-gui...
--
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 18:18 ` Shawn Pearce
@ 2006-11-07 18:26 ` Shawn Pearce
2006-11-07 18:56 ` Shawn Pearce
2006-11-07 19:27 ` Alex Riesen
0 siblings, 2 replies; 21+ messages in thread
From: Shawn Pearce @ 2006-11-07 18:26 UTC (permalink / raw)
To: Alex Riesen; +Cc: Jakub Narebski, Junio C Hamano, git
Shawn Pearce <spearce@spearce.org> wrote:
> Alex Riesen <raa.lkml@gmail.com> wrote:
> > >The patches are on the mailing list archives somewhere around
> > >Sept. 5th timeframe from me; as I recall we dropped them as they
> > >didn't apply on top of Junio's 64 bit index changes (which were
> > >reverted out of next anyway).
> >
> > I seem to be unable to find them. Does anyone still has the
> > patches/branch please? Junio, you did sound interested?
> > (God, I wish I have paid attention then...)
You can't find them because I never sent them. *sigh*
Too f'ing bad this .patch I created with format-patch doesn't say
what commits it was based on, 'cause I can't find anything it will
apply too. Or better, too f'ing bad I deleted that branch. *sigh*
--
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 18:26 ` Shawn Pearce
@ 2006-11-07 18:56 ` Shawn Pearce
2006-11-07 23:11 ` Alex Riesen
2006-11-07 19:27 ` Alex Riesen
1 sibling, 1 reply; 21+ messages in thread
From: Shawn Pearce @ 2006-11-07 18:56 UTC (permalink / raw)
To: Alex Riesen; +Cc: Jakub Narebski, Junio C Hamano, git
Shawn Pearce <spearce@spearce.org> wrote:
> Shawn Pearce <spearce@spearce.org> wrote:
> > Alex Riesen <raa.lkml@gmail.com> wrote:
> > > >The patches are on the mailing list archives somewhere around
> > > >Sept. 5th timeframe from me; as I recall we dropped them as they
> > > >didn't apply on top of Junio's 64 bit index changes (which were
> > > >reverted out of next anyway).
> > >
> > > I seem to be unable to find them. Does anyone still has the
> > > patches/branch please? Junio, you did sound interested?
> > > (God, I wish I have paid attention then...)
>
> You can't find them because I never sent them. *sigh*
>
> Too f'ing bad this .patch I created with format-patch doesn't say
> what commits it was based on, 'cause I can't find anything it will
> apply too. Or better, too f'ing bad I deleted that branch. *sigh*
This thread has now proven without any shadow of a doubt that I can
be an idiot sometimes.
In this case my patch didn't apply because it was 3rd in a series;
I had two of those patches but lacked the third (a simple cleanup
patch). Redoing that simple cleanup let everything apply.
I've pushed the changes out to repo.or.cz:
http://repo.or.cz/w/git/fastimport.git
in the window-mapping branch. Note that this is based on a slightly
older version of Git (v1.4.2). There are two "tuneables" on line 376
of sha1_file.c, this is the maximum amount of memory (in bytes) to
denote to packs and the maximum chunk size of each pack (in bytes).
I planned on making these configuration options but didn't get to
that yet.
--
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 18:26 ` Shawn Pearce
2006-11-07 18:56 ` Shawn Pearce
@ 2006-11-07 19:27 ` Alex Riesen
1 sibling, 0 replies; 21+ messages in thread
From: Alex Riesen @ 2006-11-07 19:27 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jakub Narebski, Junio C Hamano, git
Shawn Pearce, Tue, Nov 07, 2006 19:26:36 +0100:
> Too f'ing bad this .patch I created with format-patch doesn't say
> what commits it was based on, 'cause I can't find anything it will
> apply too. Or better, too f'ing bad I deleted that branch. *sigh*
Maybe you just send the patch anyway? It could give someone an idea
where to start, or what not to do...
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 18:56 ` Shawn Pearce
@ 2006-11-07 23:11 ` Alex Riesen
2006-11-08 5:19 ` Shawn Pearce
0 siblings, 1 reply; 21+ messages in thread
From: Alex Riesen @ 2006-11-07 23:11 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jakub Narebski, Junio C Hamano, git
Shawn Pearce, Tue, Nov 07, 2006 19:56:48 +0100:
> I've pushed the changes out to repo.or.cz:
>
> http://repo.or.cz/w/git/fastimport.git
>
> in the window-mapping branch. Note that this is based on a slightly
> older version of Git (v1.4.2). There are two "tuneables" on line 376
> of sha1_file.c, this is the maximum amount of memory (in bytes) to
> denote to packs and the maximum chunk size of each pack (in bytes).
Thanks.
I couldn't help noticing that the interface to the packs data is
a bit complex:
unsigned char *use_pack(struct packed_git *p,
struct pack_window **window,
unsigned long offset,
unsigned int *left);
void unuse_pack(struct pack_window **w);
Can I suggest something like below?
unsigned char *use_pack(struct packed_git *p,
off_t offset, size_t size, size_t *mapped);
void unuse_pack(struct packed_git *p, off_t offset, size_t size);
or
void unuse_pack(struct packed_git *p, unsigned char *data);
(where size/maxsize is the amount of bytes at the offset in the
pack file the caller was asking for. use_pack would fail if offset
or size do not pass sanity checks (offset past end of file, size
is 0). The mapped argument would get the length of the data
actually mapped, and can be less than requested, subject to end of
data). The window sliding code seem to have all information it
needs with this set of arguments.
Or am I missing something very obvious, and something like this
is just not feasible for some reasons?
I'm asking because I tried to slowly rebase the window-mapping up and
merge the newer branches into it (to get it working with more recent
code). At some point I came over conflicts and one of them got me
thinking about the interface. That's the part:
<<<<<<< HEAD/sha1_file.c
c = *use_pack(p, w, offset++, NULL);
*type = (c >> 4) & 7;
size = c & 15;
shift = 4;
while (c & 0x80) {
c = *use_pack(p, w, offset++, NULL);
size += (c & 0x7f) << shift;
shift += 7;
}
*sizep = size;
return offset;
=======
used = unpack_object_header_gently((unsigned char *)p->pack_base +
offset,
p->pack_size - offset, type, sizep);
if (!used)
die("object offset outside of pack file");
return offset + used;
>>>>>>> f685d07de045423a69045e42e72d2efc22a541ca/sha1_file.c
I was almost about to move your code into unpack_object_header_gently,
but ... The header isn't that big, is it? It is variable in the pack,
but the implementation of the parser is at the moment restricted by
the type we use for object size (unsigned long for the particular
platform). For example:
/* Object size is in the first 4 bits, and in the low 7 bits
* of the subsequent bytes which have high bit set */
#define MAX_LOCAL_HDRSIZE ((sizeof(long) * 8 - 4) / 7 + 1)
unsigned long size;
unsigned char *header = use_pack(p, offset, MAX_LOCAL_HDRSIZE, &size);
if (!header)
die("object header offset out of range");
/* unpack_object_header_gently takes care about truncated
* headers, by returning 0 if it encounters one */
used = unpack_object_header_gently(header, size, type, sizep);
Wouldn't have to change unpack_object_header_gently at all.
(BTW, current unpack_object_header_gently does not use it's len
argument to check if there actually is enough data to hold at least
minimal header. Is the size of mapped data checked for correctness
somewhere before?)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 23:11 ` Alex Riesen
@ 2006-11-08 5:19 ` Shawn Pearce
2006-11-08 13:37 ` Alex Riesen
0 siblings, 1 reply; 21+ messages in thread
From: Shawn Pearce @ 2006-11-08 5:19 UTC (permalink / raw)
To: Alex Riesen; +Cc: Jakub Narebski, Junio C Hamano, git
Alex Riesen <fork0@t-online.de> wrote:
> I couldn't help noticing that the interface to the packs data is
> a bit complex:
>
> unsigned char *use_pack(struct packed_git *p,
> struct pack_window **window,
> unsigned long offset,
> unsigned int *left);
> void unuse_pack(struct pack_window **w);
>
> Or am I missing something very obvious, and something like this
> is just not feasible for some reasons?
The use counter. Every time someone asks for a pointer into the
pack they need to lock that window into memory to prevent us from
garbage collecting it by unmapping it to make room for another
window that the application needs.
> I was almost about to move your code into unpack_object_header_gently,
> but ... The header isn't that big, is it? It is variable in the pack,
> but the implementation of the parser is at the moment restricted by
> the type we use for object size (unsigned long for the particular
> platform). For example:
All true. However what happens when the header spans two windows?
Lets say I have the first 4 MiB mapped and the next 4 MiB mapped in
a different window; these are not necessarily at the same locations
within memory. Now if an object header is split over these two
then some bytes are at the end of the first window and the rest
are at the start of the next window.
I can't just say "make sure we have at least X bytes available
before starting to decode the header, as to do that in this case
we'd have to unmap BOTH windows and remap a new one which keeps
that very small header fully contiguous in memory. That's thrashing
the VM page tables for really no benefit.
> (BTW, current unpack_object_header_gently does not use it's len
> argument to check if there actually is enough data to hold at least
> minimal header. Is the size of mapped data checked for correctness
> somewhere before?)
Yes. Somewhere. I think we make sure there's at least 20 bytes
in the pack remaining before we start to decode a header. We must
have at least 20 as that's the trailing SHA1 checksum of the entire
pack. :-)
--
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-08 5:19 ` Shawn Pearce
@ 2006-11-08 13:37 ` Alex Riesen
2006-11-08 17:11 ` Shawn Pearce
0 siblings, 1 reply; 21+ messages in thread
From: Alex Riesen @ 2006-11-08 13:37 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jakub Narebski, Junio C Hamano, git
> > I couldn't help noticing that the interface to the packs data is
> > a bit complex:
> >
> > unsigned char *use_pack(struct packed_git *p,
> > struct pack_window **window,
> > unsigned long offset,
> > unsigned int *left);
> > void unuse_pack(struct pack_window **w);
> >
> > Or am I missing something very obvious, and something like this
> > is just not feasible for some reasons?
>
> The use counter. Every time someone asks for a pointer into the
> pack they need to lock that window into memory to prevent us from
> garbage collecting it by unmapping it to make room for another
> window that the application needs.
I think the counters can be kept in struct packed_git somewhere. Given mmap
granularity, and the fact that not all of the pack is used in normal case
(and the granularity help us in the worst case) the memory used up by the
page counters shouldn't be too much.
> > I was almost about to move your code into unpack_object_header_gently,
> > but ... The header isn't that big, is it? It is variable in the pack,
> > but the implementation of the parser is at the moment restricted by
> > the type we use for object size (unsigned long for the particular
> > platform). For example:
>
> All true. However what happens when the header spans two windows?
> Lets say I have the first 4 MiB mapped and the next 4 MiB mapped in
> a different window; these are not necessarily at the same locations
> within memory. Now if an object header is split over these two
> then some bytes are at the end of the first window and the rest
> are at the start of the next window.
Assuming these are adjacent windows, we can just increment counters on the
all touched pages (at least the two together) and return the pointer into
the lowest page. Otherwise - time for garbage collection (why produce the
garbage at all, btw?) and remap.
> I can't just say "make sure we have at least X bytes available
> before starting to decode the header, as to do that in this case
> we'd have to unmap BOTH windows and remap a new one which keeps
> that very small header fully contiguous in memory. That's thrashing
> the VM page tables for really no benefit.
You can't mmap less than a page, can you? So it's actually never a small
portion, but at least 4k on x86.
> > (BTW, current unpack_object_header_gently does not use it's len
> > argument to check if there actually is enough data to hold at least
> > minimal header. Is the size of mapped data checked for correctness
> > somewhere before?)
>
> Yes. Somewhere. I think we make sure there's at least 20 bytes
> in the pack remaining before we start to decode a header. We must
> have at least 20 as that's the trailing SHA1 checksum of the entire
> pack. :-)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-08 13:37 ` Alex Riesen
@ 2006-11-08 17:11 ` Shawn Pearce
2006-11-08 21:33 ` Alex Riesen
0 siblings, 1 reply; 21+ messages in thread
From: Shawn Pearce @ 2006-11-08 17:11 UTC (permalink / raw)
To: Alex Riesen; +Cc: Jakub Narebski, Junio C Hamano, git
Alex Riesen <raa.lkml@gmail.com> wrote:
> >The use counter. Every time someone asks for a pointer into the
> >pack they need to lock that window into memory to prevent us from
> >garbage collecting it by unmapping it to make room for another
> >window that the application needs.
>
> I think the counters can be kept in struct packed_git somewhere. Given mmap
> granularity, and the fact that not all of the pack is used in normal case
> (and the granularity help us in the worst case) the memory used up by the
> page counters shouldn't be too much.
We already have a counter in struct packed_git (pack_use_cnt),
but this counter and struct permits only one mmap per pack.
We actually want at least two, but really four sliding windows
per pack. Here's why:
The commits are at the front of the pack, trees are somewhere just
behind them, and the blobs are behind that. The objects are also
loosely ordered by time, so the further back in the commit graph
you go the associated data is closer to the end of the pack file.
If a pack file is larger than our mmap unit (32 MiB in my
implementation) then we will mmap it in at least two different
non-contiguous windows. In the case of the Linux kernel pack
(>128 MiB now) that's at least 4 windows to mmap the entire file.
If we are a commit parsing application (merge-base, blame, log,
rev-list, etc...) we need to walk back along the commit DAG to
identify objects of interest. So we need the front of the pack
mmap'd. But if we have a path filter than we also need to mmap the
middle of the pack to get access to the trees; and if we need blob
data than we need to mmap near the back too to get those.
So really an application wants 2-3 sliding windows of a pack file;
one focused on the front covering the unparsed commits; one slightly
behind that focused on the trees; and one further back focused on
the blobs. Oh and you may actually need a 4th to do delta base
decompression. If you work with less than 4 sliding windows then
you are going to be thrashing the page tables somewhat as you toss
out one window to load another, then toss that window out just to
go back and reaccess the one you previously tossed.
If the sliding window size is large enough that nearly all commits
and trees fit into it then this is mostly a non-issue. Heck the
git.git pack is hovering around 10 MiB these days; that's less than
one window. But when the pack is larger than a couple of windows
this really starts to matter... and that's specifically the type
of repository this feature is designed for.
> >All true. However what happens when the header spans two windows?
> >Lets say I have the first 4 MiB mapped and the next 4 MiB mapped in
> >a different window; these are not necessarily at the same locations
> >within memory. Now if an object header is split over these two
> >then some bytes are at the end of the first window and the rest
> >are at the start of the next window.
>
> Assuming these are adjacent windows, we can just increment counters on the
> all touched pages (at least the two together) and return the pointer into
> the lowest page. Otherwise - time for garbage collection (why produce the
> garbage at all, btw?) and remap.
They are adjacent in the pack file but not necessarily in virtual memory!
I'm just asking the OS to return me a mapping for a chunk of a file;
I'm not also trying to make them contiguous in virtual memory.
Making adjacent pack file chunks contiguous in virtual memory is even
more code and probably going to cause problems at runtime. How do
we know what area of virtual memory is mostly unused on any given
platform? Any application that I have seen that tries to manage
its own virtual memory layout is usually full of platform specific
hacks to make it work everywhere. Lets not go there with Git.
The garbage creation is to account for the 2-4 windows required
by most applications. Most of the time each window is unused;
we really only have two windows in use during delta decompression,
at all other times we really only have 1 window in use. The commit
parsing applications don't keep the commit window in use when they
go access a tree or a blob.
Consequently we want the garbage there. Actually I shouldn't have
used garbage: the correct term would be LRU managed cache. :-)
When we need a new window and we would exceed our maximum limit
(128 MiB in my implementation) we unmap the least recently used
window which is not currently in use.
> >I can't just say "make sure we have at least X bytes available
> >before starting to decode the header, as to do that in this case
> >we'd have to unmap BOTH windows and remap a new one which keeps
> >that very small header fully contiguous in memory. That's thrashing
> >the VM page tables for really no benefit.
>
> You can't mmap less than a page, can you? So it's actually never a small
> portion, but at least 4k on x86.
No. But we always mmap even more than that per window; e.g. 32
MiB. Since the header is way smaller than 4k we're talking about
unmapping two 32 MiB windows and remapping one of them just one
page earlier to get the whole object header in a contiguous region.
I'm not a kernel hacker (nor do I pretended to be) but from what I
know of the x86 page table structure that's not exactly the fastest
operation we can ask the system to perform. :)
I could be wrong. It may not matter. But I think its crazy to
unmap otherwise valid mappings just because 2 bytes are on the
wrong side of an arbitrary boundary.
--
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 13:55 ` Alex Riesen
2006-11-07 15:50 ` Jakub Narebski
@ 2006-11-08 19:22 ` Christopher Faylor
1 sibling, 0 replies; 21+ messages in thread
From: Christopher Faylor @ 2006-11-08 19:22 UTC (permalink / raw)
To: Alex Riesen, git
On Tue, Nov 07, 2006 at 02:55:01PM +0100, Alex Riesen wrote:
>>So the problem is probably memory fragmentation.
>
>probably.
>
>>You might have more joy if you allocated one HUGE chunk immediately on
>>startup to use for the pack, and then kept re-using that chunk.
>
>Well, it is not _one_ chunk. The windows/cygwin abomin...combination
I would like to ask you, once again, to exercise some adult self-control
when you feel compelled to answer questions about Cygwin. If you need
to vent your frustration in some direction, I'd suggest getting a dog.
They can look very contrite when you yell at them even if they didn't
actually do anything wrong.
>may take an issue with this: it seem to copy complete address space
>at fork, which even for such a small packs I have here takes system
>down lightly (yes, I tried it).
Yes, Cygwin copies the heap on fork since Windows doesn't implement fork.
FWIW, Cygwin's malloc is based on Doug Lea's malloc. It reverts to
using mmap when allocating memory above a certain threshhold.
cgf
--
Christopher Faylor spammer? -> aaaspam@sourceware.org
Cygwin Co-Project Leader aaaspam@duffek.com
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-08 17:11 ` Shawn Pearce
@ 2006-11-08 21:33 ` Alex Riesen
2006-11-08 22:28 ` Shawn Pearce
0 siblings, 1 reply; 21+ messages in thread
From: Alex Riesen @ 2006-11-08 21:33 UTC (permalink / raw)
To: Shawn Pearce; +Cc: Jakub Narebski, Junio C Hamano, git
Shawn Pearce, Wed, Nov 08, 2006 18:11:31 +0100:
> > >All true. However what happens when the header spans two windows?
> > >Lets say I have the first 4 MiB mapped and the next 4 MiB mapped in
> > >a different window; these are not necessarily at the same locations
> > >within memory. Now if an object header is split over these two
> > >then some bytes are at the end of the first window and the rest
> > >are at the start of the next window.
> >
> > Assuming these are adjacent windows, we can just increment counters on the
> > all touched pages (at least the two together) and return the pointer into
> > the lowest page. Otherwise - time for garbage collection (why produce the
> > garbage at all, btw?) and remap.
>
> They are adjacent in the pack file but not necessarily in virtual memory!
Oh, right! Don't know why I thought the mapped regions would be
connected.
> The garbage creation is to account for the 2-4 windows required
> by most applications. Most of the time each window is unused;
> we really only have two windows in use during delta decompression,
> at all other times we really only have 1 window in use. The commit
> parsing applications don't keep the commit window in use when they
> go access a tree or a blob.
So they actually can call unuse_pack to unmap the window,
but it's kept for caching reasons?
> Consequently we want the garbage there. Actually I shouldn't have
> used garbage: the correct term would be LRU managed cache. :-)
> When we need a new window and we would exceed our maximum limit
> (128 MiB in my implementation) we unmap the least recently used
> window which is not currently in use.
Yep, noticed that :) Just wondered why.
> I could be wrong. It may not matter. But I think its crazy to
> unmap otherwise valid mappings just because 2 bytes are on the
> wrong side of an arbitrary boundary.
You're right, would be unfortunate to remap too often.
use_pack always maps at least 20 bytes, if I understand in_window and
its use correctly. Actually, now I'm staring at it longer, I think the
interface I suggested does almost the same, just allows to configure
(well, hint at) the amount of bytes to be mapped in.
I still can't let go of the idea to get as much data as possible with
just one call to sliding window code. Calling use_pack for every byte
just does not seem right.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-08 21:33 ` Alex Riesen
@ 2006-11-08 22:28 ` Shawn Pearce
0 siblings, 0 replies; 21+ messages in thread
From: Shawn Pearce @ 2006-11-08 22:28 UTC (permalink / raw)
To: Alex Riesen; +Cc: Jakub Narebski, Junio C Hamano, git
Alex Riesen <fork0@t-online.de> wrote:
> Shawn Pearce, Wed, Nov 08, 2006 18:11:31 +0100:
> > The garbage creation is to account for the 2-4 windows required
> > by most applications. Most of the time each window is unused;
> > we really only have two windows in use during delta decompression,
> > at all other times we really only have 1 window in use. The commit
> > parsing applications don't keep the commit window in use when they
> > go access a tree or a blob.
>
> So they actually can call unuse_pack to unmap the window,
> but it's kept for caching reasons?
Actually very few parts of the code even know about the windows.
Really the only parts that know it are the ones that directly
access the pack file, which is mostly restricted to sha1_file.c.
So since all access is through the more public interfaces what
you find is that the application code never keeps the window.
We are always doing use_pack/unuse_pack on every object access.
So the window is almost never in use. So if we didn't hang onto
it in an LRU we would be in a world of hurt performance wise.
> > I could be wrong. It may not matter. But I think its crazy to
> > unmap otherwise valid mappings just because 2 bytes are on the
> > wrong side of an arbitrary boundary.
>
> You're right, would be unfortunate to remap too often.
>
> use_pack always maps at least 20 bytes, if I understand in_window and
> its use correctly. Actually, now I'm staring at it longer, I think the
> interface I suggested does almost the same, just allows to configure
> (well, hint at) the amount of bytes to be mapped in.
True; but if you look nobody wants more than 20 bytes. They either
want <20 for the object header or 20 for the base object id in
a delta. Otherwise they are shoving the data into zlib which
doesn't care. No need to configure it, just shove it in.
> I still can't let go of the idea to get as much data as possible with
> just one call to sliding window code. Calling use_pack for every byte
> just does not seem right.
True. But the only other idea I have is to copy the data into a
buffer for the caller. Which we use only for the header section,
being that its small... we already copy the delta base (20 bytes)
onto the stack during decompression. Might as well copy the header
to decompress it. Then you can batch up the range checks to at
worst no more than 2 range checks per header.
--
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-07 11:02 win2k/cygwin cannot handle even moderately sized packs Alex Riesen
2006-11-07 12:17 ` Noel Grandin
@ 2006-11-13 12:45 ` Johannes Schindelin
2006-11-13 17:34 ` Alex Riesen
1 sibling, 1 reply; 21+ messages in thread
From: Johannes Schindelin @ 2006-11-13 12:45 UTC (permalink / raw)
To: Alex Riesen; +Cc: git
Hi,
On Tue, 7 Nov 2006, Alex Riesen wrote:
> For me, it fails even on 332Mb pack:
>
> $ git reset --hard 61bb7fcb
> fatal: packfile .git/objects/pack/pack-ad37...pack cannot be mapped.
>
> Instrumenting the code reveals that it fails on 348876870 bytes.
> Strangely enough, a cygwin program which just reads that pack
> many times without freeing the mem goes up to 1395507480 (1330Mb).
>
> If I replace the malloc (cygwin) with native VirtualAlloc(MEM_COMMIT)
> it reports that "Not enough storage is available to process this command",
> which is just ENOMEM, I think.
This looks to me as if you have NO_MMAP=1 set in your Makefile (which I
also do automatically when compiling on cygwin).
The old problem: Windows does not have fork.
<digression> And before somebody starts cygwin bashing: don't. It is not
cygwin's problem, it is Windows'. The cygwin people worked long and hard
on something truly useful, and it helps me _every_ time I have to work on
a Windows platform (which _is_ utter crap). Some problems of Windows are
so unhideable though, that even cygwin cannot work around them.
</digression>
Cygwin provides a mmap(), which works remarkably well even with the
emulated fork(), but one thing is not possible: since mmap()ed files
have to be _reopened_ after a fork(), and git uses the
open-temp-file-then-delete-it-but-continue-to-use-it paradigm quite often,
we work around it by setting NO_MMAP=1. Again, this is _not_ Cygwin's
fault!
And I think that a mmap() of 332MB would not fail.
Long time ago (to be precise, July 18th), Linus suggested (in Message-Id:
<Pine.LNX.4.64.0607180837260.3386@evo.osdl.org>) to find out which mmap()
calls are _not_ used before a fork(), and not emulate them by malloc().
I never came around to do that, but maybe others do?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-13 12:45 ` Johannes Schindelin
@ 2006-11-13 17:34 ` Alex Riesen
2006-11-13 17:36 ` Alex Riesen
0 siblings, 1 reply; 21+ messages in thread
From: Alex Riesen @ 2006-11-13 17:34 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: git
> This looks to me as if you have NO_MMAP=1 set in your Makefile (which I
> also do automatically when compiling on cygwin).
Kind of. I use mmap (from cygwin) in specially selected places.
I remember posting the patches once.
> The old problem: Windows does not have fork.
As if it have anything non-fubar at all...
> <digression> And before somebody starts cygwin bashing: don't. It is not
> cygwin's problem, it is Windows'.
I didn't bash cygwin. I just pity the whole effort (and myself, atm).
> And I think that a mmap() of 332MB would not fail.
It does not in isolated environment. It just fails in my particular context.
And before anyone suggests: yes, CreateFileMapping, and VirtualAlloc were
tried. They do return errors suggesting the same reason (ENOMEM).
> Long time ago (to be precise, July 18th), Linus suggested (in Message-Id:
> <Pine.LNX.4.64.0607180837260.3386@evo.osdl.org>) to find out which mmap()
> calls are _not_ used before a fork(), and not emulate them by malloc().
>
> I never came around to do that, but maybe others do?
I'm trying to find some time to make Shawn's sliding window work.
It looks promising (patches, not the time).
On 11/13/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Tue, 7 Nov 2006, Alex Riesen wrote:
>
> > For me, it fails even on 332Mb pack:
> >
> > $ git reset --hard 61bb7fcb
> > fatal: packfile .git/objects/pack/pack-ad37...pack cannot be mapped.
> >
> > Instrumenting the code reveals that it fails on 348876870 bytes.
> > Strangely enough, a cygwin program which just reads that pack
> > many times without freeing the mem goes up to 1395507480 (1330Mb).
> >
> > If I replace the malloc (cygwin) with native VirtualAlloc(MEM_COMMIT)
> > it reports that "Not enough storage is available to process this command",
> > which is just ENOMEM, I think.
>
> This looks to me as if you have NO_MMAP=1 set in your Makefile (which I
> also do automatically when compiling on cygwin).
>
> The old problem: Windows does not have fork.
>
> <digression> And before somebody starts cygwin bashing: don't. It is not
> cygwin's problem, it is Windows'. The cygwin people worked long and hard
> on something truly useful, and it helps me _every_ time I have to work on
> a Windows platform (which _is_ utter crap). Some problems of Windows are
> so unhideable though, that even cygwin cannot work around them.
> </digression>
>
> Cygwin provides a mmap(), which works remarkably well even with the
> emulated fork(), but one thing is not possible: since mmap()ed files
> have to be _reopened_ after a fork(), and git uses the
> open-temp-file-then-delete-it-but-continue-to-use-it paradigm quite often,
> we work around it by setting NO_MMAP=1. Again, this is _not_ Cygwin's
> fault!
>
> And I think that a mmap() of 332MB would not fail.
>
> Long time ago (to be precise, July 18th), Linus suggested (in Message-Id:
> <Pine.LNX.4.64.0607180837260.3386@evo.osdl.org>) to find out which mmap()
> calls are _not_ used before a fork(), and not emulate them by malloc().
>
> I never came around to do that, but maybe others do?
>
> Ciao,
> Dscho
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: win2k/cygwin cannot handle even moderately sized packs
2006-11-13 17:34 ` Alex Riesen
@ 2006-11-13 17:36 ` Alex Riesen
0 siblings, 0 replies; 21+ messages in thread
From: Alex Riesen @ 2006-11-13 17:36 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: git
oh, damn. Sorry for the appended copy. It's behind a broken corporate
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2006-11-13 17:36 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-07 11:02 win2k/cygwin cannot handle even moderately sized packs Alex Riesen
2006-11-07 12:17 ` Noel Grandin
2006-11-07 13:55 ` Alex Riesen
2006-11-07 15:50 ` Jakub Narebski
2006-11-07 17:28 ` Alex Riesen
2006-11-07 17:48 ` Shawn Pearce
2006-11-07 18:13 ` Alex Riesen
2006-11-07 18:18 ` Shawn Pearce
2006-11-07 18:26 ` Shawn Pearce
2006-11-07 18:56 ` Shawn Pearce
2006-11-07 23:11 ` Alex Riesen
2006-11-08 5:19 ` Shawn Pearce
2006-11-08 13:37 ` Alex Riesen
2006-11-08 17:11 ` Shawn Pearce
2006-11-08 21:33 ` Alex Riesen
2006-11-08 22:28 ` Shawn Pearce
2006-11-07 19:27 ` Alex Riesen
2006-11-08 19:22 ` Christopher Faylor
2006-11-13 12:45 ` Johannes Schindelin
2006-11-13 17:34 ` Alex Riesen
2006-11-13 17:36 ` Alex Riesen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).