* limit on number of kmapped pages @ 2001-01-23 13:56 David Wragg 2001-01-23 18:23 ` Eric W. Biederman 0 siblings, 1 reply; 9+ messages in thread From: David Wragg @ 2001-01-23 13:56 UTC (permalink / raw) To: linux-kernel While testing some kernel code of mine on a machine with CONFIG_HIGHMEM enabled, I've run into the limit on the number of pages that can be kmapped at once. I was surprised to find it was so low -- only 2MB/4MB of address space for kmap (according to the value of LAST_PKMAP; vmalloc gets a much more generous 128MB!). My code allocates a large number of pages (4MB-worth would be typical) to act as a buffer; interrupt handlers/BHs copy data into this buffer, then a kernel thread moves filled pages into the page cache and replaces them with newly allocated pages. To avoid overhead on IRQs/BHs, all the pages in the buffer are kmapped. But with CONFIG_HIGHMEM if I try to kmap 512 pages or more at once, the kernel locks up (fork() starts blocking inside kmap(), etc.). There are ways I could work around this (either by using kmap_atomic, or by adding another kernel thread that maintains a window of kmapped pages within the buffer). But I'd prefer not to have to add a lot of code specific to the CONFIG_HIGHMEM case. So why is LAST_PKMAP so low, and what would the consequences of raising it be? (I don't think kernel address space is that scarce in the CONFIG_HIGHMEM case, so I suspect that the main reason is to limit the amount of searching needed for kmap to find a free slot. Is this right?) David Wragg - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: limit on number of kmapped pages 2001-01-23 13:56 limit on number of kmapped pages David Wragg @ 2001-01-23 18:23 ` Eric W. Biederman 2001-01-24 0:35 ` David Wragg 0 siblings, 1 reply; 9+ messages in thread From: Eric W. Biederman @ 2001-01-23 18:23 UTC (permalink / raw) To: David Wragg; +Cc: linux-kernel, linux-mm David Wragg <dpw@doc.ic.ac.uk> writes: > While testing some kernel code of mine on a machine with > CONFIG_HIGHMEM enabled, I've run into the limit on the number of pages > that can be kmapped at once. I was surprised to find it was so low -- > only 2MB/4MB of address space for kmap (according to the value of > LAST_PKMAP; vmalloc gets a much more generous 128MB!). kmap is for quick transitory mappings. kmap is not for permanent mappings. At least that was my impression. The persistence is intended to just kill error prone cases. > My code allocates a large number of pages (4MB-worth would be typical) > to act as a buffer; interrupt handlers/BHs copy data into this buffer, > then a kernel thread moves filled pages into the page cache and > replaces them with newly allocated pages. To avoid overhead on > IRQs/BHs, all the pages in the buffer are kmapped. But with > CONFIG_HIGHMEM if I try to kmap 512 pages or more at once, the kernel > locks up (fork() starts blocking inside kmap(), etc.). This may be a reasonable use, I'm not certain. It wasn't the application kmap was designed to deal with though... > There are ways I could work around this (either by using kmap_atomic, > or by adding another kernel thread that maintains a window of kmapped > pages within the buffer). But I'd prefer not to have to add a lot of > code specific to the CONFIG_HIGHMEM case. Why do you need such a large buffer? And why do the pages need to be kmapped? If you are doing dma there is no such requirement... And unless you are running on something faster than a PCI bus I can't imagine why you need a buffer that big. My hunch is that it makes sense to do the kmap, and the i/o in the bottom_half. What is wrong with that? kmap should be quick and fast because it is for transitory mappings. It shouldn't be something whose overhead you are trying to avoid. If kmap is that expensive then kmap needs to be fixed, instead of your code working around a perceived problem. At least that is what it looks like from here. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: limit on number of kmapped pages 2001-01-23 18:23 ` Eric W. Biederman @ 2001-01-24 0:35 ` David Wragg 2001-01-24 2:03 ` Benjamin C.R. LaHaise 2001-01-25 18:16 ` limit on number of kmapped pages Stephen C. Tweedie 0 siblings, 2 replies; 9+ messages in thread From: David Wragg @ 2001-01-24 0:35 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel, linux-mm ebiederm@xmission.com (Eric W. Biederman) writes: > Why do you need such a large buffer? ext2 doesn't guarantee sustained write bandwidth (in particular, writing a page to an ext2 file can have a high latency due to reading the block bitmap synchronously). To deal with this I need at least a 2MB buffer. I've modifed ext2 slightly to avoid that problem, but I still expect to need a 512KB buffer (though the usual requirements are much lower). While that wouldn't hit the kmap limit, it would bring the system closer to it. Perhaps further tuning could reduce the buffer needs of my application, but it is better to have the buffer too big than too small. > And why do the pages need to be kmapped? They only need to be kmapped while data is being copied into them. > If you are doing dma there is no such requirement... And > unless you are running on something faster than a PCI bus I can't > imagine why you need a buffer that big. Gigabit ethernet. > My hunch is that it makes > sense to do the kmap, and the i/o in the bottom_half. What is wrong > with that? Do you mean kmap_atomic? The comments around kmap don't mention avoiding it in BHs, but I don't see what prevents kmap -> kmap_high -> map_new_virtual -> schedule. > kmap should be quick and fast because it is for transitory mappings. > It shouldn't be something whose overhead you are trying to avoid. If > kmap is that expensive then kmap needs to be fixed, instead of your > code working around a perceived problem. > > At least that is what it looks like from here. When adding the kmap/kunmap calls to my code I arranged them so they would be used as infrequently as possible. After working on making the critical paths in my code fast, I didn't want to add operations that have an uncertain cost into those paths unless there is a good reason. Which is why I'm asking how significant the kmap limit is. David Wragg - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: limit on number of kmapped pages 2001-01-24 0:35 ` David Wragg @ 2001-01-24 2:03 ` Benjamin C.R. LaHaise 2001-01-24 10:09 ` David Wragg 2001-01-25 18:16 ` limit on number of kmapped pages Stephen C. Tweedie 1 sibling, 1 reply; 9+ messages in thread From: Benjamin C.R. LaHaise @ 2001-01-24 2:03 UTC (permalink / raw) To: David Wragg; +Cc: Eric W. Biederman, linux-kernel, linux-mm On 24 Jan 2001, David Wragg wrote: > ebiederm@xmission.com (Eric W. Biederman) writes: > > Why do you need such a large buffer? > > ext2 doesn't guarantee sustained write bandwidth (in particular, > writing a page to an ext2 file can have a high latency due to reading > the block bitmap synchronously). To deal with this I need at least a > 2MB buffer. This is the wrong way of going about things -- you should probably insert the pages into the page cache and write them into the filesystem via writepage. That way the pages don't need to be mapped while being written out. For incoming data from a network socket, making use of the data_ready callbacks and directly copying from the skbs in one pass with a kmap of only one page at a time. Maybe I'm guessing incorrect at what is being attempted, but kmap should be used sparingly and as briefly as possible. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: limit on number of kmapped pages 2001-01-24 2:03 ` Benjamin C.R. LaHaise @ 2001-01-24 10:09 ` David Wragg 2001-01-24 14:27 ` Eric W. Biederman 2001-01-25 10:06 ` Random thoughts on sustained write performance Daniel Phillips 0 siblings, 2 replies; 9+ messages in thread From: David Wragg @ 2001-01-24 10:09 UTC (permalink / raw) To: Benjamin C.R. LaHaise; +Cc: Eric W. Biederman, linux-kernel, linux-mm "Benjamin C.R. LaHaise" <blah@kvack.org> writes: > On 24 Jan 2001, David Wragg wrote: > > > ebiederm@xmission.com (Eric W. Biederman) writes: > > > Why do you need such a large buffer? > > > > ext2 doesn't guarantee sustained write bandwidth (in particular, > > writing a page to an ext2 file can have a high latency due to reading > > the block bitmap synchronously). To deal with this I need at least a > > 2MB buffer. > > This is the wrong way of going about things -- you should probably insert > the pages into the page cache and write them into the filesystem via > writepage. I currently use prepare_write/commit_write, but I think writepage would have the same issue: When ext2 allocates a block, and has to allocate from a new block group, it may do a synchronous read of the new block group bitmap. So before the writepage (or whatever) that causes this completes, it has to wait for the read to get picked by the elevator, the seek for the read, etc. By the time it gets back to writing normally, I've buffered a couple of MB of data. But I do have a workaround for the ext2 issue. > That way the pages don't need to be mapped while being written > out. Point taken, though the kmap needed before prepare_write is much less significant than the kmap I need to do before copying data into the page. > For incoming data from a network socket, making use of the > data_ready callbacks and directly copying from the skbs in one pass with a > kmap of only one page at a time. > > Maybe I'm guessing incorrect at what is being attempted, but kmap should > be used sparingly and as briefly as possible. I'm going to see if the one-page-kmapped approach makes a measurable difference. I'd still like to know what the basis for the current kmap limit setting is. David Wragg - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: limit on number of kmapped pages 2001-01-24 10:09 ` David Wragg @ 2001-01-24 14:27 ` Eric W. Biederman 2001-01-25 10:06 ` Random thoughts on sustained write performance Daniel Phillips 1 sibling, 0 replies; 9+ messages in thread From: Eric W. Biederman @ 2001-01-24 14:27 UTC (permalink / raw) To: David Wragg; +Cc: Benjamin C.R. LaHaise, linux-kernel, linux-mm David Wragg <dpw@doc.ic.ac.uk> writes: > I'd still like to know what the basis for the current kmap limit > setting is. Mostly at one point kmap_atomic was all there was. It was only the difficulty of implementing copy_from_user with kmap_atomic that convinced people we needed something more. So actually if we can kmap several megabyte at once the kmap limit is quite high. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Random thoughts on sustained write performance 2001-01-24 10:09 ` David Wragg 2001-01-24 14:27 ` Eric W. Biederman @ 2001-01-25 10:06 ` Daniel Phillips 1 sibling, 0 replies; 9+ messages in thread From: Daniel Phillips @ 2001-01-25 10:06 UTC (permalink / raw) To: David Wragg, Benjamin C.R. LaHaise, Eric W. Biederman Cc: linux-kernel, linux-mm On Wed, 24 Jan 2001, David Wragg wrote: > "Benjamin C.R. LaHaise" <blah@kvack.org> writes: > > On 24 Jan 2001, David Wragg wrote: > > > > > ebiederm@xmission.com (Eric W. Biederman) writes: > > > > Why do you need such a large buffer? > > > > > > ext2 doesn't guarantee sustained write bandwidth (in particular, > > > writing a page to an ext2 file can have a high latency due to reading > > > the block bitmap synchronously). To deal with this I need at least a > > > 2MB buffer. > > > > This is the wrong way of going about things -- you should probably insert > > the pages into the page cache and write them into the filesystem via > > writepage. > > I currently use prepare_write/commit_write, but I think writepage > would have the same issue: When ext2 allocates a block, and has to > allocate from a new block group, it may do a synchronous read of the > new block group bitmap. So before the writepage (or whatever) that > causes this completes, it has to wait for the read to get picked by > the elevator, the seek for the read, etc. By the time it gets back to > writing normally, I've buffered a couple of MB of data. I'll add my $0.02 here. Besides reading the block bitmap you may have to read up to three levels of file index blocks as well. If we stop pinning the group descriptors in memory you might need to read those as well. So this adds up to 4-5 reads, all synchronous. Worse, the bitmap block may be widely separated from the index blocks, in turn widely separated from the data block, and if group descriptors get added to the mix you may have to seek across the entire disk. This all adds up to a pretty horrible worst case latency. Mapping index blocks into the page cache should produce a noticable average case improvement because we can change from a top-down traversal of the index hierarchy: - get triple-indirect index block nr out of inode, getblk(tind) - get double-ind nr, getblk(dind) - get indirect nr, getblk(ind) to bottom-up: - is the indirect index block in the page cache? - no? this is it mapped and just needs to be reread? - no? then is the double-indirect block there? - yes? ah, now we know the block nr of the triple-indirect block, map it and read it in and we're done. The common case for the page cache is even better: - is the indirect index block in the page cache? - yes, we're done. The page cache approach is so much better because we directly compute the page cache index at which we should find the bottom-level index block. The buffers-only approach requires us to traverse the whole chain every time. We are doing considerably less hashing with the page cache approach (because we can directly compute the page cache index at which we should find the bottom level index block) and we'll avoid some reads. Note: in case anybody thinks avoiding hashing is unimportant, we *are* wasting too much cpu in ext2, just look at the cpu numbers for dbench runs and you'll see it clearly. Getting back on-topic, we don't improve the worst case behaviour at all with the page-cache approach, which is what matters in rate-guaranteed io. So the big buffer is still needed, and it might need to be even bigger than suggested. If we are *really* unlucky and everything is not only out of cache but widely separated on disk, we could be hit with 4 reads at 5 ms each, total 20 ms. If the disk transfers 16 meg/sec (4 blocks/ms) and we're generating io at 8 meg/sec (2 blocks/ms) then the metadata reads will create a backlog of 80 blocks which will take 40 ms to clear - hope we don't hit more synchronous reads during that time. Clearly, we can construct a worst case that will overflow any size buffer. Even though the chances of that happening may be very small, we have so many users that somebody, somewhere will get hit by it. It's worth having a good think to see if there's a nice way to come up with a rate guarantee for ext2. Mapping metadata into the page cache seems like it's heading in the right design direction, but we also need to think about some organized way of memory-locking the higher level, infrequently accessed index blocks to prevent them from contributing to the worst case. Another part of the rate-guarantee is physical layout on disk: we *must* update index blocks from time to time. With streaming writes, after a while they tend to get very far from where the write activity is taking place and seek times start to hurt. The solution is to relocate the whole chain of index blocks from time to time, which sounds a lot like what Tux2 does in normal operation. This behaviour can be added to Ext2/Ext3 quite easily. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: limit on number of kmapped pages 2001-01-24 0:35 ` David Wragg 2001-01-24 2:03 ` Benjamin C.R. LaHaise @ 2001-01-25 18:16 ` Stephen C. Tweedie 2001-01-25 23:53 ` David Wragg 1 sibling, 1 reply; 9+ messages in thread From: Stephen C. Tweedie @ 2001-01-25 18:16 UTC (permalink / raw) To: David Wragg; +Cc: Eric W. Biederman, linux-kernel, linux-mm, Stephen Tweedie Hi, On Wed, Jan 24, 2001 at 12:35:12AM +0000, David Wragg wrote: > > > And why do the pages need to be kmapped? > > They only need to be kmapped while data is being copied into them. But you only need to kmap one page at a time during the copy. There is absolutely no need to copy the whole chunk at once. --Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: limit on number of kmapped pages 2001-01-25 18:16 ` limit on number of kmapped pages Stephen C. Tweedie @ 2001-01-25 23:53 ` David Wragg 0 siblings, 0 replies; 9+ messages in thread From: David Wragg @ 2001-01-25 23:53 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-kernel, linux-mm "Stephen C. Tweedie" <sct@redhat.com> writes: > On Wed, Jan 24, 2001 at 12:35:12AM +0000, David Wragg wrote: > > > > > And why do the pages need to be kmapped? > > > > They only need to be kmapped while data is being copied into them. > > But you only need to kmap one page at a time during the copy. There > is absolutely no need to copy the whole chunk at once. The chunks I'm copying are always smaller than a page. Usually they are a few hundred bytes. Though because I'm copying into the pages in a bottom half, I'll have to use kmap_atomic. After a page is filled, it is put into the page cache. So they have to be allocated with page_cache_alloc(), hence __GFP_HIGHMEM and the reason I'm bothering with kmap at all. David Wragg - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2001-01-25 23:53 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-01-23 13:56 limit on number of kmapped pages David Wragg 2001-01-23 18:23 ` Eric W. Biederman 2001-01-24 0:35 ` David Wragg 2001-01-24 2:03 ` Benjamin C.R. LaHaise 2001-01-24 10:09 ` David Wragg 2001-01-24 14:27 ` Eric W. Biederman 2001-01-25 10:06 ` Random thoughts on sustained write performance Daniel Phillips 2001-01-25 18:16 ` limit on number of kmapped pages Stephen C. Tweedie 2001-01-25 23:53 ` David Wragg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox