limit on number of kmapped pages

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* limit on number of kmapped pages
@ 2001-01-23 13:56 David Wragg
  2001-01-23 18:23 ` Eric W. Biederman
  0 siblings, 1 reply; 9+ messages in thread
From: David Wragg @ 2001-01-23 13:56 UTC (permalink / raw)
  To: linux-kernel

While testing some kernel code of mine on a machine with
CONFIG_HIGHMEM enabled, I've run into the limit on the number of pages
that can be kmapped at once.  I was surprised to find it was so low --
only 2MB/4MB of address space for kmap (according to the value of
LAST_PKMAP; vmalloc gets a much more generous 128MB!).

My code allocates a large number of pages (4MB-worth would be typical)
to act as a buffer; interrupt handlers/BHs copy data into this buffer,
then a kernel thread moves filled pages into the page cache and
replaces them with newly allocated pages.  To avoid overhead on
IRQs/BHs, all the pages in the buffer are kmapped.  But with
CONFIG_HIGHMEM if I try to kmap 512 pages or more at once, the kernel
locks up (fork() starts blocking inside kmap(), etc.).

There are ways I could work around this (either by using kmap_atomic,
or by adding another kernel thread that maintains a window of kmapped
pages within the buffer).  But I'd prefer not to have to add a lot of
code specific to the CONFIG_HIGHMEM case.

So why is LAST_PKMAP so low, and what would the consequences of
raising it be?

(I don't think kernel address space is that scarce in the
CONFIG_HIGHMEM case, so I suspect that the main reason is to limit the
amount of searching needed for kmap to find a free slot.  Is this
right?)

David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: limit on number of kmapped pages
  2001-01-23 13:56 limit on number of kmapped pages David Wragg
@ 2001-01-23 18:23 ` Eric W. Biederman
  2001-01-24  0:35   ` David Wragg
  0 siblings, 1 reply; 9+ messages in thread
From: Eric W. Biederman @ 2001-01-23 18:23 UTC (permalink / raw)
  To: David Wragg; +Cc: linux-kernel, linux-mm

David Wragg <dpw@doc.ic.ac.uk> writes:

> While testing some kernel code of mine on a machine with
> CONFIG_HIGHMEM enabled, I've run into the limit on the number of pages
> that can be kmapped at once.  I was surprised to find it was so low --
> only 2MB/4MB of address space for kmap (according to the value of
> LAST_PKMAP; vmalloc gets a much more generous 128MB!).

kmap is for quick transitory mappings.  kmap is not for permanent mappings.
At least that was my impression.  The persistence is intended to just
kill error prone cases.

> My code allocates a large number of pages (4MB-worth would be typical)
> to act as a buffer; interrupt handlers/BHs copy data into this buffer,
> then a kernel thread moves filled pages into the page cache and
> replaces them with newly allocated pages.  To avoid overhead on
> IRQs/BHs, all the pages in the buffer are kmapped.  But with
> CONFIG_HIGHMEM if I try to kmap 512 pages or more at once, the kernel
> locks up (fork() starts blocking inside kmap(), etc.).

This may be a reasonable use, I'm not certain.  It wasn't the application
kmap was designed to deal with though...

> There are ways I could work around this (either by using kmap_atomic,
> or by adding another kernel thread that maintains a window of kmapped
> pages within the buffer).  But I'd prefer not to have to add a lot of
> code specific to the CONFIG_HIGHMEM case.

Why do you need such a large buffer?  And why do the pages need to be kmapped?
If you are doing dma there is no such requirement...  And unless you are
running on something faster than a PCI bus I can't imagine why you need
a buffer that big.  My hunch is that it makes sense to do the kmap,
and the i/o in the bottom_half.  What is wrong with that?

kmap should be quick and fast because it is for transitory mappings.
It shouldn't be something whose overhead you are trying to avoid.
If kmap is that expensive then kmap needs to be fixed, instead
of your code working around a perceived problem.

At least that is what it looks like from here.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: limit on number of kmapped pages
  2001-01-23 18:23 ` Eric W. Biederman
@ 2001-01-24  0:35   ` David Wragg
  2001-01-24  2:03     ` Benjamin C.R. LaHaise
  2001-01-25 18:16     ` limit on number of kmapped pages Stephen C. Tweedie
  0 siblings, 2 replies; 9+ messages in thread
From: David Wragg @ 2001-01-24  0:35 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, linux-mm

ebiederm@xmission.com (Eric W. Biederman) writes:
> Why do you need such a large buffer? 

ext2 doesn't guarantee sustained write bandwidth (in particular,
writing a page to an ext2 file can have a high latency due to reading
the block bitmap synchronously).  To deal with this I need at least a
2MB buffer.

I've modifed ext2 slightly to avoid that problem, but I still expect
to need a 512KB buffer (though the usual requirements are much lower).
While that wouldn't hit the kmap limit, it would bring the system
closer to it.

Perhaps further tuning could reduce the buffer needs of my
application, but it is better to have the buffer too big than too
small.

> And why do the pages need to be kmapped? 

They only need to be kmapped while data is being copied into them.

> If you are doing dma there is no such requirement...  And
> unless you are running on something faster than a PCI bus I can't
> imagine why you need a buffer that big. 

Gigabit ethernet.

> My hunch is that it makes
> sense to do the kmap, and the i/o in the bottom_half.  What is wrong
> with that?

Do you mean kmap_atomic?  The comments around kmap don't mention
avoiding it in BHs, but I don't see what prevents kmap -> kmap_high ->
map_new_virtual -> schedule.

> kmap should be quick and fast because it is for transitory mappings.
> It shouldn't be something whose overhead you are trying to avoid.  If
> kmap is that expensive then kmap needs to be fixed, instead of your
> code working around a perceived problem.
> 
> At least that is what it looks like from here.

When adding the kmap/kunmap calls to my code I arranged them so they
would be used as infrequently as possible.  After working on making
the critical paths in my code fast, I didn't want to add operations
that have an uncertain cost into those paths unless there is a good
reason.  Which is why I'm asking how significant the kmap limit is.

David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: limit on number of kmapped pages
  2001-01-24  0:35   ` David Wragg
@ 2001-01-24  2:03     ` Benjamin C.R. LaHaise
  2001-01-24 10:09       ` David Wragg
  2001-01-25 18:16     ` limit on number of kmapped pages Stephen C. Tweedie
  1 sibling, 1 reply; 9+ messages in thread
From: Benjamin C.R. LaHaise @ 2001-01-24  2:03 UTC (permalink / raw)
  To: David Wragg; +Cc: Eric W. Biederman, linux-kernel, linux-mm

On 24 Jan 2001, David Wragg wrote:

> ebiederm@xmission.com (Eric W. Biederman) writes:
> > Why do you need such a large buffer? 
> 
> ext2 doesn't guarantee sustained write bandwidth (in particular,
> writing a page to an ext2 file can have a high latency due to reading
> the block bitmap synchronously).  To deal with this I need at least a
> 2MB buffer.

This is the wrong way of going about things -- you should probably insert
the pages into the page cache and write them into the filesystem via
writepage.  That way the pages don't need to be mapped while being written
out.  For incoming data from a network socket, making use of the
data_ready callbacks and directly copying from the skbs in one pass with a
kmap of only one page at a time.

Maybe I'm guessing incorrect at what is being attempted, but kmap should
be used sparingly and as briefly as possible.

		-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: limit on number of kmapped pages
  2001-01-24  2:03     ` Benjamin C.R. LaHaise
@ 2001-01-24 10:09       ` David Wragg
  2001-01-24 14:27         ` Eric W. Biederman
  2001-01-25 10:06         ` Random thoughts on sustained write performance Daniel Phillips
  0 siblings, 2 replies; 9+ messages in thread
From: David Wragg @ 2001-01-24 10:09 UTC (permalink / raw)
  To: Benjamin C.R. LaHaise; +Cc: Eric W. Biederman, linux-kernel, linux-mm

"Benjamin C.R. LaHaise" <blah@kvack.org> writes:
> On 24 Jan 2001, David Wragg wrote:
> 
> > ebiederm@xmission.com (Eric W. Biederman) writes:
> > > Why do you need such a large buffer? 
> > 
> > ext2 doesn't guarantee sustained write bandwidth (in particular,
> > writing a page to an ext2 file can have a high latency due to reading
> > the block bitmap synchronously).  To deal with this I need at least a
> > 2MB buffer.
> 
> This is the wrong way of going about things -- you should probably insert
> the pages into the page cache and write them into the filesystem via
> writepage. 

I currently use prepare_write/commit_write, but I think writepage
would have the same issue: When ext2 allocates a block, and has to
allocate from a new block group, it may do a synchronous read of the
new block group bitmap.  So before the writepage (or whatever) that
causes this completes, it has to wait for the read to get picked by
the elevator, the seek for the read, etc.  By the time it gets back to
writing normally, I've buffered a couple of MB of data.

But I do have a workaround for the ext2 issue.

> That way the pages don't need to be mapped while being written
> out.

Point taken, though the kmap needed before prepare_write is much less
significant than the kmap I need to do before copying data into the
page.

> For incoming data from a network socket, making use of the
> data_ready callbacks and directly copying from the skbs in one pass with a
> kmap of only one page at a time.
>
> Maybe I'm guessing incorrect at what is being attempted, but kmap should
> be used sparingly and as briefly as possible.

I'm going to see if the one-page-kmapped approach makes a measurable
difference.

I'd still like to know what the basis for the current kmap limit
setting is.

David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: limit on number of kmapped pages
  2001-01-24 10:09       ` David Wragg
@ 2001-01-24 14:27         ` Eric W. Biederman
  2001-01-25 10:06         ` Random thoughts on sustained write performance Daniel Phillips
  1 sibling, 0 replies; 9+ messages in thread
From: Eric W. Biederman @ 2001-01-24 14:27 UTC (permalink / raw)
  To: David Wragg; +Cc: Benjamin C.R. LaHaise, linux-kernel, linux-mm

David Wragg <dpw@doc.ic.ac.uk> writes:

> I'd still like to know what the basis for the current kmap limit
> setting is.

Mostly at one point kmap_atomic was all there was.  It was only the
difficulty of implementing copy_from_user with kmap_atomic that convinced
people we needed something more.  So actually if we can kmap several
megabyte at once the kmap limit is quite high.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Random thoughts on sustained write performance
  2001-01-24 10:09       ` David Wragg
  2001-01-24 14:27         ` Eric W. Biederman
@ 2001-01-25 10:06         ` Daniel Phillips
  1 sibling, 0 replies; 9+ messages in thread
From: Daniel Phillips @ 2001-01-25 10:06 UTC (permalink / raw)
  To: David Wragg, Benjamin C.R. LaHaise, Eric W. Biederman
  Cc: linux-kernel, linux-mm

On Wed, 24 Jan 2001, David Wragg wrote:
> "Benjamin C.R. LaHaise" <blah@kvack.org> writes:
> > On 24 Jan 2001, David Wragg wrote:
> > 
> > > ebiederm@xmission.com (Eric W. Biederman) writes:
> > > > Why do you need such a large buffer? 
> > > 
> > > ext2 doesn't guarantee sustained write bandwidth (in particular,
> > > writing a page to an ext2 file can have a high latency due to reading
> > > the block bitmap synchronously).  To deal with this I need at least a
> > > 2MB buffer.
> > 
> > This is the wrong way of going about things -- you should probably insert
> > the pages into the page cache and write them into the filesystem via
> > writepage. 
> 
> I currently use prepare_write/commit_write, but I think writepage
> would have the same issue: When ext2 allocates a block, and has to
> allocate from a new block group, it may do a synchronous read of the
> new block group bitmap.  So before the writepage (or whatever) that
> causes this completes, it has to wait for the read to get picked by
> the elevator, the seek for the read, etc.  By the time it gets back to
> writing normally, I've buffered a couple of MB of data.

I'll add my $0.02 here.  Besides reading the block bitmap you may have
to read up to three levels of file index blocks as well.  If we stop
pinning the group descriptors in memory you might need to read those as
well.  So this adds up to 4-5 reads, all synchronous.  Worse, the
bitmap block may be widely separated from the index blocks, in turn
widely separated from the data block, and if group descriptors get
added to the mix you may have to seek across the entire disk.  This all
adds up to a pretty horrible worst case latency. 

Mapping index blocks into the page cache should produce a noticable
average case improvement because we can change from a top-down
traversal of the index hierarchy:

  - get triple-indirect index block nr out of inode, getblk(tind)
  - get double-ind nr, getblk(dind)
  - get indirect nr, getblk(ind)

to bottom-up:

  - is the indirect index block in the page cache? 
  - no? this is it mapped and just needs to be reread?
  - no? then is the double-indirect block there?
  - yes? ah, now we know the block nr of the triple-indirect block,
    map it and read it in and we're done.

The common case for the page cache is even better:

  - is the indirect index block in the page cache? 
  - yes, we're done.

The page cache approach is so much better because we directly compute
the page cache index at which we should find the bottom-level index
block.  The buffers-only approach requires us to traverse the whole
chain every time.

We are doing considerably less hashing with the page cache approach
(because we can directly compute the page cache index at which we
should find the bottom level index block) and we'll avoid some reads. 
Note: in case anybody thinks avoiding hashing is unimportant, we *are*
wasting too much cpu in ext2, just look at the cpu numbers for dbench
runs and you'll see it clearly.

Getting back on-topic, we don't improve the worst case behaviour at all
with the page-cache approach, which is what matters in rate-guaranteed
io.  So the big buffer is still needed, and it might need to be even
bigger than suggested.  If we are *really* unlucky and everything is
not only out of cache but widely separated on disk, we could be hit
with 4 reads at 5 ms each, total 20 ms.  If the disk transfers 16
meg/sec (4 blocks/ms) and we're generating io at 8 meg/sec (2
blocks/ms) then the metadata reads will create a backlog of 80 blocks
which will take 40 ms to clear - hope we don't hit more synchronous
reads during that time.

Clearly, we can construct a worst case that will overflow any size
buffer.  Even though the chances of that happening may be very small,
we have so many users that somebody, somewhere will get hit by it. 
It's worth having a good think to see if there's a nice way to come up
with a rate guarantee for ext2.  Mapping metadata into the page cache
seems like it's heading in the right design direction, but we also
need to think about some organized way of memory-locking the higher
level, infrequently accessed index blocks to prevent them from
contributing to the worst case.

Another part of the rate-guarantee is physical layout on disk: we
*must* update index blocks from time to time.  With streaming writes,
after a while they tend to get very far from where the write activity
is taking place and seek times start to hurt.  The solution is to
relocate the whole chain of index blocks from time to time,
which sounds a lot like what Tux2 does in normal operation.  This
behaviour can be added to Ext2/Ext3 quite easily.

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: limit on number of kmapped pages
  2001-01-24  0:35   ` David Wragg
  2001-01-24  2:03     ` Benjamin C.R. LaHaise
@ 2001-01-25 18:16     ` Stephen C. Tweedie
  2001-01-25 23:53       ` David Wragg
  1 sibling, 1 reply; 9+ messages in thread
From: Stephen C. Tweedie @ 2001-01-25 18:16 UTC (permalink / raw)
  To: David Wragg; +Cc: Eric W. Biederman, linux-kernel, linux-mm, Stephen Tweedie

Hi,

On Wed, Jan 24, 2001 at 12:35:12AM +0000, David Wragg wrote:
> 
> > And why do the pages need to be kmapped? 
> 
> They only need to be kmapped while data is being copied into them.

But you only need to kmap one page at a time during the copy.  There
is absolutely no need to copy the whole chunk at once.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: limit on number of kmapped pages
  2001-01-25 18:16     ` limit on number of kmapped pages Stephen C. Tweedie
@ 2001-01-25 23:53       ` David Wragg
  0 siblings, 0 replies; 9+ messages in thread
From: David Wragg @ 2001-01-25 23:53 UTC (permalink / raw)
  To: Stephen C. Tweedie; +Cc: Eric W. Biederman, linux-kernel, linux-mm

"Stephen C. Tweedie" <sct@redhat.com> writes:
> On Wed, Jan 24, 2001 at 12:35:12AM +0000, David Wragg wrote:
> > 
> > > And why do the pages need to be kmapped? 
> > 
> > They only need to be kmapped while data is being copied into them.
> 
> But you only need to kmap one page at a time during the copy.  There
> is absolutely no need to copy the whole chunk at once.

The chunks I'm copying are always smaller than a page.  Usually they
are a few hundred bytes.

Though because I'm copying into the pages in a bottom half, I'll have
to use kmap_atomic.  After a page is filled, it is put into the page
cache.  So they have to be allocated with page_cache_alloc(), hence
__GFP_HIGHMEM and the reason I'm bothering with kmap at all.


David Wragg
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2001-01-25 23:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-23 13:56 limit on number of kmapped pages David Wragg
2001-01-23 18:23 ` Eric W. Biederman
2001-01-24  0:35   ` David Wragg
2001-01-24  2:03     ` Benjamin C.R. LaHaise
2001-01-24 10:09       ` David Wragg
2001-01-24 14:27         ` Eric W. Biederman
2001-01-25 10:06         ` Random thoughts on sustained write performance Daniel Phillips
2001-01-25 18:16     ` limit on number of kmapped pages Stephen C. Tweedie
2001-01-25 23:53       ` David Wragg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox