[patch] 4GB I/O, cut three

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] 4GB I/O, cut three
@ 2001-05-29 14:07 Jens Axboe
  2001-05-29 14:11 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jens Axboe @ 2001-05-29 14:07 UTC (permalink / raw)
  To: Linux Kernel

Hi,

Another day, another version.

Bugs fixed in this version: none
Known bugs in this version: none

In other words, it's perfect of course.

Changes:

- Added ide-dma segment coalescing
- Only print highmem I/O enable info when HIGHMEM is actually set

Please give it a test spin, especially if you have 1GB of RAM or more.
You should see something like this when booting:

hda: enabling highmem I/O
...
SCSI: channel 0, id 0: enabling highmem I/O

depending on drive configuration etc.

Plea to maintainers of the different architectures: could you please add
the arch parts to support this? This includes:

- memory zoning at init time
- page_to_bus
- pci_map_page / pci_unmap_page
- set_bh_sg
- KM_BH_IRQ (for HIGHMEM archs)

I think that's it, feel free to send me questions and (even better)
patches.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-29 14:07 [patch] 4GB I/O, cut three Jens Axboe
@ 2001-05-29 14:11 ` Jens Axboe
  2001-05-30  9:43 ` Mark Hemment
  2001-05-30 13:03 ` Mark Hemment
  2 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2001-05-29 14:11 UTC (permalink / raw)
  To: Linux Kernel

On Tue, May 29 2001, Jens Axboe wrote:
> Hi,
> 
> Another day, another version.

Hrmpf, let me point out where it is too...

*.kernel.org/pub/linux/kernel/axboe/patches/2.4.5/

the READHE there details what is in each patch, or you can grab
block-highmem-all-3 which has it all.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-29 14:07 [patch] 4GB I/O, cut three Jens Axboe
  2001-05-29 14:11 ` Jens Axboe
@ 2001-05-30  9:43 ` Mark Hemment
  2001-05-30  9:55   ` Jens Axboe
  2001-05-30 13:03 ` Mark Hemment
  2 siblings, 1 reply; 18+ messages in thread
From: Mark Hemment @ 2001-05-30  9:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

Hi Jens,

  I ran this (well, cut-two) on a 4-way box with 4GB of memory and a
modified qlogic fibre channel driver with 32disks hanging off it, without
any problems.  The test used was SpecFS 2.0

  Peformance is definitely up - but I can't give an exact number, as the
run with this patch was compiled with no-omit-frame-pointer for debugging
any probs.

  I did change the patch so that bounce-pages always come from the NORMAL
zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
I'm not 100% sure the VM is capable of keeping the zones it already has
balanced - and adding another one might break the camels back.  But as the
test box has 4GB, it wasn't bouncing anyway.

Mark

On Tue, 29 May 2001, Jens Axboe wrote:
> Another day, another version.
> 
> Bugs fixed in this version: none
> Known bugs in this version: none
> 
> In other words, it's perfect of course.
> 
> Changes:
> 
> - Added ide-dma segment coalescing
> - Only print highmem I/O enable info when HIGHMEM is actually set
> 
> Please give it a test spin, especially if you have 1GB of RAM or more.
> You should see something like this when booting:
> 
> hda: enabling highmem I/O
> ...
> SCSI: channel 0, id 0: enabling highmem I/O
> 
> depending on drive configuration etc.
> 
> Plea to maintainers of the different architectures: could you please add
> the arch parts to support this? This includes:
> 
> - memory zoning at init time
> - page_to_bus
> - pci_map_page / pci_unmap_page
> - set_bh_sg
> - KM_BH_IRQ (for HIGHMEM archs)
> 
> I think that's it, feel free to send me questions and (even better)
> patches.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30  9:43 ` Mark Hemment
@ 2001-05-30  9:55   ` Jens Axboe
  2001-05-30 10:59     ` Mark Hemment
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jens Axboe @ 2001-05-30  9:55 UTC (permalink / raw)
  To: Mark Hemment; +Cc: Linux Kernel, riel, andrea

On Wed, May 30 2001, Mark Hemment wrote:
> Hi Jens,
> 
>   I ran this (well, cut-two) on a 4-way box with 4GB of memory and a
> modified qlogic fibre channel driver with 32disks hanging off it, without
> any problems.  The test used was SpecFS 2.0

Cool, could you send me the qlogic diff? It's the one-liner can_dma32
chance I'm interested in, I'm just not sure what driver you used :-)
I'll add that to the patch then. Basically all the PCI cards should
work, I'm just being cautious and only enabling highmem I/O to the ones
that have been tested.

>   Peformance is definitely up - but I can't give an exact number, as the
> run with this patch was compiled with no-omit-frame-pointer for debugging
> any probs.

Good

>   I did change the patch so that bounce-pages always come from the NORMAL
> zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
> I'm not 100% sure the VM is capable of keeping the zones it already has
> balanced - and adding another one might break the camels back.  But as the
> test box has 4GB, it wasn't bouncing anyway.

You are right, this is definitely something that needs checking. I
really want this to work though. Rik, Andrea? Will the balancing handle
the extra zone?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30  9:55   ` Jens Axboe
@ 2001-05-30 10:59     ` Mark Hemment
  2001-05-30 14:26       ` andrea
  2001-05-30 14:00     ` Andrea Arcangeli
  2001-05-30 18:36     ` Rik van Riel
  2 siblings, 1 reply; 18+ messages in thread
From: Mark Hemment @ 2001-05-30 10:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel, Rik van Riel, andrea

On Wed, 30 May 2001, Jens Axboe wrote:
> On Wed, May 30 2001, Mark Hemment wrote:
> > Hi Jens,
> > 
> >   I ran this (well, cut-two) on a 4-way box with 4GB of memory and a
> > modified qlogic fibre channel driver with 32disks hanging off it, without
> > any problems.  The test used was SpecFS 2.0
> 
> Cool, could you send me the qlogic diff? It's the one-liner can_dma32
> chance I'm interested in, I'm just not sure what driver you used :-)

  The qlogic driver is the one from;
	http://www.feral.com/isp.html
I find this much more stable than the one already in the kernel.
  It did just need the one-liner change, but as the driver isn't in the
kernel there isn't much point adding it change to your patch. :)


> >   I did change the patch so that bounce-pages always come from the NORMAL
> > zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
> > I'm not 100% sure the VM is capable of keeping the zones it already has
> > balanced - and adding another one might break the camels back.  But as the
> > test box has 4GB, it wasn't bouncing anyway.
> 
> You are right, this is definitely something that needs checking. I
> really want this to work though. Rik, Andrea? Will the balancing handle
> the extra zone?

  In theory it should do - ie. there isn't anything to stop it.

  With NFS loads, over a ported VxFS filesystem, I do see some problems
between the NORMAL and HIGH zones.  Thinking about it, ZONE_DMA32
shouldn't make this any worse.

  Rik, Andrea, quick description of a balancing problem;
	Consider a VM which is under load (but not stressed), such that
	all zone free-page pools are between their MIN and LOW marks, with
	pages in the inactive_clean lists.

	The NORMAL zone has non-zero page order allocations thrown at
	it.  This causes __alloc_pages() to reap pages from the NORMAL
	inactive_clean list until the required buddy is built.  The blind
	reaping causes the NORMAL zone to have a large number of free pages
	(greater than ->pages_low).

	Now, when HIGHMEM allocations come in (for page cache pages), they
	skip the HIGH zone and use the NORMAL zone (as it now has plenty
	of free pages) - the code at the top of __alloc_pages(), which
	checks against ->pages_low.

	But the NORMAL zone is usually under more pressure than the HIGH
	zone - as many more allocations needed ready-mapped memory.  This
	causes the page-cache pages from the NORMAL zone to come under
	more pressure, and are "re-cycled" quicker than page-cache pages
	in the HIGHMEM zone.

  OK, we shouldn't be throwing too many non-zero page allocations at
__alloc_pages(), but it does happen.
  Also, the problem isn't as bad as it first looks - HIGHMEM page-cache
pages do get "recycled" (reclaimed), but there is a slight imbalance.

Mark


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-29 14:07 [patch] 4GB I/O, cut three Jens Axboe
  2001-05-29 14:11 ` Jens Axboe
  2001-05-30  9:43 ` Mark Hemment
@ 2001-05-30 13:03 ` Mark Hemment
  2001-05-30 13:24   ` Jens Axboe
  2 siblings, 1 reply; 18+ messages in thread
From: Mark Hemment @ 2001-05-30 13:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel

Hi again, :)

On Tue, 29 May 2001, Jens Axboe wrote:
> Another day, another version.
> 
> Bugs fixed in this version: none
> Known bugs in this version: none
> 
> In other words, it's perfect of course.

  With the scsi-high patch, I'm not sure about the removal of the line
from __scsi_end_request();

	req->buffer = bh->b_data;

  A requeued request is not always processed immediately, so new
buffer-heads arriving at the block-layer can be merged against it.  A
requeued request is placed at the head of a request list, so
nothing can merge with it - but what about if multiple requests are
requeued on the same queue?

  In Linus's tree, requests requeued via the SCSI layer can cause problems
(corruption).  I sent out a patch to cover this a few months back, which
got picked up by Alan (its in the -ac series - see the changes to
scsi_lib.c and scsi_merge.c) but no one posted any feedback.
  I've included some of the original message below.

Mark

------------------------------------------------------------------
>From markhe@veritas.com Sat Mar 31 16:07:14 2001 +0100
Date: Sat, 31 Mar 2001 16:07:13 +0100 (BST)
From: Mark Hemment <markhe@veritas.com>
Subject: [PATCH] Possible SCSI + block-layer bugs

Hi,

  I've never seen these trigger, but they look theoretically possible.

  When processing the completion of a SCSI request in a bottom-half,
__scsi_end_request() can find all the buffers associated with the request
haven't been completed (ie. leftovers).

  One question is; can this ever happen?
  If it can't then the code should be removed from __scsi_end_request(),
if it can happen then there appears to be a few problems;

  The request is re-queued to the block layer via 
scsi_queue_next_request(), which uses the "special" pointer in the request
structure to remember the Scsi_Cmnd associated with the request.  The SCSI
request function is then called, but doesn't guarantee to immediately
process the re-queued request even though it was added at the head (say,
the queue has become plugged).  This can trigger two possible bugs.

  The first is that __scsi_end_request() doesn't decrement the
hard_nr_sectors count in the request.  As the request is back on the
queue, it is possible for newly arriving buffer-heads to merge with the
heads already hanging off the request.  This merging uses the
hard_nr_sectors when calculating both the merged hard_nr_sectors and
nr_sectors counts.
  As the request is at the head, only back-merging can occur, but if
__scsi_end_request() triggers another uncompleted request to be re-queued,
it is possible to get front merging as well.

  The merging of a re-queued request looks safe, except for the
hard_nr_sectors.  This patch corrects the hard_nr_sectors accounting.

  The second bug is from request merging in attempt_merge().

  For a re-queued request, the request structure is the one embedded in
the Scsi_Cmnd (which is a copy of the request taken in the 
scsi_request_fn).
  In attempt_merge(), q->merge_requests_fn() is called to see the requests
are allowed to merge.  __scsi_merge_requests_fn() checks number of
segments, etc, but doesn't check if one of the requests is a re-queued one
(ie. no test against ->special).
  This can lead to attempt_merge() releasing the embedded request
structure (which, as an extract copy, has the ->q set, so to
blkdev_release_request() it looks like a request which originated from
the block layer).  This isn't too healthy.

  The fix here is to add a check in __scsi_merge_requests_fn() to check
for ->special being non-NULL.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 13:03 ` Mark Hemment
@ 2001-05-30 13:24   ` Jens Axboe
  2001-05-30 13:37     ` Mark Hemment
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2001-05-30 13:24 UTC (permalink / raw)
  To: Mark Hemment; +Cc: Linux Kernel

On Wed, May 30 2001, Mark Hemment wrote:
> Hi again, :)
> 
> On Tue, 29 May 2001, Jens Axboe wrote:
> > Another day, another version.
> > 
> > Bugs fixed in this version: none
> > Known bugs in this version: none
> > 
> > In other words, it's perfect of course.
> 
>   With the scsi-high patch, I'm not sure about the removal of the line
> from __scsi_end_request();
> 
> 	req->buffer = bh->b_data;

Why?

>   A requeued request is not always processed immediately, so new
> buffer-heads arriving at the block-layer can be merged against it.  A
> requeued request is placed at the head of a request list, so
> nothing can merge with it - but what about if multiple requests are
> requeued on the same queue?

You forget that SCSI is not head-active, so there can indeed be merges
against a request that was re-added to the queue list.

>   When processing the completion of a SCSI request in a bottom-half,
> __scsi_end_request() can find all the buffers associated with the request
> haven't been completed (ie. leftovers).
> 
>   One question is; can this ever happen?

Yes it can happen.

>   The request is re-queued to the block layer via 
> scsi_queue_next_request(), which uses the "special" pointer in the request
> structure to remember the Scsi_Cmnd associated with the request.  The SCSI
> request function is then called, but doesn't guarantee to immediately
> process the re-queued request even though it was added at the head (say,
> the queue has become plugged).  This can trigger two possible bugs.
> 
>   The first is that __scsi_end_request() doesn't decrement the
> hard_nr_sectors count in the request.  As the request is back on the
> queue, it is possible for newly arriving buffer-heads to merge with the
> heads already hanging off the request.  This merging uses the
> hard_nr_sectors when calculating both the merged hard_nr_sectors and
> nr_sectors counts.

Right, that looks like a bug. I would prefer SCSI using
end_that_request_first here actually.

>   As the request is at the head, only back-merging can occur, but if
> __scsi_end_request() triggers another uncompleted request to be re-queued,
> it is possible to get front merging as well.

There can be front merges too. If a head is active, then no merging can
occcur. But for SCSI, the front request must always be in a sane state.
Or bad things can happen, like you describe.

>   The merging of a re-queued request looks safe, except for the
> hard_nr_sectors.  This patch corrects the hard_nr_sectors accounting.

Right

>   The second bug is from request merging in attempt_merge().
> 
>   For a re-queued request, the request structure is the one embedded in
> the Scsi_Cmnd (which is a copy of the request taken in the 
> scsi_request_fn).
>   In attempt_merge(), q->merge_requests_fn() is called to see the requests
> are allowed to merge.  __scsi_merge_requests_fn() checks number of
> segments, etc, but doesn't check if one of the requests is a re-queued one
> (ie. no test against ->special).
>   This can lead to attempt_merge() releasing the embedded request
> structure (which, as an extract copy, has the ->q set, so to
> blkdev_release_request() it looks like a request which originated from
> the block layer).  This isn't too healthy.
> 
>   The fix here is to add a check in __scsi_merge_requests_fn() to check
> for ->special being non-NULL.

How about just adding 

	if (req->cmd != next->cmd
	    || req->rq_dev != next->rq_dev
	    || req->nr_sectors + next->nr_sectors > q->max_sectors
	    || next->sem || req->special)
                return;

ie check for special too, that would make sense to me. Either way would
work, but I'd rather make this explicit in the block layer that 'not
normal' requests are left alone. That includes stuff with the sem set,
or special.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 13:24   ` Jens Axboe
@ 2001-05-30 13:37     ` Mark Hemment
  2001-05-30 13:40       ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Mark Hemment @ 2001-05-30 13:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel


On Wed, 30 May 2001, Jens Axboe wrote:
> On Wed, May 30 2001, Mark Hemment wrote:
> >   This can lead to attempt_merge() releasing the embedded request
> > structure (which, as an extract copy, has the ->q set, so to
> > blkdev_release_request() it looks like a request which originated from
> > the block layer).  This isn't too healthy.
> > 
> >   The fix here is to add a check in __scsi_merge_requests_fn() to check
> > for ->special being non-NULL.
> 
> How about just adding 
> 
> 	if (req->cmd != next->cmd
> 	    || req->rq_dev != next->rq_dev
> 	    || req->nr_sectors + next->nr_sectors > q->max_sectors
> 	    || next->sem || req->special)
>                 return;
> 
> ie check for special too, that would make sense to me. Either way would
> work, but I'd rather make this explicit in the block layer that 'not
> normal' requests are left alone. That includes stuff with the sem set,
> or special.


  Yes, that is an equivalent fix.

  In the original patch I wanted to keep the change local (ie. in the SCSI
layer).  Pushing the check up the generic block layer makes sense.

  Are you going to push this change to Linus, or should I?
  I'm assuming the other scsi-layer changes in Alan's tree will eventually
be pushed.

Mark


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 13:37     ` Mark Hemment
@ 2001-05-30 13:40       ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2001-05-30 13:40 UTC (permalink / raw)
  To: Mark Hemment; +Cc: Linux Kernel

On Wed, May 30 2001, Mark Hemment wrote:
> On Wed, 30 May 2001, Jens Axboe wrote:
> > On Wed, May 30 2001, Mark Hemment wrote:
> > >   This can lead to attempt_merge() releasing the embedded request
> > > structure (which, as an extract copy, has the ->q set, so to
> > > blkdev_release_request() it looks like a request which originated from
> > > the block layer).  This isn't too healthy.
> > > 
> > >   The fix here is to add a check in __scsi_merge_requests_fn() to check
> > > for ->special being non-NULL.
> > 
> > How about just adding 
> > 
> > 	if (req->cmd != next->cmd
> > 	    || req->rq_dev != next->rq_dev
> > 	    || req->nr_sectors + next->nr_sectors > q->max_sectors
> > 	    || next->sem || req->special)
> >                 return;
> > 
> > ie check for special too, that would make sense to me. Either way would
> > work, but I'd rather make this explicit in the block layer that 'not
> > normal' requests are left alone. That includes stuff with the sem set,
> > or special.
> 
> 
>   Yes, that is an equivalent fix.
> 
>   In the original patch I wanted to keep the change local (ie. in the SCSI
> layer).  Pushing the check up the generic block layer makes sense.

Ok, so we agree.

>   Are you going to push this change to Linus, or should I?
>   I'm assuming the other scsi-layer changes in Alan's tree will eventually
> be pushed.

I'll push it, I'll do the end_that_request_first thing too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30  9:55   ` Jens Axboe
  2001-05-30 10:59     ` Mark Hemment
@ 2001-05-30 14:00     ` Andrea Arcangeli
  2001-05-30 14:06       ` Jens Axboe
  2001-05-30 18:36     ` Rik van Riel
  2 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2001-05-30 14:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mark Hemment, Linux Kernel, riel, andrea

[ my usual email is offline at the moment, please CC to andrea@e-mind.com
  for anything urgent until the problem is fixed ]

On Wed, May 30, 2001 at 11:55:38AM +0200, Jens Axboe wrote:
> On Wed, May 30 2001, Mark Hemment wrote:
> > Hi Jens,
> > 
> >   I ran this (well, cut-two) on a 4-way box with 4GB of memory and a
> > modified qlogic fibre channel driver with 32disks hanging off it, without
> > any problems.  The test used was SpecFS 2.0
> 
> Cool, could you send me the qlogic diff? It's the one-liner can_dma32
> chance I'm interested in, I'm just not sure what driver you used :-)
> I'll add that to the patch then. Basically all the PCI cards should
> work, I'm just being cautious and only enabling highmem I/O to the ones
> that have been tested.
> 
> >   Peformance is definitely up - but I can't give an exact number, as the
> > run with this patch was compiled with no-omit-frame-pointer for debugging
> > any probs.
> 
> Good
> 
> >   I did change the patch so that bounce-pages always come from the NORMAL
> > zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
> > I'm not 100% sure the VM is capable of keeping the zones it already has
> > balanced - and adding another one might break the camels back.  But as the
> > test box has 4GB, it wasn't bouncing anyway.
> 
> You are right, this is definitely something that needs checking. I
> really want this to work though. Rik, Andrea? Will the balancing handle
> the extra zone?

The bounces can came from the ZONE_NORMAL without problems, however the
ZONE_DMA32 way is fine too, but yes probably it isn't needed in real
life unless you do an huge amount of I/O at the same time. If you want
to reduce the amount of changes you can defer the zone_dma32 patch and
possibly plug it in later.

Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 14:00     ` Andrea Arcangeli
@ 2001-05-30 14:06       ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2001-05-30 14:06 UTC (permalink / raw)
  To: andrea; +Cc: Jens Axboe, Mark Hemment, Linux Kernel, riel

On Wed, May 30 2001, Andrea Arcangeli wrote:
> > >   I did change the patch so that bounce-pages always come from the NORMAL
> > > zone, hence the ZONE_DMA32 zone isn't needed.  I avoided the new zone, as
> > > I'm not 100% sure the VM is capable of keeping the zones it already has
> > > balanced - and adding another one might break the camels back.  But as the
> > > test box has 4GB, it wasn't bouncing anyway.
> > 
> > You are right, this is definitely something that needs checking. I
> > really want this to work though. Rik, Andrea? Will the balancing handle
> > the extra zone?
> 
> The bounces can came from the ZONE_NORMAL without problems, however the

Of course

> ZONE_DMA32 way is fine too, but yes probably it isn't needed in real
> life unless you do an huge amount of I/O at the same time. If you want

It's not strictly needed, but it does buy us 3 extra gig to do I/O from
an a pae enabled x86.

> to reduce the amount of changes you can defer the zone_dma32 patch and
> possibly plug it in later.

Yes, I did modular patches for this reason.

Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 10:59     ` Mark Hemment
@ 2001-05-30 14:26       ` andrea
  2001-05-30 18:42         ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: andrea @ 2001-05-30 14:26 UTC (permalink / raw)
  To: Mark Hemment; +Cc: Jens Axboe, Linux Kernel, Rik van Riel

On Wed, May 30, 2001 at 11:59:50AM +0100, Mark Hemment wrote:
> 	Now, when HIGHMEM allocations come in (for page cache pages), they
> 	skip the HIGH zone and use the NORMAL zone (as it now has plenty
> 	of free pages) - the code at the top of __alloc_pages(), which
> 	checks against ->pages_low.

btw, I think such heuristic is horribly broken ;), the highmem zone
simply needs to be balanced if it is under the pages_low mark, just
skipping it and falling back into the normal zone that happens to be
above the low mark is the wrong thing to do.

>   Also, the problem isn't as bad as it first looks - HIGHMEM page-cache
> pages do get "recycled" (reclaimed), but there is a slight imbalance.

there will always be some imbalance unless all allocations would be
capable of highmem (which will never happen). The only thing we can do
is to optimize the zone usage so we won't run out of normal pages unless
there was a good reason. Once we run out of normal pages we'll simply
return NULL and the reserved pool of highmem bounces will be used
instead (other callers will behave differently).

Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30  9:55   ` Jens Axboe
  2001-05-30 10:59     ` Mark Hemment
  2001-05-30 14:00     ` Andrea Arcangeli
@ 2001-05-30 18:36     ` Rik van Riel
  2 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2001-05-30 18:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mark Hemment, Linux Kernel, andrea

On Wed, 30 May 2001, Jens Axboe wrote:

> You are right, this is definitely something that needs checking. I
> really want this to work though. Rik, Andrea? Will the balancing
> handle the extra zone?

In as far as it handles balancing the current zones,
it'll also work with one more. In places where it's
currently broken it will probably also break with one
extra zone, though the fact that the DMA32 zone takes
the pressure off the NORMAL zone might actually help.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 14:26       ` andrea
@ 2001-05-30 18:42         ` Rik van Riel
  2001-05-30 18:57           ` Andrea Arcangeli
  2001-05-30 18:57           ` Yoann Vandoorselaere
  0 siblings, 2 replies; 18+ messages in thread
From: Rik van Riel @ 2001-05-30 18:42 UTC (permalink / raw)
  To: andrea; +Cc: Mark Hemment, Jens Axboe, Linux Kernel

On Wed, 30 May 2001 andrea@e-mind.com wrote:

> btw, I think such heuristic is horribly broken ;), the highmem zone
> simply needs to be balanced if it is under the pages_low mark, just
> skipping it and falling back into the normal zone that happens to be
> above the low mark is the wrong thing to do.

2.3.51 did this, we all know the result.

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 18:42         ` Rik van Riel
@ 2001-05-30 18:57           ` Andrea Arcangeli
  2001-05-30 18:57           ` Yoann Vandoorselaere
  1 sibling, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2001-05-30 18:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Mark Hemment, Jens Axboe, Linux Kernel

On Wed, May 30, 2001 at 03:42:51PM -0300, Rik van Riel wrote:
> On Wed, 30 May 2001 andrea@e-mind.com wrote:
> 
> > btw, I think such heuristic is horribly broken ;), the highmem zone
> > simply needs to be balanced if it is under the pages_low mark, just
> > skipping it and falling back into the normal zone that happens to be
> > above the low mark is the wrong thing to do.
> 
> 2.3.51 did this, we all know the result.

I've no idea about what 2.3.51 does, but I was obviously wrong about
that. Forget such what I said above.

Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 18:42         ` Rik van Riel
  2001-05-30 18:57           ` Andrea Arcangeli
@ 2001-05-30 18:57           ` Yoann Vandoorselaere
  2001-05-30 19:18             ` Andrea Arcangeli
  1 sibling, 1 reply; 18+ messages in thread
From: Yoann Vandoorselaere @ 2001-05-30 18:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: andrea, Mark Hemment, Jens Axboe, Linux Kernel

Rik van Riel <riel@conectiva.com.br> writes:

> On Wed, 30 May 2001 andrea@e-mind.com wrote:
> 
> > btw, I think such heuristic is horribly broken ;), the highmem zone
> > simply needs to be balanced if it is under the pages_low mark, just
> > skipping it and falling back into the normal zone that happens to be
> > above the low mark is the wrong thing to do.
> 
> 2.3.51 did this, we all know the result.

Just a note, 
I remember the 2.3.51 kernel as the most usable kernel I ever used 
talking about VM.

-- 
Yoann Vandoorselaere | C makes it easy to shoot yourself in the foot. C++ makes
MandrakeSoft         | it harder, but when you do, it blows away your whole
                     | leg. - Bjarne Stroustrup

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 18:57           ` Yoann Vandoorselaere
@ 2001-05-30 19:18             ` Andrea Arcangeli
  2001-05-30 19:23               ` Rik van Riel
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2001-05-30 19:18 UTC (permalink / raw)
  To: Yoann Vandoorselaere
  Cc: Rik van Riel, andrea, Mark Hemment, Jens Axboe, Linux Kernel

On Wed, May 30, 2001 at 08:57:50PM +0200, Yoann Vandoorselaere wrote:
> I remember the 2.3.51 kernel as the most usable kernel I ever used 
> talking about VM.

I also don't remeber anything strange in that kernel about the VM (I
instead remeber well the VM breakage introduced in 2.3.99-pre).

Regardless of what 2.3.51 was doing, the falling back into the lower
zones before starting the balancing is fine.

Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] 4GB I/O, cut three
  2001-05-30 19:18             ` Andrea Arcangeli
@ 2001-05-30 19:23               ` Rik van Riel
  0 siblings, 0 replies; 18+ messages in thread
From: Rik van Riel @ 2001-05-30 19:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Yoann Vandoorselaere, andrea, Mark Hemment, Jens Axboe,
	Linux Kernel

On Wed, 30 May 2001, Andrea Arcangeli wrote:
> On Wed, May 30, 2001 at 08:57:50PM +0200, Yoann Vandoorselaere wrote:
> > I remember the 2.3.51 kernel as the most usable kernel I ever used 
> > talking about VM.
> 
> I also don't remeber anything strange in that kernel about the VM (I
> instead remeber well the VM breakage introduced in 2.3.99-pre).
> 
> Regardless of what 2.3.51 was doing, the falling back into the lower
> zones before starting the balancing is fine.

The problem with 2.3.51 was that it started balancing
the HIGHMEM zone before falling back.

On a 1GB system this lead not only to the system starting
to swap as soon as the 128MB highmem zone was filled up,
it also resulted in the other 900MB being essentially
unused.

Having your 1GB system running as if it had 128MB definately
can be classified as Not Fun.

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2001-05-30 19:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-05-29 14:07 [patch] 4GB I/O, cut three Jens Axboe
2001-05-29 14:11 ` Jens Axboe
2001-05-30  9:43 ` Mark Hemment
2001-05-30  9:55   ` Jens Axboe
2001-05-30 10:59     ` Mark Hemment
2001-05-30 14:26       ` andrea
2001-05-30 18:42         ` Rik van Riel
2001-05-30 18:57           ` Andrea Arcangeli
2001-05-30 18:57           ` Yoann Vandoorselaere
2001-05-30 19:18             ` Andrea Arcangeli
2001-05-30 19:23               ` Rik van Riel
2001-05-30 14:00     ` Andrea Arcangeli
2001-05-30 14:06       ` Jens Axboe
2001-05-30 18:36     ` Rik van Riel
2001-05-30 13:03 ` Mark Hemment
2001-05-30 13:24   ` Jens Axboe
2001-05-30 13:37     ` Mark Hemment
2001-05-30 13:40       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox