public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
@ 2003-06-19 17:36 David Mosberger
  2003-06-19 20:48 ` Jesse Barnes
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-19 17:36 UTC (permalink / raw)
  To: linux-ia64

Since we enabled block-layer-level merging of buffers via
BIO_VMERGE_BOUNDARY (in asm-ia64/io.h), I started to notice SCSI
errors on one of our machines (an rx5670 with 16GB of memory).  If I
set BIO_VMERGE_BOUNDARY to 0, the errors disappear.  My suspicion is
that something is wrong in the sba_iommu.c code with respect to block
merging.  Unfortunately, I won't be able to investigate this further
at the moment, since I'm trying to get out a 2.5.72 patch and will be
on vacation on and off over the next 2 weeks.  One thing that would be
interesting to know is whether similar SCSI errors show up on SN2
machines (assuming the SGI guys can get the latest 2.5 bits to boot,
of course).  This should tell us whether the bug is really in the I/O
MMU or somewhere else.

For reference, I attached a couple of sample error message below.

Can someone look into this?

	--david

Jun 19 06:31:15 magma kernel: SCSI error : <3 0 0 0> return code = 0x70000
Jun 19 06:31:15 magma kernel: end_request: I/O error, dev sda, sector 2573060
Jun 19 06:31:15 magma kernel: SCSI error : <3 0 0 0> return code = 0x70000
Jun 19 06:31:15 magma kernel: end_request: I/O error, dev sda, sector 2573068
Jun 19 06:31:15 magma kernel: SCSI error : <3 0 0 0> return code = 0x70000
Jun 19 06:31:15 magma kernel: end_request: I/O error, dev sda, sector 2573076

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
@ 2003-06-19 20:48 ` Jesse Barnes
  2003-06-19 21:47 ` Grant Grundler
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Jesse Barnes @ 2003-06-19 20:48 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jun 19, 2003 at 10:36:34AM -0700, David Mosberger wrote:
> machines (assuming the SGI guys can get the latest 2.5 bits to boot,
> of course).  This should tell us whether the bug is really in the I/O

Almost there, I promise!  Once the discontig stuff is in, the only piece
left will be PCI initialization.  We used to drive it via pcibios_init
and platform_pci_fixup, but now that the ia64 PCI layer is totally ACPI
driven, I have to figure out something else.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
  2003-06-19 20:48 ` Jesse Barnes
@ 2003-06-19 21:47 ` Grant Grundler
  2003-06-23  7:11 ` Grant Grundler
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-19 21:47 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jun 19, 2003 at 10:36:34AM -0700, David Mosberger wrote:
> My suspicion is that something is wrong in the sba_iommu.c code
> with respect to block merging.

That's probably a good place to start. The DMA block merging code
is the most fragile peice in sba_iommu.c. I still have fixes from
Peter Chubb on fixing some of that in the partial rewrite I started
last year. I might have some time short-term to revisit that a bit.

grant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
  2003-06-19 20:48 ` Jesse Barnes
  2003-06-19 21:47 ` Grant Grundler
@ 2003-06-23  7:11 ` Grant Grundler
  2003-06-23 15:04 ` Alex Williamson
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-23  7:11 UTC (permalink / raw)
  To: linux-ia64

On Thu, Jun 19, 2003 at 10:36:34AM -0700, David Mosberger wrote:
> If I set BIO_VMERGE_BOUNDARY to 0, the errors disappear.

Hmm...convoluted code.
./arch/ia64/kernel/setup.c:unsigned long ia64_max_iommu_merge_mask = ~0UL;

arch/ia64/hp/common/sba_iommu.c:
	#define IOVP_MASK	PAGE_MASK
...
        if ((long) ~IOVP_MASK > (long) ia64_max_iommu_merge_mask)
			ia64_max_iommu_merge_mask = ~IOVP_MASK;


AFAICT, that's the equivalent to (PAGE_SIZE-1) > (-1) on the first round.
Or did I missed a tilde?

I'd think that ia64 wants:

	#define BIO_VMERGE_BOUNDARY (~ia64_max_iommu_merge_mask + 1)
or
	#define BIO_VMERGE_BOUNDARY PAGE_SIZE

(If I've accounted for tildes properly.)

It looks like a few other arches do something like that:
./include/asm-ppc64/io.h:#define BIO_VMERGE_BOUNDARY    4096
./include/asm-sparc64/io.h:#define BIO_VMERGE_BOUNDARY  8192

(BTW, PARISC probably wants to use 4K pages as well).


> My suspicion is that something is wrong in the sba_iommu.c code
> with respect to block merging.

There might well be. It's easy to get the block merging code
wrong when merging random chunks which can be bigger or smaller
than PAGE_SIZE. My gut feeling this can all get ripped out of
2.5 sba_iommu.c if the bio code is going to merge for us. That
means pci_map_sg() support only needs to map each address/len
pair and forget trying to merge them.

hth,
grant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
                   ` (2 preceding siblings ...)
  2003-06-23  7:11 ` Grant Grundler
@ 2003-06-23 15:04 ` Alex Williamson
  2003-06-23 16:52 ` David Mosberger
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2003-06-23 15:04 UTC (permalink / raw)
  To: linux-ia64

Grant Grundler wrote:
> 
> There might well be. It's easy to get the block merging code
> wrong when merging random chunks which can be bigger or smaller
> than PAGE_SIZE. My gut feeling this can all get ripped out of
> 2.5 sba_iommu.c if the bio code is going to merge for us. That
> means pci_map_sg() support only needs to map each address/len
> pair and forget trying to merge them.

   I don't think the bio code is doing a sufficient level of
merging to rip the coalescing out of sba_iommu.  Running w/
some debugging turned on and the bio code enabled, I'm
regularly seeing long sg lists get colapsed (99 to 1 is not
terribly uncommon).

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
                   ` (3 preceding siblings ...)
  2003-06-23 15:04 ` Alex Williamson
@ 2003-06-23 16:52 ` David Mosberger
  2003-06-23 20:24 ` Grant Grundler
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-23 16:52 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 23 Jun 2003 00:11:19 -0700, Grant Grundler <iod00d@hp.com> said:

  Grant> Hmm...convoluted code.  ./arch/ia64/kernel/setup.c:unsigned
  Grant> long ia64_max_iommu_merge_mask = ~0UL;

  Grant> arch/ia64/hp/common/sba_iommu.c: #define IOVP_MASK PAGE_MASK
  Grant> ...  if ((long) ~IOVP_MASK > (long)
  Grant> ia64_max_iommu_merge_mask) ia64_max_iommu_merge_mask   Grant> ~IOVP_MASK;

  Grant> AFAICT, that's the equivalent to (PAGE_SIZE-1) > (-1) on the
  Grant> first round.  Or did I missed a tilde?

No, that's what it is.

  Grant> I'd think that ia64 wants:

  Grant> 	#define BIO_VMERGE_BOUNDARY (~ia64_max_iommu_merge_mask + 1)
  Grant>     or #define BIO_VMERGE_BOUNDARY PAGE_SIZE

No, BIO_VMERGE_BOUNDARY must be 0xffffffffffffffff if there is no
hardware I/O MMU present (i.e., the I/O MMU page size is 2^64) and
PAGE_SIZE if an I/O MMU is present (which can support PAGE_SIZE
pages).

  Grant> It's easy to get the block merging code wrong when merging
  Grant> random chunks which can be bigger or smaller than
  Grant> PAGE_SIZE. My gut feeling this can all get ripped out of 2.5
  Grant> sba_iommu.c if the bio code is going to merge for us. That
  Grant> means pci_map_sg() support only needs to map each address/len
  Grant> pair and forget trying to merge them.

No, you got this backwards: the bio-level code _assumes_ that
discontiguous physical pages can be remapped linearly by the I/O MMU
code.  If the I/O MMU code doesn't actually do the merging, the kernel
will fall flat on its face.

	--david

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
                   ` (4 preceding siblings ...)
  2003-06-23 16:52 ` David Mosberger
@ 2003-06-23 20:24 ` Grant Grundler
  2003-06-23 20:41 ` David Mosberger
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-23 20:24 UTC (permalink / raw)
  To: linux-ia64

On Mon, Jun 23, 2003 at 09:52:06AM -0700, David Mosberger wrote:
> No, BIO_VMERGE_BOUNDARY must be 0xffffffffffffffff if there is no
> hardware I/O MMU present (i.e., the I/O MMU page size is 2^64) and
> PAGE_SIZE if an I/O MMU is present (which can support PAGE_SIZE
> pages).

yes - that's what I thought. I'm just having problems counting tildes.

#define BIO_VMERGE_BOUNDARY   (0UL)//(ia64_max_iommu_merge_mask + 1)

Which for sba_iommu should have been:
   (ia64_max_iommu_merge_mask + 1)
		= (~IOVP_MASK + 1)
		= (~PAGE_MASK + 1)
		= (~(~(PAGE_SIZE-1)) + 1)
		= PAGE_SIZE

(I hope I have that right now)

> No, you got this backwards:

:^(

> the bio-level code _assumes_ that
> discontiguous physical pages can be remapped linearly by the I/O MMU
> code.  If the I/O MMU code doesn't actually do the merging, the kernel
> will fall flat on its face.

uhmm...why does the bio-level code care what can/can't be merged if
it's not going to do it?

Seems like a waste of CPU cycles to walk the sg_list an extra time
in the IOMMU code to figure what can (and will) be merged.  My gut
feeling is bio-level code doesn't know enough to do it efficiently
and the IOMMU code needs to walk the list at least once to program
the HW (effectively twice if sba_iommu wants to attempt coalescing).

thanks,
grant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
                   ` (5 preceding siblings ...)
  2003-06-23 20:24 ` Grant Grundler
@ 2003-06-23 20:41 ` David Mosberger
  2003-06-23 22:05 ` Grant Grundler
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-23 20:41 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 23 Jun 2003 13:24:03 -0700, Grant Grundler <iod00d@hp.com> said:

  Grant> On Mon, Jun 23, 2003 at 09:52:06AM -0700, David Mosberger
  Grant> wrote:
  >> No, BIO_VMERGE_BOUNDARY must be 0xffffffffffffffff if there is no
  >> hardware I/O MMU present (i.e., the I/O MMU page size is 2^64)
  >> and PAGE_SIZE if an I/O MMU is present (which can support
  >> PAGE_SIZE pages).

  Grant> yes - that's what I thought. I'm just having problems
  Grant> counting tildes.

  Grant> #define BIO_VMERGE_BOUNDARY (0UL)//(ia64_max_iommu_merge_mask
  Grant> + 1)

  Grant> Which for sba_iommu should have been:
  Grant> (ia64_max_iommu_merge_mask + 1) = (~IOVP_MASK + 1)   Grant> (~PAGE_MASK + 1) = (~(~(PAGE_SIZE-1)) + 1) = PAGE_SIZE

  Grant> (I hope I have that right now)

Yes.  You can thank Linus for the "inverted" sense of PAGE_MASK
(though it does make sense in the VM layer).

  >> the bio-level code _assumes_ that discontiguous physical pages
  >> can be remapped linearly by the I/O MMU code.  If the I/O MMU
  >> code doesn't actually do the merging, the kernel will fall flat
  >> on its face.

  Grant> uhmm...why does the bio-level code care what can/can't be
  Grant> merged if it's not going to do it?

  Grant> Seems like a waste of CPU cycles to walk the sg_list an extra
  Grant> time in the IOMMU code to figure what can (and will) be
  Grant> merged.  My gut feeling is bio-level code doesn't know enough
  Grant> to do it efficiently and the IOMMU code needs to walk the
  Grant> list at least once to program the HW (effectively twice if
  Grant> sba_iommu wants to attempt coalescing).

Well, I'm not a disk person (if it doesn't fit in memory, you don't
have enough of it! ;-), but the basic assumption is that it is
worthwhile to spend a few CPU cycles on forming fewer, but larger disk
requests whenever possible.  Intuitively, that certainly makes sense
to me, though I haven't seen any performance numbers on how much of a
difference this can make.  You'd certainly need a disk-heavy workload
to see any difference.  Perhaps Rohit could try it on TPC-C (once the
merging is working)?

The decision has to be split across BIO and I/O MMU: only the
BIO-level knows what to do if merging _cannot_ take place and only the
I/O MMU code knows how to map physically discontiguous pages linearly
into I/O MMU space.

	--david

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
                   ` (6 preceding siblings ...)
  2003-06-23 20:41 ` David Mosberger
@ 2003-06-23 22:05 ` Grant Grundler
  2003-06-23 22:14 ` David Mosberger
  2003-06-24 21:07 ` Grant Grundler
  9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-23 22:05 UTC (permalink / raw)
  To: linux-ia64

On Mon, Jun 23, 2003 at 01:41:01PM -0700, David Mosberger wrote:
> Well, I'm not a disk person (if it doesn't fit in memory, you don't
> have enough of it! ;-), but the basic assumption is that it is
> worthwhile to spend a few CPU cycles on forming fewer, but larger disk
> requests whenever possible.

Yes - fewer interupts/timers/sleep/wakeup() calls.
Sometimes that also means fewer disk rotations too.

> Intuitively, that certainly makes sense
> to me, though I haven't seen any performance numbers on how much of a
> difference this can make.

It's substantial.
Same thing netperf tries to measure: CPU cost per KB of data transferred.

gsyprf3:~# for i in 2 4 8 16 32 64 128 ; do time sgp_dd if=/dev/sg10 of=/dev/null bpt=$i count 00000 ; done

	real          user       sys
1K	1m45.300s  0m1.120s  0m15.633s
2k	0m55.700s  0m4.399s  0m6.701s
4K	0m31.124s  0m0.830s  0m3.119s
8K	0m19.044s  0m0.511s  0m1.884s
16K	0m19.016s  0m0.175s  0m0.765s
32K	0m19.008s  0m0.089s  0m0.544s
64K	0m19.010s  0m0.050s  0m0.438s

vmstat reported 12% sys for 1k down to <2% for 32K blocks.
Context switches went from ~48K/s to 4130/s.

Oh...sg10 is a HW mirror'd device.
Here's a re-run with a ST336732LC (u320) disk:

	real          user     sys
1K	1m54.822s  0m2.828s  0m10.289s
2K	0m57.704s  0m1.386s  0m5.207s
4K	0m41.239s  0m0.736s  0m2.911s
8K	0m20.284s  0m0.373s  0m1.589s
16K	0m16.924s  0m0.192s  0m0.865s
32K	0m16.900s  0m0.088s  0m0.563s
64K	0m16.873s  0m0.057s  0m0.430s

Not too much different from sg10. ~44k context switches/second down
to ~4700 CS/s. Similar CPU utilization numbers.

> You'd certainly need a disk-heavy workload
> to see any difference.  Perhaps Rohit could try it on TPC-C (once the
> merging is working)?

AFAIK, TPC-C cares more about latency and CPU cycles/IO.
TPC/C is "random" IO.  My example above is sequential IO
but useful to measure the CPU cost of different block sizes
and raw disk throughput.

I'm skeptical TPC/C will see the benefit of block merging,
just the cost of trying to do it. That's why I don't want
to make block merging too smart. The rest of us using buffered
IO (eg file system) and have read-ahead will benefit from block merging.

> The decision has to be split across BIO and I/O MMU: only the
> BIO-level knows what to do if merging _cannot_ take place and
> only the I/O MMU code knows how to map physically discontiguous
> pages linearly into I/O MMU space.

I understand the latter. Not the former.

It looks like blk_recount_segments() is only used to gather
stastics about how many segments are in the transaction.
I tracked back to fs/bio.c to find the consumer of this
information (# of segments) but didn't find it.
Anyone know off hand?

thanks,
grant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
                   ` (7 preceding siblings ...)
  2003-06-23 22:05 ` Grant Grundler
@ 2003-06-23 22:14 ` David Mosberger
  2003-06-24 21:07 ` Grant Grundler
  9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-23 22:14 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 23 Jun 2003 15:05:49 -0700, Grant Grundler <iod00d@hp.com> said:

  >> The decision has to be split across BIO and I/O MMU: only the
  >> BIO-level knows what to do if merging _cannot_ take place and
  >> only the I/O MMU code knows how to map physically discontiguous
  >> pages linearly into I/O MMU space.

  Grant> I understand the latter. Not the former.

  Grant> It looks like blk_recount_segments() is only used to gather
  Grant> stastics about how many segments are in the transaction.  I
  Grant> tracked back to fs/bio.c to find the consumer of this
  Grant> information (# of segments) but didn't find it.  Anyone know
  Grant> off hand?

I think you're talking to the wrong people.  If you want to change the
BIO layer, talk to Jens Axboe et al.

	--david

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
  2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
                   ` (8 preceding siblings ...)
  2003-06-23 22:14 ` David Mosberger
@ 2003-06-24 21:07 ` Grant Grundler
  9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-24 21:07 UTC (permalink / raw)
  To: linux-ia64

On Mon, Jun 23, 2003 at 03:05:49PM -0700, Grant Grundler wrote:
> It looks like blk_recount_segments() is only used to gather
> stastics about how many segments are in the transaction.
> I tracked back to fs/bio.c to find the consumer of this
> information (# of segments) but didn't find it.
> Anyone know off hand?

James Bottomley explained it to me:

| The second is this nr_phys_segments in the elevator queuing code.  This   
| counts the number of SG entries that will be required to describe the
| request.  A lot of devices have fixed size SG tables, so the SG table is
| the only thing that limits the size of the transfer.  Thus the block
| layer has code that tries to merge physically contiguous segments of a
| bio so it can squeeze extra blocks into the transfer.

I don't know if LSI SCSI controllers/drivers have limits in this space
and what they might be. Will try to figure that out.

thanks,
grant

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2003-06-24 21:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
2003-06-19 20:48 ` Jesse Barnes
2003-06-19 21:47 ` Grant Grundler
2003-06-23  7:11 ` Grant Grundler
2003-06-23 15:04 ` Alex Williamson
2003-06-23 16:52 ` David Mosberger
2003-06-23 20:24 ` Grant Grundler
2003-06-23 20:41 ` David Mosberger
2003-06-23 22:05 ` Grant Grundler
2003-06-23 22:14 ` David Mosberger
2003-06-24 21:07 ` Grant Grundler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox