* SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
@ 2003-06-19 17:36 David Mosberger
2003-06-19 20:48 ` Jesse Barnes
` (9 more replies)
0 siblings, 10 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-19 17:36 UTC (permalink / raw)
To: linux-ia64
Since we enabled block-layer-level merging of buffers via
BIO_VMERGE_BOUNDARY (in asm-ia64/io.h), I started to notice SCSI
errors on one of our machines (an rx5670 with 16GB of memory). If I
set BIO_VMERGE_BOUNDARY to 0, the errors disappear. My suspicion is
that something is wrong in the sba_iommu.c code with respect to block
merging. Unfortunately, I won't be able to investigate this further
at the moment, since I'm trying to get out a 2.5.72 patch and will be
on vacation on and off over the next 2 weeks. One thing that would be
interesting to know is whether similar SCSI errors show up on SN2
machines (assuming the SGI guys can get the latest 2.5 bits to boot,
of course). This should tell us whether the bug is really in the I/O
MMU or somewhere else.
For reference, I attached a couple of sample error message below.
Can someone look into this?
--david
Jun 19 06:31:15 magma kernel: SCSI error : <3 0 0 0> return code = 0x70000
Jun 19 06:31:15 magma kernel: end_request: I/O error, dev sda, sector 2573060
Jun 19 06:31:15 magma kernel: SCSI error : <3 0 0 0> return code = 0x70000
Jun 19 06:31:15 magma kernel: end_request: I/O error, dev sda, sector 2573068
Jun 19 06:31:15 magma kernel: SCSI error : <3 0 0 0> return code = 0x70000
Jun 19 06:31:15 magma kernel: end_request: I/O error, dev sda, sector 2573076
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
@ 2003-06-19 20:48 ` Jesse Barnes
2003-06-19 21:47 ` Grant Grundler
` (8 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Jesse Barnes @ 2003-06-19 20:48 UTC (permalink / raw)
To: linux-ia64
On Thu, Jun 19, 2003 at 10:36:34AM -0700, David Mosberger wrote:
> machines (assuming the SGI guys can get the latest 2.5 bits to boot,
> of course). This should tell us whether the bug is really in the I/O
Almost there, I promise! Once the discontig stuff is in, the only piece
left will be PCI initialization. We used to drive it via pcibios_init
and platform_pci_fixup, but now that the ia64 PCI layer is totally ACPI
driven, I have to figure out something else.
Thanks,
Jesse
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
2003-06-19 20:48 ` Jesse Barnes
@ 2003-06-19 21:47 ` Grant Grundler
2003-06-23 7:11 ` Grant Grundler
` (7 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-19 21:47 UTC (permalink / raw)
To: linux-ia64
On Thu, Jun 19, 2003 at 10:36:34AM -0700, David Mosberger wrote:
> My suspicion is that something is wrong in the sba_iommu.c code
> with respect to block merging.
That's probably a good place to start. The DMA block merging code
is the most fragile peice in sba_iommu.c. I still have fixes from
Peter Chubb on fixing some of that in the partial rewrite I started
last year. I might have some time short-term to revisit that a bit.
grant
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
2003-06-19 20:48 ` Jesse Barnes
2003-06-19 21:47 ` Grant Grundler
@ 2003-06-23 7:11 ` Grant Grundler
2003-06-23 15:04 ` Alex Williamson
` (6 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-23 7:11 UTC (permalink / raw)
To: linux-ia64
On Thu, Jun 19, 2003 at 10:36:34AM -0700, David Mosberger wrote:
> If I set BIO_VMERGE_BOUNDARY to 0, the errors disappear.
Hmm...convoluted code.
./arch/ia64/kernel/setup.c:unsigned long ia64_max_iommu_merge_mask = ~0UL;
arch/ia64/hp/common/sba_iommu.c:
#define IOVP_MASK PAGE_MASK
...
if ((long) ~IOVP_MASK > (long) ia64_max_iommu_merge_mask)
ia64_max_iommu_merge_mask = ~IOVP_MASK;
AFAICT, that's the equivalent to (PAGE_SIZE-1) > (-1) on the first round.
Or did I missed a tilde?
I'd think that ia64 wants:
#define BIO_VMERGE_BOUNDARY (~ia64_max_iommu_merge_mask + 1)
or
#define BIO_VMERGE_BOUNDARY PAGE_SIZE
(If I've accounted for tildes properly.)
It looks like a few other arches do something like that:
./include/asm-ppc64/io.h:#define BIO_VMERGE_BOUNDARY 4096
./include/asm-sparc64/io.h:#define BIO_VMERGE_BOUNDARY 8192
(BTW, PARISC probably wants to use 4K pages as well).
> My suspicion is that something is wrong in the sba_iommu.c code
> with respect to block merging.
There might well be. It's easy to get the block merging code
wrong when merging random chunks which can be bigger or smaller
than PAGE_SIZE. My gut feeling this can all get ripped out of
2.5 sba_iommu.c if the bio code is going to merge for us. That
means pci_map_sg() support only needs to map each address/len
pair and forget trying to merge them.
hth,
grant
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
` (2 preceding siblings ...)
2003-06-23 7:11 ` Grant Grundler
@ 2003-06-23 15:04 ` Alex Williamson
2003-06-23 16:52 ` David Mosberger
` (5 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2003-06-23 15:04 UTC (permalink / raw)
To: linux-ia64
Grant Grundler wrote:
>
> There might well be. It's easy to get the block merging code
> wrong when merging random chunks which can be bigger or smaller
> than PAGE_SIZE. My gut feeling this can all get ripped out of
> 2.5 sba_iommu.c if the bio code is going to merge for us. That
> means pci_map_sg() support only needs to map each address/len
> pair and forget trying to merge them.
I don't think the bio code is doing a sufficient level of
merging to rip the coalescing out of sba_iommu. Running w/
some debugging turned on and the bio code enabled, I'm
regularly seeing long sg lists get colapsed (99 to 1 is not
terribly uncommon).
Alex
--
Alex Williamson HP Linux & Open Source Lab
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
` (3 preceding siblings ...)
2003-06-23 15:04 ` Alex Williamson
@ 2003-06-23 16:52 ` David Mosberger
2003-06-23 20:24 ` Grant Grundler
` (4 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-23 16:52 UTC (permalink / raw)
To: linux-ia64
>>>>> On Mon, 23 Jun 2003 00:11:19 -0700, Grant Grundler <iod00d@hp.com> said:
Grant> Hmm...convoluted code. ./arch/ia64/kernel/setup.c:unsigned
Grant> long ia64_max_iommu_merge_mask = ~0UL;
Grant> arch/ia64/hp/common/sba_iommu.c: #define IOVP_MASK PAGE_MASK
Grant> ... if ((long) ~IOVP_MASK > (long)
Grant> ia64_max_iommu_merge_mask) ia64_max_iommu_merge_mask Grant> ~IOVP_MASK;
Grant> AFAICT, that's the equivalent to (PAGE_SIZE-1) > (-1) on the
Grant> first round. Or did I missed a tilde?
No, that's what it is.
Grant> I'd think that ia64 wants:
Grant> #define BIO_VMERGE_BOUNDARY (~ia64_max_iommu_merge_mask + 1)
Grant> or #define BIO_VMERGE_BOUNDARY PAGE_SIZE
No, BIO_VMERGE_BOUNDARY must be 0xffffffffffffffff if there is no
hardware I/O MMU present (i.e., the I/O MMU page size is 2^64) and
PAGE_SIZE if an I/O MMU is present (which can support PAGE_SIZE
pages).
Grant> It's easy to get the block merging code wrong when merging
Grant> random chunks which can be bigger or smaller than
Grant> PAGE_SIZE. My gut feeling this can all get ripped out of 2.5
Grant> sba_iommu.c if the bio code is going to merge for us. That
Grant> means pci_map_sg() support only needs to map each address/len
Grant> pair and forget trying to merge them.
No, you got this backwards: the bio-level code _assumes_ that
discontiguous physical pages can be remapped linearly by the I/O MMU
code. If the I/O MMU code doesn't actually do the merging, the kernel
will fall flat on its face.
--david
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
` (4 preceding siblings ...)
2003-06-23 16:52 ` David Mosberger
@ 2003-06-23 20:24 ` Grant Grundler
2003-06-23 20:41 ` David Mosberger
` (3 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-23 20:24 UTC (permalink / raw)
To: linux-ia64
On Mon, Jun 23, 2003 at 09:52:06AM -0700, David Mosberger wrote:
> No, BIO_VMERGE_BOUNDARY must be 0xffffffffffffffff if there is no
> hardware I/O MMU present (i.e., the I/O MMU page size is 2^64) and
> PAGE_SIZE if an I/O MMU is present (which can support PAGE_SIZE
> pages).
yes - that's what I thought. I'm just having problems counting tildes.
#define BIO_VMERGE_BOUNDARY (0UL)//(ia64_max_iommu_merge_mask + 1)
Which for sba_iommu should have been:
(ia64_max_iommu_merge_mask + 1)
= (~IOVP_MASK + 1)
= (~PAGE_MASK + 1)
= (~(~(PAGE_SIZE-1)) + 1)
= PAGE_SIZE
(I hope I have that right now)
> No, you got this backwards:
:^(
> the bio-level code _assumes_ that
> discontiguous physical pages can be remapped linearly by the I/O MMU
> code. If the I/O MMU code doesn't actually do the merging, the kernel
> will fall flat on its face.
uhmm...why does the bio-level code care what can/can't be merged if
it's not going to do it?
Seems like a waste of CPU cycles to walk the sg_list an extra time
in the IOMMU code to figure what can (and will) be merged. My gut
feeling is bio-level code doesn't know enough to do it efficiently
and the IOMMU code needs to walk the list at least once to program
the HW (effectively twice if sba_iommu wants to attempt coalescing).
thanks,
grant
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
` (5 preceding siblings ...)
2003-06-23 20:24 ` Grant Grundler
@ 2003-06-23 20:41 ` David Mosberger
2003-06-23 22:05 ` Grant Grundler
` (2 subsequent siblings)
9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-23 20:41 UTC (permalink / raw)
To: linux-ia64
>>>>> On Mon, 23 Jun 2003 13:24:03 -0700, Grant Grundler <iod00d@hp.com> said:
Grant> On Mon, Jun 23, 2003 at 09:52:06AM -0700, David Mosberger
Grant> wrote:
>> No, BIO_VMERGE_BOUNDARY must be 0xffffffffffffffff if there is no
>> hardware I/O MMU present (i.e., the I/O MMU page size is 2^64)
>> and PAGE_SIZE if an I/O MMU is present (which can support
>> PAGE_SIZE pages).
Grant> yes - that's what I thought. I'm just having problems
Grant> counting tildes.
Grant> #define BIO_VMERGE_BOUNDARY (0UL)//(ia64_max_iommu_merge_mask
Grant> + 1)
Grant> Which for sba_iommu should have been:
Grant> (ia64_max_iommu_merge_mask + 1) = (~IOVP_MASK + 1) Grant> (~PAGE_MASK + 1) = (~(~(PAGE_SIZE-1)) + 1) = PAGE_SIZE
Grant> (I hope I have that right now)
Yes. You can thank Linus for the "inverted" sense of PAGE_MASK
(though it does make sense in the VM layer).
>> the bio-level code _assumes_ that discontiguous physical pages
>> can be remapped linearly by the I/O MMU code. If the I/O MMU
>> code doesn't actually do the merging, the kernel will fall flat
>> on its face.
Grant> uhmm...why does the bio-level code care what can/can't be
Grant> merged if it's not going to do it?
Grant> Seems like a waste of CPU cycles to walk the sg_list an extra
Grant> time in the IOMMU code to figure what can (and will) be
Grant> merged. My gut feeling is bio-level code doesn't know enough
Grant> to do it efficiently and the IOMMU code needs to walk the
Grant> list at least once to program the HW (effectively twice if
Grant> sba_iommu wants to attempt coalescing).
Well, I'm not a disk person (if it doesn't fit in memory, you don't
have enough of it! ;-), but the basic assumption is that it is
worthwhile to spend a few CPU cycles on forming fewer, but larger disk
requests whenever possible. Intuitively, that certainly makes sense
to me, though I haven't seen any performance numbers on how much of a
difference this can make. You'd certainly need a disk-heavy workload
to see any difference. Perhaps Rohit could try it on TPC-C (once the
merging is working)?
The decision has to be split across BIO and I/O MMU: only the
BIO-level knows what to do if merging _cannot_ take place and only the
I/O MMU code knows how to map physically discontiguous pages linearly
into I/O MMU space.
--david
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
` (6 preceding siblings ...)
2003-06-23 20:41 ` David Mosberger
@ 2003-06-23 22:05 ` Grant Grundler
2003-06-23 22:14 ` David Mosberger
2003-06-24 21:07 ` Grant Grundler
9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-23 22:05 UTC (permalink / raw)
To: linux-ia64
On Mon, Jun 23, 2003 at 01:41:01PM -0700, David Mosberger wrote:
> Well, I'm not a disk person (if it doesn't fit in memory, you don't
> have enough of it! ;-), but the basic assumption is that it is
> worthwhile to spend a few CPU cycles on forming fewer, but larger disk
> requests whenever possible.
Yes - fewer interupts/timers/sleep/wakeup() calls.
Sometimes that also means fewer disk rotations too.
> Intuitively, that certainly makes sense
> to me, though I haven't seen any performance numbers on how much of a
> difference this can make.
It's substantial.
Same thing netperf tries to measure: CPU cost per KB of data transferred.
gsyprf3:~# for i in 2 4 8 16 32 64 128 ; do time sgp_dd if=/dev/sg10 of=/dev/null bpt=$i count 00000 ; done
real user sys
1K 1m45.300s 0m1.120s 0m15.633s
2k 0m55.700s 0m4.399s 0m6.701s
4K 0m31.124s 0m0.830s 0m3.119s
8K 0m19.044s 0m0.511s 0m1.884s
16K 0m19.016s 0m0.175s 0m0.765s
32K 0m19.008s 0m0.089s 0m0.544s
64K 0m19.010s 0m0.050s 0m0.438s
vmstat reported 12% sys for 1k down to <2% for 32K blocks.
Context switches went from ~48K/s to 4130/s.
Oh...sg10 is a HW mirror'd device.
Here's a re-run with a ST336732LC (u320) disk:
real user sys
1K 1m54.822s 0m2.828s 0m10.289s
2K 0m57.704s 0m1.386s 0m5.207s
4K 0m41.239s 0m0.736s 0m2.911s
8K 0m20.284s 0m0.373s 0m1.589s
16K 0m16.924s 0m0.192s 0m0.865s
32K 0m16.900s 0m0.088s 0m0.563s
64K 0m16.873s 0m0.057s 0m0.430s
Not too much different from sg10. ~44k context switches/second down
to ~4700 CS/s. Similar CPU utilization numbers.
> You'd certainly need a disk-heavy workload
> to see any difference. Perhaps Rohit could try it on TPC-C (once the
> merging is working)?
AFAIK, TPC-C cares more about latency and CPU cycles/IO.
TPC/C is "random" IO. My example above is sequential IO
but useful to measure the CPU cost of different block sizes
and raw disk throughput.
I'm skeptical TPC/C will see the benefit of block merging,
just the cost of trying to do it. That's why I don't want
to make block merging too smart. The rest of us using buffered
IO (eg file system) and have read-ahead will benefit from block merging.
> The decision has to be split across BIO and I/O MMU: only the
> BIO-level knows what to do if merging _cannot_ take place and
> only the I/O MMU code knows how to map physically discontiguous
> pages linearly into I/O MMU space.
I understand the latter. Not the former.
It looks like blk_recount_segments() is only used to gather
stastics about how many segments are in the transaction.
I tracked back to fs/bio.c to find the consumer of this
information (# of segments) but didn't find it.
Anyone know off hand?
thanks,
grant
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
` (7 preceding siblings ...)
2003-06-23 22:05 ` Grant Grundler
@ 2003-06-23 22:14 ` David Mosberger
2003-06-24 21:07 ` Grant Grundler
9 siblings, 0 replies; 11+ messages in thread
From: David Mosberger @ 2003-06-23 22:14 UTC (permalink / raw)
To: linux-ia64
>>>>> On Mon, 23 Jun 2003 15:05:49 -0700, Grant Grundler <iod00d@hp.com> said:
>> The decision has to be split across BIO and I/O MMU: only the
>> BIO-level knows what to do if merging _cannot_ take place and
>> only the I/O MMU code knows how to map physically discontiguous
>> pages linearly into I/O MMU space.
Grant> I understand the latter. Not the former.
Grant> It looks like blk_recount_segments() is only used to gather
Grant> stastics about how many segments are in the transaction. I
Grant> tracked back to fs/bio.c to find the consumer of this
Grant> information (# of segments) but didn't find it. Anyone know
Grant> off hand?
I think you're talking to the wrong people. If you want to change the
BIO layer, talk to Jens Axboe et al.
--david
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
` (8 preceding siblings ...)
2003-06-23 22:14 ` David Mosberger
@ 2003-06-24 21:07 ` Grant Grundler
9 siblings, 0 replies; 11+ messages in thread
From: Grant Grundler @ 2003-06-24 21:07 UTC (permalink / raw)
To: linux-ia64
On Mon, Jun 23, 2003 at 03:05:49PM -0700, Grant Grundler wrote:
> It looks like blk_recount_segments() is only used to gather
> stastics about how many segments are in the transaction.
> I tracked back to fs/bio.c to find the consumer of this
> information (# of segments) but didn't find it.
> Anyone know off hand?
James Bottomley explained it to me:
| The second is this nr_phys_segments in the elevator queuing code. This
| counts the number of SG entries that will be required to describe the
| request. A lot of devices have fixed size SG tables, so the SG table is
| the only thing that limits the size of the transfer. Thus the block
| layer has code that tries to merge physically contiguous segments of a
| bio so it can squeeze extra blocks into the transfer.
I don't know if LSI SCSI controllers/drivers have limits in this space
and what they might be. Will try to figure that out.
thanks,
grant
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2003-06-24 21:07 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-19 17:36 SCSI ERRORS triggered by BIO_VMERGE_BOUNDARY David Mosberger
2003-06-19 20:48 ` Jesse Barnes
2003-06-19 21:47 ` Grant Grundler
2003-06-23 7:11 ` Grant Grundler
2003-06-23 15:04 ` Alex Williamson
2003-06-23 16:52 ` David Mosberger
2003-06-23 20:24 ` Grant Grundler
2003-06-23 20:41 ` David Mosberger
2003-06-23 22:05 ` Grant Grundler
2003-06-23 22:14 ` David Mosberger
2003-06-24 21:07 ` Grant Grundler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox