public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] BUG_ON in I/O scheduler, bugme # 288
@ 2003-01-23 21:54 Dave Olien
  2003-01-23 22:38 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Dave Olien @ 2003-01-23 21:54 UTC (permalink / raw)
  To: axboe, akpm; +Cc: linux-kernel, markw, cliffw, maryedie, jenny


Jens, Andrew

The group here doing dbt2 workload measurements have hit a couple of
problems APPARENTLY in the block I/O scheduler when doing write-intensive
raw disk I/O through a DAC960 extremeraid 2000 controller.
This wasn't a problem in 2.5.49.  It has appeared since then.

I've filed a bug on the OSDL bugme database.  You can read it at:

	http://bugme.osdl.org/show_bug.cgi?id=288

I've also put a more complete report in my web site:

	http://www.osdl.org/archive/dmo/deadline_bugon.

Begin with the README file.

For same reason, the README file isn't appearing on my web page.
I'll look into that. In the mean time, I've included the contests
of the README file below.

I'm about to try reproducing the problem on a smaller hardware
configuration.  Then, I'll test whether the same problem occurs with
read intensive I/O.



Dave Olien


-------------README---------------------------------------------------


Summary:

BUG_ON and system hangs occuring while doing write-itensive RAW disk I/O
to disks on a DAC960 extremeRAID 2000 controller.

It's possible the BUG_ON and system hangs are different, possibly
even unrelated problems.  But I'm grouping them together for now
until I've had more time to investigate.

In this directory are:

	DOT_CONFIG: the .config file for the kernel that was running.
		
	disktest.tar: a tar source file for the disktest program, 


	disktest_2.5.59.sh: a script that runs the disktest program to
			reproduce the system hanging events.

	BUG_ON: A console listing of the BUG_ON event.

	DISKTEST_STACKS: stack listings of the disktest threads that are
		hung in I/O, taken by the sysrq stack trace command.


The kernel being run was 2.5.59.  We also tried 2.5.59-mm2, and it
failed as well.

This was NOT a problem on linux 2.5.49.

The distribution on that system was Redhat 7.3.  So, the gcc compiler version
was 2.96.

The hardware configuration originally used to produce the failure is:

	8 Pentium III Xeon procssors.
	16 gig of memory, but the kernel is configured to use only 4gig.
	DAC960 extreme raid 2000, with 2 scsi channels, 11 disks on
		each chanel.  Each disk is 70 gigabytes. Each disk is
		its own logical device.


The BUG_ON() was encountered running sapdb database with the dbt2 work load.
The BUG_ON occurred at a time that the database was performing a checkpoint.
This is a random write-intensive activity that is done over many
disk devices.

The I/O is done on RAW devices.

Other times, the operating system didn't BUG_ON, but the system effectivly
hung during these checkpoint episodes.

We discovered that disktest could reproduce the problem with disktest
when run on the same hardware platform.  I'm in the process of trying
to reproduce the problem on a smaller configuration.  I'll also see
if it's only a write-intesive problem, or if there is a similar problem
with read-intensive I/O.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-23 22:38 ` Andrew Morton
@ 2003-01-23 22:34   ` Dave Olien
  2003-01-24  7:50     ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Olien @ 2003-01-23 22:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: axboe, linux-kernel, markw, cliffw, maryedie, jenny


Yup, that should be 2.5.59-mm2.  My typo.

> > I've filed a bug on the OSDL bugme database.  You can read it at:
> > 
> > 	http://bugme.osdl.org/show_bug.cgi?id=288
> 
> The title is "2.5.59 and 2.5.50-mm2".  I assume it should be 2.5.59-mm2??

My test system's down right now.  As soon as it comes up, I'll
get onto reproducing it there.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-23 21:54 [BUG] BUG_ON in I/O scheduler, bugme # 288 Dave Olien
@ 2003-01-23 22:38 ` Andrew Morton
  2003-01-23 22:34   ` Dave Olien
  2003-01-24  1:38 ` Nick Piggin
  2003-01-24  7:50 ` Jens Axboe
  2 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2003-01-23 22:38 UTC (permalink / raw)
  To: Dave Olien; +Cc: axboe, linux-kernel, markw, cliffw, maryedie, jenny

Dave Olien <dmo@osdl.org> wrote:
>
> 
> Jens, Andrew
> 
> The group here doing dbt2 workload measurements have hit a couple of
> problems APPARENTLY in the block I/O scheduler when doing write-intensive
> raw disk I/O through a DAC960 extremeraid 2000 controller.
> This wasn't a problem in 2.5.49.  It has appeared since then.
> 
> I've filed a bug on the OSDL bugme database.  You can read it at:
> 
> 	http://bugme.osdl.org/show_bug.cgi?id=288

The title is "2.5.59 and 2.5.50-mm2".  I assume it should be 2.5.59-mm2??


> I've also put a more complete report in my web site:
> 
> 	http://www.osdl.org/archive/dmo/deadline_bugon.

oooh, goody.  A new stresstest tool.

> Begin with the README file.
> 
> For same reason, the README file isn't appearing on my web page.
> I'll look into that. In the mean time, I've included the contests
> of the README file below.
> 
> I'm about to try reproducing the problem on a smaller hardware
> configuration.  Then, I'll test whether the same problem occurs with
> read intensive I/O.

OK, thanks.

The important thing about direct-io is that it will frequently cause multiple
I/Os to be in flight against the same disk sector.  That will never happen
with regular I/O because the pagecache acts as a synchronisation point.

Probably, this has tickled a bug in the I/O scheduler.  Possibly in
direct-io, too - that code's fairly fresh, and quite complex.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-23 21:54 [BUG] BUG_ON in I/O scheduler, bugme # 288 Dave Olien
  2003-01-23 22:38 ` Andrew Morton
@ 2003-01-24  1:38 ` Nick Piggin
  2003-01-24  2:00   ` Andrew Morton
  2003-01-24  7:50 ` Jens Axboe
  2 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2003-01-24  1:38 UTC (permalink / raw)
  To: Dave Olien; +Cc: axboe, akpm, linux-kernel, markw, cliffw, maryedie, jenny

Dave Olien wrote:

>Jens, Andrew
>
>The group here doing dbt2 workload measurements have hit a couple of
>problems APPARENTLY in the block I/O scheduler when doing write-intensive
>raw disk I/O through a DAC960 extremeraid 2000 controller.
> This wasn't a problem in 2.5.49.  It has appeared since then.
>
>I've filed a bug on the OSDL bugme database.  You can read it at:
>
>	http://bugme.osdl.org/show_bug.cgi?id=288
>
>I've also put a more complete report in my web site:
>
>	http://www.osdl.org/archive/dmo/deadline_bugon.
>
>Begin with the README file.
>
>For same reason, the README file isn't appearing on my web page.
>I'll look into that. In the mean time, I've included the contests
>of the README file below.
>
>I'm about to try reproducing the problem on a smaller hardware
>configuration.  Then, I'll test whether the same problem occurs with
>read intensive I/O.
>
Thanks for the report. Andrew, I think this may be because
deadline_add_drq_rb puts "aliased" requests in the next_drq although they
are not put on the sort or fifo lists. This is the problem I described to
you before and exists in mm4.

Nick


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-24  1:38 ` Nick Piggin
@ 2003-01-24  2:00   ` Andrew Morton
  2003-01-24  2:20     ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2003-01-24  2:00 UTC (permalink / raw)
  To: Nick Piggin; +Cc: dmo, axboe, linux-kernel, markw, cliffw, maryedie, jenny

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> I think this may be because
> deadline_add_drq_rb puts "aliased" requests in the next_drq although they
> are not put on the sort or fifo lists. This is the problem I described to
> you before and exists in mm4.

Yes, but 2.5.59 doesn't do that.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-24  2:00   ` Andrew Morton
@ 2003-01-24  2:20     ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2003-01-24  2:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dmo, axboe, linux-kernel, markw, cliffw, maryedie, jenny

Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>I think this may be because
>>deadline_add_drq_rb puts "aliased" requests in the next_drq although they
>>are not put on the sort or fifo lists. This is the problem I described to
>>you before and exists in mm4.
>>
>
>Yes, but 2.5.59 doesn't do that.
>
OK yeah, I thought he said 2.5.59 was OK.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-23 21:54 [BUG] BUG_ON in I/O scheduler, bugme # 288 Dave Olien
  2003-01-23 22:38 ` Andrew Morton
  2003-01-24  1:38 ` Nick Piggin
@ 2003-01-24  7:50 ` Jens Axboe
  2003-01-24 15:53   ` Dave Olien
  2 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2003-01-24  7:50 UTC (permalink / raw)
  To: Dave Olien, akpm; +Cc: linux-kernel, markw, cliffw, maryedie, jenny

On Thu, Jan 23 2003, Dave Olien wrote:
> 
> Jens, Andrew
> 
> The group here doing dbt2 workload measurements have hit a couple of
> problems APPARENTLY in the block I/O scheduler when doing write-intensive
> raw disk I/O through a DAC960 extremeraid 2000 controller.
> This wasn't a problem in 2.5.49.  It has appeared since then.
> 
> I've filed a bug on the OSDL bugme database.  You can read it at:
> 
> 	http://bugme.osdl.org/show_bug.cgi?id=288
> 
> I've also put a more complete report in my web site:
> 
> 	http://www.osdl.org/archive/dmo/deadline_bugon.

A request got on the fifo, but not in the sort tree. This is most likely
an  alias. Ah yes I see it, it can happen when two requests are merged.
I'll be back with a fix for this soon.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-23 22:34   ` Dave Olien
@ 2003-01-24  7:50     ` Jens Axboe
  2003-01-24 15:47       ` Dave Olien
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2003-01-24  7:50 UTC (permalink / raw)
  To: Dave Olien, Andrew Morton; +Cc: linux-kernel, markw, cliffw, maryedie, jenny

On Thu, Jan 23 2003, Dave Olien wrote:
> 
> Yup, that should be 2.5.59-mm2.  My typo.
> 
> > > I've filed a bug on the OSDL bugme database.  You can read it at:
> > > 
> > > 	http://bugme.osdl.org/show_bug.cgi?id=288
> > 
> > The title is "2.5.59 and 2.5.50-mm2".  I assume it should be 2.5.59-mm2??
> 
> My test system's down right now.  As soon as it comes up, I'll
> get onto reproducing it there.

I'm assuming vanilla 2.5.59, there's no BUG_ON() in -mm5 that line.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-24  7:50     ` Jens Axboe
@ 2003-01-24 15:47       ` Dave Olien
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Olien @ 2003-01-24 15:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-kernel, markw, cliffw, maryedie, jenny


Yup, the BUG_ON was in vanilla 2.5.59.  The mm patch still shows
the non-completion issue.

On Fri, Jan 24, 2003 at 08:50:30AM +0100, Jens Axboe wrote:
> 
> I'm assuming vanilla 2.5.59, there's no BUG_ON() in -mm5 that line.
> 
> -- 
> Jens Axboe
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] BUG_ON in I/O scheduler, bugme # 288
  2003-01-24  7:50 ` Jens Axboe
@ 2003-01-24 15:53   ` Dave Olien
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Olien @ 2003-01-24 15:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: akpm, linux-kernel, markw, cliffw, maryedie, jenny



OK, I was able to reproduce at least the problem with I/O apparently
never completing on my smaller test machine.  I've been assuming this
was related to whatever was causing the BUG_ON().  But in case it
isn't, I'm going to continue to look into what's going on on the running
system and I'll let you know what I find.

In the mean time, once you've generated a patch, I'll give it a try
as soon as I get it.  I'll also pass it on to the dbt2 workload guys.

On Fri, Jan 24, 2003 at 08:50:01AM +0100, Jens Axboe wrote:
> 
> A request got on the fifo, but not in the sort tree. This is most likely
> an  alias. Ah yes I see it, it can happen when two requests are merged.
> I'll be back with a fix for this soon.
> 
> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-01-24 15:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-23 21:54 [BUG] BUG_ON in I/O scheduler, bugme # 288 Dave Olien
2003-01-23 22:38 ` Andrew Morton
2003-01-23 22:34   ` Dave Olien
2003-01-24  7:50     ` Jens Axboe
2003-01-24 15:47       ` Dave Olien
2003-01-24  1:38 ` Nick Piggin
2003-01-24  2:00   ` Andrew Morton
2003-01-24  2:20     ` Nick Piggin
2003-01-24  7:50 ` Jens Axboe
2003-01-24 15:53   ` Dave Olien

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox