Raw devices broken in 2.6.1?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Raw devices broken in 2.6.1?
@ 2004-01-29 22:44 Curt Hartung
  2004-01-30  0:38 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Curt Hartung @ 2004-01-29 22:44 UTC (permalink / raw)
  To: linux-kernel

New to the list, checked the FAQ and nothing on this. I'm using raw devices
for a large database application (highwinds-software) and under 2.4 it runs
fine, but under 2.6 I get: Program terminated with signal 25, File size
limit exceeded. (SIGXFSZ) As soon as it tries to grow the raw device pase 2G
(might be 4G, I'll go back and check)

ulimit reports: file size (blocks)          unlimited
but running the process as root and setrlimit RLIMIT_FSIZE to RLIM_INFINITY
just to be sure yields the same result.

I can easily provide a short test program to trigger it, the call I'm using
is pwrite64(...);

-Curt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raw devices broken in 2.6.1?
  2004-01-29 22:44 Raw devices broken in 2.6.1? Curt Hartung
@ 2004-01-30  0:38 ` Andrew Morton
  2004-01-30  1:30   ` Raw devices broken in 2.6.1? AND- 2.6.1 I/O degraded? Curt Hartung
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-01-30  0:38 UTC (permalink / raw)
  To: Curt Hartung; +Cc: linux-kernel

"Curt Hartung" <curt@northarc.com> wrote:
>
> New to the list, checked the FAQ and nothing on this. I'm using raw devices
> for a large database application (highwinds-software) and under 2.4 it runs
> fine, but under 2.6 I get: Program terminated with signal 25, File size
> limit exceeded. (SIGXFSZ) As soon as it tries to grow the raw device pase 2G
> (might be 4G, I'll go back and check)
> 
> ulimit reports: file size (blocks)          unlimited
> but running the process as root and setrlimit RLIMIT_FSIZE to RLIM_INFINITY
> just to be sure yields the same result.

Possibly whatever version of 2.4 you're using forgot to check for
O_LARGEFILE.  But the code looks to be OK.

> I can easily provide a short test program to trigger it, the call I'm using
> is pwrite64(...);

Yes, please.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raw devices broken in 2.6.1? AND- 2.6.1 I/O degraded?
  2004-01-30  0:38 ` Andrew Morton
@ 2004-01-30  1:30   ` Curt Hartung
  2004-01-30  4:56     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Curt Hartung @ 2004-01-30  1:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

> "Curt Hartung" <curt@northarc.com> wrote:
> >
> > New to the list, checked the FAQ and nothing on this. I'm using raw
devices
> > for a large database application (highwinds-software) and under 2.4 it
runs

> Possibly whatever version of 2.4 you're using forgot to check for
> O_LARGEFILE.  But the code looks to be OK.
>

Ah yes, I remember running across "O_LARGEFILE" but my 2.4 was ignoring it
so I figured it was optional, compiling
with-  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE and
that seemed to do the trick

Setting O_LARGEFILE fixed the problem, thanks. Glad it was a simple
solution.

I have a separate, far more serious problem with the 2.6 I/O system but I'm
still working through FAQ's to see if its my error.

Long story short- I have a simple test program you can try, to see for
yourself (compare a 2.4 build with 2.6) :
http://66.118.69.159/~curt/bigfile_test.C compile with: gcc -o bf -lpthread
bigfile_test.C

Test platform is- 512M of RAM, athalon 1.33Ghz and some generic IBM drive,
ext2 (no journaling) results at the bottom. Its a vanilla installation of
RedHat 7.2

Long story slightly longer. I am the lead developer at one of the
"enterprise software vendors who has been clamboring for the new threading
model" (highwinds-software, UseNet server software)

I have been on pins and needles waiting for a stable release so I could
test/certify our software on it; our customers are screaming for it and I
want to give it to them, but the performance of our software on this kernel
was pathetic. Stalls, halts, terrible.

I finnaly narrowed it down to the disk subsystem, and the test program shows
the meat of it. When there is massive contention for a file, or just heavy
(VERY heavy) volume, the 2.6.1 kernel (presumably the filesystem portion)
falls over dead. The test program doesn't show death, but could by just
upping the thrasher count a bit.

Where do I go with this? Anyone have any idea who I can take this test
program to? I have been telling our customers for over a year now "Don't
worry, Linux will be able to rock with the new threading model" and then..
this.. I want to be constructive here. Any advice would be appreciated, I'm
new to the Linux community per se, though I've been developing on it for
years.

RESULTS:

Changing ONLY the kernel and rebooting, I ran the program twice to make sure
any buffers were flushed. This had a dramatic effect, as the second (and all
subsequent attempts, these results are representative) were consistenty
better, although the 2.6.1 implementation was still worse.

This test program accurately models the largest job our UseNet software
does, randomly accessing ENORMOUS files. It creates a 2G file and then
accesses it with and without contention.

[root|/usr/local/tornado_be/bin]$ uname -a
Linux professor.highwinds-software.com 2.6.1 #0 SMP Wed Jan 28 01:24:07 EST
2004 i686 unknown

bytes to write[2000027648]
time [227]
1000 random 8192-byte accesses (single threaded)
time [12]
1000 random 2048-byte accesses (single threaded)
time [11]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [65]
1000 random 2048-byte accesses (with thrashers)
time [56]
[curt|/usr/local/test]$ ./bf bigfile
bytes to write[0]
time [0]
1000 random 8192-byte accesses (single threaded)
time [10]
1000 random 2048-byte accesses (single threaded)
time [10]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [57]
1000 random 2048-byte accesses (with thrashers)
time [48]

[curt|/usr/local/test]$ uname -a
Linux professor.highwinds-software.com 2.4.7-10 #1 Thu Sep 6 16:46:36 EDT
2001 i686 unknown

bytes to write[2000027648]
time[139]
1000 random 8192-byte accesses (single threaded)
time[33]
1000 random 2048-byte accesses (single threaded)
time[13]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time[42]
1000 random 2048-byte accesses (with thrashers)
time[50]
[curt|/usr/local/test]$ ./bf bigfile
bytes to write[0]
time[0]
1000 random 8192-byte accesses (single threaded)
time[10]
1000 random 2048-byte accesses (single threaded)
time[9]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time[44]
1000 random 2048-byte accesses (with thrashers)
time[40]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raw devices broken in 2.6.1? AND- 2.6.1 I/O degraded?
  2004-01-30  1:30   ` Raw devices broken in 2.6.1? AND- 2.6.1 I/O degraded? Curt Hartung
@ 2004-01-30  4:56     ` Andrew Morton
  2004-01-30  6:34       ` Curt
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-01-30  4:56 UTC (permalink / raw)
  To: Curt Hartung; +Cc: linux-kernel

"Curt Hartung" <curt@northarc.com> wrote:
>
> I have a separate, far more serious problem with the 2.6 I/O system but I'm
> still working through FAQ's to see if its my error.
> 
> Long story short- I have a simple test program you can try, to see for
> yourself (compare a 2.4 build with 2.6) :
> http://66.118.69.159/~curt/bigfile_test.C compile with: gcc -o bf -lpthread
> bigfile_test.C
> 
> Test platform is- 512M of RAM, athalon 1.33Ghz and some generic IBM drive,
> ext2 (no journaling) results at the bottom. Its a vanilla installation of
> RedHat 7.2
> 
> Long story slightly longer. I am the lead developer at one of the
> "enterprise software vendors who has been clamboring for the new threading
> model" (highwinds-software, UseNet server software)
> 
> I have been on pins and needles waiting for a stable release so I could
> test/certify our software on it; our customers are screaming for it and I
> want to give it to them, but the performance of our software on this kernel
> was pathetic. Stalls, halts, terrible.
> 
> I finnaly narrowed it down to the disk subsystem, and the test program shows
> the meat of it. When there is massive contention for a file, or just heavy
> (VERY heavy) volume, the 2.6.1 kernel (presumably the filesystem portion)
> falls over dead. The test program doesn't show death, but could by just
> upping the thrasher count a bit.

2.6.1 had a readahead bug which will adversely affect workloads such as
this.  Apart from that, there is still quite a lot of tuning work to be
done in 2.6.  As the tree settles down, and as more testers come on board. 
People are working on the VM and readahead code as we speak.  It's always
best to test the most up-to-date tree.


> Where do I go with this? Anyone have any idea who I can take this test
> program to?

You came to the right place.

> I have been telling our customers for over a year now "Don't
> worry, Linux will be able to rock with the new threading model" and then..
> this.. I want to be constructive here. Any advice would be appreciated, I'm
> new to the Linux community per se, though I've been developing on it for
> years.

You wouldn't expect huge gain from 2.6 in the I/O department.  Your
workload is seek-limited.  I am seeing some small benefits from the
anticipatory I/O scheduler with your test though.


> RESULTS:
> 
> Changing ONLY the kernel and rebooting, I ran the program twice to make sure
> any buffers were flushed. This had a dramatic effect, as the second (and all
> subsequent attempts, these results are representative) were consistenty
> better, although the 2.6.1 implementation was still worse.
> 
> This test program accurately models the largest job our UseNet software
> does, randomly accessing ENORMOUS files. It creates a 2G file and then
> accesses it with and without contention.
> 
> [root|/usr/local/tornado_be/bin]$ uname -a
> Linux professor.highwinds-software.com 2.6.1 #0 SMP Wed Jan 28 01:24:07 EST
> 2004 i686 unknown
> 
> bytes to write[2000027648]
> time [227]
> 1000 random 8192-byte accesses (single threaded)
> time [12]
> 1000 random 2048-byte accesses (single threaded)
> time [11]
> Now spawning 4 threads to kill the same file
> 1000 random 8192-byte accesses (with thrashers)
> time [65]
> 1000 random 2048-byte accesses (with thrashers)
> time [56]
> [curt|/usr/local/test]$ ./bf bigfile
> bytes to write[0]
> time [0]
> 1000 random 8192-byte accesses (single threaded)
> time [10]
> 1000 random 2048-byte accesses (single threaded)
> time [10]
> Now spawning 4 threads to kill the same file
> 1000 random 8192-byte accesses (with thrashers)
> time [57]
> 1000 random 2048-byte accesses (with thrashers)
> time [48]
> 
> 
> [curt|/usr/local/test]$ uname -a
> Linux professor.highwinds-software.com 2.4.7-10 #1 Thu Sep 6 16:46:36 EDT
> 2001 i686 unknown
> 
> bytes to write[2000027648]
> time[139]
> 1000 random 8192-byte accesses (single threaded)
> time[33]
> 1000 random 2048-byte accesses (single threaded)
> time[13]
> Now spawning 4 threads to kill the same file
> 1000 random 8192-byte accesses (with thrashers)
> time[42]
> 1000 random 2048-byte accesses (with thrashers)
> time[50]
> [curt|/usr/local/test]$ ./bf bigfile
> bytes to write[0]
> time[0]
> 1000 random 8192-byte accesses (single threaded)
> time[10]
> 1000 random 2048-byte accesses (single threaded)
> time[9]
> Now spawning 4 threads to kill the same file
> 1000 random 8192-byte accesses (with thrashers)
> time[44]
> 1000 random 2048-byte accesses (with thrashers)
> time[40]

I'm fairly suspicious about the disparity between the time taken for those
initial large writes.  Two possible reasons come to mind:

1) Your disk isn't using DMA.  Use `hdparm' to check it, and check your
   kernel IDE config if it is not using DMA.

2) 2.4 sets the dirty memory writeback thresholds much higher: 40%/60%
   vs 10%/40%.  So on a 512M box it is possible that there is much more
   dirty, unwritten-back memory after the timing period has completed than
   under 2.6.  Although this difference in tuning can affect real-world
   workloads, it is really an error in the testing methodology.  Generally,
   the timing shuld include an fsync() so that all I/O which the program
   issue has completed.

   Or you can put 2.6 on par by setting
   /proc/sys/vm/dirty_background_ratio to 40 and dirty_ratio to 60.

It doesn't happen in my 256MB/2CPU/IDE testing here.  2.4 and 2.6 are
showing the same throughput.  2.6 maybe a shade quicker.

2.6.2-rc2-mm2:

vmm:/home/akpm> ./bigfile_test /mnt/hda5/1
bytes to write[2000027648]
time[51]
1000 random 8192-byte accesses (single threaded)
time[11]
1000 random 2048-byte accesses (single threaded)
time [8]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [32]
1000 random 2048-byte accesses (with thrashers)
time [35]
vmm:/home/akpm> sync
vmm:/home/akpm> ./bigfile_test /mnt/hda5/1
bytes to write[0]
time[0]
1000 random 8192-byte accesses (single threaded)
time[8]
1000 random 2048-byte accesses (single threaded)
time [7]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [32]
1000 random 2048-byte accesses (with thrashers)
time [37]

2.4.20:

vmm:/home/akpm> ./bigfile_test /mnt/hda5/1
bytes to write[2000027648]
time[56]
1000 random 8192-byte accesses (single threaded)
time[11]
1000 random 2048-byte accesses (single threaded)
time [7]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [35]
1000 random 2048-byte accesses (with thrashers)
time [33]
vmm:/home/akpm> ./bigfile_test /mnt/hda5/1
bytes to write[0]
time[0]
1000 random 8192-byte accesses (single threaded)
time[8]
1000 random 2048-byte accesses (single threaded)
time [7]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [38]
1000 random 2048-byte accesses (with thrashers)
time [35]

2.4.25-pre8:

vmm:/home/akpm> ./bigfile_test /mnt/hda5/1
bytes to write[2000027648]
time[50]
1000 random 8192-byte accesses (single threaded)
time[10]
1000 random 2048-byte accesses (single threaded)
time [7]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [39]
1000 random 2048-byte accesses (with thrashers)
time [34]
vmm:/home/akpm> sync                      
vmm:/home/akpm> ./bigfile_test /mnt/hda5/1
bytes to write[0]
time[0]
1000 random 8192-byte accesses (single threaded)
time[7]
1000 random 2048-byte accesses (single threaded)
time [7]
Now spawning 4 threads to kill the same file
1000 random 8192-byte accesses (with thrashers)
time [36]
1000 random 2048-byte accesses (with thrashers)
time [34]

Longer-term, if your customers are using scsi, you should ensure that the
disks do not use a tag queue depth of more than 4 or 8.  More than that and
the anticipatory scheduler becomes ineffective and you won't get that
multithreaded-read goodness.

Please stay in touch, btw.  If we cannot get applications such as yours
working well, we've wasted our time...


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raw devices broken in 2.6.1? AND- 2.6.1 I/O degraded?
  2004-01-30  4:56     ` Andrew Morton
@ 2004-01-30  6:34       ` Curt
  2004-01-30  6:46         ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Curt @ 2004-01-30  6:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

> > I finnaly narrowed it down to the disk subsystem, and the test program
shows
> > the meat of it. When there is massive contention for a file, or just
heavy
> > (VERY heavy) volume, the 2.6.1 kernel (presumably the filesystem
portion)
> > falls over dead. The test program doesn't show death, but could by just
> > upping the thrasher count a bit.
>
> 2.6.1 had a readahead bug which will adversely affect workloads such as
> this.  Apart from that, there is still quite a lot of tuning work to be
> done in 2.6.  As the tree settles down, and as more testers come on board.
> People are working on the VM and readahead code as we speak.  It's always
> best to test the most up-to-date tree.

Will do.

> > I have been telling our customers for over a year now "Don't
> > worry, Linux will be able to rock with the new threading model" and
then..
>
> You wouldn't expect huge gain from 2.6 in the I/O department.  Your
> workload is seek-limited.  I am seeing some small benefits from the
> anticipatory I/O scheduler with your test though.

Nice to be amongst people who know that, usually I'm the one beating my head
against customers trying to explain to them how even the fanciest
super-drives will be brought to their knees by our random-access patterns.
My kingdom for a quantum leap in seek penalties.

The gains we (Highwinds-Software) were expecting to see in 2.6.x were in the
thread scheduling, as a massively-multithreaded database application (1000
threads without breaking a sweat) we were very anxious for true kernel-space
LWPs in Linux. Our I/O world is well outside the CPU in RAID stacks, not a
whole lot any kernel can do to help us there, 99% of the time its driver
issues.

> I'm fairly suspicious about the disparity between the time taken for those
> initial large writes.  Two possible reasons come to mind:
>
> 1) Your disk isn't using DMA.  Use `hdparm' to check it, and check your
>    kernel IDE config if it is not using DMA.

To my great surprise DMA was NOT enabled, even though support for it was
(apparantly) compiled into the kernel. Thats a puzzle for another day. On
different hardware the problem did seem to go away.

> 2) 2.4 sets the dirty memory writeback thresholds much higher: 40%/60%
>    vs 10%/40%.  So on a 512M box it is possible that there is much more
>    dirty, unwritten-back memory after the timing period has completed than
>    under 2.6.  Although this difference in tuning can affect real-world
>    workloads, it is really an error in the testing methodology.
Generally,
>    the timing shuld include an fsync() so that all I/O which the program
>    issue has completed.

Yeah I know, but it was late, I just waited for the drive light to turn off
before I ran it the second time ;)

>    Or you can put 2.6 on par by setting
>    /proc/sys/vm/dirty_background_ratio to 40 and dirty_ratio to 60.

Okay will do, is there a good comprehensive resource where I can read up on
these (and presumably many other I/O related) variables?

> Longer-term, if your customers are using scsi, you should ensure that the
> disks do not use a tag queue depth of more than 4 or 8.  More than that
and
> the anticipatory scheduler becomes ineffective and you won't get that
> multithreaded-read goodness.

I've heard-tell of tweaking the elevator paramter to 'deadline', again could
you point me to a resource where I can read up on this? And forgive the
newbie-question, but is this a boot-time parameter, or a bit I can set in
the /proc system, or both?

> Please stay in touch, btw.  If we cannot get applications such as yours
> working well, we've wasted our time...

I'll do what I can to provide real-world feedback, I want this to work too.

-Curt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raw devices broken in 2.6.1? AND- 2.6.1 I/O degraded?
  2004-01-30  6:34       ` Curt
@ 2004-01-30  6:46         ` Andrew Morton
  2004-01-30  7:13           ` Nick Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2004-01-30  6:46 UTC (permalink / raw)
  To: Curt; +Cc: linux-kernel

"Curt" <curt@northarc.com> wrote:
>
>  >    Or you can put 2.6 on par by setting
>  >    /proc/sys/vm/dirty_background_ratio to 40 and dirty_ratio to 60.
> 
>  Okay will do, is there a good comprehensive resource where I can read up on
>  these (and presumably many other I/O related) variables?

We've been relatively good about keeping the in-kernel documentation up to
date.  For this stuff, see Documentation/filesystems/proc.txt and
Documentation/sysctl/vm.txt.

>  > Longer-term, if your customers are using scsi, you should ensure that the
>  > disks do not use a tag queue depth of more than 4 or 8.  More than that
>  and
>  > the anticipatory scheduler becomes ineffective and you won't get that
>  > multithreaded-read goodness.
> 
>  I've heard-tell of tweaking the elevator paramter to 'deadline', again could
>  you point me to a resource where I can read up on this? And forgive the
>  newbie-question, but is this a boot-time parameter, or a bit I can set in
>  the /proc system, or both?

It's boot-time only.  We were working on making it per-disk but that was
quite complex and we really didn't get there in time.

So add `elevator=deadline' to your kernel boot command line.  From my
(brief) testing, it was a significant lose.  It needs more work though:
2.6+deadline shouldn't be slower than 2.4.x

>  > Please stay in touch, btw.  If we cannot get applications such as yours
>  > working well, we've wasted our time...
> 
>  I'll do what I can to provide real-world feedback, I want this to work too.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Raw devices broken in 2.6.1? AND- 2.6.1 I/O degraded?
  2004-01-30  6:46         ` Andrew Morton
@ 2004-01-30  7:13           ` Nick Piggin
  0 siblings, 0 replies; 7+ messages in thread
From: Nick Piggin @ 2004-01-30  7:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Curt, linux-kernel



Andrew Morton wrote:

>"Curt" <curt@northarc.com> wrote:
>
>> >    Or you can put 2.6 on par by setting
>> >    /proc/sys/vm/dirty_background_ratio to 40 and dirty_ratio to 60.
>>
>> Okay will do, is there a good comprehensive resource where I can read up on
>> these (and presumably many other I/O related) variables?
>>
>
>We've been relatively good about keeping the in-kernel documentation up to
>date.  For this stuff, see Documentation/filesystems/proc.txt and
>Documentation/sysctl/vm.txt.
>
>
>> > Longer-term, if your customers are using scsi, you should ensure that the
>> > disks do not use a tag queue depth of more than 4 or 8.  More than that
>> and
>> > the anticipatory scheduler becomes ineffective and you won't get that
>> > multithreaded-read goodness.
>>
>> I've heard-tell of tweaking the elevator paramter to 'deadline', again could
>> you point me to a resource where I can read up on this? And forgive the
>> newbie-question, but is this a boot-time parameter, or a bit I can set in
>> the /proc system, or both?
>>
>
>It's boot-time only.  We were working on making it per-disk but that was
>quite complex and we really didn't get there in time.
>
>So add `elevator=deadline' to your kernel boot command line.  From my
>(brief) testing, it was a significant lose.  It needs more work though:
>2.6+deadline shouldn't be slower than 2.4.x
>
>

Another thing you can do which is runtime per-disk is set
/sys/block/???/queue/iosched/antic_expire to 0 which gives you
something quite like deadline.

>> > Please stay in touch, btw.  If we cannot get applications such as yours
>> > working well, we've wasted our time...
>>
>> I'll do what I can to provide real-world feedback, I want this to work too.
>>
>
>Thanks.
>

I'd be interested in taking a look at the io scheduler if you have
problems with these workloads in future.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-01-30  7:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-29 22:44 Raw devices broken in 2.6.1? Curt Hartung
2004-01-30  0:38 ` Andrew Morton
2004-01-30  1:30   ` Raw devices broken in 2.6.1? AND- 2.6.1 I/O degraded? Curt Hartung
2004-01-30  4:56     ` Andrew Morton
2004-01-30  6:34       ` Curt
2004-01-30  6:46         ` Andrew Morton
2004-01-30  7:13           ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox