Problem with reiserfs volume

All of lore.kernel.org
 help / color / mirror / Atom feed

* Problem with reiserfs volume
@ 2009-04-04 17:25 Lelsie Rhorer
  2009-04-06 20:04 ` Corey Hickey
  0 siblings, 1 reply; 12+ messages in thread
From: Lelsie Rhorer @ 2009-04-04 17:25 UTC (permalink / raw)
  To: reiserfs-devel

I know this is a development list, so if I am posting in the wrong list,
please forgive me and point me toward the correct one.

I'm having a severe problem whose root cause I cannot determine.  I have a
RAID 6 array managed by mdadm running on Debian "Lenny" with a 3.2GHz AMD
Athlon 64 x 2 processor and 8G of RAM.  The kernel is 2.6.26-1-amd64.  There
are ten 1 Terabyte SATA drives, unpartitioned, fully allocated to the
/dev/md0 device. The drive are served by 3 Silicon Image SATA port
multipliers and a Silicon Image 4 port eSATA controller.  The /dev/md0
device is also unpartitioned, and all 8T of active space is formatted as a
single Reiserfs file system.  The entire volume is mounted to /RAID.
Various directories on the volume are shared using both NFS and SAMBA.

Performance of the RAID system is very good.  The array can read and write
at over 450 Mbps, and I don't know if the limit is the array itself or the
network, but since the performance is more than adequate I really am not
concerned which is the case.

The issue is the entire array will occasionally pause completely for about
40 seconds when a file is created.  This does not always happen, but the
situation is easily reproducible.  The frequency at which the symptom occurs
seems to be somewhat related to the transfer load on the array.  If no other
transfers are in process, then the failure seems somewhat more rare, perhaps
accompanying less than 1 file creation in 10..  During heavy file transfer
activity, sometimes the system halts with every other file creation.
Although I have observed many dozens of these events, I have never once
observed it to happen except when a file creation occurs. 
Reading and writing existing files never triggers the event, although any
read or write occurring during the event is halted for the duration. 
(There is one cron jog which runs every half-hour that creates a tiny file;
this is the most common failure vector.)  There are other drives formatted
with other file systems on the machine, but the issue has never been seen on
any of the other drives.  When the array runs its regularly scheduled health
check, the problem is much worse.  Not only does it lock up with almost
every single file creation, but the lock-up time is much longer - sometimes
in excess of 2 minutes.

Transfers via Linux based utilities (ftp, NFS, cp, mv, rsync, etc) all
recover after the event, but SAMBA based transfers frequently fail, both
reads and writes.

I discussed the matter over on the linux-raid list, but so far none of the
suggestions there have yielded any great progress in fixing the issue.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problem with reiserfs volume
  2009-04-04 17:25 Problem with reiserfs volume Lelsie Rhorer
@ 2009-04-06 20:04 ` Corey Hickey
  2009-04-28 23:53   ` Leslie Rhorer
  0 siblings, 1 reply; 12+ messages in thread
From: Corey Hickey @ 2009-04-06 20:04 UTC (permalink / raw)
  To: lrhorer, reiserfs-devel

Lelsie Rhorer wrote:
> The issue is the entire array will occasionally pause completely for about
> 40 seconds when a file is created.  This does not always happen, but the
> situation is easily reproducible.  The frequency at which the symptom occurs
> seems to be somewhat related to the transfer load on the array.  If no other
> transfers are in process, then the failure seems somewhat more rare, perhaps
> accompanying less than 1 file creation in 10..  During heavy file transfer
> activity, sometimes the system halts with every other file creation.
> Although I have observed many dozens of these events, I have never once
> observed it to happen except when a file creation occurs. 
> Reading and writing existing files never triggers the event, although any
> read or write occurring during the event is halted for the duration. 
> (There is one cron jog which runs every half-hour that creates a tiny file;
> this is the most common failure vector.)  There are other drives formatted
> with other file systems on the machine, but the issue has never been seen on
> any of the other drives.  When the array runs its regularly scheduled health
> check, the problem is much worse.  Not only does it lock up with almost
> every single file creation, but the lock-up time is much longer - sometimes
> in excess of 2 minutes.

This sounds somewhat like an intermittent problem I reported on 2008-02-20:

http://www.spinics.net/lists/reiserfs-devel/msg00702.html

The gist of the issue, apparently, was that writing files would cause
those files to be cached and the kernel would drop reiserfs bitmap data
to make room in the page cache. Once those bitmaps were dropped from the
cache and another file needed to be written, many bitmaps needed to be
read back from the disk in order to find free space. The bitmaps are
small, but spaced every 128 MB, so very many seeks were needed and the
read speed was quite slow.

All that seeking caused the disk to buzz distinctively. Try listening
for that, or looking at the disk read/write activity with something like
dstat.

You can force bitmap data to be dropped and then re-read, in order to
find out what to look/listen for (change sdc4 to md0 or whatever):

# echo 1 > /proc/sys/vm/drop_caches
# debugreiserfs -m /dev/sdc4 > /dev/null

Here's what dstat looks like when I run the above commands:

-------------------
$ dstat -d -D sdc
--dsk/sdc--
 read  writ
 914k  221k
   0    16k
   0     0
   0     0
   0     0
  92k    0
 780k    0
 412k    0
 608k    0
 528k    0
 552k    0
 440k    0
 444k    0
 432k    0
 432k    0
 608k    0
 500k    0
 556k    0
 520k    0
 208k    0
   0     0
   0     0
   0     0
   0     0
-------------------

That might or might not be what's happening to you; my machine had much
less RAM, but also a much smaller array.

Jeff Mahoney was helpful and informative when I reported the issue, but
wasn't able to reproduce it on his system (neither could I, on a machine
with a larger filesystem and less RAM). I ended up switching to ext4 for
the problematic array, but most of my other filesystems are still
reiserfs and have never had that problem.

Good luck,
Corey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Problem with reiserfs volume
  2009-04-06 20:04 ` Corey Hickey
@ 2009-04-28 23:53   ` Leslie Rhorer
  2009-04-29  0:00     ` Leslie Rhorer
  0 siblings, 1 reply; 12+ messages in thread
From: Leslie Rhorer @ 2009-04-28 23:53 UTC (permalink / raw)
  To: reiserfs-devel

> This sounds somewhat like an intermittent problem I reported on 2008-02-
> 20:
> 
> http://www.spinics.net/lists/reiserfs-devel/msg00702.html
> 
> The gist of the issue, apparently, was that writing files would cause
> those files to be cached and the kernel would drop reiserfs bitmap data
> to make room in the page cache. Once those bitmaps were dropped from the
> cache and another file needed to be written, many bitmaps needed to be
> read back from the disk in order to find free space. The bitmaps are
> small, but spaced every 128 MB, so very many seeks were needed and the
> read speed was quite slow.
> 
> All that seeking caused the disk to buzz distinctively. Try listening
> for that, or looking at the disk read/write activity with something like
> dstat.

No, I did a fair bit of additional investigation, and the symptoms were
fairly odd.  When a halt would occur, all writes at every level would fall
to dead zero.  The reads at the array level would fall to zero on 5 of the
10 drives, while the other 5 would report a very low level of read activity,
but not zero.  It would always be the same 5 drives which dropped to zero
and the same 5 which still reported some reads going on.  Note if a RAID
resync was occurring, then all 10 drives would continue to report
significant read rates at the drive level, but array level read / writes
would stop altogether.  The likelihood of a halt event was fairly low if
there was no drive activity, and increased as the level of drive activity
(read or write) increased.  During a RAID resync, almost every file create
causes a halt.  After exhausting all my abilities to troubleshoot the issue,
I finally erased the entire array and reformatted it as XFS.  I am still
transferring the data from the backup to the RAID array, but with over 30%
of the data transferred and over 10,000 files created in the last several
days, I have not been able to trigger a halt event.  What's more, my file
delete performance for large files was very poor under Reiserfs.  A 20G file
could take upwards of 30 seconds to delete, although deleting a file never
caused a file system halt like creating a file did.  Under the new file
system, deleting a 20G file takes typically 0.1 seconds or less.

This definitely suggests there may be a problem with Reiserfs.  The only
things which changed from the last array to this one were the physical drive
locations in the array (I had swapped drives around to try to pinpoint the
issue), a Version 1.2 Superblock in the new array vs. 0.9 in the old array,
and a 256K chunk size rather than the default 64K to improve performance.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Problem with reiserfs volume
  2009-04-28 23:53   ` Leslie Rhorer
@ 2009-04-29  0:00     ` Leslie Rhorer
  2009-04-30  6:47       ` Corey Hickey
  0 siblings, 1 reply; 12+ messages in thread
From: Leslie Rhorer @ 2009-04-29  0:00 UTC (permalink / raw)
  To: reiserfs-devel



> > The gist of the issue, apparently, was that writing files would cause
> > those files to be cached and the kernel would drop reiserfs bitmap data
> > to make room in the page cache. Once those bitmaps were dropped from the
> > cache and another file needed to be written, many bitmaps needed to be
> > read back from the disk in order to find free space. The bitmaps are
> > small, but spaced every 128 MB, so very many seeks were needed and the
> > read speed was quite slow.
> >
> > All that seeking caused the disk to buzz distinctively. Try listening
> > for that, or looking at the disk read/write activity with something like
> > dstat.
> 
> No, I did a fair bit of additional investigation, and the symptoms were
> fairly odd.  When a halt would occur, all writes at every level would fall
> to dead zero.  The reads at the array level would fall to zero on 5 of the
> 10 drives, while the other 5 would report a very low level of read
> activity,
> but not zero.

Oops!  I'm sorry.  I mis-typed the sentences just above.  What I meant to
say was the write activity at both the array and drive level fell to zero.
The read activity at the array level also fell to zero, but at the drive
level 5 of the drives would still show activity.

> It would always be the same 5 drives which dropped to zero
> and the same 5 which still reported some reads going on.  Note if a RAID
> resync was occurring, then all 10 drives would continue to report
> significant read rates at the drive level, but array level read / writes
> would stop altogether.  The likelihood of a halt event was fairly low if
> there was no drive activity, and increased as the level of drive activity
> (read or write) increased.  During a RAID resync, almost every file create
> causes a halt.  After exhausting all my abilities to troubleshoot the
> issue,
> I finally erased the entire array and reformatted it as XFS.  I am still
> transferring the data from the backup to the RAID array, but with over 30%
> of the data transferred and over 10,000 files created in the last several
> days, I have not been able to trigger a halt event.  What's more, my file
> delete performance for large files was very poor under Reiserfs.  A 20G
> file
> could take upwards of 30 seconds to delete, although deleting a file never
> caused a file system halt like creating a file did.  Under the new file
> system, deleting a 20G file takes typically 0.1 seconds or less.
> 
> This definitely suggests there may be a problem with Reiserfs.  The only
> things which changed from the last array to this one were the physical
> drive
> locations in the array (I had swapped drives around to try to pinpoint the
> issue), a Version 1.2 Superblock in the new array vs. 0.9 in the old
> array,
> and a 256K chunk size rather than the default 64K to improve performance.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe reiserfs-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problem with reiserfs volume
  2009-04-29  0:00     ` Leslie Rhorer
@ 2009-04-30  6:47       ` Corey Hickey
  2009-05-03  1:58         ` Leslie Rhorer
  0 siblings, 1 reply; 12+ messages in thread
From: Corey Hickey @ 2009-04-30  6:47 UTC (permalink / raw)
  To: lrhorer, reiserfs-devel

Leslie Rhorer wrote:
>>> The gist of the issue, apparently, was that writing files would cause
>>> those files to be cached and the kernel would drop reiserfs bitmap data
>>> to make room in the page cache. Once those bitmaps were dropped from the
>>> cache and another file needed to be written, many bitmaps needed to be
>>> read back from the disk in order to find free space. The bitmaps are
>>> small, but spaced every 128 MB, so very many seeks were needed and the
>>> read speed was quite slow.
>>>
>>> All that seeking caused the disk to buzz distinctively. Try listening
>>> for that, or looking at the disk read/write activity with something like
>>> dstat.
>> No, I did a fair bit of additional investigation, and the symptoms were
>> fairly odd.  When a halt would occur, all writes at every level would fall
>> to dead zero.  The reads at the array level would fall to zero on 5 of the
>> 10 drives, while the other 5 would report a very low level of read
>> activity,
>> but not zero.
> 
> Oops!  I'm sorry.  I mis-typed the sentences just above.  What I meant to
> say was the write activity at both the array and drive level fell to zero.
> The read activity at the array level also fell to zero, but at the drive
> level 5 of the drives would still show activity.

Are you sure the read activity for the array was 0? If the array wasn't
doing anything but the individual drives were, that would indicate a
lower-level problem than the filesystem; unless I'm missing something,
the filesystem can't do anything to the individual drives without it
showing up as read/write from/to the array device.

Aside from that, everything you're written seems to be consistent with
my hypothesis that you had a bitmap caching problem. Or maybe I'm just
falling prey to confirmation bias.

Did you ever test with dstat and debugreiserfs like I mentioned earlier
in this thread?

>> It would always be the same 5 drives which dropped to zero
>> and the same 5 which still reported some reads going on. 

I did the math and (if a couple reasonable assumptions I made are
correct), then the reiserfs bitmaps would indeed be distributed among
five of 10 drives in a RAID-6.

If you're interested, ask, and I'll write it up.

>> Note if a RAID
>> resync was occurring, then all 10 drives would continue to report
>> significant read rates at the drive level, but array level read / writes
>> would stop altogether.  The likelihood of a halt event was fairly low if
>> there was no drive activity, and increased as the level of drive activity
>> (read or write) increased.  During a RAID resync, almost every file create
>> causes a halt. 

Perhaps because the resync I/O caused the bitmap data to fall off the
page cache.

>> After exhausting all my abilities to troubleshoot the
>> issue,
>> I finally erased the entire array and reformatted it as XFS.  I am still
>> transferring the data from the backup to the RAID array, but with over 30%
>> of the data transferred and over 10,000 files created in the last several
>> days, I have not been able to trigger a halt event.  What's more, my file
>> delete performance for large files was very poor under Reiserfs.  A 20G
>> file
>> could take upwards of 30 seconds to delete, although deleting a file never
>> caused a file system halt like creating a file did.  Under the new file
>> system, deleting a 20G file takes typically 0.1 seconds or less.

I remember being annoyed by large file deletion performance before, but
I can't reproduce it right now (with kernel 2.6.28.2).

In any case, I've switched my large filesystem to ext4, so far without
any regrets. My remaining filesystems are mostly still reiserfs, and
I'll eventually migrate them, but I'm in no hurry.

-Corey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Problem with reiserfs volume
  2009-04-30  6:47       ` Corey Hickey
@ 2009-05-03  1:58         ` Leslie Rhorer
  2009-05-03 23:54           ` Corey Hickey
  0 siblings, 1 reply; 12+ messages in thread
From: Leslie Rhorer @ 2009-05-03  1:58 UTC (permalink / raw)
  To: reiserfs-devel

> >> No, I did a fair bit of additional investigation, and the symptoms were
> >> fairly odd.  When a halt would occur, all writes at every level would
> fall
> >> to dead zero.  The reads at the array level would fall to zero on 5 of
> the
> >> 10 drives, while the other 5 would report a very low level of read
> >> activity,
> >> but not zero.
> >
> > Oops!  I'm sorry.  I mis-typed the sentences just above.  What I meant
> to
> > say was the write activity at both the array and drive level fell to
> zero.
> > The read activity at the array level also fell to zero, but at the drive
> > level 5 of the drives would still show activity.
> 
> Are you sure the read activity for the array was 0?

Yep.  According to iostat, absolute zilch.

> If the array wasn't
> doing anything but the individual drives were, that would indicate a
> lower-level problem than the filesystem;

It could, yes.  In fact, it is not unlikely to be and interaction failure
between the file system and the RAID device management system (/dev/md0, or
whatever).

> unless I'm missing something,
> the filesystem can't do anything to the individual drives without it
> showing up as read/write from/to the array device.

I don't know if that's true or not.  Certainly if the FS is RAID aware, it
can query the RAID system for details about the array and its member
elements (XFS, for example does just this in order to automatically set up
stripe width dur8ing format).  There's nothing to prevent the FS from
issuing command directly to the drive management system (/dev/sda, /dev/sdb,
etc.).


> Aside from that, everything you're written seems to be consistent with
> my hypothesis that you had a bitmap caching problem. Or maybe I'm just
> falling prey to confirmation bias.
> 
> Did you ever test with dstat and debugreiserfs like I mentioned earlier
> in this thread?

Yes to the first and no to the second.  I must have missed the reference in
all the correspondence.  'Sorry about that.

> >> It would always be the same 5 drives which dropped to zero
> >> and the same 5 which still reported some reads going on.
> 
> I did the math and (if a couple reasonable assumptions I made are
> correct), then the reiserfs bitmaps would indeed be distributed among
> five of 10 drives in a RAID-6.
> 
> If you're interested, ask, and I'll write it up.

It's academic, but I'm curious.  Why would the default parameters have
failed?

> >> Note if a RAID
> >> resync was occurring, then all 10 drives would continue to report
> >> significant read rates at the drive level, but array level read /
> writes
> >> would stop altogether.  The likelihood of a halt event was fairly low
> if
> >> there was no drive activity, and increased as the level of drive
> activity
> >> (read or write) increased.  During a RAID resync, almost every file
> create
> >> causes a halt.
> 
> Perhaps because the resync I/O caused the bitmap data to fall off the
> page cache.

How would that happen?  More to the point, how would it happen without
triggering activity in the FS?

> >> After exhausting all my abilities to troubleshoot the
> >> issue,
> >> I finally erased the entire array and reformatted it as XFS.  I am
> still
> >> transferring the data from the backup to the RAID array, but with over
> 30%
> >> of the data transferred and over 10,000 files created in the last
> several
> >> days, I have not been able to trigger a halt event.  What's more, my
> file
> >> delete performance for large files was very poor under Reiserfs.  A 20G
> >> file
> >> could take upwards of 30 seconds to delete, although deleting a file
> never
> >> caused a file system halt like creating a file did.  Under the new file
> >> system, deleting a 20G file takes typically 0.1 seconds or less.
> 
> I remember being annoyed by large file deletion performance before, but
> I can't reproduce it right now (with kernel 2.6.28.2).

Certainly I'm not having the problem, now.  With more than half the data (3T
out of 5.8T) transferred, I haven't had a single halt and deleting a 23G
file takes less than 0.9 seconds, where before it took up to 30 seconds.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problem with reiserfs volume
  2009-05-03  1:58         ` Leslie Rhorer
@ 2009-05-03 23:54           ` Corey Hickey
  2009-05-05  8:43             ` Leslie Rhorer
  0 siblings, 1 reply; 12+ messages in thread
From: Corey Hickey @ 2009-05-03 23:54 UTC (permalink / raw)
  To: lrhorer, reiserfs-devel

Leslie Rhorer wrote:
>>> The read activity at the array level also fell to zero, but at the drive
>>> level 5 of the drives would still show activity.
>> Are you sure the read activity for the array was 0?
> 
> Yep.  According to iostat, absolute zilch.

Peculiar. I cannot explain that.

>> If the array wasn't
>> doing anything but the individual drives were, that would indicate a
>> lower-level problem than the filesystem;
> 
> It could, yes.  In fact, it is not unlikely to be and interaction failure
> between the file system and the RAID device management system (/dev/md0, or
> whatever).
> 
>> unless I'm missing something,
>> the filesystem can't do anything to the individual drives without it
>> showing up as read/write from/to the array device.
> 
> I don't know if that's true or not.  Certainly if the FS is RAID aware, it
> can query the RAID system for details about the array and its member
> elements (XFS, for example does just this in order to automatically set up
> stripe width dur8ing format). 

For XFS, this appears to be done by mkfs.xfs via a GET_ARRAY_INFO ioctl
on the md block device. See the xfsprogs source, libdisk/md.c,
md_get_subvol_stripe().

> There's nothing to prevent the FS from
> issuing command directly to the drive management system (/dev/sda, /dev/sdb,
> etc.).

That seems to me like it would be opening a can of worms. It's the job
of md (or lvm, dm, etc.) to figure out which disk (or partition, or
file, etc.) to read/write; otherwise, the filesystem would have to
consider a number of factors, even besides RAID layout. Someone please
correct me if I'm mistaken....

>> Did you ever test with dstat and debugreiserfs like I mentioned earlier
>> in this thread?
> 
> Yes to the first and no to the second.  I must have missed the reference in
> all the correspondence.  'Sorry about that.

That's ok.

>>>> It would always be the same 5 drives which dropped to zero
>>>> and the same 5 which still reported some reads going on.
>> I did the math and (if a couple reasonable assumptions I made are
>> correct), then the reiserfs bitmaps would indeed be distributed among
>> five of 10 drives in a RAID-6.
>>
>> If you're interested, ask, and I'll write it up.
> 
> It's academic, but I'm curious.  Why would the default parameters have
> failed?

It's not exactly a "failure"--it's just that the bitmaps are placed
every 128 MB, and that results in a certain distribution among your disks.

bitmap_freq = 128 MB * 1024 KB/MB = 131072 KB

For a simple example, first consider a 2-disk RAID-0 with the default 64
KB chunk size.

num_data_disks = 2
chunk_size = 64 KB
stripe_size = chunk_size * num_data_disks = 128 KB
stripe_offset = bitmap_freq / stripe_size = 1024

131072 is a multiple of 128, so the bitmaps are all on the same disk,
1024 stripes apart.

Now consider a 3-disk RAID-0. 131072 is not a multiple of 192.

num_data_disks = 3
chunk_size = 64 KB
stripe_size = chunk_size * num_data_disks = 192 KB
stripe_offset = bitmap_freq / stripe_size = 682.6666....

Bitmaps are 682 and 2/3 stripes apart. 2/3 of a 3-chunk stripe is 2
chunks, so if the first bitmap is on the first disk, the next bitmap
would be on the third disk, then the second disk, then back to the
first: (A,C,B,...). In this case the bitmaps would be spread among all
three disks.

Now lets look at your 10-disk RAID-6. This is more complicated because
we have to consider that two chunks out of each stripe hold parity, and
that the chunk layout changes with each stripe. Here's where I have to
make an assumption: I can't find out whether the layout methods for
RAID-6 are the same as for RAID-5. If they are, the layout for your RAID
will be like this (the default left-symmetric) or at least substantially
similar.

         disk  ABCDEFGHIJ

stripe 0:      abcdefghPP
stripe 1:      bcdefghPPa
stripe 2:      cdefghPPab
stripe 3:      defghPPabc
stripe 4:      efghPPabcd
stripe 5:      fghPPabcde
stripe 6:      ghPPabcdef
stripe 7:      hPPabcdefg
stripe 8:      PPabcdefgh
stripe 9:      PabcdefghP

Note that the layout repetition period is the same as the number of
disks. So...

chunk_size = 64 KB
num_disks = 10
num_data_disks = num_disks - 2 = 8
stripe_size = chunk_size * num_data_disks = 512 KB
stripe_offset = bitmap_freq / stripe_size = 256

131072 is a multiple of 512, so the bitmaps are all on the first chunk
of a stripe, 256 stripes apart; however, 256 is not a multiple of the
chunk layout period, so, for each stripe that holds a bitmap, the
position of the first chunk will vary.

chunk_layout_period = num_disks = 10
stripe_layout_offset = stripe_offset % chunk_layout_period = 6

That means each subsequent bitmap will be 6 stripes later within the
stripe layout pattern: 0,6,2,8,4,...

The first chunk is chunk "a", so, for each of those stripes, find which
disk chunk "a" is on in the layout table above. That yields disks
A,E,I,C,G: five disks out of the ten, just like you reported.

(Hopefully I didn't screw up too much of that.)

>>>> During a RAID resync, almost every file create causes a halt.
>> Perhaps because the resync I/O caused the bitmap data to fall off the
>> page cache.
> 
> How would that happen?  More to the point, how would it happen without
> triggering activity in the FS?

That was sort of a speculative statement, and I can't really back it up
because I don't know the details of how the page cache fits in, but IF
the data read and written during a resync gets cached, then the page
cache might prefer to retain that data rather than the bitmap data.

If the bitmap data never stays in the page cache for long, then a file
write would pretty much always require some bitmaps to be re-read.

-Corey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Problem with reiserfs volume
  2009-05-03 23:54           ` Corey Hickey
@ 2009-05-05  8:43             ` Leslie Rhorer
  2009-05-05 23:40               ` Corey Hickey
  0 siblings, 1 reply; 12+ messages in thread
From: Leslie Rhorer @ 2009-05-05  8:43 UTC (permalink / raw)
  To: reiserfs-devel

> >> If the array wasn't
> >> doing anything but the individual drives were, that would indicate a
> >> lower-level problem than the filesystem;
> >
> > It could, yes.  In fact, it is not unlikely to be and interaction
> failure
> > between the file system and the RAID device management system (/dev/md0,
> or
> > whatever).
> >
> >> unless I'm missing something,
> >> the filesystem can't do anything to the individual drives without it
> >> showing up as read/write from/to the array device.
> >
> > I don't know if that's true or not.  Certainly if the FS is RAID aware,
> it
> > can query the RAID system for details about the array and its member
> > elements (XFS, for example does just this in order to automatically set
> up
> > stripe width dur8ing format).
> 
> For XFS, this appears to be done by mkfs.xfs via a GET_ARRAY_INFO ioctl
> on the md block device. See the xfsprogs source, libdisk/md.c,
> md_get_subvol_stripe().
> 
> > There's nothing to prevent the FS from
> > issuing command directly to the drive management system (/dev/sda,
> /dev/sdb,
> > etc.).
> 
> That seems to me like it would be opening a can of worms.

It surely would.  'Doesn't necessarily mean someone didn't.  I have an idea,
though...

> >> Did you ever test with dstat and debugreiserfs like I mentioned earlier
> >> in this thread?
> >
> > Yes to the first and no to the second.  I must have missed the reference
> in
> > all the correspondence.  'Sorry about that.
> 
> That's ok.
> 
> >>>> It would always be the same 5 drives which dropped to zero
> >>>> and the same 5 which still reported some reads going on.
> >> I did the math and (if a couple reasonable assumptions I made are
> >> correct), then the reiserfs bitmaps would indeed be distributed among
> >> five of 10 drives in a RAID-6.
> >>
> >> If you're interested, ask, and I'll write it up.
> >
> > It's academic, but I'm curious.  Why would the default parameters have
> > failed?
> 
> It's not exactly a "failure"--it's just that the bitmaps are placed
> every 128 MB, and that results in a certain distribution among your disks.

This triggered a thought.  When I built the array, it was physically in a
termporary configuration, so that while /dev/sda was drive 0 in the array
and /dev/sdj was drive 9 in the array when it was built, the drives were
moved in a piecemeal fashion to the new chassis, so that the order was
something like /dev/sdf, /dev/sdg, /dev/sdh, /dev/sdi, /dev/sdj, /dev/sda,
/dev/sde, /dev/sdd, /dev/sdc, /dev/sb, or something like that.  This
shouldn't create a problem, as md handles RAID assembly based upon the drive
superblock, not the udev assignment.  Is it possible the re-arrangement
caused a failure of the bitmap somehow?

It still doesn't quite explain to me how a high read rate strictly at the
drive level (e.g. ckarray) causes severe problems at the FS level, while an
idle system did not exhibit nearly the frequency of problems nor did the
hang last even a fraction as long (40 seconds vs. 20 minutes).
 
> That means each subsequent bitmap will be 6 stripes later within the
> stripe layout pattern: 0,6,2,8,4,...
> 
> The first chunk is chunk "a", so, for each of those stripes, find which
> disk chunk "a" is on in the layout table above. That yields disks
> A,E,I,C,G: five disks out of the ten, just like you reported.

Yeah, that's about right.

> 
> 
> (Hopefully I didn't screw up too much of that.)
> 
> >>>> During a RAID resync, almost every file create causes a halt.
> >> Perhaps because the resync I/O caused the bitmap data to fall off the
> >> page cache.
> >
> > How would that happen?  More to the point, how would it happen without
> > triggering activity in the FS?
> 
> That was sort of a speculative statement, and I can't really back it up
> because I don't know the details of how the page cache fits in, but IF
> the data read and written during a resync gets cached, then the page
> cache might prefer to retain that data rather than the bitmap data.
> 
> If the bitmap data never stays in the page cache for long, then a file
> write would pretty much always require some bitmaps to be re-read.

Except this happened without any file writes or reads other than the file
creation itself and with no disk activity other than the array re-sync.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problem with reiserfs volume
  2009-05-05  8:43             ` Leslie Rhorer
@ 2009-05-05 23:40               ` Corey Hickey
  2009-05-06  2:04                 ` Leslie Rhorer
  0 siblings, 1 reply; 12+ messages in thread
From: Corey Hickey @ 2009-05-05 23:40 UTC (permalink / raw)
  To: lrhorer; +Cc: reiserfs-devel

Leslie Rhorer wrote:
>>>>>> It would always be the same 5 drives which dropped to zero
>>>>>> and the same 5 which still reported some reads going on.
>>>> I did the math and (if a couple reasonable assumptions I made are
>>>> correct), then the reiserfs bitmaps would indeed be distributed among
>>>> five of 10 drives in a RAID-6.
>>>>
>>>> If you're interested, ask, and I'll write it up.
>>> It's academic, but I'm curious.  Why would the default parameters have
>>> failed?
>> It's not exactly a "failure"--it's just that the bitmaps are placed
>> every 128 MB, and that results in a certain distribution among your disks.
> 
> This triggered a thought.  When I built the array, it was physically in a
> termporary configuration, so that while /dev/sda was drive 0 in the array
> and /dev/sdj was drive 9 in the array when it was built, the drives were
> moved in a piecemeal fashion to the new chassis, so that the order was
> something like /dev/sdf, /dev/sdg, /dev/sdh, /dev/sdi, /dev/sdj, /dev/sda,
> /dev/sde, /dev/sdd, /dev/sdc, /dev/sb, or something like that.  This
> shouldn't create a problem, as md handles RAID assembly based upon the drive
> superblock, not the udev assignment.  Is it possible the re-arrangement
> caused a failure of the bitmap somehow?

That should be fine.

I might not have been clear on this before: reading the bitmap data is
slow because it is distributed every 128 MB across the filesystem; this
means that in order to read lots of bitmaps, the disk spends most of its
time seeking rather than reading. For me, that's what was causing the
disk to "buzz", and that's why dstat showed read rates of only 400-600
KB/sec.

I just ran a quick test on my single-disk reiserfs and calculated the
average seek rate:

fs_size = 242341144 KB
bitmap_spacing = 128 MB = 131072 KB
num_bitmaps = fs_size / bitmap_spacing = 1849
bitmaps_read_time = 15.5 sec   (from debugreiserfs -m)
bitmap_read_rate = num_bitmaps / bitmaps_read_time = 119 bitmaps/sec
seek_rate = bitmap_read_rate = 119 seeks/sec  (seek to every bitmap)

That's a lot of seeking!

Having the bitmaps spread out among several disks of a RAID probably
wouldn't help. Reiserfs doesn't try to read the bitmaps in parallel;
that would be bad unless it knew the RAID layout. So, each disk would
just be idle when it wasn't its turn to seek and read another bitmap.

Remember how in the old days (before 2.6.19, I think) large reiserfs
filesystems took forever to mount? That's because reiserfs was reading
all the bitmap data and caching it internally. Eventually Jeff Mahoney
wrote a patch to make reiserfs read bitmap data on-demand and just let
the kernel cache them (or not).

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5065227b46235ec0131b383cc2f537069b55c6b6

> It still doesn't quite explain to me how a high read rate strictly at the
> drive level (e.g. ckarray) causes severe problems at the FS level, while an
> idle system did not exhibit nearly the frequency of problems nor did the
> hang last even a fraction as long (40 seconds vs. 20 minutes).

20 minutes sounds excessive, even when competing with a resync. I
couldn't say, and can't test it here.

>>>>>> During a RAID resync, almost every file create causes a halt.
>>>> Perhaps because the resync I/O caused the bitmap data to fall off the
>>>> page cache.
>>> How would that happen?  More to the point, how would it happen without
>>> triggering activity in the FS?
>> That was sort of a speculative statement, and I can't really back it up
>> because I don't know the details of how the page cache fits in, but IF
>> the data read and written during a resync gets cached, then the page
>> cache might prefer to retain that data rather than the bitmap data.
>>
>> If the bitmap data never stays in the page cache for long, then a file
>> write would pretty much always require some bitmaps to be re-read.
> 
> Except this happened without any file writes or reads other than the file
> creation itself and with no disk activity other than the array re-sync.

I remember even 0-byte files taking a long time to write. My guess would
be that reiserfs doesn't know the file will end up being empty when the
file is created, or perhaps it tries to find some contiguous space
anyway so the file can be appended to without excessive fragmentation.

In order to find contiguous space, reiserfs needs to look at the
bitmaps; if enough bitmap data isn't cached, reiserfs will have to read
some, which, as we know, can take a long time.

-Corey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Problem with reiserfs volume
  2009-05-05 23:40               ` Corey Hickey
@ 2009-05-06  2:04                 ` Leslie Rhorer
  2009-05-07  5:59                   ` Corey Hickey
  0 siblings, 1 reply; 12+ messages in thread
From: Leslie Rhorer @ 2009-05-06  2:04 UTC (permalink / raw)
  To: reiserfs-devel

> I might not have been clear on this before: reading the bitmap data is
> slow because it is distributed every 128 MB across the filesystem; this
> means that in order to read lots of bitmaps, the disk spends most of its
> time seeking rather than reading. For me, that's what was causing the
> disk to "buzz", and that's why dstat showed read rates of only 400-600
> KB/sec.

Yeah, but reads and writes worked just fine: up to 450 Mbps.  Appending to
an existing file (or writing several GB to a file once the create was done)
ran like a racehorse on one or several files without ever a burp.  Reading
could be accomplished flat-out no matter what, but with total disk activity
well in excess of 500Mbps, everything would suddenly halt if a file was
created on an intermittent basis.  Perhaps one create in five or so would
trigger the issue if high volumes of data were being read and / or written,
except when a resync was under way, in which case almost every file create
would generate a pause.  During normal operation the pause would almost
always last exactly 40 seconds.  During resync, the pause lasted as much as
20 minutes.

> I just ran a quick test on my single-disk reiserfs and calculated the
> average seek rate:
> 
> fs_size = 242341144 KB
> bitmap_spacing = 128 MB = 131072 KB
> num_bitmaps = fs_size / bitmap_spacing = 1849
> bitmaps_read_time = 15.5 sec   (from debugreiserfs -m)
> bitmap_read_rate = num_bitmaps / bitmaps_read_time = 119 bitmaps/sec
> seek_rate = bitmap_read_rate = 119 seeks/sec  (seek to every bitmap)
> 
> That's a lot of seeking!

No question, but under ordinary read and write loads, the system handled the
situation with aplomb.  Create ten 20 byte files over a period of 30
minutes, however, and it would halt perhaps 3 - 5 times.  Under light loads,
perhaps 1 in 10 times, although sometimes even with heavy loads I would
create 30 or 40 files or more with no symptoms.  During a resync, however, a
halt was all but guaranteed with every creation.

> Having the bitmaps spread out among several disks of a RAID probably
> wouldn't help. Reiserfs doesn't try to read the bitmaps in parallel;
> that would be bad unless it knew the RAID layout. So, each disk would
> just be idle when it wasn't its turn to seek and read another bitmap.

With 400+ Mbps of data being read and written, the discs weren't idle very
much.

> Remember how in the old days (before 2.6.19, I think) large reiserfs
> filesystems took forever to mount?

I have only been using reiserfs for a short time.

> > It still doesn't quite explain to me how a high read rate strictly at
> the
> > drive level (e.g. ckarray) causes severe problems at the FS level, while
> an
> > idle system did not exhibit nearly the frequency of problems nor did the
> > hang last even a fraction as long (40 seconds vs. 20 minutes).
> 
> 20 minutes sounds excessive, even when competing with a resync. I
> couldn't say, and can't test it here.

More to the point, reads and writes didn't have any problem competing with
the resync.  When accessing a file for either read or write, the data
transfer would begin in earnest within 2 or 3 seconds, with other activity
continuing unabated. An ls would return in a fraction of a second.  Once the
halt occurred, however, an ls would not return until the event had resolved.

> > Except this happened without any file writes or reads other than the
> file
> > creation itself and with no disk activity other than the array re-sync.
> 
> I remember even 0-byte files taking a long time to write. My guess would
> be that reiserfs doesn't know the file will end up being empty when the
> file is created, or perhaps it tries to find some contiguous space
> anyway so the file can be appended to without excessive fragmentation.

So why didn't it happen when appending data to an existing file?  Once a
file was created, large or small, I could write freely to it over and over,
either appending data or writing over data.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Problem with reiserfs volume
  2009-05-06  2:04                 ` Leslie Rhorer
@ 2009-05-07  5:59                   ` Corey Hickey
  2009-05-11 16:37                     ` Leslie Rhorer
  0 siblings, 1 reply; 12+ messages in thread
From: Corey Hickey @ 2009-05-07  5:59 UTC (permalink / raw)
  To: lrhorer; +Cc: reiserfs-devel

Leslie Rhorer wrote:
>> I might not have been clear on this before: reading the bitmap data is
>> slow because it is distributed every 128 MB across the filesystem; this
>> means that in order to read lots of bitmaps, the disk spends most of its
>> time seeking rather than reading. For me, that's what was causing the
>> disk to "buzz", and that's why dstat showed read rates of only 400-600
>> KB/sec.
> 
> Yeah, but reads and writes worked just fine: up to 450 Mbps. 

I mean, above, that read rates would fall to 400-600 KB/sec when the
filesystem was busy reading bitmap data. That at least roughly
corresponds to what you wrote on 2009-04-28: "The reads at the array
level would fall to zero on 5 of the 10 drives, while the other 5 would
report a very low level of read activity, but not zero."

> Appending to
> an existing file (or writing several GB to a file once the create was done)
> ran like a racehorse on one or several files without ever a burp.  Reading
> could be accomplished flat-out no matter what, but with total disk activity
> well in excess of 500Mbps, everything would suddenly halt if a file was
> created on an intermittent basis. 

That's just like what was happening to me. The filesystem would drop
everything else it was doing and read bitmaps for a while.

>> Having the bitmaps spread out among several disks of a RAID probably
>> wouldn't help. Reiserfs doesn't try to read the bitmaps in parallel;
>> that would be bad unless it knew the RAID layout. So, each disk would
>> just be idle when it wasn't its turn to seek and read another bitmap.
> 
> With 400+ Mbps of data being read and written, the discs weren't idle very
> much.

Except that when the filesystem is busy reading bitmaps, it isn't doing
anything else.... :)

>> Remember how in the old days (before 2.6.19, I think) large reiserfs
>> filesystems took forever to mount?
> 
> I have only been using reiserfs for a short time.

Well, mounting did take forever. :)
http://lkml.org/lkml/2006/1/14/223
http://linuxgazette.net/122/TWDT.html#piszcz
(scroll down a bit to the graphs)

>>> Except this happened without any file writes or reads other than the
>> file
>>> creation itself and with no disk activity other than the array re-sync.
>> I remember even 0-byte files taking a long time to write. My guess would
>> be that reiserfs doesn't know the file will end up being empty when the
>> file is created, or perhaps it tries to find some contiguous space
>> anyway so the file can be appended to without excessive fragmentation.
> 
> So why didn't it happen when appending data to an existing file?  Once a
> file was created, large or small, I could write freely to it over and over,
> either appending data or writing over data.

I don't know how appends or overwrites are handled. The scheme for
finding free space may differ.

-Corey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Problem with reiserfs volume
  2009-05-07  5:59                   ` Corey Hickey
@ 2009-05-11 16:37                     ` Leslie Rhorer
  0 siblings, 0 replies; 12+ messages in thread
From: Leslie Rhorer @ 2009-05-11 16:37 UTC (permalink / raw)
  To: reiserfs-devel

> >> I might not have been clear on this before: reading the bitmap data is
> >> slow because it is distributed every 128 MB across the filesystem; this
> >> means that in order to read lots of bitmaps, the disk spends most of
> its
> >> time seeking rather than reading. For me, that's what was causing the
> >> disk to "buzz", and that's why dstat showed read rates of only 400-600
> >> KB/sec.
> >
> > Yeah, but reads and writes worked just fine: up to 450 Mbps.
> 
> I mean, above, that read rates would fall to 400-600 KB/sec when the
> filesystem was busy reading bitmap data.

	Well, first of all, it would drop to more like 4KBps, not 400KBps.

> That at least roughly
> corresponds to what you wrote on 2009-04-28: "The reads at the array
> level would fall to zero on 5 of the 10 drives, while the other 5 would
> report a very low level of read activity, but not zero."
> 
> > Appending to
> > an existing file (or writing several GB to a file once the create was
> done)
> > ran like a racehorse on one or several files without ever a burp.
> Reading
> > could be accomplished flat-out no matter what, but with total disk
> activity
> > well in excess of 500Mbps, everything would suddenly halt if a file was
> > created on an intermittent basis.
> 
> That's just like what was happening to me. The filesystem would drop
> everything else it was doing and read bitmaps for a while.
> 
> >> Having the bitmaps spread out among several disks of a RAID probably
> >> wouldn't help. Reiserfs doesn't try to read the bitmaps in parallel;
> >> that would be bad unless it knew the RAID layout. So, each disk would
> >> just be idle when it wasn't its turn to seek and read another bitmap.
> >
> > With 400+ Mbps of data being read and written, the discs weren't idle
> very
> > much.
> 
> Except that when the filesystem is busy reading bitmaps, it isn't doing
> anything else.... :)

Are you saying it doesn't read the bitmaps during reads and writes?


> >>> Except this happened without any file writes or reads other than the
> >> file
> >>> creation itself and with no disk activity other than the array re-
> sync.
> >> I remember even 0-byte files taking a long time to write. My guess
> would
> >> be that reiserfs doesn't know the file will end up being empty when the
> >> file is created, or perhaps it tries to find some contiguous space
> >> anyway so the file can be appended to without excessive fragmentation.
> >
> > So why didn't it happen when appending data to an existing file?  Once a
> > file was created, large or small, I could write freely to it over and
> over,
> > either appending data or writing over data.
> 
> I don't know how appends or overwrites are handled. The scheme for
> finding free space may differ.

Yes, of course that's true, but I wouldn't think it would be so by design.
It also doesn't explain why the event was more likely during heavy activity.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-05-11 16:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-04 17:25 Problem with reiserfs volume Lelsie Rhorer
2009-04-06 20:04 ` Corey Hickey
2009-04-28 23:53   ` Leslie Rhorer
2009-04-29  0:00     ` Leslie Rhorer
2009-04-30  6:47       ` Corey Hickey
2009-05-03  1:58         ` Leslie Rhorer
2009-05-03 23:54           ` Corey Hickey
2009-05-05  8:43             ` Leslie Rhorer
2009-05-05 23:40               ` Corey Hickey
2009-05-06  2:04                 ` Leslie Rhorer
2009-05-07  5:59                   ` Corey Hickey
2009-05-11 16:37                     ` Leslie Rhorer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.