* Problem with reiserfs volume @ 2009-04-04 17:25 Lelsie Rhorer 2009-04-06 20:04 ` Corey Hickey 0 siblings, 1 reply; 12+ messages in thread From: Lelsie Rhorer @ 2009-04-04 17:25 UTC (permalink / raw) To: reiserfs-devel I know this is a development list, so if I am posting in the wrong list, please forgive me and point me toward the correct one. I'm having a severe problem whose root cause I cannot determine. I have a RAID 6 array managed by mdadm running on Debian "Lenny" with a 3.2GHz AMD Athlon 64 x 2 processor and 8G of RAM. The kernel is 2.6.26-1-amd64. There are ten 1 Terabyte SATA drives, unpartitioned, fully allocated to the /dev/md0 device. The drive are served by 3 Silicon Image SATA port multipliers and a Silicon Image 4 port eSATA controller. The /dev/md0 device is also unpartitioned, and all 8T of active space is formatted as a single Reiserfs file system. The entire volume is mounted to /RAID. Various directories on the volume are shared using both NFS and SAMBA. Performance of the RAID system is very good. The array can read and write at over 450 Mbps, and I don't know if the limit is the array itself or the network, but since the performance is more than adequate I really am not concerned which is the case. The issue is the entire array will occasionally pause completely for about 40 seconds when a file is created. This does not always happen, but the situation is easily reproducible. The frequency at which the symptom occurs seems to be somewhat related to the transfer load on the array. If no other transfers are in process, then the failure seems somewhat more rare, perhaps accompanying less than 1 file creation in 10.. During heavy file transfer activity, sometimes the system halts with every other file creation. Although I have observed many dozens of these events, I have never once observed it to happen except when a file creation occurs. Reading and writing existing files never triggers the event, although any read or write occurring during the event is halted for the duration. (There is one cron jog which runs every half-hour that creates a tiny file; this is the most common failure vector.) There are other drives formatted with other file systems on the machine, but the issue has never been seen on any of the other drives. When the array runs its regularly scheduled health check, the problem is much worse. Not only does it lock up with almost every single file creation, but the lock-up time is much longer - sometimes in excess of 2 minutes. Transfers via Linux based utilities (ftp, NFS, cp, mv, rsync, etc) all recover after the event, but SAMBA based transfers frequently fail, both reads and writes. I discussed the matter over on the linux-raid list, but so far none of the suggestions there have yielded any great progress in fixing the issue. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problem with reiserfs volume 2009-04-04 17:25 Problem with reiserfs volume Lelsie Rhorer @ 2009-04-06 20:04 ` Corey Hickey 2009-04-28 23:53 ` Leslie Rhorer 0 siblings, 1 reply; 12+ messages in thread From: Corey Hickey @ 2009-04-06 20:04 UTC (permalink / raw) To: lrhorer, reiserfs-devel Lelsie Rhorer wrote: > The issue is the entire array will occasionally pause completely for about > 40 seconds when a file is created. This does not always happen, but the > situation is easily reproducible. The frequency at which the symptom occurs > seems to be somewhat related to the transfer load on the array. If no other > transfers are in process, then the failure seems somewhat more rare, perhaps > accompanying less than 1 file creation in 10.. During heavy file transfer > activity, sometimes the system halts with every other file creation. > Although I have observed many dozens of these events, I have never once > observed it to happen except when a file creation occurs. > Reading and writing existing files never triggers the event, although any > read or write occurring during the event is halted for the duration. > (There is one cron jog which runs every half-hour that creates a tiny file; > this is the most common failure vector.) There are other drives formatted > with other file systems on the machine, but the issue has never been seen on > any of the other drives. When the array runs its regularly scheduled health > check, the problem is much worse. Not only does it lock up with almost > every single file creation, but the lock-up time is much longer - sometimes > in excess of 2 minutes. This sounds somewhat like an intermittent problem I reported on 2008-02-20: http://www.spinics.net/lists/reiserfs-devel/msg00702.html The gist of the issue, apparently, was that writing files would cause those files to be cached and the kernel would drop reiserfs bitmap data to make room in the page cache. Once those bitmaps were dropped from the cache and another file needed to be written, many bitmaps needed to be read back from the disk in order to find free space. The bitmaps are small, but spaced every 128 MB, so very many seeks were needed and the read speed was quite slow. All that seeking caused the disk to buzz distinctively. Try listening for that, or looking at the disk read/write activity with something like dstat. You can force bitmap data to be dropped and then re-read, in order to find out what to look/listen for (change sdc4 to md0 or whatever): # echo 1 > /proc/sys/vm/drop_caches # debugreiserfs -m /dev/sdc4 > /dev/null Here's what dstat looks like when I run the above commands: ------------------- $ dstat -d -D sdc --dsk/sdc-- read writ 914k 221k 0 16k 0 0 0 0 0 0 92k 0 780k 0 412k 0 608k 0 528k 0 552k 0 440k 0 444k 0 432k 0 432k 0 608k 0 500k 0 556k 0 520k 0 208k 0 0 0 0 0 0 0 0 0 ------------------- That might or might not be what's happening to you; my machine had much less RAM, but also a much smaller array. Jeff Mahoney was helpful and informative when I reported the issue, but wasn't able to reproduce it on his system (neither could I, on a machine with a larger filesystem and less RAM). I ended up switching to ext4 for the problematic array, but most of my other filesystems are still reiserfs and have never had that problem. Good luck, Corey ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Problem with reiserfs volume 2009-04-06 20:04 ` Corey Hickey @ 2009-04-28 23:53 ` Leslie Rhorer 2009-04-29 0:00 ` Leslie Rhorer 0 siblings, 1 reply; 12+ messages in thread From: Leslie Rhorer @ 2009-04-28 23:53 UTC (permalink / raw) To: reiserfs-devel > This sounds somewhat like an intermittent problem I reported on 2008-02- > 20: > > http://www.spinics.net/lists/reiserfs-devel/msg00702.html > > The gist of the issue, apparently, was that writing files would cause > those files to be cached and the kernel would drop reiserfs bitmap data > to make room in the page cache. Once those bitmaps were dropped from the > cache and another file needed to be written, many bitmaps needed to be > read back from the disk in order to find free space. The bitmaps are > small, but spaced every 128 MB, so very many seeks were needed and the > read speed was quite slow. > > All that seeking caused the disk to buzz distinctively. Try listening > for that, or looking at the disk read/write activity with something like > dstat. No, I did a fair bit of additional investigation, and the symptoms were fairly odd. When a halt would occur, all writes at every level would fall to dead zero. The reads at the array level would fall to zero on 5 of the 10 drives, while the other 5 would report a very low level of read activity, but not zero. It would always be the same 5 drives which dropped to zero and the same 5 which still reported some reads going on. Note if a RAID resync was occurring, then all 10 drives would continue to report significant read rates at the drive level, but array level read / writes would stop altogether. The likelihood of a halt event was fairly low if there was no drive activity, and increased as the level of drive activity (read or write) increased. During a RAID resync, almost every file create causes a halt. After exhausting all my abilities to troubleshoot the issue, I finally erased the entire array and reformatted it as XFS. I am still transferring the data from the backup to the RAID array, but with over 30% of the data transferred and over 10,000 files created in the last several days, I have not been able to trigger a halt event. What's more, my file delete performance for large files was very poor under Reiserfs. A 20G file could take upwards of 30 seconds to delete, although deleting a file never caused a file system halt like creating a file did. Under the new file system, deleting a 20G file takes typically 0.1 seconds or less. This definitely suggests there may be a problem with Reiserfs. The only things which changed from the last array to this one were the physical drive locations in the array (I had swapped drives around to try to pinpoint the issue), a Version 1.2 Superblock in the new array vs. 0.9 in the old array, and a 256K chunk size rather than the default 64K to improve performance. ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Problem with reiserfs volume 2009-04-28 23:53 ` Leslie Rhorer @ 2009-04-29 0:00 ` Leslie Rhorer 2009-04-30 6:47 ` Corey Hickey 0 siblings, 1 reply; 12+ messages in thread From: Leslie Rhorer @ 2009-04-29 0:00 UTC (permalink / raw) To: reiserfs-devel > > The gist of the issue, apparently, was that writing files would cause > > those files to be cached and the kernel would drop reiserfs bitmap data > > to make room in the page cache. Once those bitmaps were dropped from the > > cache and another file needed to be written, many bitmaps needed to be > > read back from the disk in order to find free space. The bitmaps are > > small, but spaced every 128 MB, so very many seeks were needed and the > > read speed was quite slow. > > > > All that seeking caused the disk to buzz distinctively. Try listening > > for that, or looking at the disk read/write activity with something like > > dstat. > > No, I did a fair bit of additional investigation, and the symptoms were > fairly odd. When a halt would occur, all writes at every level would fall > to dead zero. The reads at the array level would fall to zero on 5 of the > 10 drives, while the other 5 would report a very low level of read > activity, > but not zero. Oops! I'm sorry. I mis-typed the sentences just above. What I meant to say was the write activity at both the array and drive level fell to zero. The read activity at the array level also fell to zero, but at the drive level 5 of the drives would still show activity. > It would always be the same 5 drives which dropped to zero > and the same 5 which still reported some reads going on. Note if a RAID > resync was occurring, then all 10 drives would continue to report > significant read rates at the drive level, but array level read / writes > would stop altogether. The likelihood of a halt event was fairly low if > there was no drive activity, and increased as the level of drive activity > (read or write) increased. During a RAID resync, almost every file create > causes a halt. After exhausting all my abilities to troubleshoot the > issue, > I finally erased the entire array and reformatted it as XFS. I am still > transferring the data from the backup to the RAID array, but with over 30% > of the data transferred and over 10,000 files created in the last several > days, I have not been able to trigger a halt event. What's more, my file > delete performance for large files was very poor under Reiserfs. A 20G > file > could take upwards of 30 seconds to delete, although deleting a file never > caused a file system halt like creating a file did. Under the new file > system, deleting a 20G file takes typically 0.1 seconds or less. > > This definitely suggests there may be a problem with Reiserfs. The only > things which changed from the last array to this one were the physical > drive > locations in the array (I had swapped drives around to try to pinpoint the > issue), a Version 1.2 Superblock in the new array vs. 0.9 in the old > array, > and a 256K chunk size rather than the default 64K to improve performance. > > -- > To unsubscribe from this list: send the line "unsubscribe reiserfs-devel" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problem with reiserfs volume 2009-04-29 0:00 ` Leslie Rhorer @ 2009-04-30 6:47 ` Corey Hickey 2009-05-03 1:58 ` Leslie Rhorer 0 siblings, 1 reply; 12+ messages in thread From: Corey Hickey @ 2009-04-30 6:47 UTC (permalink / raw) To: lrhorer, reiserfs-devel Leslie Rhorer wrote: >>> The gist of the issue, apparently, was that writing files would cause >>> those files to be cached and the kernel would drop reiserfs bitmap data >>> to make room in the page cache. Once those bitmaps were dropped from the >>> cache and another file needed to be written, many bitmaps needed to be >>> read back from the disk in order to find free space. The bitmaps are >>> small, but spaced every 128 MB, so very many seeks were needed and the >>> read speed was quite slow. >>> >>> All that seeking caused the disk to buzz distinctively. Try listening >>> for that, or looking at the disk read/write activity with something like >>> dstat. >> No, I did a fair bit of additional investigation, and the symptoms were >> fairly odd. When a halt would occur, all writes at every level would fall >> to dead zero. The reads at the array level would fall to zero on 5 of the >> 10 drives, while the other 5 would report a very low level of read >> activity, >> but not zero. > > Oops! I'm sorry. I mis-typed the sentences just above. What I meant to > say was the write activity at both the array and drive level fell to zero. > The read activity at the array level also fell to zero, but at the drive > level 5 of the drives would still show activity. Are you sure the read activity for the array was 0? If the array wasn't doing anything but the individual drives were, that would indicate a lower-level problem than the filesystem; unless I'm missing something, the filesystem can't do anything to the individual drives without it showing up as read/write from/to the array device. Aside from that, everything you're written seems to be consistent with my hypothesis that you had a bitmap caching problem. Or maybe I'm just falling prey to confirmation bias. Did you ever test with dstat and debugreiserfs like I mentioned earlier in this thread? >> It would always be the same 5 drives which dropped to zero >> and the same 5 which still reported some reads going on. I did the math and (if a couple reasonable assumptions I made are correct), then the reiserfs bitmaps would indeed be distributed among five of 10 drives in a RAID-6. If you're interested, ask, and I'll write it up. >> Note if a RAID >> resync was occurring, then all 10 drives would continue to report >> significant read rates at the drive level, but array level read / writes >> would stop altogether. The likelihood of a halt event was fairly low if >> there was no drive activity, and increased as the level of drive activity >> (read or write) increased. During a RAID resync, almost every file create >> causes a halt. Perhaps because the resync I/O caused the bitmap data to fall off the page cache. >> After exhausting all my abilities to troubleshoot the >> issue, >> I finally erased the entire array and reformatted it as XFS. I am still >> transferring the data from the backup to the RAID array, but with over 30% >> of the data transferred and over 10,000 files created in the last several >> days, I have not been able to trigger a halt event. What's more, my file >> delete performance for large files was very poor under Reiserfs. A 20G >> file >> could take upwards of 30 seconds to delete, although deleting a file never >> caused a file system halt like creating a file did. Under the new file >> system, deleting a 20G file takes typically 0.1 seconds or less. I remember being annoyed by large file deletion performance before, but I can't reproduce it right now (with kernel 2.6.28.2). In any case, I've switched my large filesystem to ext4, so far without any regrets. My remaining filesystems are mostly still reiserfs, and I'll eventually migrate them, but I'm in no hurry. -Corey ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Problem with reiserfs volume 2009-04-30 6:47 ` Corey Hickey @ 2009-05-03 1:58 ` Leslie Rhorer 2009-05-03 23:54 ` Corey Hickey 0 siblings, 1 reply; 12+ messages in thread From: Leslie Rhorer @ 2009-05-03 1:58 UTC (permalink / raw) To: reiserfs-devel > >> No, I did a fair bit of additional investigation, and the symptoms were > >> fairly odd. When a halt would occur, all writes at every level would > fall > >> to dead zero. The reads at the array level would fall to zero on 5 of > the > >> 10 drives, while the other 5 would report a very low level of read > >> activity, > >> but not zero. > > > > Oops! I'm sorry. I mis-typed the sentences just above. What I meant > to > > say was the write activity at both the array and drive level fell to > zero. > > The read activity at the array level also fell to zero, but at the drive > > level 5 of the drives would still show activity. > > Are you sure the read activity for the array was 0? Yep. According to iostat, absolute zilch. > If the array wasn't > doing anything but the individual drives were, that would indicate a > lower-level problem than the filesystem; It could, yes. In fact, it is not unlikely to be and interaction failure between the file system and the RAID device management system (/dev/md0, or whatever). > unless I'm missing something, > the filesystem can't do anything to the individual drives without it > showing up as read/write from/to the array device. I don't know if that's true or not. Certainly if the FS is RAID aware, it can query the RAID system for details about the array and its member elements (XFS, for example does just this in order to automatically set up stripe width dur8ing format). There's nothing to prevent the FS from issuing command directly to the drive management system (/dev/sda, /dev/sdb, etc.). > Aside from that, everything you're written seems to be consistent with > my hypothesis that you had a bitmap caching problem. Or maybe I'm just > falling prey to confirmation bias. > > Did you ever test with dstat and debugreiserfs like I mentioned earlier > in this thread? Yes to the first and no to the second. I must have missed the reference in all the correspondence. 'Sorry about that. > >> It would always be the same 5 drives which dropped to zero > >> and the same 5 which still reported some reads going on. > > I did the math and (if a couple reasonable assumptions I made are > correct), then the reiserfs bitmaps would indeed be distributed among > five of 10 drives in a RAID-6. > > If you're interested, ask, and I'll write it up. It's academic, but I'm curious. Why would the default parameters have failed? > >> Note if a RAID > >> resync was occurring, then all 10 drives would continue to report > >> significant read rates at the drive level, but array level read / > writes > >> would stop altogether. The likelihood of a halt event was fairly low > if > >> there was no drive activity, and increased as the level of drive > activity > >> (read or write) increased. During a RAID resync, almost every file > create > >> causes a halt. > > Perhaps because the resync I/O caused the bitmap data to fall off the > page cache. How would that happen? More to the point, how would it happen without triggering activity in the FS? > >> After exhausting all my abilities to troubleshoot the > >> issue, > >> I finally erased the entire array and reformatted it as XFS. I am > still > >> transferring the data from the backup to the RAID array, but with over > 30% > >> of the data transferred and over 10,000 files created in the last > several > >> days, I have not been able to trigger a halt event. What's more, my > file > >> delete performance for large files was very poor under Reiserfs. A 20G > >> file > >> could take upwards of 30 seconds to delete, although deleting a file > never > >> caused a file system halt like creating a file did. Under the new file > >> system, deleting a 20G file takes typically 0.1 seconds or less. > > I remember being annoyed by large file deletion performance before, but > I can't reproduce it right now (with kernel 2.6.28.2). Certainly I'm not having the problem, now. With more than half the data (3T out of 5.8T) transferred, I haven't had a single halt and deleting a 23G file takes less than 0.9 seconds, where before it took up to 30 seconds. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problem with reiserfs volume 2009-05-03 1:58 ` Leslie Rhorer @ 2009-05-03 23:54 ` Corey Hickey 2009-05-05 8:43 ` Leslie Rhorer 0 siblings, 1 reply; 12+ messages in thread From: Corey Hickey @ 2009-05-03 23:54 UTC (permalink / raw) To: lrhorer, reiserfs-devel Leslie Rhorer wrote: >>> The read activity at the array level also fell to zero, but at the drive >>> level 5 of the drives would still show activity. >> Are you sure the read activity for the array was 0? > > Yep. According to iostat, absolute zilch. Peculiar. I cannot explain that. >> If the array wasn't >> doing anything but the individual drives were, that would indicate a >> lower-level problem than the filesystem; > > It could, yes. In fact, it is not unlikely to be and interaction failure > between the file system and the RAID device management system (/dev/md0, or > whatever). > >> unless I'm missing something, >> the filesystem can't do anything to the individual drives without it >> showing up as read/write from/to the array device. > > I don't know if that's true or not. Certainly if the FS is RAID aware, it > can query the RAID system for details about the array and its member > elements (XFS, for example does just this in order to automatically set up > stripe width dur8ing format). For XFS, this appears to be done by mkfs.xfs via a GET_ARRAY_INFO ioctl on the md block device. See the xfsprogs source, libdisk/md.c, md_get_subvol_stripe(). > There's nothing to prevent the FS from > issuing command directly to the drive management system (/dev/sda, /dev/sdb, > etc.). That seems to me like it would be opening a can of worms. It's the job of md (or lvm, dm, etc.) to figure out which disk (or partition, or file, etc.) to read/write; otherwise, the filesystem would have to consider a number of factors, even besides RAID layout. Someone please correct me if I'm mistaken.... >> Did you ever test with dstat and debugreiserfs like I mentioned earlier >> in this thread? > > Yes to the first and no to the second. I must have missed the reference in > all the correspondence. 'Sorry about that. That's ok. >>>> It would always be the same 5 drives which dropped to zero >>>> and the same 5 which still reported some reads going on. >> I did the math and (if a couple reasonable assumptions I made are >> correct), then the reiserfs bitmaps would indeed be distributed among >> five of 10 drives in a RAID-6. >> >> If you're interested, ask, and I'll write it up. > > It's academic, but I'm curious. Why would the default parameters have > failed? It's not exactly a "failure"--it's just that the bitmaps are placed every 128 MB, and that results in a certain distribution among your disks. bitmap_freq = 128 MB * 1024 KB/MB = 131072 KB For a simple example, first consider a 2-disk RAID-0 with the default 64 KB chunk size. num_data_disks = 2 chunk_size = 64 KB stripe_size = chunk_size * num_data_disks = 128 KB stripe_offset = bitmap_freq / stripe_size = 1024 131072 is a multiple of 128, so the bitmaps are all on the same disk, 1024 stripes apart. Now consider a 3-disk RAID-0. 131072 is not a multiple of 192. num_data_disks = 3 chunk_size = 64 KB stripe_size = chunk_size * num_data_disks = 192 KB stripe_offset = bitmap_freq / stripe_size = 682.6666.... Bitmaps are 682 and 2/3 stripes apart. 2/3 of a 3-chunk stripe is 2 chunks, so if the first bitmap is on the first disk, the next bitmap would be on the third disk, then the second disk, then back to the first: (A,C,B,...). In this case the bitmaps would be spread among all three disks. Now lets look at your 10-disk RAID-6. This is more complicated because we have to consider that two chunks out of each stripe hold parity, and that the chunk layout changes with each stripe. Here's where I have to make an assumption: I can't find out whether the layout methods for RAID-6 are the same as for RAID-5. If they are, the layout for your RAID will be like this (the default left-symmetric) or at least substantially similar. disk ABCDEFGHIJ stripe 0: abcdefghPP stripe 1: bcdefghPPa stripe 2: cdefghPPab stripe 3: defghPPabc stripe 4: efghPPabcd stripe 5: fghPPabcde stripe 6: ghPPabcdef stripe 7: hPPabcdefg stripe 8: PPabcdefgh stripe 9: PabcdefghP Note that the layout repetition period is the same as the number of disks. So... chunk_size = 64 KB num_disks = 10 num_data_disks = num_disks - 2 = 8 stripe_size = chunk_size * num_data_disks = 512 KB stripe_offset = bitmap_freq / stripe_size = 256 131072 is a multiple of 512, so the bitmaps are all on the first chunk of a stripe, 256 stripes apart; however, 256 is not a multiple of the chunk layout period, so, for each stripe that holds a bitmap, the position of the first chunk will vary. chunk_layout_period = num_disks = 10 stripe_layout_offset = stripe_offset % chunk_layout_period = 6 That means each subsequent bitmap will be 6 stripes later within the stripe layout pattern: 0,6,2,8,4,... The first chunk is chunk "a", so, for each of those stripes, find which disk chunk "a" is on in the layout table above. That yields disks A,E,I,C,G: five disks out of the ten, just like you reported. (Hopefully I didn't screw up too much of that.) >>>> During a RAID resync, almost every file create causes a halt. >> Perhaps because the resync I/O caused the bitmap data to fall off the >> page cache. > > How would that happen? More to the point, how would it happen without > triggering activity in the FS? That was sort of a speculative statement, and I can't really back it up because I don't know the details of how the page cache fits in, but IF the data read and written during a resync gets cached, then the page cache might prefer to retain that data rather than the bitmap data. If the bitmap data never stays in the page cache for long, then a file write would pretty much always require some bitmaps to be re-read. -Corey ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Problem with reiserfs volume 2009-05-03 23:54 ` Corey Hickey @ 2009-05-05 8:43 ` Leslie Rhorer 2009-05-05 23:40 ` Corey Hickey 0 siblings, 1 reply; 12+ messages in thread From: Leslie Rhorer @ 2009-05-05 8:43 UTC (permalink / raw) To: reiserfs-devel > >> If the array wasn't > >> doing anything but the individual drives were, that would indicate a > >> lower-level problem than the filesystem; > > > > It could, yes. In fact, it is not unlikely to be and interaction > failure > > between the file system and the RAID device management system (/dev/md0, > or > > whatever). > > > >> unless I'm missing something, > >> the filesystem can't do anything to the individual drives without it > >> showing up as read/write from/to the array device. > > > > I don't know if that's true or not. Certainly if the FS is RAID aware, > it > > can query the RAID system for details about the array and its member > > elements (XFS, for example does just this in order to automatically set > up > > stripe width dur8ing format). > > For XFS, this appears to be done by mkfs.xfs via a GET_ARRAY_INFO ioctl > on the md block device. See the xfsprogs source, libdisk/md.c, > md_get_subvol_stripe(). > > > There's nothing to prevent the FS from > > issuing command directly to the drive management system (/dev/sda, > /dev/sdb, > > etc.). > > That seems to me like it would be opening a can of worms. It surely would. 'Doesn't necessarily mean someone didn't. I have an idea, though... > >> Did you ever test with dstat and debugreiserfs like I mentioned earlier > >> in this thread? > > > > Yes to the first and no to the second. I must have missed the reference > in > > all the correspondence. 'Sorry about that. > > That's ok. > > >>>> It would always be the same 5 drives which dropped to zero > >>>> and the same 5 which still reported some reads going on. > >> I did the math and (if a couple reasonable assumptions I made are > >> correct), then the reiserfs bitmaps would indeed be distributed among > >> five of 10 drives in a RAID-6. > >> > >> If you're interested, ask, and I'll write it up. > > > > It's academic, but I'm curious. Why would the default parameters have > > failed? > > It's not exactly a "failure"--it's just that the bitmaps are placed > every 128 MB, and that results in a certain distribution among your disks. This triggered a thought. When I built the array, it was physically in a termporary configuration, so that while /dev/sda was drive 0 in the array and /dev/sdj was drive 9 in the array when it was built, the drives were moved in a piecemeal fashion to the new chassis, so that the order was something like /dev/sdf, /dev/sdg, /dev/sdh, /dev/sdi, /dev/sdj, /dev/sda, /dev/sde, /dev/sdd, /dev/sdc, /dev/sb, or something like that. This shouldn't create a problem, as md handles RAID assembly based upon the drive superblock, not the udev assignment. Is it possible the re-arrangement caused a failure of the bitmap somehow? It still doesn't quite explain to me how a high read rate strictly at the drive level (e.g. ckarray) causes severe problems at the FS level, while an idle system did not exhibit nearly the frequency of problems nor did the hang last even a fraction as long (40 seconds vs. 20 minutes). > That means each subsequent bitmap will be 6 stripes later within the > stripe layout pattern: 0,6,2,8,4,... > > The first chunk is chunk "a", so, for each of those stripes, find which > disk chunk "a" is on in the layout table above. That yields disks > A,E,I,C,G: five disks out of the ten, just like you reported. Yeah, that's about right. > > > (Hopefully I didn't screw up too much of that.) > > >>>> During a RAID resync, almost every file create causes a halt. > >> Perhaps because the resync I/O caused the bitmap data to fall off the > >> page cache. > > > > How would that happen? More to the point, how would it happen without > > triggering activity in the FS? > > That was sort of a speculative statement, and I can't really back it up > because I don't know the details of how the page cache fits in, but IF > the data read and written during a resync gets cached, then the page > cache might prefer to retain that data rather than the bitmap data. > > If the bitmap data never stays in the page cache for long, then a file > write would pretty much always require some bitmaps to be re-read. Except this happened without any file writes or reads other than the file creation itself and with no disk activity other than the array re-sync. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problem with reiserfs volume 2009-05-05 8:43 ` Leslie Rhorer @ 2009-05-05 23:40 ` Corey Hickey 2009-05-06 2:04 ` Leslie Rhorer 0 siblings, 1 reply; 12+ messages in thread From: Corey Hickey @ 2009-05-05 23:40 UTC (permalink / raw) To: lrhorer; +Cc: reiserfs-devel Leslie Rhorer wrote: >>>>>> It would always be the same 5 drives which dropped to zero >>>>>> and the same 5 which still reported some reads going on. >>>> I did the math and (if a couple reasonable assumptions I made are >>>> correct), then the reiserfs bitmaps would indeed be distributed among >>>> five of 10 drives in a RAID-6. >>>> >>>> If you're interested, ask, and I'll write it up. >>> It's academic, but I'm curious. Why would the default parameters have >>> failed? >> It's not exactly a "failure"--it's just that the bitmaps are placed >> every 128 MB, and that results in a certain distribution among your disks. > > This triggered a thought. When I built the array, it was physically in a > termporary configuration, so that while /dev/sda was drive 0 in the array > and /dev/sdj was drive 9 in the array when it was built, the drives were > moved in a piecemeal fashion to the new chassis, so that the order was > something like /dev/sdf, /dev/sdg, /dev/sdh, /dev/sdi, /dev/sdj, /dev/sda, > /dev/sde, /dev/sdd, /dev/sdc, /dev/sb, or something like that. This > shouldn't create a problem, as md handles RAID assembly based upon the drive > superblock, not the udev assignment. Is it possible the re-arrangement > caused a failure of the bitmap somehow? That should be fine. I might not have been clear on this before: reading the bitmap data is slow because it is distributed every 128 MB across the filesystem; this means that in order to read lots of bitmaps, the disk spends most of its time seeking rather than reading. For me, that's what was causing the disk to "buzz", and that's why dstat showed read rates of only 400-600 KB/sec. I just ran a quick test on my single-disk reiserfs and calculated the average seek rate: fs_size = 242341144 KB bitmap_spacing = 128 MB = 131072 KB num_bitmaps = fs_size / bitmap_spacing = 1849 bitmaps_read_time = 15.5 sec (from debugreiserfs -m) bitmap_read_rate = num_bitmaps / bitmaps_read_time = 119 bitmaps/sec seek_rate = bitmap_read_rate = 119 seeks/sec (seek to every bitmap) That's a lot of seeking! Having the bitmaps spread out among several disks of a RAID probably wouldn't help. Reiserfs doesn't try to read the bitmaps in parallel; that would be bad unless it knew the RAID layout. So, each disk would just be idle when it wasn't its turn to seek and read another bitmap. Remember how in the old days (before 2.6.19, I think) large reiserfs filesystems took forever to mount? That's because reiserfs was reading all the bitmap data and caching it internally. Eventually Jeff Mahoney wrote a patch to make reiserfs read bitmap data on-demand and just let the kernel cache them (or not). http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5065227b46235ec0131b383cc2f537069b55c6b6 > It still doesn't quite explain to me how a high read rate strictly at the > drive level (e.g. ckarray) causes severe problems at the FS level, while an > idle system did not exhibit nearly the frequency of problems nor did the > hang last even a fraction as long (40 seconds vs. 20 minutes). 20 minutes sounds excessive, even when competing with a resync. I couldn't say, and can't test it here. >>>>>> During a RAID resync, almost every file create causes a halt. >>>> Perhaps because the resync I/O caused the bitmap data to fall off the >>>> page cache. >>> How would that happen? More to the point, how would it happen without >>> triggering activity in the FS? >> That was sort of a speculative statement, and I can't really back it up >> because I don't know the details of how the page cache fits in, but IF >> the data read and written during a resync gets cached, then the page >> cache might prefer to retain that data rather than the bitmap data. >> >> If the bitmap data never stays in the page cache for long, then a file >> write would pretty much always require some bitmaps to be re-read. > > Except this happened without any file writes or reads other than the file > creation itself and with no disk activity other than the array re-sync. I remember even 0-byte files taking a long time to write. My guess would be that reiserfs doesn't know the file will end up being empty when the file is created, or perhaps it tries to find some contiguous space anyway so the file can be appended to without excessive fragmentation. In order to find contiguous space, reiserfs needs to look at the bitmaps; if enough bitmap data isn't cached, reiserfs will have to read some, which, as we know, can take a long time. -Corey ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Problem with reiserfs volume 2009-05-05 23:40 ` Corey Hickey @ 2009-05-06 2:04 ` Leslie Rhorer 2009-05-07 5:59 ` Corey Hickey 0 siblings, 1 reply; 12+ messages in thread From: Leslie Rhorer @ 2009-05-06 2:04 UTC (permalink / raw) To: reiserfs-devel > I might not have been clear on this before: reading the bitmap data is > slow because it is distributed every 128 MB across the filesystem; this > means that in order to read lots of bitmaps, the disk spends most of its > time seeking rather than reading. For me, that's what was causing the > disk to "buzz", and that's why dstat showed read rates of only 400-600 > KB/sec. Yeah, but reads and writes worked just fine: up to 450 Mbps. Appending to an existing file (or writing several GB to a file once the create was done) ran like a racehorse on one or several files without ever a burp. Reading could be accomplished flat-out no matter what, but with total disk activity well in excess of 500Mbps, everything would suddenly halt if a file was created on an intermittent basis. Perhaps one create in five or so would trigger the issue if high volumes of data were being read and / or written, except when a resync was under way, in which case almost every file create would generate a pause. During normal operation the pause would almost always last exactly 40 seconds. During resync, the pause lasted as much as 20 minutes. > I just ran a quick test on my single-disk reiserfs and calculated the > average seek rate: > > fs_size = 242341144 KB > bitmap_spacing = 128 MB = 131072 KB > num_bitmaps = fs_size / bitmap_spacing = 1849 > bitmaps_read_time = 15.5 sec (from debugreiserfs -m) > bitmap_read_rate = num_bitmaps / bitmaps_read_time = 119 bitmaps/sec > seek_rate = bitmap_read_rate = 119 seeks/sec (seek to every bitmap) > > That's a lot of seeking! No question, but under ordinary read and write loads, the system handled the situation with aplomb. Create ten 20 byte files over a period of 30 minutes, however, and it would halt perhaps 3 - 5 times. Under light loads, perhaps 1 in 10 times, although sometimes even with heavy loads I would create 30 or 40 files or more with no symptoms. During a resync, however, a halt was all but guaranteed with every creation. > Having the bitmaps spread out among several disks of a RAID probably > wouldn't help. Reiserfs doesn't try to read the bitmaps in parallel; > that would be bad unless it knew the RAID layout. So, each disk would > just be idle when it wasn't its turn to seek and read another bitmap. With 400+ Mbps of data being read and written, the discs weren't idle very much. > Remember how in the old days (before 2.6.19, I think) large reiserfs > filesystems took forever to mount? I have only been using reiserfs for a short time. > > It still doesn't quite explain to me how a high read rate strictly at > the > > drive level (e.g. ckarray) causes severe problems at the FS level, while > an > > idle system did not exhibit nearly the frequency of problems nor did the > > hang last even a fraction as long (40 seconds vs. 20 minutes). > > 20 minutes sounds excessive, even when competing with a resync. I > couldn't say, and can't test it here. More to the point, reads and writes didn't have any problem competing with the resync. When accessing a file for either read or write, the data transfer would begin in earnest within 2 or 3 seconds, with other activity continuing unabated. An ls would return in a fraction of a second. Once the halt occurred, however, an ls would not return until the event had resolved. > > Except this happened without any file writes or reads other than the > file > > creation itself and with no disk activity other than the array re-sync. > > I remember even 0-byte files taking a long time to write. My guess would > be that reiserfs doesn't know the file will end up being empty when the > file is created, or perhaps it tries to find some contiguous space > anyway so the file can be appended to without excessive fragmentation. So why didn't it happen when appending data to an existing file? Once a file was created, large or small, I could write freely to it over and over, either appending data or writing over data. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Problem with reiserfs volume 2009-05-06 2:04 ` Leslie Rhorer @ 2009-05-07 5:59 ` Corey Hickey 2009-05-11 16:37 ` Leslie Rhorer 0 siblings, 1 reply; 12+ messages in thread From: Corey Hickey @ 2009-05-07 5:59 UTC (permalink / raw) To: lrhorer; +Cc: reiserfs-devel Leslie Rhorer wrote: >> I might not have been clear on this before: reading the bitmap data is >> slow because it is distributed every 128 MB across the filesystem; this >> means that in order to read lots of bitmaps, the disk spends most of its >> time seeking rather than reading. For me, that's what was causing the >> disk to "buzz", and that's why dstat showed read rates of only 400-600 >> KB/sec. > > Yeah, but reads and writes worked just fine: up to 450 Mbps. I mean, above, that read rates would fall to 400-600 KB/sec when the filesystem was busy reading bitmap data. That at least roughly corresponds to what you wrote on 2009-04-28: "The reads at the array level would fall to zero on 5 of the 10 drives, while the other 5 would report a very low level of read activity, but not zero." > Appending to > an existing file (or writing several GB to a file once the create was done) > ran like a racehorse on one or several files without ever a burp. Reading > could be accomplished flat-out no matter what, but with total disk activity > well in excess of 500Mbps, everything would suddenly halt if a file was > created on an intermittent basis. That's just like what was happening to me. The filesystem would drop everything else it was doing and read bitmaps for a while. >> Having the bitmaps spread out among several disks of a RAID probably >> wouldn't help. Reiserfs doesn't try to read the bitmaps in parallel; >> that would be bad unless it knew the RAID layout. So, each disk would >> just be idle when it wasn't its turn to seek and read another bitmap. > > With 400+ Mbps of data being read and written, the discs weren't idle very > much. Except that when the filesystem is busy reading bitmaps, it isn't doing anything else.... :) >> Remember how in the old days (before 2.6.19, I think) large reiserfs >> filesystems took forever to mount? > > I have only been using reiserfs for a short time. Well, mounting did take forever. :) http://lkml.org/lkml/2006/1/14/223 http://linuxgazette.net/122/TWDT.html#piszcz (scroll down a bit to the graphs) >>> Except this happened without any file writes or reads other than the >> file >>> creation itself and with no disk activity other than the array re-sync. >> I remember even 0-byte files taking a long time to write. My guess would >> be that reiserfs doesn't know the file will end up being empty when the >> file is created, or perhaps it tries to find some contiguous space >> anyway so the file can be appended to without excessive fragmentation. > > So why didn't it happen when appending data to an existing file? Once a > file was created, large or small, I could write freely to it over and over, > either appending data or writing over data. I don't know how appends or overwrites are handled. The scheme for finding free space may differ. -Corey ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: Problem with reiserfs volume 2009-05-07 5:59 ` Corey Hickey @ 2009-05-11 16:37 ` Leslie Rhorer 0 siblings, 0 replies; 12+ messages in thread From: Leslie Rhorer @ 2009-05-11 16:37 UTC (permalink / raw) To: reiserfs-devel > >> I might not have been clear on this before: reading the bitmap data is > >> slow because it is distributed every 128 MB across the filesystem; this > >> means that in order to read lots of bitmaps, the disk spends most of > its > >> time seeking rather than reading. For me, that's what was causing the > >> disk to "buzz", and that's why dstat showed read rates of only 400-600 > >> KB/sec. > > > > Yeah, but reads and writes worked just fine: up to 450 Mbps. > > I mean, above, that read rates would fall to 400-600 KB/sec when the > filesystem was busy reading bitmap data. Well, first of all, it would drop to more like 4KBps, not 400KBps. > That at least roughly > corresponds to what you wrote on 2009-04-28: "The reads at the array > level would fall to zero on 5 of the 10 drives, while the other 5 would > report a very low level of read activity, but not zero." > > > Appending to > > an existing file (or writing several GB to a file once the create was > done) > > ran like a racehorse on one or several files without ever a burp. > Reading > > could be accomplished flat-out no matter what, but with total disk > activity > > well in excess of 500Mbps, everything would suddenly halt if a file was > > created on an intermittent basis. > > That's just like what was happening to me. The filesystem would drop > everything else it was doing and read bitmaps for a while. > > >> Having the bitmaps spread out among several disks of a RAID probably > >> wouldn't help. Reiserfs doesn't try to read the bitmaps in parallel; > >> that would be bad unless it knew the RAID layout. So, each disk would > >> just be idle when it wasn't its turn to seek and read another bitmap. > > > > With 400+ Mbps of data being read and written, the discs weren't idle > very > > much. > > Except that when the filesystem is busy reading bitmaps, it isn't doing > anything else.... :) Are you saying it doesn't read the bitmaps during reads and writes? > >>> Except this happened without any file writes or reads other than the > >> file > >>> creation itself and with no disk activity other than the array re- > sync. > >> I remember even 0-byte files taking a long time to write. My guess > would > >> be that reiserfs doesn't know the file will end up being empty when the > >> file is created, or perhaps it tries to find some contiguous space > >> anyway so the file can be appended to without excessive fragmentation. > > > > So why didn't it happen when appending data to an existing file? Once a > > file was created, large or small, I could write freely to it over and > over, > > either appending data or writing over data. > > I don't know how appends or overwrites are handled. The scheme for > finding free space may differ. Yes, of course that's true, but I wouldn't think it would be so by design. It also doesn't explain why the event was more likely during heavy activity. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2009-05-11 16:37 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-04-04 17:25 Problem with reiserfs volume Lelsie Rhorer 2009-04-06 20:04 ` Corey Hickey 2009-04-28 23:53 ` Leslie Rhorer 2009-04-29 0:00 ` Leslie Rhorer 2009-04-30 6:47 ` Corey Hickey 2009-05-03 1:58 ` Leslie Rhorer 2009-05-03 23:54 ` Corey Hickey 2009-05-05 8:43 ` Leslie Rhorer 2009-05-05 23:40 ` Corey Hickey 2009-05-06 2:04 ` Leslie Rhorer 2009-05-07 5:59 ` Corey Hickey 2009-05-11 16:37 ` Leslie Rhorer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.