* data corruption with 2.4.25 and datalogging patches @ 2006-07-12 6:16 Francisco Javier Cabello 2006-07-12 8:24 ` Hans Reiser 2006-07-13 14:34 ` Vladimir V. Saveliev 0 siblings, 2 replies; 17+ messages in thread From: Francisco Javier Cabello @ 2006-07-12 6:16 UTC (permalink / raw) To: reiserfs-list Hello, My company develops video recorder system. Basically we work with linux boxes running kernel 2.4.25. The system captures analogue video, and after processing and compressing, digital video is stored to hard disk. We are recording continuously (24x7). We have realized that more or less a 10% of our systems are suffering data corruption in the reiserfs partition. Sometimes it's possible to fix it running 'reiserfsck --rebuild-tree' but not always. More information: -Kernel 2.4.25 + v4l2 patches -Reiserfsprogs 3.6.19 -Datalogging patches. (http://mirror.mcs.anl.gov/suse-people/mason/patches/data-logging/2.4.25/) I have checked datalogging patches from Reiserfs website and they seem equal to suse ones. I don't have any idea of what it's happening. The disk bandwidth is not so high (300-500kb/sec). The disk is always full at 90% (we have a process deleting old video). I have been thinking about removing Dataloggin patches but I would like to have serious reason. It's not easy to check that the problem is solved because we are not able to reproduce the error in our headquarter. Regards, Paco -- One of my most productive days was throwing away 1000 lines of code (Ken Thompson) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-12 6:16 data corruption with 2.4.25 and datalogging patches Francisco Javier Cabello @ 2006-07-12 8:24 ` Hans Reiser 2006-07-13 14:34 ` Vladimir V. Saveliev 1 sibling, 0 replies; 17+ messages in thread From: Hans Reiser @ 2006-07-12 8:24 UTC (permalink / raw) To: Francisco Javier Cabello; +Cc: reiserfs-list, Chris Mason Francisco Javier Cabello wrote: >Hello, >My company develops video recorder system. Basically we work with linux boxes >running kernel 2.4.25. The system captures analogue video, and after >processing and compressing, digital video is stored to hard disk. We are >recording continuously (24x7). > >We have realized that more or less a 10% of our systems are suffering data >corruption in the reiserfs partition. Sometimes it's possible to fix it >running 'reiserfsck --rebuild-tree' but not always. >More information: >-Kernel 2.4.25 + v4l2 patches >-Reiserfsprogs 3.6.19 >-Datalogging patches. >(http://mirror.mcs.anl.gov/suse-people/mason/patches/data-logging/2.4.25/) > >I have checked datalogging patches from Reiserfs website and they seem equal >to suse ones. > >I don't have any idea of what it's happening. The disk bandwidth is not so >high (300-500kb/sec). The disk is always full at 90% (we have a process >deleting old video). > >I have been thinking about removing Dataloggin patches but I would like to >have serious reason. It's not easy to check that the problem is solved >because we are not able to reproduce the error in our headquarter. > >Regards, > >Paco > > > > Unless Chris has an idea what might be going on off the top of his head, this sounds like a paid support problem, and I must say it sounds like it could be a lot of work to reproduce it. If solving it still interests you, we charge $150/hr. I have made no efforts to test and verify the datalogging patches myself, and most users do meta-data journaling, so I cannot say much without assigning to someone the task of reviewing them and your problem. Reiser4 might be a desirable answer for you. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-12 6:16 data corruption with 2.4.25 and datalogging patches Francisco Javier Cabello 2006-07-12 8:24 ` Hans Reiser @ 2006-07-13 14:34 ` Vladimir V. Saveliev 2006-07-14 8:25 ` Francisco Javier Cabello 1 sibling, 1 reply; 17+ messages in thread From: Vladimir V. Saveliev @ 2006-07-13 14:34 UTC (permalink / raw) To: Francisco Javier Cabello; +Cc: reiserfs-list Hello On Wed, 2006-07-12 at 08:16 +0200, Francisco Javier Cabello wrote: > Hello, > My company develops video recorder system. Basically we work with linux boxes > running kernel 2.4.25. The system captures analogue video, and after > processing and compressing, digital video is stored to hard disk. We are > recording continuously (24x7). > > We have realized that more or less a 10% of our systems are suffering data > corruption in the reiserfs partition. Did unclean shutdowns take place on those systems? If you let us see what does reiserfsck report in those cases that could help to understand what is is happening. > Sometimes it's possible to fix it > running 'reiserfsck --rebuild-tree' but not always. > More information: > -Kernel 2.4.25 + v4l2 patches > -Reiserfsprogs 3.6.19 > -Datalogging patches. > (http://mirror.mcs.anl.gov/suse-people/mason/patches/data-logging/2.4.25/) > > I have checked datalogging patches from Reiserfs website and they seem equal > to suse ones. > > I don't have any idea of what it's happening. The disk bandwidth is not so > high (300-500kb/sec). The disk is always full at 90% (we have a process > deleting old video). > > I have been thinking about removing Dataloggin patches but I would like to > have serious reason. It's not easy to check that the problem is solved > because we are not able to reproduce the error in our headquarter. > > Regards, > > Paco > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-13 14:34 ` Vladimir V. Saveliev @ 2006-07-14 8:25 ` Francisco Javier Cabello 2006-07-14 11:48 ` Vladimir V. Saveliev 0 siblings, 1 reply; 17+ messages in thread From: Francisco Javier Cabello @ 2006-07-14 8:25 UTC (permalink / raw) To: Vladimir V. Saveliev; +Cc: reiserfs-list [-- Attachment #1: Type: text/plain, Size: 2462 bytes --] Hello, I am almost sure that unclean shutdowns happen in those systems. We have tried to reproduce removing power each 5 minutes and the filesystem wasn't suffering corruption. Perhaps it's related, but I don't know. I have talked about 'Datalogging patches' because it's the only thing different from our system. I have searched a lot and few people have corruption with reiserfs standalone... so, it may be datalogging patches. what do you need from reiserfsck? I guess the output of 'reiserfsck --check device' of perhaps you need the output of reiserfsck --rebuild tree. Regards, Paco On Thursday, 13 de July de 2006 16:34, Vladimir V. Saveliev wrote: > Hello > > On Wed, 2006-07-12 at 08:16 +0200, Francisco Javier Cabello wrote: > > Hello, > > My company develops video recorder system. Basically we work with linux > > boxes running kernel 2.4.25. The system captures analogue video, and > > after processing and compressing, digital video is stored to hard disk. > > We are recording continuously (24x7). > > > > We have realized that more or less a 10% of our systems are suffering > > data corruption in the reiserfs partition. > > Did unclean shutdowns take place on those systems? > If you let us see what does reiserfsck report in those cases that could > help to understand what is is happening. > > > Sometimes it's possible to fix it > > running 'reiserfsck --rebuild-tree' but not always. > > More information: > > -Kernel 2.4.25 + v4l2 patches > > -Reiserfsprogs 3.6.19 > > -Datalogging patches. > > (http://mirror.mcs.anl.gov/suse-people/mason/patches/data-logging/2.4.25/ > >) > > > > I have checked datalogging patches from Reiserfs website and they seem > > equal to suse ones. > > > > I don't have any idea of what it's happening. The disk bandwidth is not > > so high (300-500kb/sec). The disk is always full at 90% (we have a > > process deleting old video). > > > > I have been thinking about removing Dataloggin patches but I would like > > to have serious reason. It's not easy to check that the problem is solved > > because we are not able to reproduce the error in our headquarter. > > > > Regards, > > > > Paco -- One of my most productive days was throwing away 1000 lines of code (Ken Thompson) ----------------- PGP fingerprint: AF69 62B4 97EB F5BB 2C60 B802 568A E122 BBBE 5820 PGP Key available at http://pgp.mit.edu ----------------- [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-14 8:25 ` Francisco Javier Cabello @ 2006-07-14 11:48 ` Vladimir V. Saveliev 2006-07-14 12:03 ` Francisco Javier Cabello 0 siblings, 1 reply; 17+ messages in thread From: Vladimir V. Saveliev @ 2006-07-14 11:48 UTC (permalink / raw) To: Francisco Javier Cabello; +Cc: reiserfs-list Hello On Fri, 2006-07-14 at 10:25 +0200, Francisco Javier Cabello wrote: > Hello, > I am almost sure that unclean shutdowns happen in those systems. We have tried > to reproduce removing power each 5 minutes and the filesystem wasn't > suffering corruption. Perhaps it's related, but I don't know. > > I have talked about 'Datalogging patches' because it's the only thing > different from our system. sorry, I am confused. Am I correct that you have set of systems and they all run similar load on the same kernel and only ~10% of them encounter reiserfs corruptions? Do they have identical hardware? > I have searched a lot and few people have > corruption with reiserfs standalone... so, it may be datalogging patches. > > what do you need from reiserfsck? I guess the output of 'reiserfsck --check > device' yes. There is -l option to redirect output to log file. > of perhaps you need the output of reiserfsck --rebuild tree. > > > Regards, > > Paco > > > > > On Thursday, 13 de July de 2006 16:34, Vladimir V. Saveliev wrote: > > Hello > > > > On Wed, 2006-07-12 at 08:16 +0200, Francisco Javier Cabello wrote: > > > Hello, > > > My company develops video recorder system. Basically we work with linux > > > boxes running kernel 2.4.25. The system captures analogue video, and > > > after processing and compressing, digital video is stored to hard disk. > > > We are recording continuously (24x7). > > > > > > We have realized that more or less a 10% of our systems are suffering > > > data corruption in the reiserfs partition. > > > > Did unclean shutdowns take place on those systems? > > If you let us see what does reiserfsck report in those cases that could > > help to understand what is is happening. > > > > > Sometimes it's possible to fix it > > > running 'reiserfsck --rebuild-tree' but not always. > > > More information: > > > -Kernel 2.4.25 + v4l2 patches > > > -Reiserfsprogs 3.6.19 > > > -Datalogging patches. > > > (http://mirror.mcs.anl.gov/suse-people/mason/patches/data-logging/2.4.25/ > > >) > > > > > > I have checked datalogging patches from Reiserfs website and they seem > > > equal to suse ones. > > > > > > I don't have any idea of what it's happening. The disk bandwidth is not > > > so high (300-500kb/sec). The disk is always full at 90% (we have a > > > process deleting old video). > > > > > > I have been thinking about removing Dataloggin patches but I would like > > > to have serious reason. It's not easy to check that the problem is solved > > > because we are not able to reproduce the error in our headquarter. > > > > > > Regards, > > > > > > Paco > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-14 11:48 ` Vladimir V. Saveliev @ 2006-07-14 12:03 ` Francisco Javier Cabello 2006-07-14 12:20 ` Francisco Javier Cabello 0 siblings, 1 reply; 17+ messages in thread From: Francisco Javier Cabello @ 2006-07-14 12:03 UTC (permalink / raw) To: Vladimir V. Saveliev; +Cc: reiserfs-list [-- Attachment #1: Type: text/plain, Size: 3492 bytes --] Yes. I have a sef of system with the same main board, memory, microprocessor... They are identical. The difference is the conditions where they are working. Perhaps the cpu load average is difference, the amount of data they are writting, the number of power failure... I am going to send you the output of reiserfsck of some the systems. Regards, Paco On Friday, 14 de July de 2006 13:48, Vladimir V. Saveliev wrote: > Hello > > On Fri, 2006-07-14 at 10:25 +0200, Francisco Javier Cabello wrote: > > Hello, > > I am almost sure that unclean shutdowns happen in those systems. We have > > tried to reproduce removing power each 5 minutes and the filesystem > > wasn't suffering corruption. Perhaps it's related, but I don't know. > > > > I have talked about 'Datalogging patches' because it's the only thing > > different from our system. > > sorry, I am confused. Am I correct that you have set of systems and they > all run similar load on the same kernel and only ~10% of them encounter > reiserfs corruptions? Do they have identical hardware? > > > I have searched a lot and few people have > > corruption with reiserfs standalone... so, it may be datalogging patches. > > > > what do you need from reiserfsck? I guess the output of 'reiserfsck > > --check device' > > yes. There is -l option to redirect output to log file. > > > of perhaps you need the output of reiserfsck --rebuild tree. > > > > > > Regards, > > > > Paco > > > > On Thursday, 13 de July de 2006 16:34, Vladimir V. Saveliev wrote: > > > Hello > > > > > > On Wed, 2006-07-12 at 08:16 +0200, Francisco Javier Cabello wrote: > > > > Hello, > > > > My company develops video recorder system. Basically we work with > > > > linux boxes running kernel 2.4.25. The system captures analogue > > > > video, and after processing and compressing, digital video is stored > > > > to hard disk. We are recording continuously (24x7). > > > > > > > > We have realized that more or less a 10% of our systems are suffering > > > > data corruption in the reiserfs partition. > > > > > > Did unclean shutdowns take place on those systems? > > > If you let us see what does reiserfsck report in those cases that could > > > help to understand what is is happening. > > > > > > > Sometimes it's possible to fix it > > > > running 'reiserfsck --rebuild-tree' but not always. > > > > More information: > > > > -Kernel 2.4.25 + v4l2 patches > > > > -Reiserfsprogs 3.6.19 > > > > -Datalogging patches. > > > > (http://mirror.mcs.anl.gov/suse-people/mason/patches/data-logging/2.4 > > > >.25/ ) > > > > > > > > I have checked datalogging patches from Reiserfs website and they > > > > seem equal to suse ones. > > > > > > > > I don't have any idea of what it's happening. The disk bandwidth is > > > > not so high (300-500kb/sec). The disk is always full at 90% (we have > > > > a process deleting old video). > > > > > > > > I have been thinking about removing Dataloggin patches but I would > > > > like to have serious reason. It's not easy to check that the problem > > > > is solved because we are not able to reproduce the error in our > > > > headquarter. > > > > > > > > Regards, > > > > > > > > Paco -- One of my most productive days was throwing away 1000 lines of code (Ken Thompson) ----------------- PGP fingerprint: AF69 62B4 97EB F5BB 2C60 B802 568A E122 BBBE 5820 PGP Key available at http://pgp.mit.edu ----------------- [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-14 12:03 ` Francisco Javier Cabello @ 2006-07-14 12:20 ` Francisco Javier Cabello 2006-07-14 12:59 ` Vladimir V. Saveliev 0 siblings, 1 reply; 17+ messages in thread From: Francisco Javier Cabello @ 2006-07-14 12:20 UTC (permalink / raw) To: reiserfs-list; +Cc: Vladimir V. Saveliev [-- Attachment #1: Type: text/plain, Size: 4936 bytes --] Hello Vladimir, # reiserfsck -l /tmp/reiserfsck.log -y --check /dev/hdc1 Standard output: ====================================================== Will read-only check consistency of the filesystem on /dev/hdc1 Will put log info to '/tmp/reiserfsck.log' ########### reiserfsck --check started at Fri Jul 14 14:09:33 2006 ########### Replaying journal.. Reiserfs journal '/dev/hdc1' in blocks [18..8211]: 0 transactions replayed Checking internal tree..finished Comparing bitmaps..Bad nodes were found, Semantic pass skipped 1 found corruptions can be fixed only when running with --rebuild-tree ########### reiserfsck finished at Fri Jul 14 14:13:29 2006 ########### ====================================================== /tmp/reiserfsck.log: ====================================================== bad_internal: vpf-10320: block 23868569, items 91 and 92: The wrong order of items: [410810496 11321 0x16abca00 ??? (15)], [11312 11321 0x22f1c880 DIR (3)] the problem in the internal node occured (23868569), whole subtree is skipped vpf-10640: The on-disk and the correct bitmaps differs. ====================================================== Regards, Paco On Friday, 14 de July de 2006 14:03, Francisco Javier Cabello wrote: > Yes. I have a sef of system with the same main board, memory, > microprocessor... They are identical. The difference is the conditions > where they are working. Perhaps the cpu load average is difference, the > amount of data they are writting, the number of power failure... > > I am going to send you the output of reiserfsck of some the systems. > > Regards, > > Paco > > On Friday, 14 de July de 2006 13:48, Vladimir V. Saveliev wrote: > > Hello > > > > On Fri, 2006-07-14 at 10:25 +0200, Francisco Javier Cabello wrote: > > > Hello, > > > I am almost sure that unclean shutdowns happen in those systems. We > > > have tried to reproduce removing power each 5 minutes and the > > > filesystem wasn't suffering corruption. Perhaps it's related, but I > > > don't know. > > > > > > I have talked about 'Datalogging patches' because it's the only thing > > > different from our system. > > > > sorry, I am confused. Am I correct that you have set of systems and they > > all run similar load on the same kernel and only ~10% of them encounter > > reiserfs corruptions? Do they have identical hardware? > > > > > I have searched a lot and few people have > > > corruption with reiserfs standalone... so, it may be datalogging > > > patches. > > > > > > what do you need from reiserfsck? I guess the output of 'reiserfsck > > > --check device' > > > > yes. There is -l option to redirect output to log file. > > > > > of perhaps you need the output of reiserfsck --rebuild tree. > > > > > > > > > Regards, > > > > > > Paco > > > > > > On Thursday, 13 de July de 2006 16:34, Vladimir V. Saveliev wrote: > > > > Hello > > > > > > > > On Wed, 2006-07-12 at 08:16 +0200, Francisco Javier Cabello wrote: > > > > > Hello, > > > > > My company develops video recorder system. Basically we work with > > > > > linux boxes running kernel 2.4.25. The system captures analogue > > > > > video, and after processing and compressing, digital video is > > > > > stored to hard disk. We are recording continuously (24x7). > > > > > > > > > > We have realized that more or less a 10% of our systems are > > > > > suffering data corruption in the reiserfs partition. > > > > > > > > Did unclean shutdowns take place on those systems? > > > > If you let us see what does reiserfsck report in those cases that > > > > could help to understand what is is happening. > > > > > > > > > Sometimes it's possible to fix it > > > > > running 'reiserfsck --rebuild-tree' but not always. > > > > > More information: > > > > > -Kernel 2.4.25 + v4l2 patches > > > > > -Reiserfsprogs 3.6.19 > > > > > -Datalogging patches. > > > > > (http://mirror.mcs.anl.gov/suse-people/mason/patches/data-logging/2 > > > > >.4 .25/ ) > > > > > > > > > > I have checked datalogging patches from Reiserfs website and they > > > > > seem equal to suse ones. > > > > > > > > > > I don't have any idea of what it's happening. The disk bandwidth is > > > > > not so high (300-500kb/sec). The disk is always full at 90% (we > > > > > have a process deleting old video). > > > > > > > > > > I have been thinking about removing Dataloggin patches but I would > > > > > like to have serious reason. It's not easy to check that the > > > > > problem is solved because we are not able to reproduce the error in > > > > > our headquarter. > > > > > > > > > > Regards, > > > > > > > > > > Paco -- One of my most productive days was throwing away 1000 lines of code (Ken Thompson) ----------------- PGP fingerprint: AF69 62B4 97EB F5BB 2C60 B802 568A E122 BBBE 5820 PGP Key available at http://pgp.mit.edu ----------------- [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-14 12:20 ` Francisco Javier Cabello @ 2006-07-14 12:59 ` Vladimir V. Saveliev 2006-07-17 8:53 ` Francisco Javier Cabello ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Vladimir V. Saveliev @ 2006-07-14 12:59 UTC (permalink / raw) To: Francisco Javier Cabello; +Cc: reiserfs-list [-- Attachment #1: Type: text/plain, Size: 1418 bytes --] Hello On Fri, 2006-07-14 at 14:20 +0200, Francisco Javier Cabello wrote: > Hello Vladimir, > > # reiserfsck -l /tmp/reiserfsck.log -y --check /dev/hdc1 > > Standard output: > ====================================================== > Will read-only check consistency of the filesystem on /dev/hdc1 > Will put log info to '/tmp/reiserfsck.log' > ########### > reiserfsck --check started at Fri Jul 14 14:09:33 2006 > ########### > Replaying journal.. > Reiserfs journal '/dev/hdc1' in blocks [18..8211]: 0 transactions replayed > Checking internal tree..finished > Comparing bitmaps..Bad nodes were found, Semantic pass skipped > 1 found corruptions can be fixed only when running with --rebuild-tree > ########### > reiserfsck finished at Fri Jul 14 14:13:29 2006 > ########### > ====================================================== > > /tmp/reiserfsck.log: > ====================================================== > bad_internal: vpf-10320: block 23868569, items 91 and 92: The wrong order of > items: [410810496 11321 0x16abca00 ??? (15)], [11312 11321 0x22f1c880 DIR > (3)] such corruptions used to be considered as hardware bugs. Memory failure, for instance. Did you ever run memtest on your systems? > the problem in the internal node occured (23868569), whole subtree is skipped > vpf-10640: The on-disk and the correct bitmaps differs. > ====================================================== > [-- Attachment #2: Type: text/html, Size: 2460 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-14 12:59 ` Vladimir V. Saveliev @ 2006-07-17 8:53 ` Francisco Javier Cabello 2006-07-17 17:55 ` Vladimir V. Saveliev 2006-07-17 10:49 ` Francisco Javier Cabello 2006-07-19 12:33 ` Francisco Javier Cabello 2 siblings, 1 reply; 17+ messages in thread From: Francisco Javier Cabello @ 2006-07-17 8:53 UTC (permalink / raw) To: Vladimir V. Saveliev; +Cc: reiserfs-list [-- Attachment #1: Type: text/plain, Size: 2388 bytes --] Hello Vladimir, > such corruptions used to be considered as hardware bugs. Memory failure, > for instance. Did you ever run memtest on your systems? Yes, We have run memtest in our system. It's very seldom to find a system with a hardware memory problem running. When we find a memory problem the kernel doesn't boot. I am going to pass memtest in some of the system with reiserfs corruption problem. Could I give you more information? Perhaps if I run 'reiserfsck --rebuild-tree' and I give you the traces... would it be useful? Regards, Paco On Friday, 14 de July de 2006 14:59, Vladimir V. Saveliev wrote: > Hello > > On Fri, 2006-07-14 at 14:20 +0200, Francisco Javier Cabello wrote: > > Hello Vladimir, > > > > # reiserfsck -l /tmp/reiserfsck.log -y --check /dev/hdc1 > > > > Standard output: > > ====================================================== > > Will read-only check consistency of the filesystem on /dev/hdc1 > > Will put log info to '/tmp/reiserfsck.log' > > ########### > > reiserfsck --check started at Fri Jul 14 14:09:33 2006 > > ########### > > Replaying journal.. > > Reiserfs journal '/dev/hdc1' in blocks [18..8211]: 0 transactions > > replayed Checking internal tree..finished > > Comparing bitmaps..Bad nodes were found, Semantic pass skipped > > 1 found corruptions can be fixed only when running with --rebuild-tree > > ########### > > reiserfsck finished at Fri Jul 14 14:13:29 2006 > > ########### > > ====================================================== > > > > /tmp/reiserfsck.log: > > ====================================================== > > bad_internal: vpf-10320: block 23868569, items 91 and 92: The wrong order > > of items: [410810496 11321 0x16abca00 ??? (15)], [11312 11321 0x22f1c880 > > DIR (3)] > > such corruptions used to be considered as hardware bugs. Memory failure, > for instance. Did you ever run memtest on your systems? > > > the problem in the internal node occured (23868569), whole subtree is > > skipped vpf-10640: The on-disk and the correct bitmaps differs. > > ====================================================== -- One of my most productive days was throwing away 1000 lines of code (Ken Thompson) ----------------- PGP fingerprint: AF69 62B4 97EB F5BB 2C60 B802 568A E122 BBBE 5820 PGP Key available at http://pgp.mit.edu ----------------- [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-17 8:53 ` Francisco Javier Cabello @ 2006-07-17 17:55 ` Vladimir V. Saveliev 2006-07-17 18:14 ` Brad Dameron 0 siblings, 1 reply; 17+ messages in thread From: Vladimir V. Saveliev @ 2006-07-17 17:55 UTC (permalink / raw) To: Francisco Javier Cabello; +Cc: reiserfs-list Hello On Mon, 2006-07-17 at 10:53 +0200, Francisco Javier Cabello wrote: > Hello Vladimir, > > such corruptions used to be considered as hardware bugs. Memory failure, > > for instance. Did you ever run memtest on your systems? > > Yes, We have run memtest in our system. It's very seldom to find a system with > a hardware memory problem running. When we find a memory problem the kernel > doesn't boot. I am going to pass memtest in some of the system with reiserfs > corruption problem. > please let it run few hours at least. > Could I give you more information? Perhaps if I run 'reiserfsck > --rebuild-tree' and I give you the traces... would it be useful? > ok, although you sent reiserfsck --check log. The corruption looked like a content of block was randomly overwritten by random characters. We used to consider such corruptions as caused by hardware faults. Especially because most of your systems are running in similar circumstances flawlessly. > Regards, > > Paco > > On Friday, 14 de July de 2006 14:59, Vladimir V. Saveliev wrote: > > Hello > > > > On Fri, 2006-07-14 at 14:20 +0200, Francisco Javier Cabello wrote: > > > Hello Vladimir, > > > > > > # reiserfsck -l /tmp/reiserfsck.log -y --check /dev/hdc1 > > > > > > Standard output: > > > ====================================================== > > > Will read-only check consistency of the filesystem on /dev/hdc1 > > > Will put log info to '/tmp/reiserfsck.log' > > > ########### > > > reiserfsck --check started at Fri Jul 14 14:09:33 2006 > > > ########### > > > Replaying journal.. > > > Reiserfs journal '/dev/hdc1' in blocks [18..8211]: 0 transactions > > > replayed Checking internal tree..finished > > > Comparing bitmaps..Bad nodes were found, Semantic pass skipped > > > 1 found corruptions can be fixed only when running with --rebuild-tree > > > ########### > > > reiserfsck finished at Fri Jul 14 14:13:29 2006 > > > ########### > > > ====================================================== > > > > > > /tmp/reiserfsck.log: > > > ====================================================== > > > bad_internal: vpf-10320: block 23868569, items 91 and 92: The wrong order > > > of items: [410810496 11321 0x16abca00 ??? (15)], [11312 11321 0x22f1c880 > > > DIR (3)] > > > > such corruptions used to be considered as hardware bugs. Memory failure, > > for instance. Did you ever run memtest on your systems? > > > > > the problem in the internal node occured (23868569), whole subtree is > > > skipped vpf-10640: The on-disk and the correct bitmaps differs. > > > ====================================================== > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-17 17:55 ` Vladimir V. Saveliev @ 2006-07-17 18:14 ` Brad Dameron 2006-07-17 19:12 ` Hans Reiser 2006-07-17 21:01 ` Toby Thain 0 siblings, 2 replies; 17+ messages in thread From: Brad Dameron @ 2006-07-17 18:14 UTC (permalink / raw) To: reiserfs-list On Mon, 2006-07-17 at 21:55 +0400, Vladimir V. Saveliev wrote: > Hello > > On Mon, 2006-07-17 at 10:53 +0200, Francisco Javier Cabello wrote: > > Hello Vladimir, > > > such corruptions used to be considered as hardware bugs. Memory failure, > > > for instance. Did you ever run memtest on your systems? > > > > Yes, We have run memtest in our system. It's very seldom to find a system with > > a hardware memory problem running. When we find a memory problem the kernel > > doesn't boot. I am going to pass memtest in some of the system with reiserfs > > corruption problem. > > This is not true. There are certain memory issues that can still allow the system to boot and appear to run ok. I had a system that didn't show a memory error until the 4th pass on memtest. I just happened to let it run over the weekend. I have seen other issues with my larger systems that have 64GB of ram. To where memtest after a week didn't detect anything but the kernel mcelog reported weird ECC memory issues. I replaced several DIMM's and the issue went away. But who knows what could of occured had I not replaced the memory. Brad Dameron SeaTab Software www.seatab.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-17 18:14 ` Brad Dameron @ 2006-07-17 19:12 ` Hans Reiser 2006-07-17 20:09 ` Valdis.Kletnieks 2006-07-17 21:01 ` Toby Thain 1 sibling, 1 reply; 17+ messages in thread From: Hans Reiser @ 2006-07-17 19:12 UTC (permalink / raw) To: Brad Dameron; +Cc: reiserfs-list It seems like bad memory is growing as a percentage of user filesystem problem sources. Do others have that feeling also? Hans Brad Dameron wrote: >On Mon, 2006-07-17 at 21:55 +0400, Vladimir V. Saveliev wrote: > > >>Hello >> >>On Mon, 2006-07-17 at 10:53 +0200, Francisco Javier Cabello wrote: >> >> >>>Hello Vladimir, >>> >>> >>>>such corruptions used to be considered as hardware bugs. Memory failure, >>>>for instance. Did you ever run memtest on your systems? >>>> >>>> >>>Yes, We have run memtest in our system. It's very seldom to find a system with >>>a hardware memory problem running. When we find a memory problem the kernel >>>doesn't boot. I am going to pass memtest in some of the system with reiserfs >>>corruption problem. >>> >>> >>> > >This is not true. There are certain memory issues that can still allow >the system to boot and appear to run ok. I had a system that didn't show >a memory error until the 4th pass on memtest. I just happened to let it >run over the weekend. I have seen other issues with my larger systems >that have 64GB of ram. To where memtest after a week didn't detect >anything but the kernel mcelog reported weird ECC memory issues. I >replaced several DIMM's and the issue went away. But who knows what >could of occured had I not replaced the memory. > >Brad Dameron >SeaTab Software >www.seatab.com > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-17 19:12 ` Hans Reiser @ 2006-07-17 20:09 ` Valdis.Kletnieks 0 siblings, 0 replies; 17+ messages in thread From: Valdis.Kletnieks @ 2006-07-17 20:09 UTC (permalink / raw) To: Hans Reiser; +Cc: Brad Dameron, reiserfs-list [-- Attachment #1: Type: text/plain, Size: 625 bytes --] On Mon, 17 Jul 2006 12:12:22 PDT, Hans Reiser said: > It seems like bad memory is growing as a percentage of user filesystem > problem sources. Do others have that feeling also? Assuming that the chances of any given 16 megabit (or whatever size it is) RAM chip having a flaky bit being identical, then the chance of bad memory in any given gigabyte of RAM is the same.. and if you have 1/2G of memory, you have 1/8 the chance of a bad bit compared to having 4G installed. The bigger question is why ECC isn't catching this stuff.... (and yes, I know some hardware doesn't do ECC on all data paths, which is the point :) [-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches @ 2006-07-17 21:01 ` Toby Thain 0 siblings, 0 replies; 17+ messages in thread From: Toby Thain @ 2006-07-17 21:01 UTC (permalink / raw) To: reiserfs-list On 17-Jul-06, at 2:14 PM, Brad Dameron wrote: > On Mon, 2006-07-17 at 21:55 +0400, Vladimir V. Saveliev wrote: >> Hello >> >> On Mon, 2006-07-17 at 10:53 +0200, Francisco Javier Cabello wrote: >>> Hello Vladimir, >>>> such corruptions used to be considered as hardware bugs. Memory >>>> failure, >>>> for instance. Did you ever run memtest on your systems? >>> >>> Yes, We have run memtest in our system. It's very seldom to find >>> a system with >>> a hardware memory problem running. When we find a memory problem >>> the kernel >>> doesn't boot. I am going to pass memtest in some of the system >>> with reiserfs >>> corruption problem. >>> > > This is not true. There are certain memory issues that can still allow > the system to boot and appear to run ok. I had a system that didn't > show > a memory error until the 4th pass on memtest. I just happened to > let it > run over the weekend. I have seen other issues with my larger systems > that have 64GB of ram. To where memtest after a week didn't detect > anything but the kernel mcelog reported weird ECC memory issues. I > replaced several DIMM's and the issue went away. But who knows what > could of occured had I not replaced the memory. I agree with Brad. Memory problems can certainly manifest in obvious or obscure ways that don't prevent boot. I spent months chasing down what I thought was an IDE controller chipset problem (corrupt disk I/ O invisible to the kernel, hence corrupt filesystems, etc) that was simply bad RAM. --T > > Brad Dameron > SeaTab Software > www.seatab.com > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches @ 2006-07-17 21:01 ` Toby Thain 0 siblings, 0 replies; 17+ messages in thread From: Toby Thain @ 2006-07-17 21:01 UTC (permalink / raw) To: reiserfs-list On 17-Jul-06, at 2:14 PM, Brad Dameron wrote: > On Mon, 2006-07-17 at 21:55 +0400, Vladimir V. Saveliev wrote: >> Hello >> >> On Mon, 2006-07-17 at 10:53 +0200, Francisco Javier Cabello wrote: >>> Hello Vladimir, >>>> such corruptions used to be considered as hardware bugs. Memory >>>> failure, >>>> for instance. Did you ever run memtest on your systems? >>> >>> Yes, We have run memtest in our system. It's very seldom to find >>> a system with >>> a hardware memory problem running. When we find a memory problem >>> the kernel >>> doesn't boot. I am going to pass memtest in some of the system >>> with reiserfs >>> corruption problem. >>> > > This is not true. There are certain memory issues that can still allow > the system to boot and appear to run ok. I had a system that didn't > show > a memory error until the 4th pass on memtest. I just happened to > let it > run over the weekend. I have seen other issues with my larger systems > that have 64GB of ram. To where memtest after a week didn't detect > anything but the kernel mcelog reported weird ECC memory issues. I > replaced several DIMM's and the issue went away. But who knows what > could of occured had I not replaced the memory. I agree with Brad. Memory problems can certainly manifest in obvious or obscure ways that don't prevent boot. I spent months chasing down what I thought was an IDE controller chipset problem (corrupt disk I/ O invisible to the kernel, hence corrupt filesystems, etc) that was simply bad RAM. --T > > Brad Dameron > SeaTab Software > www.seatab.com > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-14 12:59 ` Vladimir V. Saveliev 2006-07-17 8:53 ` Francisco Javier Cabello @ 2006-07-17 10:49 ` Francisco Javier Cabello 2006-07-19 12:33 ` Francisco Javier Cabello 2 siblings, 0 replies; 17+ messages in thread From: Francisco Javier Cabello @ 2006-07-17 10:49 UTC (permalink / raw) To: Vladimir V. Saveliev; +Cc: reiserfs-list [-- Attachment #1: Type: text/plain, Size: 2027 bytes --] Hello, do you think it could be realted with low memory conditions? Perhaps the systems with the problem has a high memory use with a lot of swapping. I am just brainstorming... Regards, Paco On Friday, 14 de July de 2006 14:59, Vladimir V. Saveliev wrote: > Hello > > On Fri, 2006-07-14 at 14:20 +0200, Francisco Javier Cabello wrote: > > Hello Vladimir, > > > > # reiserfsck -l /tmp/reiserfsck.log -y --check /dev/hdc1 > > > > Standard output: > > ====================================================== > > Will read-only check consistency of the filesystem on /dev/hdc1 > > Will put log info to '/tmp/reiserfsck.log' > > ########### > > reiserfsck --check started at Fri Jul 14 14:09:33 2006 > > ########### > > Replaying journal.. > > Reiserfs journal '/dev/hdc1' in blocks [18..8211]: 0 transactions > > replayed Checking internal tree..finished > > Comparing bitmaps..Bad nodes were found, Semantic pass skipped > > 1 found corruptions can be fixed only when running with --rebuild-tree > > ########### > > reiserfsck finished at Fri Jul 14 14:13:29 2006 > > ########### > > ====================================================== > > > > /tmp/reiserfsck.log: > > ====================================================== > > bad_internal: vpf-10320: block 23868569, items 91 and 92: The wrong order > > of items: [410810496 11321 0x16abca00 ??? (15)], [11312 11321 0x22f1c880 > > DIR (3)] > > such corruptions used to be considered as hardware bugs. Memory failure, > for instance. Did you ever run memtest on your systems? > > > the problem in the internal node occured (23868569), whole subtree is > > skipped vpf-10640: The on-disk and the correct bitmaps differs. > > ====================================================== -- One of my most productive days was throwing away 1000 lines of code (Ken Thompson) ----------------- PGP fingerprint: AF69 62B4 97EB F5BB 2C60 B802 568A E122 BBBE 5820 PGP Key available at http://pgp.mit.edu ----------------- [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-14 12:59 ` Vladimir V. Saveliev 2006-07-17 8:53 ` Francisco Javier Cabello 2006-07-17 10:49 ` Francisco Javier Cabello @ 2006-07-19 12:33 ` Francisco Javier Cabello 2006-07-20 7:29 ` Francisco Javier Cabello 2 siblings, 1 reply; 17+ messages in thread From: Francisco Javier Cabello @ 2006-07-19 12:33 UTC (permalink / raw) To: Vladimir V. Saveliev; +Cc: reiserfs-list [-- Attachment #1.1: Type: text/plain, Size: 2296 bytes --] Hello, I think It's not a memory problem. I have run memtest for more than 24 hours and the memory is OK. We use the best memory in the market (Kingston). When a system fails we check hardware devices first. If we find a hardware problem we usually don't check for software problem. I have other system with reiserfs corruption. I send you attached reisefsck output for 'reiserfsck --check' and 'reiserfsck --check --fix-fixable' Regards, Paco On Friday, 14 de July de 2006 14:59, Vladimir V. Saveliev wrote: > Hello > > On Fri, 2006-07-14 at 14:20 +0200, Francisco Javier Cabello wrote: > > Hello Vladimir, > > > > # reiserfsck -l /tmp/reiserfsck.log -y --check /dev/hdc1 > > > > Standard output: > > ====================================================== > > Will read-only check consistency of the filesystem on /dev/hdc1 > > Will put log info to '/tmp/reiserfsck.log' > > ########### > > reiserfsck --check started at Fri Jul 14 14:09:33 2006 > > ########### > > Replaying journal.. > > Reiserfs journal '/dev/hdc1' in blocks [18..8211]: 0 transactions > > replayed Checking internal tree..finished > > Comparing bitmaps..Bad nodes were found, Semantic pass skipped > > 1 found corruptions can be fixed only when running with --rebuild-tree > > ########### > > reiserfsck finished at Fri Jul 14 14:13:29 2006 > > ########### > > ====================================================== > > > > /tmp/reiserfsck.log: > > ====================================================== > > bad_internal: vpf-10320: block 23868569, items 91 and 92: The wrong order > > of items: [410810496 11321 0x16abca00 ??? (15)], [11312 11321 0x22f1c880 > > DIR (3)] > > such corruptions used to be considered as hardware bugs. Memory failure, > for instance. Did you ever run memtest on your systems? > > > the problem in the internal node occured (23868569), whole subtree is > > skipped vpf-10640: The on-disk and the correct bitmaps differs. > > ====================================================== -- One of my most productive days was throwing away 1000 lines of code (Ken Thompson) ----------------- PGP fingerprint: AF69 62B4 97EB F5BB 2C60 B802 568A E122 BBBE 5820 PGP Key available at http://pgp.mit.edu ----------------- [-- Attachment #1.2: reiserfsck_check_with_fixfixable --] [-- Type: text/plain, Size: 1233 bytes --] bad_stat_data: The objectid (14933) is shared by at least two files. Can be fixed with --rebuild-tree only. vpf-10630: The on-disk and the correct bitmaps differs. Will be fixed later. vpf-10650: The directory [14056 14057] has the wrong size in the StatData (16320) - corrected to (16248) check_semantic_pass: Name "C04_060228T175015298+000-175337903+000_0000240_792.mpeg" in directory [14056 14063] points to nowhere - removed check_semantic_pass: Name "C04_060228T173711821+000-174033786+000_0000240_773.mpeg" in directory [14056 14063] points to nowhere - removed check_semantic_pass: Name "C04_060228T173404188+000-173711581+000_0000240_758.mpeg" in directory [14056 14063] points to nowhere - removed check_semantic_pass: Name "C04_060228T174034066+000-174357422+000_0000240_782.mpeg" in directory [14056 14063] points to nowhere - removed check_semantic_pass: Name "C04_060228T174357662+000-174701364+000_0000240_762.mpeg" in directory [14056 14063] points to nowhere - removed check_semantic_pass: Name "C04_060228T174701604+000-175015018+000_0000240_773.mpeg" in directory [14056 14063] points to nowhere - removed vpf-10650: The directory [14056 14063] has the wrong size in the StatData (7104) - corrected to (6672) [-- Attachment #1.3: reiserfsck_check_without_fixfixable --] [-- Type: text/plain, Size: 1144 bytes --] bad_stat_data: The objectid (14933) is shared by at least two files. Can be fixed with --rebuild-tree only. vpf-10640: The on-disk and the correct bitmaps differs. vpf-10650: The directory [14056 14057] has the wrong size in the StatData (16320), should be (16248) check_semantic_pass: Name "C04_060228T175015298+000-175337903+000_0000240_792.mpeg" in directory [14056 14063] points to nowhere check_semantic_pass: Name "C04_060228T173711821+000-174033786+000_0000240_773.mpeg" in directory [14056 14063] points to nowhere check_semantic_pass: Name "C04_060228T173404188+000-173711581+000_0000240_758.mpeg" in directory [14056 14063] points to nowhere check_semantic_pass: Name "C04_060228T174034066+000-174357422+000_0000240_782.mpeg" in directory [14056 14063] points to nowhere check_semantic_pass: Name "C04_060228T174357662+000-174701364+000_0000240_762.mpeg" in directory [14056 14063] points to nowhere check_semantic_pass: Name "C04_060228T174701604+000-175015018+000_0000240_773.mpeg" in directory [14056 14063] points to nowhere vpf-10650: The directory [14056 14063] has the wrong size in the StatData (7104), should be (6672) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: data corruption with 2.4.25 and datalogging patches 2006-07-19 12:33 ` Francisco Javier Cabello @ 2006-07-20 7:29 ` Francisco Javier Cabello 0 siblings, 0 replies; 17+ messages in thread From: Francisco Javier Cabello @ 2006-07-20 7:29 UTC (permalink / raw) To: reiserfs-list; +Cc: Vladimir V. Saveliev [-- Attachment #1.1: Type: text/plain, Size: 2595 bytes --] At the end I could fix the filesystem. I attach the output of 'reiserfsck --rebuild-tree' Do you have any idea? Regards, Paco On Wednesday, 19 de July de 2006 14:33, Francisco Javier Cabello wrote: > Hello, > I think It's not a memory problem. I have run memtest for more than 24 > hours and the memory is OK. > We use the best memory in the market (Kingston). When a system > fails we check hardware devices first. If we find a hardware problem we > usually don't check for software problem. > > I have other system with reiserfs corruption. I send you attached reisefsck > output for 'reiserfsck --check' and 'reiserfsck --check --fix-fixable' > > Regards, > > Paco > > On Friday, 14 de July de 2006 14:59, Vladimir V. Saveliev wrote: > > Hello > > > > On Fri, 2006-07-14 at 14:20 +0200, Francisco Javier Cabello wrote: > > > Hello Vladimir, > > > > > > # reiserfsck -l /tmp/reiserfsck.log -y --check /dev/hdc1 > > > > > > Standard output: > > > ====================================================== > > > Will read-only check consistency of the filesystem on /dev/hdc1 > > > Will put log info to '/tmp/reiserfsck.log' > > > ########### > > > reiserfsck --check started at Fri Jul 14 14:09:33 2006 > > > ########### > > > Replaying journal.. > > > Reiserfs journal '/dev/hdc1' in blocks [18..8211]: 0 transactions > > > replayed Checking internal tree..finished > > > Comparing bitmaps..Bad nodes were found, Semantic pass skipped > > > 1 found corruptions can be fixed only when running with --rebuild-tree > > > ########### > > > reiserfsck finished at Fri Jul 14 14:13:29 2006 > > > ########### > > > ====================================================== > > > > > > /tmp/reiserfsck.log: > > > ====================================================== > > > bad_internal: vpf-10320: block 23868569, items 91 and 92: The wrong > > > order of items: [410810496 11321 0x16abca00 ??? (15)], [11312 11321 > > > 0x22f1c880 DIR (3)] > > > > such corruptions used to be considered as hardware bugs. Memory failure, > > for instance. Did you ever run memtest on your systems? > > > > > the problem in the internal node occured (23868569), whole subtree is > > > skipped vpf-10640: The on-disk and the correct bitmaps differs. > > > ====================================================== -- One of my most productive days was throwing away 1000 lines of code (Ken Thompson) ----------------- PGP fingerprint: AF69 62B4 97EB F5BB 2C60 B802 568A E122 BBBE 5820 PGP Key available at http://pgp.mit.edu ----------------- [-- Attachment #1.2: reiserfsck_rebuildtree --] [-- Type: text/plain, Size: 257 bytes --] ####### Pass 0 ####### 24011 directory entries were hashed with "r5" hash. ####### Pass 1 ####### ####### Pass 2 ####### ####### Pass 3 ######### ####### Pass 3a (lost+found pass) ######### rewrite_file: 3 items of file [14057 14933] moved to [14057 23960] [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2006-07-20 7:29 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-07-12 6:16 data corruption with 2.4.25 and datalogging patches Francisco Javier Cabello 2006-07-12 8:24 ` Hans Reiser 2006-07-13 14:34 ` Vladimir V. Saveliev 2006-07-14 8:25 ` Francisco Javier Cabello 2006-07-14 11:48 ` Vladimir V. Saveliev 2006-07-14 12:03 ` Francisco Javier Cabello 2006-07-14 12:20 ` Francisco Javier Cabello 2006-07-14 12:59 ` Vladimir V. Saveliev 2006-07-17 8:53 ` Francisco Javier Cabello 2006-07-17 17:55 ` Vladimir V. Saveliev 2006-07-17 18:14 ` Brad Dameron 2006-07-17 19:12 ` Hans Reiser 2006-07-17 20:09 ` Valdis.Kletnieks 2006-07-17 21:01 ` Toby Thain 2006-07-17 21:01 ` Toby Thain 2006-07-17 10:49 ` Francisco Javier Cabello 2006-07-19 12:33 ` Francisco Javier Cabello 2006-07-20 7:29 ` Francisco Javier Cabello
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.