* Many open/close on same files yeilds "No such file or directory".
@ 2008-05-01 15:34 Jesper Krogh
2008-05-02 5:39 ` Andrew Morton
2008-05-09 5:22 ` Jesper Krogh
0 siblings, 2 replies; 28+ messages in thread
From: Jesper Krogh @ 2008-05-01 15:34 UTC (permalink / raw)
To: linux-kernel
Hi list.
I have a "fairly" reproducible problem. When a program opens and closes
the same file many times, it eventually ends up with a "no such file or
directory". Test program that can reproduce the problem on my setup:
root@hest:~# cat test-file-c.c
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
unsigned long i=0;
int fh;
char *filename;
filename=argv[1];
while(1) {
fh=open(filename, O_RDONLY);
if (fh==-1) {
printf("Failed to open %s\n", filename);
printf("Open number: %ld\n",i);
exit(10);
}
close(fh);
i++;
}
exit(0);
}
root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
Failed to open /z/bio/databases/online/index/index-by-accno
Open number: 61785000
root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
Failed to open /z/bio/databases/online/index/index-by-accno
Open number: 120929685
(The problem is not isolate to a single file on the filesystem).
strace on the program reviel that the system indeed return a "No such
file or directory" to the program.
This is run on an Ubuntu Gutsy (vendor kernel): 2.6.22-14-server on an
4.5TB ext3 filesystem on an LVM volume. The volume was created on a
dapper (2 releases back) and has just followed with during upgrades.
I cannot reproduce it on other disks attached to the same server or on
other servers attached to similar disksystems.
The filesystem was taken offline yesterday for a forced fsck and it was
found to be clean.
The diskarray is a quite old Fibrenetix FX1200 with 12xPATA disk
in raid5 (with hotspare) exposed to the OS as 3 SCSI-disks of
2+2+0.5TB assembled with LVM afterwards. The SCSI-controller is a:
05:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev c1)
What suggestions do you have to solve this problem?
I'm about to mkfs.ext3 the volume and spool it back in from the backup,
but somehow I'm not convinced that it will solve the problem at all.
It may just be a hardware problem, but dmesg doesnt tell anything.
We actually got the problem from a perl-script, but this seems to be the
minimal program that reproduces the problem.
Jesper
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-01 15:34 Many open/close on same files yeilds "No such file or directory" Jesper Krogh
@ 2008-05-02 5:39 ` Andrew Morton
2008-05-02 8:20 ` Jesper Krogh
2008-05-09 5:22 ` Jesper Krogh
1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2008-05-02 5:39 UTC (permalink / raw)
To: Jesper Krogh; +Cc: linux-kernel
On Thu, 01 May 2008 17:34:46 +0200 Jesper Krogh <jesper@krogh.cc> wrote:
> Hi list.
>
> I have a "fairly" reproducible problem. When a program opens and closes
> the same file many times, it eventually ends up with a "no such file or
> directory". Test program that can reproduce the problem on my setup:
>
> root@hest:~# cat test-file-c.c
> #include <stdlib.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[]) {
> unsigned long i=0;
> int fh;
> char *filename;
>
> filename=argv[1];
>
> while(1) {
> fh=open(filename, O_RDONLY);
> if (fh==-1) {
> printf("Failed to open %s\n", filename);
> printf("Open number: %ld\n",i);
> exit(10);
> }
> close(fh);
> i++;
> }
>
> exit(0);
> }
> root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
> Failed to open /z/bio/databases/online/index/index-by-accno
> Open number: 61785000
> root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
> Failed to open /z/bio/databases/online/index/index-by-accno
> Open number: 120929685
> (The problem is not isolate to a single file on the filesystem).
>
What an amazing bug.
> strace on the program reviel that the system indeed return a "No such
> file or directory" to the program.
>
> This is run on an Ubuntu Gutsy (vendor kernel): 2.6.22-14-server on an
> 4.5TB ext3 filesystem on an LVM volume. The volume was created on a
> dapper (2 releases back) and has just followed with during upgrades.
The test program is (almost) all in RAM and won't care about the hardware.
> I cannot reproduce it on other disks attached to the same server or on
> other servers attached to similar disksystems.
hmm.
I guess it would be interesting to remount that filesystem with `noatime'
to eliminate the last bit of I/O and block-=realted code.
> The filesystem was taken offline yesterday for a forced fsck and it was
> found to be clean.
>
> The diskarray is a quite old Fibrenetix FX1200 with 12xPATA disk
> in raid5 (with hotspare) exposed to the OS as 3 SCSI-disks of
> 2+2+0.5TB assembled with LVM afterwards. The SCSI-controller is a:
> 05:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
> Fusion-MPT Dual Ultra320 SCSI (rev c1)
>
> What suggestions do you have to solve this problem?
>
> I'm about to mkfs.ext3 the volume and spool it back in from the backup,
> but somehow I'm not convinced that it will solve the problem at all.
> It may just be a hardware problem, but dmesg doesnt tell anything.
>
> We actually got the problem from a perl-script, but this seems to be the
> minimal program that reproduces the problem.
I'd suspect that after 1e8 loops your CPU got too hot and started to
misbehave.
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 5:39 ` Andrew Morton
@ 2008-05-02 8:20 ` Jesper Krogh
2008-05-01 12:15 ` Arjan van de Ven
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Jesper Krogh @ 2008-05-02 8:20 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Andrew Morton wrote:
>> I cannot reproduce it on other disks attached to the same server or on
>> other servers attached to similar disksystems.
>
> hmm.
>
> I guess it would be interesting to remount that filesystem with `noatime'
> to eliminate the last bit of I/O and block-=realted code.
It is allready mounted noatime:
/dev/mapper/fx1200_vg-fx1200_lv on /z/fx1200 type ext3 (rw,noatime)
>> I'm about to mkfs.ext3 the volume and spool it back in from the backup,
>> but somehow I'm not convinced that it will solve the problem at all.
>> It may just be a hardware problem, but dmesg doesnt tell anything.
>>
>> We actually got the problem from a perl-script, but this seems to be the
>> minimal program that reproduces the problem.
>
> I'd suspect that after 1e8 loops your CPU got too hot and started to
> misbehave.
Hardware is an Sun Fire X4600 (8xdual-core AMD64 processors). The
problem seem to be tied to this filesystem. (I cannot havent been able
to reproduce it on the /-mounted disk of the same system. So if a cpu
problem.. then it shouldn't be tied to a specific filesystem?
This is the only activity on the system .. so a load of 1 / 16cpus.
The system are generally rock-solid.
Jesper
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 8:20 ` Jesper Krogh
@ 2008-05-01 12:15 ` Arjan van de Ven
2008-05-02 11:03 ` Many open/close on same files yeilds Jesper Krogh
2008-05-02 15:19 ` Many open/close on same files yeilds "No such file or directory" Jesper Krogh
2008-05-02 15:21 ` Jesper Krogh
2 siblings, 1 reply; 28+ messages in thread
From: Arjan van de Ven @ 2008-05-01 12:15 UTC (permalink / raw)
To: Jesper Krogh; +Cc: Andrew Morton, linux-kernel
On Fri, 02 May 2008 10:20:20 +0200
Jesper Krogh <jesper@krogh.cc> wrote:
> Hardware is an Sun Fire X4600 (8xdual-core AMD64 processors). The
> problem seem to be tied to this filesystem. (I cannot havent been able
> to reproduce it on the /-mounted disk of the same system. So if a cpu
> problem.. then it shouldn't be tied to a specific filesystem?
>
> This is the only activity on the system .. so a load of 1 / 16cpus.
have you run fsck on the filesystem?
(and it might be interesting to use dump the metadata to a file first
to save the broken state for later analysis)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds
2008-05-01 12:15 ` Arjan van de Ven
@ 2008-05-02 11:03 ` Jesper Krogh
2008-05-01 14:07 ` Arjan van de Ven
0 siblings, 1 reply; 28+ messages in thread
From: Jesper Krogh @ 2008-05-02 11:03 UTC (permalink / raw)
To: linux-kernel
Arjan van de Ven <arjan <at> infradead.org> writes:
> On Fri, 02 May 2008 10:20:20 +0200
> Jesper Krogh <jesper <at> krogh.cc> wrote:
> > This is the only activity on the system .. so a load of 1 / 16cpus.
>
> have you run fsck on the filesystem?
Yes.. the first thing i did was a forced fsck .. and it was clean.
> (and it might be interesting to use dump the metadata to a file first
> to save the broken state for later analysis)
All 4.5TB?
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds
2008-05-02 11:03 ` Many open/close on same files yeilds Jesper Krogh
@ 2008-05-01 14:07 ` Arjan van de Ven
0 siblings, 0 replies; 28+ messages in thread
From: Arjan van de Ven @ 2008-05-01 14:07 UTC (permalink / raw)
To: Jesper Krogh; +Cc: linux-kernel
On Fri, 2 May 2008 11:03:16 +0000 (UTC)
Jesper Krogh <jesper@krogh.cc> wrote:
> Arjan van de Ven <arjan <at> infradead.org> writes:
> > On Fri, 02 May 2008 10:20:20 +0200
> > Jesper Krogh <jesper <at> krogh.cc> wrote:
> > > This is the only activity on the system .. so a load of 1 /
> > > 16cpus.
> >
> > have you run fsck on the filesystem?
>
> Yes.. the first thing i did was a forced fsck .. and it was clean.
>
> > (and it might be interesting to use dump the metadata to a file
> > first to save the broken state for later analysis)
>
> All 4.5TB?
metdata only; the "e2image" program only saves metadata, not data... so
it'll be a lot less than 4.5tb..
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 8:20 ` Jesper Krogh
2008-05-01 12:15 ` Arjan van de Ven
@ 2008-05-02 15:19 ` Jesper Krogh
2008-05-02 15:47 ` Ray Lee
2008-05-02 15:21 ` Jesper Krogh
2 siblings, 1 reply; 28+ messages in thread
From: Jesper Krogh @ 2008-05-02 15:19 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Jesper Krogh wrote:
>> I'd suspect that after 1e8 loops your CPU got too hot and started to
>> misbehave.
>
> Hardware is an Sun Fire X4600 (8xdual-core AMD64 processors). The
> problem seem to be tied to this filesystem. (I cannot havent been able
> to reproduce it on the /-mounted disk of the same system. So if a cpu
> problem.. then it shouldn't be tied to a specific filesystem?
>
> This is the only activity on the system .. so a load of 1 / 16cpus.
I've tried to explore this suggestion (the best I could).
There are 2 ext3 filesystems locally mounted. / and this one. Running 16
parallel runs of this program on a file on the /-mounted filesystem
cannot reproduce the problem. If it was linked to hot hardware, I
believe I should be able to reproduce it this way. The servers are in a
17 degress serverroom.
It changes alot when.. it actually happens. The "earliest ones" has been
from 200000 cycles.
Ive tried to upgrade.. so the system is now running a 2.6.24-16-server
(Ubuntu/Hardy) kernel. But execpt from differnet output in uname,
everything else seems to be the same.
All suggestions/ideas are welcome? The general feeling is that it wont
change a thing to reformat the filesystem?
Jesper
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 15:19 ` Many open/close on same files yeilds "No such file or directory" Jesper Krogh
@ 2008-05-02 15:47 ` Ray Lee
2008-05-02 15:55 ` Jesper Krogh
2008-05-02 19:52 ` Jesper Krogh
0 siblings, 2 replies; 28+ messages in thread
From: Ray Lee @ 2008-05-02 15:47 UTC (permalink / raw)
To: Jesper Krogh; +Cc: Andrew Morton, linux-kernel
On Fri, May 2, 2008 at 8:19 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> Jesper Krogh wrote:
>
> >
> > > I'd suspect that after 1e8 loops your CPU got too hot and started to
> > > misbehave.
> > >
> >
> > Hardware is an Sun Fire X4600 (8xdual-core AMD64 processors). The
> > problem seem to be tied to this filesystem. (I cannot havent been able
> > to reproduce it on the /-mounted disk of the same system. So if a cpu
> > problem.. then it shouldn't be tied to a specific filesystem?
> >
> > This is the only activity on the system .. so a load of 1 / 16cpus.
> >
>
> I've tried to explore this suggestion (the best I could).
>
> There are 2 ext3 filesystems locally mounted. / and this one. Running 16
> parallel runs of this program on a file on the /-mounted filesystem cannot
> reproduce the problem. If it was linked to hot hardware, I believe I should
> be able to reproduce it this way. The servers are in a 17 degress
> serverroom.
>
> It changes alot when.. it actually happens. The "earliest ones" has been
> from 200000 cycles.
Run 16 in parallel on /, and another 16 simultaneously on the trouble
filesystem? If you continue to get errors only on the 'trouble'
filesystem, and no errors start occurring on / coincident, then it
sounds pretty localized.
BTW, I may have missed this earlier, but does it happen *anywhere* on
the troublesome filesystem (ie, in a newly created subdirectory)?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 15:47 ` Ray Lee
@ 2008-05-02 15:55 ` Jesper Krogh
2008-05-02 16:45 ` Ray Lee
2008-05-02 19:52 ` Jesper Krogh
1 sibling, 1 reply; 28+ messages in thread
From: Jesper Krogh @ 2008-05-02 15:55 UTC (permalink / raw)
To: Ray Lee; +Cc: Andrew Morton, linux-kernel
Ray Lee wrote:
> On Fri, May 2, 2008 at 8:19 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>> Jesper Krogh wrote:
>>
>>>> I'd suspect that after 1e8 loops your CPU got too hot and started to
>>>> misbehave.
>>>>
>>> Hardware is an Sun Fire X4600 (8xdual-core AMD64 processors). The
>>> problem seem to be tied to this filesystem. (I cannot havent been able
>>> to reproduce it on the /-mounted disk of the same system. So if a cpu
>>> problem.. then it shouldn't be tied to a specific filesystem?
>>>
>>> This is the only activity on the system .. so a load of 1 / 16cpus.
>>>
>> I've tried to explore this suggestion (the best I could).
>>
>> There are 2 ext3 filesystems locally mounted. / and this one. Running 16
>> parallel runs of this program on a file on the /-mounted filesystem cannot
>> reproduce the problem. If it was linked to hot hardware, I believe I should
>> be able to reproduce it this way. The servers are in a 17 degress
>> serverroom.
>>
>> It changes alot when.. it actually happens. The "earliest ones" has been
>> from 200000 cycles.
>
> Run 16 in parallel on /, and another 16 simultaneously on the trouble
> filesystem? If you continue to get errors only on the 'trouble'
> filesystem, and no errors start occurring on / coincident, then it
> sounds pretty localized.
That test has been done. I can only reproduce it on this filesystem. But
I cannot really conclude that it is only present there.. since sometimes
my testprogram just goes on .. and dies past 1 billion cycles. But I
have never gotten errors from the / filesystem on the same installation.
> BTW, I may have missed this earlier, but does it happen *anywhere* on
> the troublesome filesystem (ie, in a newly created subdirectory)?
I'll run that test now.
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 15:55 ` Jesper Krogh
@ 2008-05-02 16:45 ` Ray Lee
2008-05-02 19:53 ` Jesper Krogh
0 siblings, 1 reply; 28+ messages in thread
From: Ray Lee @ 2008-05-02 16:45 UTC (permalink / raw)
To: Jesper Krogh; +Cc: Andrew Morton, linux-kernel
On Fri, May 2, 2008 at 8:55 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>
> Ray Lee wrote:
>
> > On Fri, May 2, 2008 at 8:19 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> >
> > > Jesper Krogh wrote:
> > >
> > >
> > > >
> > > > > I'd suspect that after 1e8 loops your CPU got too hot and started to
> > > > > misbehave.
> > > > >
> > > > >
> > > > Hardware is an Sun Fire X4600 (8xdual-core AMD64 processors). The
> > > > problem seem to be tied to this filesystem. (I cannot havent been able
> > > > to reproduce it on the /-mounted disk of the same system. So if a cpu
> > > > problem.. then it shouldn't be tied to a specific filesystem?
> > > >
> > > > This is the only activity on the system .. so a load of 1 / 16cpus.
> > > >
> > > >
> > > I've tried to explore this suggestion (the best I could).
> > >
> > > There are 2 ext3 filesystems locally mounted. / and this one. Running
> 16
> > > parallel runs of this program on a file on the /-mounted filesystem
> cannot
> > > reproduce the problem. If it was linked to hot hardware, I believe I
> should
> > > be able to reproduce it this way. The servers are in a 17 degress
> > > serverroom.
> > >
> > > It changes alot when.. it actually happens. The "earliest ones" has
> been
> > > from 200000 cycles.
> > >
> >
> > Run 16 in parallel on /, and another 16 simultaneously on the trouble
> > filesystem? If you continue to get errors only on the 'trouble'
> > filesystem, and no errors start occurring on / coincident, then it
> > sounds pretty localized.
> >
>
> That test has been done. I can only reproduce it on this filesystem. But I
> cannot really conclude that it is only present there.. since sometimes my
> testprogram just goes on .. and dies past 1 billion cycles. But I have never
> gotten errors from the / filesystem on the same installation.
Sorry for belaboring the point, but reading up-thread I see you ran
both / and /troublefs one after the other, but you're saying you also
ran them at the same time? If so, that would seem to conclusively rule
out hardware overheating.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 16:45 ` Ray Lee
@ 2008-05-02 19:53 ` Jesper Krogh
0 siblings, 0 replies; 28+ messages in thread
From: Jesper Krogh @ 2008-05-02 19:53 UTC (permalink / raw)
To: Ray Lee; +Cc: Andrew Morton, linux-kernel
Ray Lee wrote:
> Sorry for belaboring the point, but reading up-thread I see you ran
> both / and /troublefs one after the other, but you're saying you also
> ran them at the same time? If so, that would seem to conclusively rule
> out hardware overheating.
I've run up to 32 instances of the program at the same time. Both on the
/ volumes and the "trouble" volume. Only programs from the trouble
volume failed.
Jesper
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 15:47 ` Ray Lee
2008-05-02 15:55 ` Jesper Krogh
@ 2008-05-02 19:52 ` Jesper Krogh
2008-05-05 17:43 ` Jesper Krogh
1 sibling, 1 reply; 28+ messages in thread
From: Jesper Krogh @ 2008-05-02 19:52 UTC (permalink / raw)
To: Ray Lee; +Cc: Andrew Morton, linux-kernel
Ray Lee wrote:
> BTW, I may have missed this earlier, but does it happen *anywhere* on
> the troublesome filesystem (ie, in a newly created subdirectory)?
Yes. It is reproducible on a newly created subdirectory on the
filesystem.
Jesper
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 19:52 ` Jesper Krogh
@ 2008-05-05 17:43 ` Jesper Krogh
2008-05-05 17:51 ` Randy.Dunlap
0 siblings, 1 reply; 28+ messages in thread
From: Jesper Krogh @ 2008-05-05 17:43 UTC (permalink / raw)
To: Ray Lee; +Cc: Andrew Morton, linux-kernel
Jesper Krogh wrote:
> Ray Lee wrote:
>> BTW, I may have missed this earlier, but does it happen *anywhere* on
>> the troublesome filesystem (ie, in a newly created subdirectory)?
>
> Yes. It is reproducible on a newly created subdirectory on the
> filesystem.
Now I've created a new filesystem. (resized the original LVM-volume to
half and created a new ext3 one next to it). The problem persists on
this new filesystem.
Just guessing, can there be something timing related in clearing the
FS-cache for the filesystem?
Any other suggestions/pointers.
Jesper
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-05 17:43 ` Jesper Krogh
@ 2008-05-05 17:51 ` Randy.Dunlap
2008-05-05 17:54 ` Jesper Krogh
0 siblings, 1 reply; 28+ messages in thread
From: Randy.Dunlap @ 2008-05-05 17:51 UTC (permalink / raw)
To: Jesper Krogh; +Cc: Ray Lee, Andrew Morton, linux-kernel
On Mon, 5 May 2008, Jesper Krogh wrote:
> Jesper Krogh wrote:
> > Ray Lee wrote:
> > > BTW, I may have missed this earlier, but does it happen *anywhere* on
> > > the troublesome filesystem (ie, in a newly created subdirectory)?
> >
> > Yes. It is reproducible on a newly created subdirectory on the
> > filesystem.
>
> Now I've created a new filesystem. (resized the original LVM-volume to half
> and created a new ext3 one next to it). The problem persists on this new
> filesystem.
>
> Just guessing, can there be something timing related in clearing the FS-cache
> for the filesystem?
>
> Any other suggestions/pointers.
Hi,
Has anyone else been able to reproduce this problem?
I ran the test program overnight with no problem.
--
~Randy
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-05 17:51 ` Randy.Dunlap
@ 2008-05-05 17:54 ` Jesper Krogh
[not found] ` <2c0942db0805051121r47cc97d2jb71cc8ab9eaa7981@mail.gmail.com>
0 siblings, 1 reply; 28+ messages in thread
From: Jesper Krogh @ 2008-05-05 17:54 UTC (permalink / raw)
To: Randy.Dunlap; +Cc: Ray Lee, Andrew Morton, linux-kernel
Randy.Dunlap wrote:
> On Mon, 5 May 2008, Jesper Krogh wrote:
>
>> Jesper Krogh wrote:
>>> Ray Lee wrote:
>>>> BTW, I may have missed this earlier, but does it happen *anywhere* on
>>>> the troublesome filesystem (ie, in a newly created subdirectory)?
>>> Yes. It is reproducible on a newly created subdirectory on the
>>> filesystem.
>> Now I've created a new filesystem. (resized the original LVM-volume to half
>> and created a new ext3 one next to it). The problem persists on this new
>> filesystem.
>>
>> Just guessing, can there be something timing related in clearing the FS-cache
>> for the filesystem?
>>
>> Any other suggestions/pointers.
>
> Hi,
>
> Has anyone else been able to reproduce this problem?
No, and out of my systems. I only have this external SCSI-IDE-raid
device I can reproduce it on. The internal system disks doesnt allow me
to reproduce it on the same installation.
--
Jesper Krogh
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-02 8:20 ` Jesper Krogh
2008-05-01 12:15 ` Arjan van de Ven
2008-05-02 15:19 ` Many open/close on same files yeilds "No such file or directory" Jesper Krogh
@ 2008-05-02 15:21 ` Jesper Krogh
2 siblings, 0 replies; 28+ messages in thread
From: Jesper Krogh @ 2008-05-02 15:21 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Jesper Krogh wrote:
> Andrew Morton wrote:
>>> I cannot reproduce it on other disks attached to the same server or on
>>> other servers attached to similar disksystems.
>>
>> I guess it would be interesting to remount that filesystem with `noatime'
>> to eliminate the last bit of I/O and block-=realted code.
>
> It is allready mounted noatime:
> /dev/mapper/fx1200_vg-fx1200_lv on /z/fx1200 type ext3 (rw,noatime)
Watching vmstat 1 while running the program confirms that there are
absolutely no block i/o. Everything happens purely in the cache of the OS.
Jesper
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-01 15:34 Many open/close on same files yeilds "No such file or directory" Jesper Krogh
2008-05-02 5:39 ` Andrew Morton
@ 2008-05-09 5:22 ` Jesper Krogh
2008-05-09 5:36 ` Andrew Morton
1 sibling, 1 reply; 28+ messages in thread
From: Jesper Krogh @ 2008-05-09 5:22 UTC (permalink / raw)
To: linux-kernel, Andrew Morton
Hi.
A week has passed the problem still persist and I have done a fair
amount of testing.
I have now been able to reproduce it on 2 different servers, a
dual-core,8 cpu Sun X4600 amd64 and a 2 cpu single-core Sun V20Z.
It has been reproduced against 4 different diskarrays one of them being
the IDE-SCSI(We have 2 of them) array mentioned before. The other one is
an iSCSI SAN. And lately when trying to move of the "trouble systems" we
hit the bug on a FC-SAN.
I belive that we can rule out hardware now.
I have both reproduced it on "old" ext3 volumes and freshly created
ones.
I havent been able to reproduce it on the internal system disks of
the servers, but thinking a bit more about the setup it seems that
the filesystem need to have a significant load on the
filesystem-structures (not data) to exploit the bug. The typical
environment where I found it was serving (the same) ~5GB of files of the
NFS-server to 48 dual-core machines, this bacically never hits the
actual disk (with 32GB of ram in the machine). In this setup
a single process could hit the bug on the server (the problem could
also be seen over NFS from the clients).
It is worth keeping in mind that the problem is "rare". Thus all
applications only opening and reading a single file every 10 minutes
or so probably never hits it.
Altering the test-script so it just tries again to pick up the file,
it succeeds on the second pass.
From my perspective this defininately has something to do with how the
OS caches/invalidates the filesystem structures it has in memory (or
somewhere near that).
I still haven't been able to produce a small pack-and-shit test that
knocks the problem out on every system. But please come with suggestions
about how to write a piece of code that tries to hit the problem from
my descriptions.
My feeling is that the script below may reveal the bug on any "busy"
volume, where busy is lots of activity in the OS-cache of the volume,
not on the actual drives.
All suggestions are welcome.
I have tried 4 different kernel versions:
2.6.20, 2.6.22, 2.6.24, 2.6.25.2
Jesper
Jesper Krogh wrote:
> Hi list.
>
> I have a "fairly" reproducible problem. When a program opens and closes
> the same file many times, it eventually ends up with a "no such file or
> directory". Test program that can reproduce the problem on my setup:
>
> root@hest:~# cat test-file-c.c
> #include <stdlib.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[]) {
> unsigned long i=0;
> int fh;
> char *filename;
>
> filename=argv[1];
>
> while(1) {
> fh=open(filename, O_RDONLY);
> if (fh==-1) {
> printf("Failed to open %s\n", filename);
> printf("Open number: %ld\n",i);
> exit(10);
> }
> close(fh);
> i++;
> }
>
> exit(0);
> }
> root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
> Failed to open /z/bio/databases/online/index/index-by-accno
> Open number: 61785000
> root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
> Failed to open /z/bio/databases/online/index/index-by-accno
> Open number: 120929685
> (The problem is not isolate to a single file on the filesystem).
>
>
> strace on the program reviel that the system indeed return a "No such
> file or directory" to the program.
>
> This is run on an Ubuntu Gutsy (vendor kernel): 2.6.22-14-server on an
> 4.5TB ext3 filesystem on an LVM volume. The volume was created on a
> dapper (2 releases back) and has just followed with during upgrades.
> I cannot reproduce it on other disks attached to the same server or on
> other servers attached to similar disksystems.
>
> The filesystem was taken offline yesterday for a forced fsck and it was
> found to be clean.
>
> The diskarray is a quite old Fibrenetix FX1200 with 12xPATA disk
> in raid5 (with hotspare) exposed to the OS as 3 SCSI-disks of
> 2+2+0.5TB assembled with LVM afterwards. The SCSI-controller is a:
> 05:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
> Fusion-MPT Dual Ultra320 SCSI (rev c1)
>
> What suggestions do you have to solve this problem?
>
> I'm about to mkfs.ext3 the volume and spool it back in from the backup,
> but somehow I'm not convinced that it will solve the problem at all.
> It may just be a hardware problem, but dmesg doesnt tell anything.
>
> We actually got the problem from a perl-script, but this seems to be the
> minimal program that reproduces the problem.
>
> Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-09 5:22 ` Jesper Krogh
@ 2008-05-09 5:36 ` Andrew Morton
2008-05-09 6:09 ` Jesper Krogh
0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2008-05-09 5:36 UTC (permalink / raw)
To: Jesper Krogh; +Cc: linux-kernel
On Fri, 09 May 2008 07:22:46 +0200 Jesper Krogh <jesper@krogh.cc> wrote:
> Hi.
>
> A week has passed the problem still persist and I have done a fair
> amount of testing.
>
> I have now been able to reproduce it on 2 different servers, a
> dual-core,8 cpu Sun X4600 amd64 and a 2 cpu single-core Sun V20Z.
>
> It has been reproduced against 4 different diskarrays one of them being
> the IDE-SCSI(We have 2 of them) array mentioned before. The other one is
> an iSCSI SAN. And lately when trying to move of the "trouble systems" we
> hit the bug on a FC-SAN.
>
> I belive that we can rule out hardware now.
>
> I have both reproduced it on "old" ext3 volumes and freshly created
> ones.
>
> I havent been able to reproduce it on the internal system disks of
> the servers, but thinking a bit more about the setup it seems that
> the filesystem need to have a significant load on the
> filesystem-structures (not data) to exploit the bug. The typical
> environment where I found it was serving (the same) ~5GB of files of the
> NFS-server to 48 dual-core machines, this bacically never hits the
> actual disk (with 32GB of ram in the machine). In this setup
> a single process could hit the bug on the server (the problem could
> also be seen over NFS from the clients).
>
> It is worth keeping in mind that the problem is "rare". Thus all
> applications only opening and reading a single file every 10 minutes
> or so probably never hits it.
>
> Altering the test-script so it just tries again to pick up the file,
> it succeeds on the second pass.
>
> From my perspective this defininately has something to do with how the
> OS caches/invalidates the filesystem structures it has in memory (or
> somewhere near that).
>
> I still haven't been able to produce a small pack-and-shit test that
> knocks the problem out on every system. But please come with suggestions
> about how to write a piece of code that tries to hit the problem from
> my descriptions.
It's weird.
> My feeling is that the script below may reveal the bug on any "busy"
> volume, where busy is lots of activity in the OS-cache of the volume,
> not on the actual drives.
By this do you mean that there has to be a lot of other activity on the
system to reproduce it? Stuff which is turning over memory?
Because one possiblity is that the cached dentry got reclaimed by memory
pressure and we have some race bug which causes us to think that the file
doesn't exist.
(That still shouldn't happen because the dentry should be marked
recently-accessed, but perhaps the underlying inode gets reclaimed or
something. Grasping at straws here)
> All suggestions are welcome.
>
> I have tried 4 different kernel versions:
> 2.6.20, 2.6.22, 2.6.24, 2.6.25.2
>
> Jesper
>
> Jesper Krogh wrote:
> > Hi list.
> >
> > I have a "fairly" reproducible problem. When a program opens and closes
> > the same file many times, it eventually ends up with a "no such file or
> > directory". Test program that can reproduce the problem on my setup:
> >
> > root@hest:~# cat test-file-c.c
> > #include <stdlib.h>
> > #include <stdio.h>
> > #include <fcntl.h>
> > #include <unistd.h>
> >
> > int main(int argc, char *argv[]) {
> > unsigned long i=0;
> > int fh;
> > char *filename;
> >
> > filename=argv[1];
> >
> > while(1) {
> > fh=open(filename, O_RDONLY);
> > if (fh==-1) {
> > printf("Failed to open %s\n", filename);
> > printf("Open number: %ld\n",i);
> > exit(10);
> > }
> > close(fh);
> > i++;
> > }
> >
> > exit(0);
> > }
gee.
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-09 5:36 ` Andrew Morton
@ 2008-05-09 6:09 ` Jesper Krogh
2008-05-09 6:22 ` Andrew Morton
2008-05-12 1:53 ` Neil Brown
0 siblings, 2 replies; 28+ messages in thread
From: Jesper Krogh @ 2008-05-09 6:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Andrew Morton wrote:
>> My feeling is that the script below may reveal the bug on any "busy"
>> volume, where busy is lots of activity in the OS-cache of the volume,
>> not on the actual drives.
>
> By this do you mean that there has to be a lot of other activity on the
> system to reproduce it? Stuff which is turning over memory?
Yes, something like that. (sorry for not being able to be more
concrete). The applications has "high activity" on a few files, not
spread activity throughout the volume.
> Because one possiblity is that the cached dentry got reclaimed by memory
> pressure and we have some race bug which causes us to think that the file
> doesn't exist.
What can i do to explore this theory? Can I disable caching of dentries
and see it go away? Does it fit the pattern that it is only the
"open"-syscall that is hit (not read for example)?
> (That still shouldn't happen because the dentry should be marked
> recently-accessed, but perhaps the underlying inode gets reclaimed or
> something. Grasping at straws here)
When I disabled the NFS-server and rand my "real-world" program on a
single processor (make -j 1). It ran through fine. It basically
gets around 20 million chunks out of differnet file and assemble the
chuncks in a few other files. This processes more or less 5 individual
sections, so make can run effectively with a concurrency of 5.
I dont know if there can be any technical reasons for not seeing it on
internal attached disks? (other than I just hadn't been able to
reproduce the same error conditions there.
Jesper
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-09 6:09 ` Jesper Krogh
@ 2008-05-09 6:22 ` Andrew Morton
2008-05-12 1:53 ` Neil Brown
1 sibling, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2008-05-09 6:22 UTC (permalink / raw)
To: Jesper Krogh; +Cc: linux-kernel
On Fri, 09 May 2008 08:09:34 +0200 Jesper Krogh <jesper@krogh.cc> wrote:
> Andrew Morton wrote:
> >> My feeling is that the script below may reveal the bug on any "busy"
> >> volume, where busy is lots of activity in the OS-cache of the volume,
> >> not on the actual drives.
> >
> > By this do you mean that there has to be a lot of other activity on the
> > system to reproduce it? Stuff which is turning over memory?
>
> Yes, something like that. (sorry for not being able to be more
> concrete). The applications has "high activity" on a few files, not
> spread activity throughout the volume.
>
> > Because one possiblity is that the cached dentry got reclaimed by memory
> > pressure and we have some race bug which causes us to think that the file
> > doesn't exist.
>
> What can i do to explore this theory?
Watch /proc/vmstat while the tests is running. If the "*steal*" numbers
are going up, the system is reclaiming memory.
> Can I disable caching of dentries
> and see it go away?
Nope.
> Does it fit the pattern that it is only the
> "open"-syscall that is hit (not read for example)?
Yes. open() will look up the filename in the dentry cache, read() wil not.
If name lookup has a race agaisnt dentry cache reclaim, something like
this might happen.
But it'd be damned odd.
> > (That still shouldn't happen because the dentry should be marked
> > recently-accessed, but perhaps the underlying inode gets reclaimed or
> > something. Grasping at straws here)
>
> When I disabled the NFS-server and rand my "real-world" program on a
> single processor (make -j 1). It ran through fine. It basically
> gets around 20 million chunks out of differnet file and assemble the
> chuncks in a few other files. This processes more or less 5 individual
> sections, so make can run effectively with a concurrency of 5.
>
> I dont know if there can be any technical reasons for not seeing it on
> internal attached disks? (other than I just hadn't been able to
> reproduce the same error conditions there.
I can't think of any reason.
I guess a suitable test would be to run your little test app and then read
large files as fast as you can from as many disks as you can, to force
memory reclaim. If that triggers the bug, and the bug is more likely to
trigger the faster you read those files, then we have a theory.
Damned odd though.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-09 6:09 ` Jesper Krogh
2008-05-09 6:22 ` Andrew Morton
@ 2008-05-12 1:53 ` Neil Brown
2008-05-12 6:00 ` J. Bruce Fields
2008-05-12 6:41 ` Jesper Krogh
1 sibling, 2 replies; 28+ messages in thread
From: Neil Brown @ 2008-05-12 1:53 UTC (permalink / raw)
To: Jesper Krogh; +Cc: Andrew Morton, linux-kernel, linux-nfs
On Friday May 9, jesper@krogh.cc wrote:
>
> When I disabled the NFS-server and rand my "real-world" program on a
> single processor (make -j 1). It ran through fine. It basically
> gets around 20 million chunks out of differnet file and assemble the
> chuncks in a few other files. This processes more or less 5 individual
> sections, so make can run effectively with a concurrency of 5.
(For linux-nfs readers: the problem is that repeatedly opening a given
file sometimes returns a ENOENT - http://lkml.org/lkml/2008/5/9/15).
The mention of an NFS-server made my ears prick up...
Do I understand correctly that the problem only occurs when you have
48 clients hammering away at the filesystem in question?
Could the clients be accessing the same file that you are experiencing
problems with? Or one of the directories in the path (if so, how
deep).
How many different files to these 20 million chunks come from? And
how does that number compare with the first number from
grep dentry /proc/slabinfo
??
The NFS server does some slighty strange things with the dcache if the
object being access is not in the cache.
Also, can get a few instances of
grep '^fh' /proc/nfs/rpc/nfsd
while things are going strange. The numbers are:
* fh <stale> <total-lookups> <anonlookups> <dir-not-in-dcache> <nondir-not-in-dcache>
That will show us if it is looking for things that aren't in the
dcache.
Finally, if the filesystem export with "subtree_check" or
"nosubtree_check"?
Does it make a difference if you switch the setting of this flag and
re-export?
NeilBrown
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-12 1:53 ` Neil Brown
@ 2008-05-12 6:00 ` J. Bruce Fields
2008-05-12 6:41 ` Jesper Krogh
1 sibling, 0 replies; 28+ messages in thread
From: J. Bruce Fields @ 2008-05-12 6:00 UTC (permalink / raw)
To: Neil Brown; +Cc: Jesper Krogh, Andrew Morton, linux-kernel, linux-nfs
On Mon, May 12, 2008 at 11:53:57AM +1000, Neil Brown wrote:
> On Friday May 9, jesper@krogh.cc wrote:
> >
> > When I disabled the NFS-server and rand my "real-world" program on a
> > single processor (make -j 1). It ran through fine. It basically
> > gets around 20 million chunks out of differnet file and assemble the
> > chuncks in a few other files. This processes more or less 5 individual
> > sections, so make can run effectively with a concurrency of 5.
>
> (For linux-nfs readers: the problem is that repeatedly opening a given
> file sometimes returns a ENOENT - http://lkml.org/lkml/2008/5/9/15).
>
> The mention of an NFS-server made my ears prick up...
>
> Do I understand correctly that the problem only occurs when you have
> 48 clients hammering away at the filesystem in question?
>
> Could the clients be accessing the same file that you are experiencing
> problems with? Or one of the directories in the path (if so, how
> deep).
>
> How many different files to these 20 million chunks come from? And
> how does that number compare with the first number from
> grep dentry /proc/slabinfo
> ??
>
> The NFS server does some slighty strange things with the dcache if the
> object being access is not in the cache.
>
> Also, can get a few instances of
> grep '^fh' /proc/nfs/rpc/nfsd
I think you meant /proc/net/rpc/nfsd.
--b.
>
> while things are going strange. The numbers are:
> * fh <stale> <total-lookups> <anonlookups> <dir-not-in-dcache> <nondir-not-in-dcache>
>
> That will show us if it is looking for things that aren't in the
> dcache.
>
> Finally, if the filesystem export with "subtree_check" or
> "nosubtree_check"?
> Does it make a difference if you switch the setting of this flag and
> re-export?
>
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Many open/close on same files yeilds "No such file or directory".
2008-05-12 1:53 ` Neil Brown
2008-05-12 6:00 ` J. Bruce Fields
@ 2008-05-12 6:41 ` Jesper Krogh
2008-05-12 6:51 ` Andrew Morton
1 sibling, 1 reply; 28+ messages in thread
From: Jesper Krogh @ 2008-05-12 6:41 UTC (permalink / raw)
To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-nfs
Neil Brown wrote:
> On Friday May 9, jesper@krogh.cc wrote:
>> When I disabled the NFS-server and rand my "real-world" program on a
>> single processor (make -j 1). It ran through fine. It basically
>> gets around 20 million chunks out of differnet file and assemble the
>> chuncks in a few other files. This processes more or less 5 individual
>> sections, so make can run effectively with a concurrency of 5.
>
> (For linux-nfs readers: the problem is that repeatedly opening a given
> file sometimes returns a ENOENT - http://lkml.org/lkml/2008/5/9/15).
This thing really, really irritated me, but I must admit that Andrew
Morton was very correct about this "not being very likely"- a kernel
bug.
It seem that our central configuration handling system (slack) was being
way to aggressive about updating symlinks in paths of the filesystems
that I was testing upon, that explains why I couldn't reproduce it
on the internal volumes, and not on any of the volumes I created only
for testing purposes. Sometimes you just get too blind..
(I haven't been able to reproduce for 12 hours now)
Just to answer your questions, yes, the 48 clients do hammer on NFS and
now it seems to work excellent.
Sorry for all the noise.
--
Jesper
^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <aoJcW-38V-37@gated-at.bofh.it>]
end of thread, other threads:[~2008-05-12 6:52 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-01 15:34 Many open/close on same files yeilds "No such file or directory" Jesper Krogh
2008-05-02 5:39 ` Andrew Morton
2008-05-02 8:20 ` Jesper Krogh
2008-05-01 12:15 ` Arjan van de Ven
2008-05-02 11:03 ` Many open/close on same files yeilds Jesper Krogh
2008-05-01 14:07 ` Arjan van de Ven
2008-05-02 15:19 ` Many open/close on same files yeilds "No such file or directory" Jesper Krogh
2008-05-02 15:47 ` Ray Lee
2008-05-02 15:55 ` Jesper Krogh
2008-05-02 16:45 ` Ray Lee
2008-05-02 19:53 ` Jesper Krogh
2008-05-02 19:52 ` Jesper Krogh
2008-05-05 17:43 ` Jesper Krogh
2008-05-05 17:51 ` Randy.Dunlap
2008-05-05 17:54 ` Jesper Krogh
[not found] ` <2c0942db0805051121r47cc97d2jb71cc8ab9eaa7981@mail.gmail.com>
2008-05-05 18:29 ` Jesper Krogh
[not found] ` <2c0942db0805051154q63a18bcfhce8a30d4a663ea3f@mail.gmail.com>
2008-05-07 20:51 ` Jesper Krogh
2008-05-07 22:27 ` Jesper Krogh
2008-05-02 15:21 ` Jesper Krogh
2008-05-09 5:22 ` Jesper Krogh
2008-05-09 5:36 ` Andrew Morton
2008-05-09 6:09 ` Jesper Krogh
2008-05-09 6:22 ` Andrew Morton
2008-05-12 1:53 ` Neil Brown
2008-05-12 6:00 ` J. Bruce Fields
2008-05-12 6:41 ` Jesper Krogh
2008-05-12 6:51 ` Andrew Morton
[not found] <aoJcW-38V-37@gated-at.bofh.it>
[not found] ` <aoWjI-1Br-5@gated-at.bofh.it>
[not found] ` <aoYOH-6RO-13@gated-at.bofh.it>
[not found] ` <ap5nc-3ZT-7@gated-at.bofh.it>
[not found] ` <ap5Gx-4vu-43@gated-at.bofh.it>
[not found] ` <ap9Ar-4Nn-21@gated-at.bofh.it>
[not found] ` <aqcZe-7Fg-23@gated-at.bofh.it>
[not found] ` <aqd98-7Vb-25@gated-at.bofh.it>
[not found] ` <aqd99-7Vb-27@gated-at.bofh.it>
2008-05-05 19:05 ` Henry Nestler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox