* ncpfs: Connection invalid / Input-/Output Errors [not found] <S932080AbVIGI45/20050907085657Z+286@vger.kernel.org> @ 2005-09-07 11:08 ` schönfeld / in-medias-res 2005-09-07 12:11 ` Anton Altaparmakov 0 siblings, 1 reply; 7+ messages in thread From: schönfeld / in-medias-res @ 2005-09-07 11:08 UTC (permalink / raw) To: linux-kernel Hi, first of all: I'm unsure if i'm writing to the right list, so if i'm wrong please just correct me. At one of our sites we run a Novell Fileserver with some DOS Clients and a linux server. The linux server is running an older SuSE version with Linux 2.4.29 kernel, as well as various custom applications. It is running quiet stable so far without bigger problems. As we want to migrate our servers to Debian their is another system running Debian, a Linux 2.6.12 kernel build from debianized sources and the same custom applications as on the SuSE system. But for a reason, we can't figure out, the novell connection on that system fails in a random matter. It just "disappears" and logfiles (syslog and kern.log) state that the ncpfs connection is invalid. First we thought of a hardware problem, but that does not seem to be the reason, as we swapped the responsible NIC and the problem keeps happening. Then we thought it may be a kernel bug, which is maybe fixed in a newer version, upgraded the kernel, but the situation did not change. I thought one special application may be the point of failure, but it does run on the other host, too - without any problem. Anyways i straced the application to see whats happening when the connection breaks. Nothing, that could help. It's just normal operation until it gets into an "Input/Output Error" loop. At the current point i don't know what to do. I don't see possibilites to trace down the problem, nor can i find some hints via google or in this mailinglist so i want to ask if somebody can tell me how to trace down that problem, or give me some hints in any other way. The ncpfs software running on the server is 2.2.6, while the server without problems is running 2.2.0.18. Thanks in advance Greets Patrick Schönfeld IN MEDIAS RES -=Operations=- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ncpfs: Connection invalid / Input-/Output Errors 2005-09-07 11:08 ` ncpfs: Connection invalid / Input-/Output Errors schönfeld / in-medias-res @ 2005-09-07 12:11 ` Anton Altaparmakov 2005-09-07 14:14 ` schönfeld / in-medias-res 0 siblings, 1 reply; 7+ messages in thread From: Anton Altaparmakov @ 2005-09-07 12:11 UTC (permalink / raw) To: schönfeld / in-medias-res; +Cc: linux-kernel On Wed, 2005-09-07 at 13:08 +0200, schönfeld / in-medias-res wrote: > At one of our sites we run a Novell Fileserver with some DOS Clients > and a linux server. The linux server is running an older SuSE version > with Linux 2.4.29 kernel, as well as various custom applications. > It is running quiet stable so far without bigger problems. > > As we want to migrate our servers to Debian their is another system > running Debian, a Linux 2.6.12 kernel build from debianized sources and > the same custom applications as on the SuSE system. But for a reason, > we can't figure out, the novell connection on that system fails in > a random matter. It just "disappears" and logfiles (syslog and kern.log) > state that the ncpfs connection is invalid. First we thought of a > hardware problem, but that does not seem to be the reason, as we swapped > the responsible NIC and the problem keeps happening. Then we thought > it may be a kernel bug, which is maybe fixed in a newer version, > upgraded the kernel, but the situation did not change. I thought one > special application may be the point of failure, but it does run on > the other host, too - without any problem. Anyways i straced the > application to see whats happening when the connection breaks. Nothing, > that could help. It's just normal operation until it gets into an > "Input/Output Error" loop. > > At the current point i don't know what to do. I don't see possibilites > to trace down the problem, nor can i find some hints via google or in > this mailinglist so i want to ask if somebody can tell me how to trace > down that problem, or give me some hints in any other way. Are you using IPX or TCP/IP or UDP? Are you using the same on both? Are the two boxes in the same place and on the same connection/the same speed? For example if one box is sitting close to the netware server and the other further away, on a congested network, it is much more likely to loose the connection. Also IPX is much worse than UDP. Our connection loss problems decreased a lot when we moved from IPX to UDP. Haven't had much experience with TCP/IP yet. Also so far we have not seen any connection loss problems since we switched from 2.4 to 2.6 kernels (suse 9.3, i.e. 2.6.11.4-21.9). One of the reasons for a connection disappearing is that the NCP sequence numbers on the netware server and the linux client become out of sync. When the netware server detects this it shuts down the connetion. Linux can't do reconnects so you get exactly the errors you see and the connection is gone. The fix is to umount and to mount again when this happens. To see if this is your problem, insert some printk()s in the relevant ncpfs code (depends whether you are using ipx or tcp/udp as to where) and see if they are triggered. We have been trying to track this down for years and have failed so far... We were hoping the problems had gone away with the 2.6 kernel but if you are seeing them maybe we will start seing them once term starts and Linux is used more again... (We only switched to 2.6 this summer.) > The ncpfs software running on the server is 2.2.6, while the server > without problems is running 2.2.0.18. That is irrelevant. Only the kernel driver version matters. Hope this is useful. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ncpfs: Connection invalid / Input-/Output Errors 2005-09-07 12:11 ` Anton Altaparmakov @ 2005-09-07 14:14 ` schönfeld / in-medias-res 2005-09-07 16:24 ` Petr Vandrovec 0 siblings, 1 reply; 7+ messages in thread From: schönfeld / in-medias-res @ 2005-09-07 14:14 UTC (permalink / raw) To: linux-kernel Hi, thanks for your answere. Anton Altaparmakov schrieb: > Are you using IPX or TCP/IP or UDP? Are you using the same on both? Sorry missed pointing that out. We are using IPX. I don't think it'll be that easy to switch to anything other :/ > Are the two boxes in the same place and on the same connection/the same > speed? For example if one box is sitting close to the netware server > and the other further away, on a congested network, it is much more > likely to loose the connection. Both systems are local and therefore thats not the difference, between them. > Also IPX is much worse than UDP. Our connection loss > problems decreased a lot when we moved from IPX to UDP. > Haven't had much experience with TCP/IP yet. Also so far we have not > seen any connection loss problems since we switched from 2.4 to 2.6 > kernels (suse 9.3, i.e. 2.6.11.4-21.9). Well i can imagine that IPX is much worse than UDP ("IPX just sucks"). Unfortunately it doesn't seem to be that easy to switch that system over to UDP, cause the Novell Server is in center of a whole system, which has to be highly available, so we don't want to touch it. > One of the reasons for a connection disappearing is that the NCP > sequence numbers on the netware server and the linux client become out > of sync. When the netware server detects this it shuts down the > connetion. Linux can't do reconnects so you get exactly the errors you > see and the connection is gone. The fix is to umount and to mount again > when this happens. Uhmm... then remains the question: Why should that happen on the first machine but not on the second? > To see if this is your problem, insert some printk()s in the relevant > ncpfs code (depends whether you are using ipx or tcp/udp as to where) Well - i'm using IPX. So where do i insert the printk()s? And what kind of printk()s should i insert? Please don't think of me as an idiot, but i'm just not firm with "kernel hacking". > Hope this is useful. A little bit. Thanks anywaysw Greets Patrick ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ncpfs: Connection invalid / Input-/Output Errors 2005-09-07 14:14 ` schönfeld / in-medias-res @ 2005-09-07 16:24 ` Petr Vandrovec 2005-09-09 7:36 ` schönfeld / in-medias-res 0 siblings, 1 reply; 7+ messages in thread From: Petr Vandrovec @ 2005-09-07 16:24 UTC (permalink / raw) To: schönfeld / in-medias-res; +Cc: linux-kernel schönfeld / in-medias-res wrote: > Hi, > > thanks for your answere. > Uhmm... then remains the question: Why should that happen on the first > machine but not on the second? Enable displaying of connection watchdog logouts on the server. Do not use 'intr' mount option. Do not send KILL signal to the connection which is waiting for reply from server. If you are not sure that your network infrastructure is fine, use 'hard' mount option to disable timeouts altogether. >>To see if this is your problem, insert some printk()s in the relevant >>ncpfs code (depends whether you are using ipx or tcp/udp as to where) > > Well - i'm using IPX. So where do i insert the printk()s? And what kind > of printk()s should i insert? Please don't think of me as an idiot, > but i'm just not firm with "kernel hacking". Into 'ncp_invalidate_conn()', or better, into its callers. One is in __abort_ncp_connection (invoked for IPX connections when __ncpdgram_timeout_proc fires), second is in ncp_do_request (if server reports some problem, or if KILL signal is sent to the process). Petr ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ncpfs: Connection invalid / Input-/Output Errors 2005-09-07 16:24 ` Petr Vandrovec @ 2005-09-09 7:36 ` schönfeld / in-medias-res 2005-09-09 10:46 ` Petr Vandrovec 0 siblings, 1 reply; 7+ messages in thread From: schönfeld / in-medias-res @ 2005-09-09 7:36 UTC (permalink / raw) To: linux-kernel Hi Petr, Petr Vandrovec schrieb: > Enable displaying of connection watchdog logouts on the server. Do not > use 'intr' mount option. Do not send KILL signal to the connection > which is waiting for reply from server. If you are not sure that your > network infrastructure is fine, use 'hard' mount option to disable > timeouts altogether. well, the thing with KILL signals is something i found after reading your email. You did write that to another person a while ago. Now i found that i missed a thing when i looked for differences between the two server and got a suspicion on my mind. The only real difference between the two servers is that the one with the problems does run a nagios nrpe server and some plugins, e.g. to check disk space on the novell disk, while the other server does not. Now i found that heavy operations on the filesystem (e.g. stat'ing many small files in a short time) is a kind of problematic, if you want to do anything else on the filesystem at the same time. The second process just hangs until the first one accessing the ncp filesystem is ready with its operation. Well if nagios pretends to run a check it does send a request to the nrpe server, which will start a plugin to check what it has to check. Now the problem is, that the plugin will not return a result until the timeout (i'm quiet sure that one exists) exceeds. The only question now is: Does NRPE Server send a SIGKILL to the plugin when time out exceeds? I'll try that. Maybe the dog lies buried there. For now: Thanks for your help. I'll try that first and then eventually the printk-thing. > Into 'ncp_invalidate_conn()', or better, into its callers. One is in > __abort_ncp_connection (invoked for IPX connections when > __ncpdgram_timeout_proc fires), second is in ncp_do_request (if server > reports some problem, or if KILL signal is sent to the process). Ok, thanks. Greets Patrick ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ncpfs: Connection invalid / Input-/Output Errors 2005-09-09 7:36 ` schönfeld / in-medias-res @ 2005-09-09 10:46 ` Petr Vandrovec 2005-09-09 11:16 ` schönfeld / in-medias-res 0 siblings, 1 reply; 7+ messages in thread From: Petr Vandrovec @ 2005-09-09 10:46 UTC (permalink / raw) To: schönfeld / in-medias-res; +Cc: linux-kernel schönfeld / in-medias-res wrote: > Hi Petr, > > the two servers is that the one with the problems does run a nagios nrpe > server and some plugins, e.g. to check disk space on the novell disk, > while the other server does not. Now i found that heavy operations on > the filesystem (e.g. stat'ing many small files in a short time) is a > kind of problematic, if you want to do anything else on the filesystem > at the same time. The second process just hangs until the first one > accessing the ncp filesystem is ready with its operation. Well if You need either another CPU, or semaphore which do not suffer from starvation. Or you have to rewrite ncpfs to use some queue instead of simple semaphore. What happens is that your copy process in a loop acquires ncp_server's semaphore, sends request to server, waits for response, and releases semaphore. It does that for every request sent out. Now your process comes in, finds that ncp_server's semaphore is locked, and starts waiting. Other process gets answer from server, releases semaphore, and as both processes were just waiting before this happened, they both have same priority, and so one which just did up() continues to run. And before waken up process gets chance to do its task, copy process sends another request, and so your second process goes to sleep again. Petr ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ncpfs: Connection invalid / Input-/Output Errors 2005-09-09 10:46 ` Petr Vandrovec @ 2005-09-09 11:16 ` schönfeld / in-medias-res 0 siblings, 0 replies; 7+ messages in thread From: schönfeld / in-medias-res @ 2005-09-09 11:16 UTC (permalink / raw) To: linux-kernel Petr Vandrovec schrieb: > schönfeld / in-medias-res wrote: > >> Hi Petr, >> >> the two servers is that the one with the problems does run a nagios nrpe >> server and some plugins, e.g. to check disk space on the novell disk, >> while the other server does not. Now i found that heavy operations on >> the filesystem (e.g. stat'ing many small files in a short time) is a >> kind of problematic, if you want to do anything else on the filesystem >> at the same time. The second process just hangs until the first one >> accessing the ncp filesystem is ready with its operation. Well if > > > You need either another CPU, or semaphore which do not suffer from > starvation. > Or you have to rewrite ncpfs to use some queue instead of simple > semaphore. What happens is that your copy process in a loop acquires > ncp_server's semaphore, sends request to server, waits for response, and > releases semaphore. It does that for every request sent out. Now your > process comes in, finds that ncp_server's semaphore is locked, and starts > waiting. Other process gets answer from server, releases semaphore, and > as both processes were just waiting before this happened, they both have > same priority, and so one which just did up() continues to run. And > before waken up process gets chance to do its task, copy process sends > another request, and so your second process goes to sleep again. Ah thanks. That makes things a lot of clearer. I found out that my attemption were true: the plugin really gets a KILL signal if it exceeds the timeout. Means that the nagios check plugin is the source of the problem (in combination with that what you did explain AND the process which uses the ncpfs regulary and is running constant). Now we found a solution for that. We just start the always running process with a lower priority. That makes ncpfs access possible while this process is running and producing load. Now: If we have the always running process running, with low priority (nice +5), and the nagios plugin tries to do something on the ncpfs it is able to, runs fine and exits gracefully. Problem solved, at least until we find a solution that does not look like a workaround ;-) Thanks for your help! You helped me very much. Bye Patrick ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2005-09-09 11:14 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <S932080AbVIGI45/20050907085657Z+286@vger.kernel.org>
2005-09-07 11:08 ` ncpfs: Connection invalid / Input-/Output Errors schönfeld / in-medias-res
2005-09-07 12:11 ` Anton Altaparmakov
2005-09-07 14:14 ` schönfeld / in-medias-res
2005-09-07 16:24 ` Petr Vandrovec
2005-09-09 7:36 ` schönfeld / in-medias-res
2005-09-09 10:46 ` Petr Vandrovec
2005-09-09 11:16 ` schönfeld / in-medias-res
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox