How to deal with such hanging processes?

All of lore.kernel.org
 help / color / mirror / Atom feed

* How to deal with such hanging processes?
@ 2012-01-27 20:33 Łukasz Maśko
       [not found] ` <201201272133.35986-D2Dg4Jie/XezyIjkdXusMg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Łukasz Maśko @ 2012-01-27 20:33 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

I have a Welland ME-752GNS NAS storage. it is capable to serve the files 
only using FTP or CIFS protocol. To quickly transfer data I'm using FTP, but 
if I want to mount the disk fot instance to browse my images or watch 
movies, I'm forced to use cifs.

It seems to work, but not too well. First, I realise, that my problems come 
mainly from poor CIFS implementation in the NAS firmware, but since it is 
the only one I have now and I cannot afford to change it, I must somehow 
live with it. The main problem is that quite often something happens with 
the data transfer. First, it results in such entries in dmesg and logs:

[ 5743.489573] CIFS VFS: ignoring corrupt resume name
[ 5743.553028] CIFS VFS: ignoring corrupt resume name
[ 5743.652823] CIFS VFS: ignoring corrupt resume name
[ 5744.822936] CIFS VFS: ignoring corrupt resume name
[ 5758.608685] CIFS VFS: ignoring corrupt resume name
[ 5770.010003] CIFS VFS: ignoring corrupt resume name
[ 5792.937939] CIFS VFS: Send error in read = -512
[ 5792.938948] CIFS VFS: No task to wake, unknown frame received! NumMids 2
[ 5792.938958] Received Data is: : dump of 37 bytes of data at 0xf4f4b6c0
[ 5792.938974]  60000000 424d53ff 0000a4a4 c0018000 . . . ` \xffffffff S M B 
¤ ¤ . . . . . Ŕ
[ 5792.938988]  00000000 00000000 00000000 2e130006 . . . . . . . . . . . . 
. . . .
[ 5792.938996]  67950002 00000012 . . . g .

Especially that part with "CIFS VFS: ignoring corrupt resume name" is 
happening very often, but it is not causing any major problems.
Then, but not always, a process which is performing data transfer hangs and 
I'm getting the following errors:

[ 6120.569517] INFO: task kio_file:12029 blocked for more than 120 seconds.
[ 6120.569521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[ 6120.569525] kio_file        D e417bc30     0 12029   6037 0x00000004
[ 6120.569533]  e417bcb4 00000086 e417bc30 e417bc30 e417bc38 31a0b404 
00000559 00000000
[ 6120.569543]  c0724a80 e417bc58 c0724a80 f6707a80 f60ab180 f0c539c0 
00000020 00000000
[ 6120.569552]  e427e3c0 e5602938 00000020 e560293c 000003b7 00000010 
c0664940 e417bcb4
[ 6120.569561] Call Trace:
[ 6120.569575]  [<c0174f6c>] ? ktime_get_ts+0xdc/0x110
[ 6120.569583]  [<c04eadc0>] schedule+0x30/0x50
[ 6120.569588]  [<c04eae53>] io_schedule+0x73/0xb0
[ 6120.569594]  [<c01d93c8>] sleep_on_page+0x8/0x10
[ 6120.569599]  [<c04eb4d7>] __wait_on_bit_lock+0x47/0x90
[ 6120.569604]  [<c01d93c0>] ? __lock_page+0x80/0x80
[ 6120.569609]  [<c01d93b6>] __lock_page+0x76/0x80
[ 6120.569616]  [<c016c2e0>] ? autoremove_wake_function+0x40/0x40
[ 6120.569623]  [<c024ad6d>] __generic_file_splice_read+0x52d/0x550
[ 6120.569630]  [<c03f535c>] ? sock_alloc_send_pskb+0x15c/0x290
[ 6120.569636]  [<c03f94bb>] ? __alloc_skb+0x5b/0x210
[ 6120.569640]  [<c03f535c>] ? sock_alloc_send_pskb+0x15c/0x290
[ 6120.569647]  [<c03018fd>] ? _copy_from_user+0x3d/0x60
[ 6120.569652]  [<c03f8f27>] ? skb_queue_tail+0x37/0x50
[ 6120.569659]  [<c0484150>] ? unix_stream_sendmsg+0x3d0/0x420
[ 6120.569665]  [<c0249600>] ? page_cache_pipe_buf_release+0x20/0x20
[ 6120.569671]  [<c024ae24>] generic_file_splice_read+0x94/0x100
[ 6120.569677]  [<c024ad90>] ? __generic_file_splice_read+0x550/0x550
[ 6120.569682]  [<c02498f0>] do_splice_to+0x60/0x80
[ 6120.569687]  [<c0249b2e>] splice_direct_to_actor+0xae/0x1d0
[ 6120.569692]  [<c0249860>] ? do_splice_from+0x80/0x80
[ 6120.569698]  [<c024afcd>] do_splice_direct+0x4d/0x70
[ 6120.569705]  [<c02252e1>] do_sendfile+0x181/0x220
[ 6120.569710]  [<c0226053>] sys_sendfile64+0x53/0xc0
[ 6120.569716]  [<c04f391f>] sysenter_do_call+0x12/0x28

I'm unable to kill this process and it prevents the share from being 
unmounted:

$ ps ax | grep kio_file
12029 ?        D      0:00 kdeinit4: kio_file [kdeinit] file 
local:/home/users/ed/tmp/ksocket-ed/klauncherTi6038.slave-socket 
local:/home/users/ed/tmp/ksocket-ed/dolphinU11997.slave-socket

So far I've learned, that I can do such combination: first, I can umount 
this share with -l (lazy) option, but the process in question still exists. 
Second, I can turn the NAS off, wait for a moment and turn it on again (I'm 
not 100% sure if the restart of NAS is a must here, but it is working) and 
reload the cifs.ko module. As a result, the process is gone and I can keep 
on working. Till the problem occurs again...

I'm using PLD Linux (which is probably not important). I have a vanilla 
kernel, right now it is 3.2.2 but the same happened since 2.6.x (the only 
improve after changing to 3.2. is a big performance jump). I have cifs-
utils-5.2 installed and I'm loading the cifs.ko module with the following 
parameters:

echo_retries=1 cifs_max_pending=2

cifs_max_pending=2 is the most important, the higher the value, the more 
often the problem occurs and 2 is the smallest possible.

Is there anything I can do in the side of my Linux box in such situation? I 
cannot upgrade the NAS firmware for I have the latest version and probably 
no newer will be released (it is closed-source). I cannot get rid of this 
NAS either. At least for some time. The best would be of course to make cifs 
work with my NAS anyway, but it's up to You, for I have not enough knowledge 
about it.
-- 
Łukasz Maśko                                                            _o)
Lukasz.Masko(at)ipipan.waw.pl                                           /\\
Registered Linux User #61028                                           _\_V

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to deal with such hanging processes?
       [not found] ` <201201272133.35986-D2Dg4Jie/XezyIjkdXusMg@public.gmane.org>
@ 2012-01-28 12:30   ` Jeff Layton
       [not found]     ` <20120128073021.4eca547e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Layton @ 2012-01-28 12:30 UTC (permalink / raw)
  To: Łukasz Maśko; +Cc: linux-cifs-u79uwXL29TY76Z2rM5mHXA

On Fri, 27 Jan 2012 21:33:35 +0100
Łukasz Maśko <masko-/XNhcdyn+gCn9IrpMBEE/Q@public.gmane.org> wrote:

> I have a Welland ME-752GNS NAS storage. it is capable to serve the files 
> only using FTP or CIFS protocol. To quickly transfer data I'm using FTP, but 
> if I want to mount the disk fot instance to browse my images or watch 
> movies, I'm forced to use cifs.
> 
> It seems to work, but not too well. First, I realise, that my problems come 
> mainly from poor CIFS implementation in the NAS firmware, but since it is 
> the only one I have now and I cannot afford to change it, I must somehow 
> live with it. The main problem is that quite often something happens with 
> the data transfer. First, it results in such entries in dmesg and logs:
> 
> [ 5743.489573] CIFS VFS: ignoring corrupt resume name
> [ 5743.553028] CIFS VFS: ignoring corrupt resume name
> [ 5743.652823] CIFS VFS: ignoring corrupt resume name
> [ 5744.822936] CIFS VFS: ignoring corrupt resume name
> [ 5758.608685] CIFS VFS: ignoring corrupt resume name
> [ 5770.010003] CIFS VFS: ignoring corrupt resume name
> [ 5792.937939] CIFS VFS: Send error in read = -512
> [ 5792.938948] CIFS VFS: No task to wake, unknown frame received! NumMids 2
> [ 5792.938958] Received Data is: : dump of 37 bytes of data at 0xf4f4b6c0
> [ 5792.938974]  60000000 424d53ff 0000a4a4 c0018000 . . . ` \xffffffff S M B 
> ¤ ¤ . . . . . Ŕ
> [ 5792.938988]  00000000 00000000 00000000 2e130006 . . . . . . . . . . . . 
> . . . .
> [ 5792.938996]  67950002 00000012 . . . g .
> 
> Especially that part with "CIFS VFS: ignoring corrupt resume name" is 
> happening very often, but it is not causing any major problems.
> Then, but not always, a process which is performing data transfer hangs and 
> I'm getting the following errors:
> 
> [ 6120.569517] INFO: task kio_file:12029 blocked for more than 120 seconds.
> [ 6120.569521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 6120.569525] kio_file        D e417bc30     0 12029   6037 0x00000004
> [ 6120.569533]  e417bcb4 00000086 e417bc30 e417bc30 e417bc38 31a0b404 
> 00000559 00000000
> [ 6120.569543]  c0724a80 e417bc58 c0724a80 f6707a80 f60ab180 f0c539c0 
> 00000020 00000000
> [ 6120.569552]  e427e3c0 e5602938 00000020 e560293c 000003b7 00000010 
> c0664940 e417bcb4
> [ 6120.569561] Call Trace:
> [ 6120.569575]  [<c0174f6c>] ? ktime_get_ts+0xdc/0x110
> [ 6120.569583]  [<c04eadc0>] schedule+0x30/0x50
> [ 6120.569588]  [<c04eae53>] io_schedule+0x73/0xb0
> [ 6120.569594]  [<c01d93c8>] sleep_on_page+0x8/0x10
> [ 6120.569599]  [<c04eb4d7>] __wait_on_bit_lock+0x47/0x90
> [ 6120.569604]  [<c01d93c0>] ? __lock_page+0x80/0x80
> [ 6120.569609]  [<c01d93b6>] __lock_page+0x76/0x80
> [ 6120.569616]  [<c016c2e0>] ? autoremove_wake_function+0x40/0x40
> [ 6120.569623]  [<c024ad6d>] __generic_file_splice_read+0x52d/0x550
> [ 6120.569630]  [<c03f535c>] ? sock_alloc_send_pskb+0x15c/0x290
> [ 6120.569636]  [<c03f94bb>] ? __alloc_skb+0x5b/0x210
> [ 6120.569640]  [<c03f535c>] ? sock_alloc_send_pskb+0x15c/0x290
> [ 6120.569647]  [<c03018fd>] ? _copy_from_user+0x3d/0x60
> [ 6120.569652]  [<c03f8f27>] ? skb_queue_tail+0x37/0x50
> [ 6120.569659]  [<c0484150>] ? unix_stream_sendmsg+0x3d0/0x420
> [ 6120.569665]  [<c0249600>] ? page_cache_pipe_buf_release+0x20/0x20
> [ 6120.569671]  [<c024ae24>] generic_file_splice_read+0x94/0x100
> [ 6120.569677]  [<c024ad90>] ? __generic_file_splice_read+0x550/0x550
> [ 6120.569682]  [<c02498f0>] do_splice_to+0x60/0x80
> [ 6120.569687]  [<c0249b2e>] splice_direct_to_actor+0xae/0x1d0
> [ 6120.569692]  [<c0249860>] ? do_splice_from+0x80/0x80
> [ 6120.569698]  [<c024afcd>] do_splice_direct+0x4d/0x70
> [ 6120.569705]  [<c02252e1>] do_sendfile+0x181/0x220
> [ 6120.569710]  [<c0226053>] sys_sendfile64+0x53/0xc0
> [ 6120.569716]  [<c04f391f>] sysenter_do_call+0x12/0x28
> 

The process here is stuck waiting for the page lock on a page. Quite
possibly that page is part of a file on a cifs filesystems.

> I'm unable to kill this process and it prevents the share from being 
> unmounted:
> 
> $ ps ax | grep kio_file
> 12029 ?        D      0:00 kdeinit4: kio_file [kdeinit] file 
> local:/home/users/ed/tmp/ksocket-ed/klauncherTi6038.slave-socket 
> local:/home/users/ed/tmp/ksocket-ed/dolphinU11997.slave-socket
> 

Right. D state is uninterruptible sleep, and you won't be able to kill
it until it wakes up and comes out of kernel space.

> So far I've learned, that I can do such combination: first, I can umount 
> this share with -l (lazy) option, but the process in question still exists. 
> Second, I can turn the NAS off, wait for a moment and turn it on again (I'm 
> not 100% sure if the restart of NAS is a must here, but it is working) and 
> reload the cifs.ko module. As a result, the process is gone and I can keep 
> on working. Till the problem occurs again...
> 
> I'm using PLD Linux (which is probably not important). I have a vanilla 
> kernel, right now it is 3.2.2 but the same happened since 2.6.x (the only 
> improve after changing to 3.2. is a big performance jump). I have cifs-
> utils-5.2 installed and I'm loading the cifs.ko module with the following 
> parameters:
> 
> echo_retries=1 cifs_max_pending=2
> 
> cifs_max_pending=2 is the most important, the higher the value, the more 
> often the problem occurs and 2 is the smallest possible.
> 
> Is there anything I can do in the side of my Linux box in such situation? I 
> cannot upgrade the NAS firmware for I have the latest version and probably 
> no newer will be released (it is closed-source). I cannot get rid of this 
> NAS either. At least for some time. The best would be of course to make cifs 
> work with my NAS anyway, but it's up to You, for I have not enough knowledge 
> about it.

The way to deal with them is to solve the problem that causes them to
hang in the first place. Once they're stuck like that, there's really
little you can do until the page lock is released. The messages from
the ring buffer suggest that the server is sending corrupt replies to
the requests. A network capture might help confirm that.

Is this the same NAS that requests a maxmpx of 1? If so, the fact that
cifs sends more than one request a time to this server might be the
ultimate cause.

Obviously the server should handle that situation without corrupting
its replies, but cifs is clearly broken in this regard and shouldn't be
sending more than one request at a time to such a server. I doubt
there's anything you can do until Steve fixes that bug.

-- 
Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to deal with such hanging processes?
       [not found]     ` <20120128073021.4eca547e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2012-01-28 13:08       ` ralda-Mmb7MZpHnFY
       [not found]         ` <20120128140830.2b48a0c8.ralda-Mmb7MZpHnFY@public.gmane.org>
  2012-01-28 14:54       ` Łukasz Maśko
  1 sibling, 1 reply; 5+ messages in thread
From: ralda-Mmb7MZpHnFY @ 2012-01-28 13:08 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Łukasz Maśko, linux-cifs-u79uwXL29TY76Z2rM5mHXA

Hallo Jeff!

> Is this the same NAS that requests a maxmpx of 1? If so, the fact that
> cifs sends more than one request a time to this server might be the
> ultimate cause.
> 
> Obviously the server should handle that situation without corrupting
> its replies, but cifs is clearly broken in this regard and shouldn't be
> sending more than one request at a time to such a server. I doubt
> there's anything you can do until Steve fixes that bug.

That cifs_max_pending=2 module parameter helps a lot on my side. It not
only avoids many cases the cifs process gets stuck it even leads to
the interesting behavior that such processes vanish (accept the kill)
after a couple of minutes. In those cases the device even return to a
working state without an reboot. After that you may unmount normal or
continue to access the NAS device. Awesome those stuck periods but
better than having to reboot. Especially as those collisions happen
much less than without the module parameter.

--
Harald

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to deal with such hanging processes?
       [not found]         ` <20120128140830.2b48a0c8.ralda-Mmb7MZpHnFY@public.gmane.org>
@ 2012-01-28 14:50           ` Łukasz Maśko
  0 siblings, 0 replies; 5+ messages in thread
From: Łukasz Maśko @ 2012-01-28 14:50 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

Dnia sobota, 28 stycznia 2012, ralda-Mmb7MZpHnFY@public.gmane.org napisał:
[...]
> That cifs_max_pending=2 module parameter helps a lot on my side. It not
> only avoids many cases the cifs process gets stuck it even leads to
> the interesting behavior that such processes vanish (accept the kill)
> after a couple of minutes. In those cases the device even return to a
> working state without an reboot. After that you may unmount normal or
> continue to access the NAS device. Awesome those stuck periods but
> better than having to reboot. Especially as those collisions happen
> much less than without the module parameter.

Exactly. Same for me.
-- 
Łukasz Maśko                                           GG:   2441498    _o)
Lukasz.Masko(at)ipipan.waw.pl                                           /\\
Registered Linux User #61028                                           _\_V
Ubuntu: staroafrykańskie słowo oznaczające "Nie umiem zainstalować Debiana"

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: How to deal with such hanging processes?
       [not found]     ` <20120128073021.4eca547e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2012-01-28 13:08       ` ralda-Mmb7MZpHnFY
@ 2012-01-28 14:54       ` Łukasz Maśko
  1 sibling, 0 replies; 5+ messages in thread
From: Łukasz Maśko @ 2012-01-28 14:54 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

Dnia sobota, 28 stycznia 2012, Jeff Layton napisał:
[...]
> The way to deal with them is to solve the problem that causes them to
> hang in the first place. Once they're stuck like that, there's really
> little you can do until the page lock is released. The messages from
> the ring buffer suggest that the server is sending corrupt replies to
> the requests. A network capture might help confirm that.

It is a bit hard to capture it since it is nondeterministic :-/ I'll try 
anyway.

> Is this the same NAS that requests a maxmpx of 1? If so, the fact that
> cifs sends more than one request a time to this server might be the
> ultimate cause.

It was in another thread for a different NAS, but it seems, that it is the 
same in my case.
 
> Obviously the server should handle that situation without corrupting
> its replies, but cifs is clearly broken in this regard and shouldn't be
> sending more than one request at a time to such a server. I doubt
> there's anything you can do until Steve fixes that bug.

So I'm waiting patiently, that's all I can do :-)

-- 
Łukasz Maśko                                                            _o)
Lukasz.Masko(at)ipipan.waw.pl                                           /\\
Registered Linux User #61028                                           _\_V

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-01-28 14:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-27 20:33 How to deal with such hanging processes? Łukasz Maśko
     [not found] ` <201201272133.35986-D2Dg4Jie/XezyIjkdXusMg@public.gmane.org>
2012-01-28 12:30   ` Jeff Layton
     [not found]     ` <20120128073021.4eca547e-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2012-01-28 13:08       ` ralda-Mmb7MZpHnFY
     [not found]         ` <20120128140830.2b48a0c8.ralda-Mmb7MZpHnFY@public.gmane.org>
2012-01-28 14:50           ` Łukasz Maśko
2012-01-28 14:54       ` Łukasz Maśko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.