From mboxrd@z Thu Jan 1 00:00:00 1970 From: David McBride Subject: Re: Hanging cifsiod / OpLock break problems Date: Fri, 2 Oct 2015 11:10:56 +0100 Message-ID: <560E5830.30201@cam.ac.uk> References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Cc: Tobias Doerffel , "linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" To: Shirish Pargaonkar Return-path: In-Reply-To: Sender: linux-cifs-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: On 04/03/15 03:08, Shirish Pargaonkar wrote: > Steven Rostedt had posted a patch on 3/21/2014 on the cifs mailing list > that might be helpful here. > > Link: http://lkml.kernel.org/r/20140319151252.16ed3ac6-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org > > On Tue, Mar 3, 2015 at 3:25 AM, Tobias Doerffel wrote: >> Hi, >> >> we're having constant issues with CIFS mounts on our Linux terminal servers (Ubuntu 12.04 with updated kernel). Usually after 1 or 2 days we see the following in dmesg: Hello, I've got some servers operating in a similar configuration --- remote login servers, running Ubuntu 14.04 (rather than 12.04), with cifs automounts to a (Windows, not Samba) fileserver for home directories --- and I've been seeing similar issues under load. An example mount looks like: //filestore/d/dwm37 on /home/dwm37 type cifs (rw,relatime,vers=1.0,sec=ntlmssp,cache=strict,multiuser,domain=DOMAIN,uid=0,noforceuid,gid=0,noforcegid,addr=1.2.3.4,file_mode=0700,dir_mode=0700,nounix,mapchars,nobrl,mfsymlinks,noperm,rsize=61440,wsize=65536,actimeo=1) (The user's session keyring is automatically populated with their passphrase on login, allowing them to access mounts as themselves using strong-authentication.) An example trace for the deadlock I'm seeing is: > [389388.350684] INFO: task cifsiod:2810 blocked for more than 120 seconds. > [389388.351032] Tainted: G C 3.19.0-28-generic #30~14.04.1-Ubuntu > [389388.351221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [389388.351401] cifsiod D ffff88012496fbb8 0 2810 2 0x00000000 > [389388.351418] Workqueue: cifsiod cifs_oplock_break [cifs] > [389388.351420] ffff88012496fbb8 ffff880095916bf0 0000000000013e80 ffff88012496ffd8 > [389388.351423] 0000000000013e80 ffff8800351c13a0 ffff880095916bf0 ffff88012496fc60 > [389388.351425] ffff88013fd14778 ffff88012496fc60 ffff88013ffdc2e8 0000000000000002 > [389388.351427] Call Trace: > [389388.351434] [] ? bit_wait+0x50/0x50 > [389388.351437] [] io_schedule+0xa0/0x130 > [389388.351439] [] bit_wait_io+0x2c/0x50 > [389388.351442] [] __wait_on_bit+0x65/0x90 > [389388.351446] [] ? find_get_pages_tag+0xcd/0x170 > [389388.351449] [] wait_on_page_bit+0xc7/0xd0 > [389388.351453] [] ? autoremove_wake_function+0x40/0x40 > [389388.351456] [] filemap_fdatawait_range+0xf9/0x190 > [389388.351459] [] ? do_writepages+0x1e/0x40 > [389388.351462] [] ? __filemap_fdatawrite_range+0x59/0x60 > [389388.351464] [] filemap_fdatawait+0x27/0x30 > [389388.351471] [] cifs_oplock_break+0x301/0x340 [cifs] > [389388.351475] [] process_one_work+0x14f/0x400 > [389388.351478] [] rescuer_thread+0x19c/0x3d0 > [389388.351480] [] ? process_one_work+0x400/0x400 > [389388.351483] [] kthread+0xd2/0xf0 > [389388.351486] [] ? kthread_create_on_node+0x1c0/0x1c0 > [389388.351489] [] ret_from_fork+0x58/0x90 > [389388.351492] [] ? kthread_create_on_node+0x1c0/0x1c0 Which looks extremely similar to the one Tobias reported. To try to diagnose the problem further, I took the Ubuntu kernel source-package for the kernel we're running, enabled CONFIG_PROVE_LOCKING and related options, and rebuilt it using `make deb-pkg` --- and have been running it on one of our remote-login machines. (Despite concerns about the performance overhead of this option, it appears to be perfectly adequate in practice.) And lo, lockdep triggered overnight: > [73551.456398] ====================================================== > [73551.456426] [ INFO: possible circular locking dependency detected ] > [73551.456455] 3.19.8-ckt6 #1 Tainted: G C E > [73551.456477] ------------------------------------------------------- > [73551.456504] rsync/20306 is trying to acquire lock: > [73551.456527] ((&cfile->oplock_break)){+.+.+.}, at: [] flush_work+0x5/0x280 > [73551.456584] > [73551.456584] but task is already holding lock: > [73551.456610] (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [] chmod_common+0x6a/0x150 > [73551.456661] > [73551.456661] which lock already depends on the new lock. > [73551.456661] > [73551.457458] > [73551.457458] the existing dependency chain (in reverse order) is: > [73551.458879] > [73551.458879] -> #2 (&sb->s_type->i_mutex_key#19){+.+.+.}: > [73551.460298] [] lock_acquire+0xa0/0x120 > [73551.461016] [] mutex_lock_nested+0x56/0x4d0 > [73551.461737] [] cifs_strict_writev+0xfd/0x280 [cifs] > [73551.462465] [] new_sync_write+0x81/0xb0 > [73551.463178] [] vfs_write+0xba/0x1f0 > [73551.463872] [] SyS_write+0x49/0xb0 > [73551.464549] [] system_call_fastpath+0x16/0x1b > [73551.465224] > [73551.465224] -> #1 (&cifsi->lock_sem){++++++}: > [73551.466563] [] lock_acquire+0xa0/0x120 > [73551.467220] [] down_write+0x47/0xb0 > [73551.467869] [] cifs_oplock_break+0xfd/0x380 [cifs] > [73551.468524] [] process_one_work+0x1c2/0x4a0 > [73551.469167] [] worker_thread+0x121/0x450 > [73551.469806] [] kthread+0xf9/0x110 > [73551.470427] [] ret_from_fork+0x58/0x90 > [73551.471048] > [73551.471048] -> #0 ((&cfile->oplock_break)){+.+.+.}: > [73551.472224] [] __lock_acquire+0x192e/0x1a20 > [73551.472833] [] lock_acquire+0xa0/0x120 > [73551.473430] [] flush_work+0x4c/0x280 > [73551.474017] [] __cancel_work_timer+0xa0/0x1d0 > [73551.474660] [] cancel_work_sync+0x10/0x20 > [73551.475247] [] cifsFileInfo_put+0x142/0x3e0 [cifs] > [73551.475841] [] smb_set_file_info+0xbd/0x290 [cifs] > [73551.476413] [] cifs_set_file_info+0xaa/0x180 [cifs] > [73551.476979] [] cifs_setattr+0x475/0x980 [cifs] > [73551.477538] [] notify_change+0x231/0x390 > [73551.478076] [] chmod_common+0x11f/0x150 > [73551.478594] [] SyS_chmod+0x45/0xa0 > [73551.479115] [] system_call_fastpath+0x16/0x1b > [73551.479646] > [73551.479646] other info that might help us debug this: > [73551.479646] > [73551.481191] Chain exists of: > [73551.481191] (&cfile->oplock_break) --> &cifsi->lock_sem --> &sb->s_type->i_mutex_key#19 > [73551.481191] > [73551.482752] Possible unsafe locking scenario: > [73551.482752] > [73551.483797] CPU0 CPU1 > [73551.484311] ---- ---- > [73551.484819] lock(&sb->s_type->i_mutex_key#19); > [73551.485339] lock(&cifsi->lock_sem); > [73551.485874] lock(&sb->s_type->i_mutex_key#19); > [73551.486400] lock((&cfile->oplock_break)); > [73551.486930] > [73551.486930] *** DEADLOCK *** > [73551.486930] > [73551.488410] 2 locks held by rsync/20306: > [73551.488899] #0: (sb_writers#17){.+.+.+}, at: [] mnt_want_write+0x24/0x50 > [73551.489432] #1: (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [] chmod_common+0x6a/0x150 > [73551.490513] > [73551.490513] stack backtrace: > [73551.491547] CPU: 0 PID: 20306 Comm: rsync Tainted: G C E 3.19.8-ckt6 #1 > [73551.492079] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/14/2014 > [73551.493184] ffffffff82889790 ffff88000a9278a8 ffffffff817e77f1 0000000000000000 > [73551.493772] ffffffff82889f50 ffff88000a9278f8 ffffffff817e2d3f ffff88000a9278c8 > [73551.494358] ffff88000a927948 ffff880004921ea8 ffff8800049215f0 ffff880004921ea8 > [73551.494941] Call Trace: > [73551.495524] [] dump_stack+0x45/0x57 > [73551.496108] [] print_circular_bug+0x1fb/0x20c > [73551.496703] [] __lock_acquire+0x192e/0x1a20 > [73551.497290] [] lock_acquire+0xa0/0x120 > [73551.497874] [] ? flush_work+0x5/0x280 > [73551.498456] [] flush_work+0x4c/0x280 > [73551.499035] [] ? flush_work+0x5/0x280 > [73551.499613] [] ? mark_held_locks+0x75/0xa0 > [73551.500186] [] ? _raw_spin_unlock_irqrestore+0x36/0x60 > [73551.500766] [] ? mark_held_locks+0x75/0xa0 > [73551.501349] [] ? __cancel_work_timer+0x105/0x1d0 > [73551.501923] [] __cancel_work_timer+0xa0/0x1d0 > [73551.502486] [] cancel_work_sync+0x10/0x20 > [73551.503056] [] cifsFileInfo_put+0x142/0x3e0 [cifs] > [73551.503625] [] ? CIFSSMBSetFileInfo+0x188/0x210 [cifs] > [73551.504205] [] ? CIFSSMBSetFileInfo+0x188/0x210 [cifs] > [73551.504779] [] ? find_get_pages_tag+0x25/0x1c0 > [73551.505359] [] smb_set_file_info+0xbd/0x290 [cifs] > [73551.505938] [] ? pagevec_lookup_tag+0x25/0x40 > [73551.506519] [] ? filemap_fdatawait_range+0x13b/0x190 > [73551.507096] [] cifs_set_file_info+0xaa/0x180 [cifs] > [73551.507673] [] ? do_writepages+0x21/0x50 > [73551.508262] [] cifs_setattr+0x475/0x980 [cifs] > [73551.508857] [] ? evm_inode_setattr+0x21/0x70 > [73551.509452] [] ? current_fs_time+0x16/0x60 > [73551.510039] [] ? trace_hardirqs_on+0xd/0x10 > [73551.510607] [] notify_change+0x231/0x390 > [73551.511154] [] chmod_common+0x11f/0x150 > [73551.511709] [] SyS_chmod+0x45/0xa0 > [73551.512267] [] system_call_fastpath+0x16/0x1b I'm not sufficiently familiar with the cifs internals to be able to immediately see whether this is a real scenario, or whether the patch posted previously by Steven Rostedt will likely address it. Hopefully this information can advance understanding of this problem? Kind regards, David -- David McBride Unix Specialist, University Information Services