From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f174.google.com ([209.85.161.174]:36803 "EHLO mail-yw0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934997AbcJXM5T (ORCPT ); Mon, 24 Oct 2016 08:57:19 -0400 Received: by mail-yw0-f174.google.com with SMTP id u124so186473208ywg.3 for ; Mon, 24 Oct 2016 05:57:18 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1268904273.1449131.1477064310378.JavaMail.zimbra@redhat.com> References: <20161005153014.GC26977@htj.duckdns.org> <270577901.647921.1475682888765.JavaMail.zimbra@redhat.com> <874538236.682217.1475693824077.JavaMail.zimbra@redhat.com> <20161005200522.GE19539@ZenIV.linux.org.uk> <119370333.805584.1475756417736.JavaMail.zimbra@redhat.com> <1860793605.807021.1475756759147.JavaMail.zimbra@redhat.com> <20161007070838.GA16260@quack2.suse.cz> <1268904273.1449131.1477064310378.JavaMail.zimbra@redhat.com> From: Miklos Szeredi Date: Mon, 24 Oct 2016 14:57:17 +0200 Message-ID: Subject: Re: [4.9-rc1+] overlayfs lockdep To: CAI Qian Cc: Jan Kara , Al Viro , Linus Torvalds , linux-fsdevel@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, Oct 21, 2016 at 5:38 PM, CAI Qian wrote: > > ----- Original Message ----- >> From: "Jan Kara" >> Sent: Friday, October 7, 2016 3:08:38 AM >> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked) >> >> >> So I believe this may be just a problem in overlayfs lockdep annotation >> (see below). Added Miklos to CC. >> >> > Wait. There is also a lockep happened before the xfs internal error as >> > well. >> > >> > [ 5839.452325] ====================================================== >> > [ 5839.459221] [ INFO: possible circular locking dependency detected ] >> > [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted >> > [ 5839.471945] ------------------------------------------------------- >> > [ 5839.478937] trinity-c220/69531 is trying to acquire lock: >> > [ 5839.484961] (&p->lock){+.+.+.}, at: [] >> > seq_read+0x4c/0x3e0 >> > [ 5839.492967] >> > but task is already holding lock: >> > [ 5839.499476] (sb_writers#8){.+.+.+}, at: [] >> > __sb_start_write+0xd1/0xf0 >> > [ 5839.508560] >> > which lock already depends on the new lock. >> > >> > [ 5839.517686] >> > the existing dependency chain (in reverse order) is: >> > [ 5839.526036] >> > -> #3 (sb_writers#8){.+.+.+}: >> > [ 5839.530751] [] lock_acquire+0xd4/0x240 >> > [ 5839.537368] [] percpu_down_read+0x4a/0x90 >> > [ 5839.544275] [] __sb_start_write+0xd1/0xf0 >> > [ 5839.551181] [] mnt_want_write+0x24/0x50 >> > [ 5839.557892] [] ovl_want_write+0x1f/0x30 >> > [overlay] >> > [ 5839.565577] [] ovl_do_remove+0x46/0x480 >> > [overlay] >> > [ 5839.573259] [] ovl_unlink+0x13/0x20 [overlay] >> > [ 5839.580555] [] vfs_unlink+0xda/0x190 >> > [ 5839.586979] [] do_unlinkat+0x268/0x2b0 >> > [ 5839.593599] [] SyS_unlinkat+0x1b/0x30 >> > [ 5839.600120] [] do_syscall_64+0x6c/0x1e0 >> > [ 5839.606836] [] return_from_SYSCALL_64+0x0/0x7a >> > [ 5839.614231] >> >> So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection >> through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in >> I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we >> grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then >> in ovl_want_write() we grab freeze protection again, but this time for the >> upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT >> (overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering >> (we maintain locking classes per fs type so that's why I'm showing fs type >> in parenthesis). >> >> Now this nesting is nasty because once you add locks that are not tracked >> per fs type into the mix, you get cycles. In this case we've got >> seq_file->lock and cred_guard_mutex into the mix - the splice path is >> doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing >> from seq_file into the real filesystem). Exec path further establishes >> cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle: >> >> sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex >> (overlay) -> sb_writers (FS-A) >> >> If I analyzed the lockdep trace, this looks like a real (although remote) >> deadlock possibility. Miklos? Yeah, you can leave out seq_file->lock, the chain can be made up from just 3 parts: unlink : i_mutex(ov) -> sb_writers(fs-a) splice: sb_writers(fs-a) ->cred_guard_mutex (though proc_tgid_io_accounting) exec: cred_guard_mutex -> i_mutex(ov) None of those are incorrect, but the cred_guard_mutex usage is also pretty weird: taken outside path lookup as well as inside ->read() in proc. Doesn't look a serious worry in practice, I don't think anybody would trigger the actual deadlock in a 1000years (an fs freeze is needed at just the right moment in addition to the above, very unlikely chain). Thanks, Miklos