stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Eric Ren <zren@suse.com>,
	Joseph Qi <jiangqi903@gmail.com>,
	Mark Fasheh <mfasheh@versity.com>,
	Joel Becker <jlbec@evilplan.org>,
	Junxiao Bi <junxiao.bi@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4.4 06/48] ocfs2: fix crash caused by stale lvb with fsdlm plugin
Date: Wed, 18 Jan 2017 11:46:15 +0100	[thread overview]
Message-ID: <20170118104625.826644157@linuxfoundation.org> (raw)
In-Reply-To: <20170118104625.550018627@linuxfoundation.org>

4.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Eric Ren <zren@suse.com>

commit e7ee2c089e94067d68475990bdeed211c8852917 upstream.

The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.

The crash backtrace is below:

   dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
   dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
   ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
   ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
   ocfs2: Beginning quota recovery on device (253,18) for slot 2
   ocfs2: Finishing quota recovery on device (253,18) for slot 2
   (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
   (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
   ------------[ cut here ]------------
   kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
   invalid opcode: 0000 [#1] SMP
   Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
   Supported: No, Unsupported modules are loaded
   CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
   task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
   RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
   RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
   RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
   RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
   RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
   R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
   R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
   FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
   CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
   Call Trace:
     ocfs2_setattr+0x698/0xa90 [ocfs2]
     notify_change+0x1ae/0x380
     do_truncate+0x5e/0x90
     do_sys_ftruncate.constprop.11+0x108/0x160
     entry_SYSCALL_64_fastpath+0x12/0x6d
   Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
   RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]

It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size.  We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us.  But, why?

The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB.  This is not the right way for
fsdlm.

The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.

The following diagram briefly illustrates how this crash happens:

RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;

The 1st round:

             Node1                                    Node2
RSB1: PR
                                                  RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
  ocfs2_dlm_lock(no DLM_LKF_VALBLK)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

dlm_lock(no DLM_LKF_VALBLK)
  convert_lock(overwrite lkb->lkb_exflags
               with no DLM_LKF_VALBLK)

RSB1: NULL                                        RSB1: EX
                                                  reset Node2
dlm_recover_rsbs()
  recover_lvb()

/* The LVB is not trustable if the node with EX fails and
 * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
 */

 if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
           return;                   * to invalid the LVB here.
                                     */

The 2nd round:

         Node 1                                Node2
RSB1(become master from recovery)

ocfs2_setattr()
  ocfs2_inode_lock(NULL->EX)
    /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
    ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
  ocfs2_truncate_file()
      mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */

The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.

Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/ocfs2/dlmglue.c   |   10 ++++++++++
 fs/ocfs2/stackglue.c |    6 ++++++
 fs/ocfs2/stackglue.h |    3 +++
 3 files changed, 19 insertions(+)

--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3321,6 +3321,16 @@ static int ocfs2_downconvert_lock(struct
 	mlog(ML_BASTS, "lockres %s, level %d => %d\n", lockres->l_name,
 	     lockres->l_level, new_level);
 
+	/*
+	 * On DLM_LKF_VALBLK, fsdlm behaves differently with o2cb. It always
+	 * expects DLM_LKF_VALBLK being set if the LKB has LVB, so that
+	 * we can recover correctly from node failure. Otherwise, we may get
+	 * invalid LVB in LKB, but without DLM_SBF_VALNOTVALID being set.
+	 */
+	if (!ocfs2_is_o2cb_active() &&
+	    lockres->l_ops->flags & LOCK_TYPE_USES_LVB)
+		lvb = 1;
+
 	if (lvb)
 		dlm_flags |= DLM_LKF_VALBLK;
 
--- a/fs/ocfs2/stackglue.c
+++ b/fs/ocfs2/stackglue.c
@@ -48,6 +48,12 @@ static char ocfs2_hb_ctl_path[OCFS2_MAX_
  */
 static struct ocfs2_stack_plugin *active_stack;
 
+inline int ocfs2_is_o2cb_active(void)
+{
+	return !strcmp(active_stack->sp_name, OCFS2_STACK_PLUGIN_O2CB);
+}
+EXPORT_SYMBOL_GPL(ocfs2_is_o2cb_active);
+
 static struct ocfs2_stack_plugin *ocfs2_stack_lookup(const char *name)
 {
 	struct ocfs2_stack_plugin *p;
--- a/fs/ocfs2/stackglue.h
+++ b/fs/ocfs2/stackglue.h
@@ -298,4 +298,7 @@ void ocfs2_stack_glue_set_max_proto_vers
 int ocfs2_stack_glue_register(struct ocfs2_stack_plugin *plugin);
 void ocfs2_stack_glue_unregister(struct ocfs2_stack_plugin *plugin);
 
+/* In ocfs2_downconvert_lock(), we need to know which stack we are using */
+int ocfs2_is_o2cb_active(void);
+
 #endif  /* STACKGLUE_H */



  parent reply	other threads:[~2017-01-18 10:51 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20170118104957epcas3p3c8bb456f6ed6bf7171f9b645196aafc7@epcas3p3.samsung.com>
2017-01-18 10:46 ` [PATCH 4.4 00/48] 4.4.44-stable review Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 01/48] Input: xpad - use correct product id for x360w controllers Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 02/48] Input: i8042 - add Pegatron touchpad to noloop table Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 03/48] selftests: do not require bash to run netsocktests testcase Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 04/48] selftests: do not require bash for the generated test Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 05/48] mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done} Greg Kroah-Hartman
2017-02-09 15:26     ` Ben Hutchings
2017-02-10  5:00       ` Dan Williams
2017-01-18 10:46   ` Greg Kroah-Hartman [this message]
2017-01-18 10:46   ` [PATCH 4.4 07/48] mm/hugetlb.c: fix reservation race when freeing surplus pages Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 08/48] KVM: x86: fix emulation of "MOV SS, null selector" Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 10/48] jump_labels: API for flushing deferred jump label updates Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 11/48] KVM: x86: flush pending lapic jump label updates on module unload Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 15/48] KVM: x86: Introduce segmented_write_std Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 16/48] nl80211: fix sched scan netlink socket owner destruction Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 17/48] USB: serial: kl5kusb105: fix line-state error handling Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 18/48] USB: serial: ch341: fix initial modem-control state Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 19/48] USB: serial: ch341: fix open error handling Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 20/48] USB: serial: ch341: fix control-message " Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 21/48] USB: serial: ch341: fix open and resume after B0 Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 22/48] Input: elants_i2c - avoid divide by 0 errors on bad touchscreen data Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 23/48] i2c: print correct device invalid address Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 24/48] i2c: fix kernel memory disclosure in dev interface Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 25/48] xhci: fix deadlock at host remove by running watchdog correctly Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 27/48] mnt: Protect the mountpoint hashtable with mount_lock Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 28/48] tty/serial: atmel_serial: BUG: stop DMA from transmitting in stop_tx Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 29/48] sysrq: attach sysrq handler correctly for 32-bit kernel Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 30/48] sysctl: Drop reference added by grab_header in proc_sys_readdir Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 31/48] drm/radeon: drop verde dpm quirks Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 32/48] USB: serial: ch341: fix resume after reset Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 33/48] USB: serial: ch341: fix modem-control and B0 handling Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 34/48] x86/cpu: Fix bootup crashes by sanitizing the argument of the clearcpuid= command-line option Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 35/48] btrfs: fix locking when we put back a delayed ref thats too new Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 36/48] btrfs: fix error handling when run_delayed_extent_op fails Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 37/48] pinctrl: meson: fix gpio request disabling other modes Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 38/48] pNFS: Fix race in pnfs_wait_on_layoutreturn Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 39/48] NFS: Fix a performance regression in readdir Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 40/48] NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 41/48] cpufreq: powernv: Disable preemption while checking CPU throttling state Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 42/48] block: cfq_cpd_alloc() should use @gfp Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 43/48] ACPI / APEI: Fix NMI notification handling Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 44/48] blk-mq: Always schedule hctx->next_cpu Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 45/48] bus: vexpress-config: fix device reference leak Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 46/48] powerpc/ibmebus: Fix further device reference leaks Greg Kroah-Hartman
2017-01-18 10:46   ` [PATCH 4.4 47/48] powerpc/ibmebus: Fix device reference leaks in sysfs interface Greg Kroah-Hartman
2017-01-18 18:45   ` [PATCH 4.4 00/48] 4.4.44-stable review Guenter Roeck
2017-01-19 18:02   ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170118104625.826644157@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=jiangqi903@gmail.com \
    --cc=jlbec@evilplan.org \
    --cc=junxiao.bi@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mfasheh@versity.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=zren@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).