[BUG] cgroup writeback list corruption

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] cgroup writeback list corruption
@ 2016-03-02 22:26 Tahsin Erdogan
       [not found] ` <CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Tahsin Erdogan @ 2016-03-02 22:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse

Hi,

cgroup based writeback sometimes appears to manipulate inode->i_io_list
while holding the lock on the wrong bdi_writeback object.

Following is a crash I was able to produce by adding extra delays in
between list
pointer updates (repro patch attached).

[  116.595958] ------------[ cut here ]------------
[  116.596508] kernel BUG at lib/list_debug.c:74!
[  116.597000] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[  116.597654] CPU: 3 PID: 940 Comm: kworker/u8:6 Not tainted 4.5.0-rc6+ #39
[  116.598397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  116.599290] Workqueue: writeback wb_workfn (flush-8:16)
[  116.599895] task: ffff88007c213700 ti: ffff88007c714000 task.ti:
ffff88007c714000
[  116.600006] RIP: 0010:[<ffffffff805a3393>]  [<ffffffff805a3393>]
__list_del_entry+0x53/0x60
[  116.600006] RSP: 0018:ffff88007c717c50  EFLAGS: 00010206
[  116.600006] RAX: dead000000000200 RBX: ffff88007c284818 RCX: ffff88007c717db0
[  116.600006] RDX: ffff8800765bbdd8 RSI: ffff88007c717c90 RDI: ffff8800768595d8
[  116.600006] RBP: ffff88007c717c60 R08: ffff8800768591d8 R09: 0000000000000000
[  116.600006] R10: 0000000000000004 R11: 0000000000000001 R12: ffff88007c284818
[  116.600006] R13: ffff88007c717c90 R14: ffff88007c284818 R15: ffff88007c717d28
[  116.600006] FS:  0000000000000000(0000) GS:ffff88007f980000(0000)
knlGS:0000000000000000
[  116.600006] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  116.600006] CR2: 00007fff5822ff40 CR3: 0000000000e0a000 CR4: 00000000000006e0
[  116.600006] Stack:
[  116.600006]  ffff8800768595d8 ffff88007c46f000 ffff88007c717cc8
ffffffff8035e37f
[  116.600006]  ffff88007c284828 0000000000000082 000000000000026a
ffff88007c717cc0
[  116.600006]  ffff8800768111d8 ffff880076a309d8 ffff88007c284800
ffff88007c284828
[  116.600006] Call Trace:
[  116.600006]  [<ffffffff8035e37f>] move_expired_inodes+0x4f/0x180
[  116.600006]  [<ffffffff8035f0a1>] queue_io+0x61/0xb0
[  116.600006]  [<ffffffff80360813>] wb_writeback+0x1a3/0x1e0
[  116.600006]  [<ffffffff803609db>] wb_workfn+0x18b/0x280
[  116.600006]  [<ffffffff802795a8>] process_one_work+0x128/0x300
[  116.600006]  [<ffffffff802798a0>] worker_thread+0x120/0x480
[  116.600006]  [<ffffffff80279780>] ? process_one_work+0x300/0x300
[  116.600006]  [<ffffffff8027ea64>] kthread+0xc4/0xe0
[  116.600006]  [<ffffffff8032d1e8>] ? kfree+0xc8/0x100
[  116.600006]  [<ffffffff8027e9a0>] ? __kthread_parkme+0x70/0x70
[  116.600006]  [<ffffffff80970bdf>] ret_from_fork+0x3f/0x70
[  116.600006]  [<ffffffff8027e9a0>] ? __kthread_parkme+0x70/0x70
[  116.600006] Code: 39 c3 74 29 48 3b 3b 75 22 49 3b 7c 24 08 75 19
49 89 5c 24 08 e8 fe fd ff ff 4c 89 23 e8 f6 fd ff ff 5b 41 5c 5d c3
0f 0b 0f 0b <0f> 0b 0f 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48
89 fb
[  116.600006] RIP  [<ffffffff805a3393>] __list_del_entry+0x53/0x60
[  116.600006]  RSP <ffff88007c717c50>
[  116.623099] ---[ end trace ff07596b6e4f4928 ]---



I see a few places where a wrong bdi_writeback object may be locked while
manipulating inode->i_io_list links.

The first one is in writeback_single_inode():

1353        spin_lock(&wb->list_lock);
1354        spin_lock(&inode->i_lock);
1355        /*
1356         * If inode is clean, remove it from writeback lists.
Otherwise don't
1357         * touch it. See comment above for explanation.
1358         */
1359        if (!(inode->i_state & I_DIRTY_ALL))
1360                inode_io_list_del_locked(inode, wb);
1361        spin_unlock(&wb->list_lock);

The locked wb is passed in as a parameter and equals to inode_to_bdi(inode)->wb.
But this may not actually match inode->i_wb.


The second one  is in writeback_sb_inodes():

1499                __writeback_single_inode(inode, &wbc);
1500
1501                wbc_detach_inode(&wbc);
1502                work->nr_pages -= write_chunk - wbc.nr_to_write;
1503                wrote += write_chunk - wbc.nr_to_write;
1504
1505                if (need_resched()) {
1506                        /*
1507                         * We're trying to balance between
building up a nice
1508                         * long list of IOs to improve our merge rate, and
1509                         * getting those IOs out quickly for
anyone throttling
1510                         * in balance_dirty_pages().  cond_resched() doesn't
1511                         * unplug, so get our IOs out the door before we
1512                         * give up the CPU.
1513                         */
1514                        blk_flush_plug(current);
1515                        cond_resched();
1516                }
1517
1518
1519                spin_lock(&wb->list_lock);
1520                spin_lock(&inode->i_lock);
1521                if (!(inode->i_state & I_DIRTY_ALL))
1522                        wrote++;
1523                requeue_inode(inode, wb, &wbc);

After wbc_detach_inode() is called, inode's i_wb could have changed. So locking
the original wb seems wrong. The same issue exists in writeback_single_inode().


Repro patch:

---
 fs/fs-writeback.c    |  4 ++++
 include/linux/list.h |  6 ++++++
 lib/list_debug.c     | 29 ++++++++++++++++-------------
 repro.sh             | 31 +++++++++++++++++++++++++++++++
 4 files changed, 57 insertions(+), 13 deletions(-)
 create mode 100755 repro.sh

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1f76d89..bd1bd75 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1301,6 +1301,8 @@ __writeback_single_inode(struct inode *inode,
struct writeback_control *wbc)
  return ret;
 }

+atomic_t slow_down = ATOMIC_INIT(0);
+
 /*
  * Write out an inode's dirty pages. Either the caller has an active reference
  * on the inode or the inode has I_WILL_FREE set.
@@ -1315,6 +1317,7 @@ writeback_single_inode(struct inode *inode,
struct bdi_writeback *wb,
 {
  int ret = 0;

+ atomic_inc(&slow_down);
  spin_lock(&inode->i_lock);
  if (!atomic_read(&inode->i_count))
  WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
@@ -1362,6 +1365,7 @@ writeback_single_inode(struct inode *inode,
struct bdi_writeback *wb,
  inode_sync_complete(inode);
 out:
  spin_unlock(&inode->i_lock);
+ atomic_dec(&slow_down);
  return ret;
 }

diff --git a/include/linux/list.h b/include/linux/list.h
index 30cf420..e3b032b 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -6,6 +6,8 @@
 #include <linux/poison.h>
 #include <linux/const.h>
 #include <linux/kernel.h>
+#include <asm/atomic.h>
+#include <asm/processor.h>

 /*
  * Simple doubly linked list implementation.
@@ -77,6 +79,8 @@ static inline void list_add_tail(struct list_head
*new, struct list_head *head)
  __list_add(new, head->prev, head);
 }

+void do_delay(void);
+
 /*
  * Delete a list entry by making the prev/next entries
  * point to each other.
@@ -87,7 +91,9 @@ static inline void list_add_tail(struct list_head
*new, struct list_head *head)
 static inline void __list_del(struct list_head * prev, struct list_head * next)
 {
  next->prev = prev;
+ do_delay();
  WRITE_ONCE(prev->next, next);
+ do_delay();
 }

 /**
diff --git a/lib/list_debug.c b/lib/list_debug.c
index 3345a08..7195c4f 100644
--- a/lib/list_debug.c
+++ b/lib/list_debug.c
@@ -19,6 +19,14 @@ void list_force_poison(struct list_head *entry)
  entry->prev = &force_poison;
 }

+extern atomic_t slow_down;
+
+void do_delay() {
+ int i;
+ for (i=0; atomic_read(&slow_down) && i<100000; i++)
+ cpu_relax();
+}
+
 /*
  * Insert a new entry between two known consecutive entries.
  *
@@ -44,9 +52,13 @@ void __list_add(struct list_head *new,
      "list_add double add: new=%p, prev=%p, next=%p.\n",
      new, prev, next);
  next->prev = new;
+ do_delay();
  new->next = next;
+ do_delay();
  new->prev = prev;
+ do_delay();
  WRITE_ONCE(prev->next, new);
+ do_delay();
 }
 EXPORT_SYMBOL(__list_add);

@@ -57,19 +69,10 @@ void __list_del_entry(struct list_head *entry)
  prev = entry->prev;
  next = entry->next;

- if (WARN(next == LIST_POISON1,
- "list_del corruption, %p->next is LIST_POISON1 (%p)\n",
- entry, LIST_POISON1) ||
-    WARN(prev == LIST_POISON2,
- "list_del corruption, %p->prev is LIST_POISON2 (%p)\n",
- entry, LIST_POISON2) ||
-    WARN(prev->next != entry,
- "list_del corruption. prev->next should be %p, "
- "but was %p\n", entry, prev->next) ||
-    WARN(next->prev != entry,
- "list_del corruption. next->prev should be %p, "
- "but was %p\n", entry, next->prev))
- return;
+ BUG_ON(next == LIST_POISON1);
+ BUG_ON(prev == LIST_POISON2);
+ BUG_ON(prev->next != entry);
+ BUG_ON(next->prev != entry);

  __list_del(prev, next);
 }
diff --git a/repro.sh b/repro.sh
new file mode 100755
index 0000000..05de4c7
--- /dev/null
+++ b/repro.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+CGROUP_ROOT=/mnt-cgroup2
+
+mkdir -p $CGROUP_ROOT
+
+if ! mount | grep -qw cgroup2; then
+  mount -t cgroup2 none $CGROUP_ROOT
+fi
+
+mkdir -p $CGROUP_ROOT/mem1
+
+echo '+memory' > $CGROUP_ROOT/cgroup.subtree_control
+
+echo $$ > $CGROUP_ROOT/mem1/cgroup.procs
+
+if dumpe2fs -h /dev/sdb |grep -q 'Journal size'; then
+  echo 'This repro requires ext4 without journaling'
+  exit 1
+fi
+
+if ! mount | grep -qw /dev/sdb; then
+  mount /dev/sdb /mnt/sdb
+fi
+
+(for i in {1..10000}; do dd if=/dev/urandom of=/mnt/sdb/fsync1
bs=4096 count=1 conv=notrunc,fsync &> /dev/null; done)&
+
+(for i in {1..10000};do dd if=/dev/urandom of=/mnt/sdb/mark_dirty$i
bs=4096 count=1 conv=notrunc &> /dev/null;done)&
+
+wait
+
-- 
2.7.0.rc3.207.g0ac5344

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [BUG] cgroup writeback list corruption
       [not found] ` <CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-09 16:12   ` Tahsin Erdogan
       [not found]     ` <CAAeU0aPANgyD0dpQYwnuLioEymUzPgfU5JNG_QBz0fy_fn06ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-03-16 22:10   ` Tejun Heo
  2016-03-18 17:50   ` [PATCH for-linus 1/2] writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() Tejun Heo
  2 siblings, 1 reply; 10+ messages in thread
From: Tahsin Erdogan @ 2016-03-09 16:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse

Ping...

On Wed, Mar 2, 2016 at 2:26 PM, Tahsin Erdogan <tahsin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> Hi,
>
> cgroup based writeback sometimes appears to manipulate inode->i_io_list
> while holding the lock on the wrong bdi_writeback object.
>
> Following is a crash I was able to produce by adding extra delays in
> between list
> pointer updates (repro patch attached).
>
> [  116.595958] ------------[ cut here ]------------
> [  116.596508] kernel BUG at lib/list_debug.c:74!
> [  116.597000] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [  116.597654] CPU: 3 PID: 940 Comm: kworker/u8:6 Not tainted 4.5.0-rc6+ #39
> [  116.598397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  116.599290] Workqueue: writeback wb_workfn (flush-8:16)
> [  116.599895] task: ffff88007c213700 ti: ffff88007c714000 task.ti:
> ffff88007c714000
> [  116.600006] RIP: 0010:[<ffffffff805a3393>]  [<ffffffff805a3393>]
> __list_del_entry+0x53/0x60
> [  116.600006] RSP: 0018:ffff88007c717c50  EFLAGS: 00010206
> [  116.600006] RAX: dead000000000200 RBX: ffff88007c284818 RCX: ffff88007c717db0
> [  116.600006] RDX: ffff8800765bbdd8 RSI: ffff88007c717c90 RDI: ffff8800768595d8
> [  116.600006] RBP: ffff88007c717c60 R08: ffff8800768591d8 R09: 0000000000000000
> [  116.600006] R10: 0000000000000004 R11: 0000000000000001 R12: ffff88007c284818
> [  116.600006] R13: ffff88007c717c90 R14: ffff88007c284818 R15: ffff88007c717d28
> [  116.600006] FS:  0000000000000000(0000) GS:ffff88007f980000(0000)
> knlGS:0000000000000000
> [  116.600006] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  116.600006] CR2: 00007fff5822ff40 CR3: 0000000000e0a000 CR4: 00000000000006e0
> [  116.600006] Stack:
> [  116.600006]  ffff8800768595d8 ffff88007c46f000 ffff88007c717cc8
> ffffffff8035e37f
> [  116.600006]  ffff88007c284828 0000000000000082 000000000000026a
> ffff88007c717cc0
> [  116.600006]  ffff8800768111d8 ffff880076a309d8 ffff88007c284800
> ffff88007c284828
> [  116.600006] Call Trace:
> [  116.600006]  [<ffffffff8035e37f>] move_expired_inodes+0x4f/0x180
> [  116.600006]  [<ffffffff8035f0a1>] queue_io+0x61/0xb0
> [  116.600006]  [<ffffffff80360813>] wb_writeback+0x1a3/0x1e0
> [  116.600006]  [<ffffffff803609db>] wb_workfn+0x18b/0x280
> [  116.600006]  [<ffffffff802795a8>] process_one_work+0x128/0x300
> [  116.600006]  [<ffffffff802798a0>] worker_thread+0x120/0x480
> [  116.600006]  [<ffffffff80279780>] ? process_one_work+0x300/0x300
> [  116.600006]  [<ffffffff8027ea64>] kthread+0xc4/0xe0
> [  116.600006]  [<ffffffff8032d1e8>] ? kfree+0xc8/0x100
> [  116.600006]  [<ffffffff8027e9a0>] ? __kthread_parkme+0x70/0x70
> [  116.600006]  [<ffffffff80970bdf>] ret_from_fork+0x3f/0x70
> [  116.600006]  [<ffffffff8027e9a0>] ? __kthread_parkme+0x70/0x70
> [  116.600006] Code: 39 c3 74 29 48 3b 3b 75 22 49 3b 7c 24 08 75 19
> 49 89 5c 24 08 e8 fe fd ff ff 4c 89 23 e8 f6 fd ff ff 5b 41 5c 5d c3
> 0f 0b 0f 0b <0f> 0b 0f 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48
> 89 fb
> [  116.600006] RIP  [<ffffffff805a3393>] __list_del_entry+0x53/0x60
> [  116.600006]  RSP <ffff88007c717c50>
> [  116.623099] ---[ end trace ff07596b6e4f4928 ]---
>
>
>
> I see a few places where a wrong bdi_writeback object may be locked while
> manipulating inode->i_io_list links.
>
> The first one is in writeback_single_inode():
>
> 1353        spin_lock(&wb->list_lock);
> 1354        spin_lock(&inode->i_lock);
> 1355        /*
> 1356         * If inode is clean, remove it from writeback lists.
> Otherwise don't
> 1357         * touch it. See comment above for explanation.
> 1358         */
> 1359        if (!(inode->i_state & I_DIRTY_ALL))
> 1360                inode_io_list_del_locked(inode, wb);
> 1361        spin_unlock(&wb->list_lock);
>
> The locked wb is passed in as a parameter and equals to inode_to_bdi(inode)->wb.
> But this may not actually match inode->i_wb.
>
>
> The second one  is in writeback_sb_inodes():
>
> 1499                __writeback_single_inode(inode, &wbc);
> 1500
> 1501                wbc_detach_inode(&wbc);
> 1502                work->nr_pages -= write_chunk - wbc.nr_to_write;
> 1503                wrote += write_chunk - wbc.nr_to_write;
> 1504
> 1505                if (need_resched()) {
> 1506                        /*
> 1507                         * We're trying to balance between
> building up a nice
> 1508                         * long list of IOs to improve our merge rate, and
> 1509                         * getting those IOs out quickly for
> anyone throttling
> 1510                         * in balance_dirty_pages().  cond_resched() doesn't
> 1511                         * unplug, so get our IOs out the door before we
> 1512                         * give up the CPU.
> 1513                         */
> 1514                        blk_flush_plug(current);
> 1515                        cond_resched();
> 1516                }
> 1517
> 1518
> 1519                spin_lock(&wb->list_lock);
> 1520                spin_lock(&inode->i_lock);
> 1521                if (!(inode->i_state & I_DIRTY_ALL))
> 1522                        wrote++;
> 1523                requeue_inode(inode, wb, &wbc);
>
> After wbc_detach_inode() is called, inode's i_wb could have changed. So locking
> the original wb seems wrong. The same issue exists in writeback_single_inode().
>
>
> Repro patch:
>
> ---
>  fs/fs-writeback.c    |  4 ++++
>  include/linux/list.h |  6 ++++++
>  lib/list_debug.c     | 29 ++++++++++++++++-------------
>  repro.sh             | 31 +++++++++++++++++++++++++++++++
>  4 files changed, 57 insertions(+), 13 deletions(-)
>  create mode 100755 repro.sh
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1f76d89..bd1bd75 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1301,6 +1301,8 @@ __writeback_single_inode(struct inode *inode,
> struct writeback_control *wbc)
>   return ret;
>  }
>
> +atomic_t slow_down = ATOMIC_INIT(0);
> +
>  /*
>   * Write out an inode's dirty pages. Either the caller has an active reference
>   * on the inode or the inode has I_WILL_FREE set.
> @@ -1315,6 +1317,7 @@ writeback_single_inode(struct inode *inode,
> struct bdi_writeback *wb,
>  {
>   int ret = 0;
>
> + atomic_inc(&slow_down);
>   spin_lock(&inode->i_lock);
>   if (!atomic_read(&inode->i_count))
>   WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
> @@ -1362,6 +1365,7 @@ writeback_single_inode(struct inode *inode,
> struct bdi_writeback *wb,
>   inode_sync_complete(inode);
>  out:
>   spin_unlock(&inode->i_lock);
> + atomic_dec(&slow_down);
>   return ret;
>  }
>
> diff --git a/include/linux/list.h b/include/linux/list.h
> index 30cf420..e3b032b 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -6,6 +6,8 @@
>  #include <linux/poison.h>
>  #include <linux/const.h>
>  #include <linux/kernel.h>
> +#include <asm/atomic.h>
> +#include <asm/processor.h>
>
>  /*
>   * Simple doubly linked list implementation.
> @@ -77,6 +79,8 @@ static inline void list_add_tail(struct list_head
> *new, struct list_head *head)
>   __list_add(new, head->prev, head);
>  }
>
> +void do_delay(void);
> +
>  /*
>   * Delete a list entry by making the prev/next entries
>   * point to each other.
> @@ -87,7 +91,9 @@ static inline void list_add_tail(struct list_head
> *new, struct list_head *head)
>  static inline void __list_del(struct list_head * prev, struct list_head * next)
>  {
>   next->prev = prev;
> + do_delay();
>   WRITE_ONCE(prev->next, next);
> + do_delay();
>  }
>
>  /**
> diff --git a/lib/list_debug.c b/lib/list_debug.c
> index 3345a08..7195c4f 100644
> --- a/lib/list_debug.c
> +++ b/lib/list_debug.c
> @@ -19,6 +19,14 @@ void list_force_poison(struct list_head *entry)
>   entry->prev = &force_poison;
>  }
>
> +extern atomic_t slow_down;
> +
> +void do_delay() {
> + int i;
> + for (i=0; atomic_read(&slow_down) && i<100000; i++)
> + cpu_relax();
> +}
> +
>  /*
>   * Insert a new entry between two known consecutive entries.
>   *
> @@ -44,9 +52,13 @@ void __list_add(struct list_head *new,
>       "list_add double add: new=%p, prev=%p, next=%p.\n",
>       new, prev, next);
>   next->prev = new;
> + do_delay();
>   new->next = next;
> + do_delay();
>   new->prev = prev;
> + do_delay();
>   WRITE_ONCE(prev->next, new);
> + do_delay();
>  }
>  EXPORT_SYMBOL(__list_add);
>
> @@ -57,19 +69,10 @@ void __list_del_entry(struct list_head *entry)
>   prev = entry->prev;
>   next = entry->next;
>
> - if (WARN(next == LIST_POISON1,
> - "list_del corruption, %p->next is LIST_POISON1 (%p)\n",
> - entry, LIST_POISON1) ||
> -    WARN(prev == LIST_POISON2,
> - "list_del corruption, %p->prev is LIST_POISON2 (%p)\n",
> - entry, LIST_POISON2) ||
> -    WARN(prev->next != entry,
> - "list_del corruption. prev->next should be %p, "
> - "but was %p\n", entry, prev->next) ||
> -    WARN(next->prev != entry,
> - "list_del corruption. next->prev should be %p, "
> - "but was %p\n", entry, next->prev))
> - return;
> + BUG_ON(next == LIST_POISON1);
> + BUG_ON(prev == LIST_POISON2);
> + BUG_ON(prev->next != entry);
> + BUG_ON(next->prev != entry);
>
>   __list_del(prev, next);
>  }
> diff --git a/repro.sh b/repro.sh
> new file mode 100755
> index 0000000..05de4c7
> --- /dev/null
> +++ b/repro.sh
> @@ -0,0 +1,31 @@
> +#!/bin/bash
> +
> +CGROUP_ROOT=/mnt-cgroup2
> +
> +mkdir -p $CGROUP_ROOT
> +
> +if ! mount | grep -qw cgroup2; then
> +  mount -t cgroup2 none $CGROUP_ROOT
> +fi
> +
> +mkdir -p $CGROUP_ROOT/mem1
> +
> +echo '+memory' > $CGROUP_ROOT/cgroup.subtree_control
> +
> +echo $$ > $CGROUP_ROOT/mem1/cgroup.procs
> +
> +if dumpe2fs -h /dev/sdb |grep -q 'Journal size'; then
> +  echo 'This repro requires ext4 without journaling'
> +  exit 1
> +fi
> +
> +if ! mount | grep -qw /dev/sdb; then
> +  mount /dev/sdb /mnt/sdb
> +fi
> +
> +(for i in {1..10000}; do dd if=/dev/urandom of=/mnt/sdb/fsync1
> bs=4096 count=1 conv=notrunc,fsync &> /dev/null; done)&
> +
> +(for i in {1..10000};do dd if=/dev/urandom of=/mnt/sdb/mark_dirty$i
> bs=4096 count=1 conv=notrunc &> /dev/null;done)&
> +
> +wait
> +
> --
> 2.7.0.rc3.207.g0ac5344

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] cgroup writeback list corruption
       [not found]     ` <CAAeU0aPANgyD0dpQYwnuLioEymUzPgfU5JNG_QBz0fy_fn06ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-03-11 19:46       ` Tejun Heo
  0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2016-03-11 19:46 UTC (permalink / raw)
  To: Tahsin Erdogan; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse

Hello, Tahsin.

On Wed, Mar 09, 2016 at 08:12:47AM -0800, Tahsin Erdogan wrote:
> Ping...

Sorry about the delay.  Will get to it early next week.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] cgroup writeback list corruption
       [not found] ` <CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-03-09 16:12   ` Tahsin Erdogan
@ 2016-03-16 22:10   ` Tejun Heo
       [not found]     ` <20160316221015.GL21104-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  2016-03-18 17:50   ` [PATCH for-linus 1/2] writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() Tejun Heo
  2 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2016-03-16 22:10 UTC (permalink / raw)
  To: Tahsin Erdogan; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse

Hello, Tahsin.

On Wed, Mar 02, 2016 at 02:26:45PM -0800, Tahsin Erdogan wrote:
> I see a few places where a wrong bdi_writeback object may be locked while
> manipulating inode->i_io_list links.
> 
> The first one is in writeback_single_inode():
> 
> 1353        spin_lock(&wb->list_lock);
> 1354        spin_lock(&inode->i_lock);
> 1355        /*
> 1356         * If inode is clean, remove it from writeback lists.
> Otherwise don't
> 1357         * touch it. See comment above for explanation.
> 1358         */
> 1359        if (!(inode->i_state & I_DIRTY_ALL))
> 1360                inode_io_list_del_locked(inode, wb);
> 1361        spin_unlock(&wb->list_lock);
> 
> The locked wb is passed in as a parameter and equals to inode_to_bdi(inode)->wb.
> But this may not actually match inode->i_wb.

Oops, yeah, that's a stupid bug.  Thanks for spotting it.  We probably
should drop @wb parameter from the function and grab the associated wb
there.

> The second one  is in writeback_sb_inodes():
> 
> 1499                __writeback_single_inode(inode, &wbc);
> 1500
> 1501                wbc_detach_inode(&wbc);
> 1502                work->nr_pages -= write_chunk - wbc.nr_to_write;
> 1503                wrote += write_chunk - wbc.nr_to_write;
> 1504
> 1505                if (need_resched()) {
> 1506                        /*
> 1507                         * We're trying to balance between
> building up a nice
> 1508                         * long list of IOs to improve our merge rate, and
> 1509                         * getting those IOs out quickly for
> anyone throttling
> 1510                         * in balance_dirty_pages().  cond_resched() doesn't
> 1511                         * unplug, so get our IOs out the door before we
> 1512                         * give up the CPU.
> 1513                         */
> 1514                        blk_flush_plug(current);
> 1515                        cond_resched();
> 1516                }
> 1517
> 1518
> 1519                spin_lock(&wb->list_lock);
> 1520                spin_lock(&inode->i_lock);
> 1521                if (!(inode->i_state & I_DIRTY_ALL))
> 1522                        wrote++;
> 1523                requeue_inode(inode, wb, &wbc);
> 
> After wbc_detach_inode() is called, inode's i_wb could have changed. So locking
> the original wb seems wrong. The same issue exists in writeback_single_inode().

Yes, I'll add a helper to grab the current wb and convert these sites
to use it.  Thanks a lot for spotting these!

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] cgroup writeback list corruption
       [not found]     ` <20160316221015.GL21104-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-03-17  1:13       ` Tejun Heo
       [not found]         ` <20160317011350.GN21104-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2016-03-17  1:13 UTC (permalink / raw)
  To: Tahsin Erdogan; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse

Hello, Tahsin.

FYI, this is the preliminary patch I'm trying to test.  I'm running
the repro case w/o the patch but the bug hasn't triggered yet.

Thanks.

Index: work/include/linux/backing-dev.h
===================================================================
--- work.orig/include/linux/backing-dev.h
+++ work/include/linux/backing-dev.h
@@ -362,6 +362,26 @@ static inline struct bdi_writeback *inod
 	return inode->i_wb;
 }
 
+static inline struct bdi_writeback *
+unlocked_inode_to_wb_lock_list(struct inode *inode)
+{
+	struct bdi_writeback *wb;
+
+	rcu_read_lock();
+
+	while (true) {
+		wb = inode->i_wb;
+		spin_lock(&wb->list_lock);
+		if (likely(wb == inode->i_wb))
+			break;
+		spin_unlock(&wb->list_lock);
+		cpu_relax();
+	}
+
+	rcu_read_unlock();
+	return wb;
+}
+
 /**
  * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction
  * @inode: target inode
@@ -374,8 +394,7 @@ static inline struct bdi_writeback *inod
  * unlocked_inode_to_wb_end().
  *
  * The caller must call unlocked_inode_to_wb_end() with *@lockdep
- * afterwards and can't sleep during transaction.  IRQ may or may not be
- * disabled on return.
+ * afterwards and can't sleep during transaction.
  */
 static inline struct bdi_writeback *
 unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
@@ -388,14 +407,16 @@ unlocked_inode_to_wb_begin(struct inode
 	 */
 	*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;
 
-	if (unlikely(*lockedp))
-		spin_lock_irq(&inode->i_mapping->tree_lock);
-
 	/*
-	 * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
+	 * Protected by either !I_WB_SWITCH + rcu_read_lock(),
 	 * inode_to_wb() will bark.  Deref directly.
 	 */
-	return inode->i_wb;
+	if (likely(!*lockedp))
+		return inode->i_wb;
+
+	rcu_read_unlock();
+
+	return unlocked_inode_to_wb_lock_list(inode);
 }
 
 /**
@@ -405,10 +426,10 @@ unlocked_inode_to_wb_begin(struct inode
  */
 static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
 {
-	if (unlikely(locked))
-		spin_unlock_irq(&inode->i_mapping->tree_lock);
-
-	rcu_read_unlock();
+	if (likely(!locked))
+		rcu_read_unlock();
+	else
+		spin_unlock(&inode->i_wb->list_lock);
 }
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
@@ -453,6 +474,15 @@ static inline struct bdi_writeback *inod
 }
 
 static inline struct bdi_writeback *
+unlocked_inode_to_wb_lock_list(struct inode *inode)
+{
+	struct bdi_writeback *wb = inode_to_wb(inode);
+
+	spin_lock(&wb->list_lock);
+	return wb;
+}
+
+static inline struct bdi_writeback *
 unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
 {
 	return inode_to_wb(inode);
Index: work/fs/fs-writeback.c
===================================================================
--- work.orig/fs/fs-writeback.c
+++ work/fs/fs-writeback.c
@@ -1309,10 +1309,10 @@ __writeback_single_inode(struct inode *i
  * we go e.g. from filesystem. Flusher thread uses __writeback_single_inode()
  * and does more profound writeback list handling in writeback_sb_inodes().
  */
-static int
-writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
-		       struct writeback_control *wbc)
+static int writeback_single_inode(struct inode *inode,
+				  struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb;
 	int ret = 0;
 
 	spin_lock(&inode->i_lock);
@@ -1350,7 +1350,8 @@ writeback_single_inode(struct inode *ino
 	ret = __writeback_single_inode(inode, wbc);
 
 	wbc_detach_inode(wbc);
-	spin_lock(&wb->list_lock);
+
+	wb = unlocked_inode_to_wb_lock_list(inode);
 	spin_lock(&inode->i_lock);
 	/*
 	 * If inode is clean, remove it from writeback lists. Otherwise don't
@@ -1515,8 +1516,7 @@ static long writeback_sb_inodes(struct s
 			cond_resched();
 		}
 
-
-		spin_lock(&wb->list_lock);
+		wb = unlocked_inode_to_wb_lock_list(inode);
 		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY_ALL))
 			wrote++;
@@ -2310,7 +2310,6 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
 		.sync_mode = sync ? WB_SYNC_ALL : WB_SYNC_NONE,
@@ -2322,7 +2321,7 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	return writeback_single_inode(inode, wb, &wbc);
+	return writeback_single_inode(inode, &wbc);
 }
 EXPORT_SYMBOL(write_inode_now);
 
@@ -2339,7 +2338,7 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return writeback_single_inode(inode, &inode_to_bdi(inode)->wb, wbc);
+	return writeback_single_inode(inode, wbc);
 }
 EXPORT_SYMBOL(sync_inode);
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] cgroup writeback list corruption
       [not found]         ` <20160317011350.GN21104-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-03-17  2:36           ` Tahsin Erdogan
  0 siblings, 0 replies; 10+ messages in thread
From: Tahsin Erdogan @ 2016-03-17  2:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse

On Wed, Mar 16, 2016 at 6:13 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Tahsin.
>
> FYI, this is the preliminary patch I'm trying to test.  I'm running
> the repro case w/o the patch but the bug hasn't triggered yet.

Please let me know if you want me to send the repro config. I don't have
it available at the moment but I can reconstruct it if you want.

thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH for-linus 1/2] writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()
       [not found] ` <CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2016-03-09 16:12   ` Tahsin Erdogan
  2016-03-16 22:10   ` Tejun Heo
@ 2016-03-18 17:50   ` Tejun Heo
       [not found]     ` <20160318175003.GA20028-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  2 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2016-03-18 17:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse, Tahsin Erdogan

locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
the target inode, unlocks inode, locks the wb's list_lock and verifies
that the inode is still associated with the wb.  To prevent the wb
going away between dropping inode lock and acquiring list_lock, the wb
is pinned while inode lock is held.  The wb reference is put right
after acquiring list_lock citing that the wb won't be dereferenced
anymore.

This isn't true.  If the inode is still associated with the wb, the
inode has reference and it's safe to return the wb; however, if inode
has been switched, the wb still needs to be unlocked which is a
dereference and can lead to use-after-free if it it races with wb
destruction.

Fix it by putting the reference after releasing list_lock.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Fixes: 87e1d789bf55 ("writeback: implement [locked_]inode_to_wb_and_lock_list()")
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.2+
Cc: Tahsin Erdogan <tahsin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
Hello,

Jens, two fixes for tricky brekages in cgroup writeback which can lead
lead to oops.  Both shouldn't affect !cgroup case at all.

Thanks.

 fs/fs-writeback.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -278,13 +278,15 @@ locked_inode_to_wb_and_lock_list(struct
 		wb_get(wb);
 		spin_unlock(&inode->i_lock);
 		spin_lock(&wb->list_lock);
-		wb_put(wb);		/* not gonna deref it anymore */

 		/* i_wb may have changed inbetween, can't use inode_to_wb() */
-		if (likely(wb == inode->i_wb))
-			return wb;	/* @inode already has ref */
+		if (likely(wb == inode->i_wb)) {
+			wb_put(wb);	/* @inode already has ref */
+			return wb;
+		}

 		spin_unlock(&wb->list_lock);
+		wb_put(wb);
 		cpu_relax();
 		spin_lock(&inode->i_lock);
 	}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH for-linus 2/2] writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode
       [not found]     ` <20160318175003.GA20028-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-03-18 17:52       ` Tejun Heo
       [not found]         ` <20160318175204.GB20028-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  2016-03-20 15:44       ` [PATCH for-linus 1/2] writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() Jens Axboe
  1 sibling, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2016-03-18 17:52 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse, Tahsin Erdogan

When cgroup writeback is in use, there can be multiple wb's
(bdi_writeback's) per bdi and an inode may switch among them
dynamically.  In a couple places, the wrong wb was used leading to
performing operations on the wrong list under the wrong lock
corrupting the io lists.

* writeback_single_inode() was taking @wb parameter and used it to
  remove the inode from io lists if it becomes clean after writeback.
  The callers of this function were always passing in the root wb
  regardless of the actual wb that the inode was associated with,
  which could also change while writeback is in progress.

  Fix it by dropping the @wb parameter and using
  inode_to_wb_and_lock_list() to determine and lock the associated wb.

* After writeback_sb_inodes() writes out an inode, it re-locks @wb and
  inode to remove it from or move it to the right io list.  It assumes
  that the inode is still associated with @wb; however, the inode may
  have switched to another wb while writeback was in progress.

  Fix it by using inode_to_wb_and_lock_list() to determine and lock
  the associated wb after writeback is complete.  As the function
  requires the original @wb->list_lock locked for the next iteration,
  in the unlikely case where the inode has changed association, switch
  the locks.

Kudos to Tahsin for pinpointing these subtle breakages.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org
Reported-and-diagnosed-by: Tahsin Erdogan <tahsin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.2+
---
Hello,

Tahsin, I could reproduce the bug and this fixes it for me but it'd be
great if you can verify the fix too.

Thanks a lot!

 fs/fs-writeback.c |   29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1311,10 +1311,10 @@ __writeback_single_inode(struct inode *i
  * we go e.g. from filesystem. Flusher thread uses __writeback_single_inode()
  * and does more profound writeback list handling in writeback_sb_inodes().
  */
-static int
-writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
-		       struct writeback_control *wbc)
+static int writeback_single_inode(struct inode *inode,
+				  struct writeback_control *wbc)
 {
+	struct bdi_writeback *wb;
 	int ret = 0;
 
 	spin_lock(&inode->i_lock);
@@ -1352,7 +1352,8 @@ writeback_single_inode(struct inode *ino
 	ret = __writeback_single_inode(inode, wbc);
 
 	wbc_detach_inode(wbc);
-	spin_lock(&wb->list_lock);
+
+	wb = inode_to_wb_and_lock_list(inode);
 	spin_lock(&inode->i_lock);
 	/*
 	 * If inode is clean, remove it from writeback lists. Otherwise don't
@@ -1427,6 +1428,7 @@ static long writeback_sb_inodes(struct s
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
+		struct bdi_writeback *tmp_wb;
 
 		if (inode->i_sb != sb) {
 			if (work->sb) {
@@ -1517,15 +1519,23 @@ static long writeback_sb_inodes(struct s
 			cond_resched();
 		}
 
-
-		spin_lock(&wb->list_lock);
+		/*
+		 * Requeue @inode if still dirty.  Be careful as @inode may
+		 * have been switched to another wb in the meantime.
+		 */
+		tmp_wb = inode_to_wb_and_lock_list(inode);
 		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY_ALL))
 			wrote++;
-		requeue_inode(inode, wb, &wbc);
+		requeue_inode(inode, tmp_wb, &wbc);
 		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
 
+		if (unlikely(tmp_wb != wb)) {
+			spin_unlock(&tmp_wb->list_lock);
+			spin_lock(&wb->list_lock);
+		}
+
 		/*
 		 * bail out to wb_writeback() often enough to check
 		 * background threshold and other termination conditions.
@@ -2312,7 +2322,6 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
 		.sync_mode = sync ? WB_SYNC_ALL : WB_SYNC_NONE,
@@ -2324,7 +2333,7 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	return writeback_single_inode(inode, wb, &wbc);
+	return writeback_single_inode(inode, &wbc);
 }
 EXPORT_SYMBOL(write_inode_now);
 
@@ -2341,7 +2350,7 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return writeback_single_inode(inode, &inode_to_bdi(inode)->wb, wbc);
+	return writeback_single_inode(inode, wbc);
 }
 EXPORT_SYMBOL(sync_inode);
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-linus 2/2] writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode
       [not found]         ` <20160318175204.GB20028-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2016-03-18 23:33           ` Tahsin Erdogan
  0 siblings, 0 replies; 10+ messages in thread
From: Tahsin Erdogan @ 2016-03-18 23:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Jan Kara, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Michel Lespinasse

On Fri, Mar 18, 2016 at 10:52 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Tahsin, I could reproduce the bug and this fixes it for me but it'd be
> great if you can verify the fix too.
>

Hi Tejun,
I verified that your patch fixes the problem.

Thanks a lot!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH for-linus 1/2] writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()
       [not found]     ` <20160318175003.GA20028-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  2016-03-18 17:52       ` [PATCH for-linus 2/2] writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode Tejun Heo
@ 2016-03-20 15:44       ` Jens Axboe
  1 sibling, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-03-20 15:44 UTC (permalink / raw)
  To: Tejun Heo, Jan Kara
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Michel Lespinasse, Tahsin Erdogan

On 03/18/2016 11:50 AM, Tejun Heo wrote:
> locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
> the target inode, unlocks inode, locks the wb's list_lock and verifies
> that the inode is still associated with the wb.  To prevent the wb
> going away between dropping inode lock and acquiring list_lock, the wb
> is pinned while inode lock is held.  The wb reference is put right
> after acquiring list_lock citing that the wb won't be dereferenced
> anymore.
>
> This isn't true.  If the inode is still associated with the wb, the
> inode has reference and it's safe to return the wb; however, if inode
> has been switched, the wb still needs to be unlocked which is a
> dereference and can lead to use-after-free if it it races with wb
> destruction.
>
> Fix it by putting the reference after releasing list_lock.

Applied for current series, thanks Tejun.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-03-20 15:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-02 22:26 [BUG] cgroup writeback list corruption Tahsin Erdogan
     [not found] ` <CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-09 16:12   ` Tahsin Erdogan
     [not found]     ` <CAAeU0aPANgyD0dpQYwnuLioEymUzPgfU5JNG_QBz0fy_fn06ng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-03-11 19:46       ` Tejun Heo
2016-03-16 22:10   ` Tejun Heo
     [not found]     ` <20160316221015.GL21104-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-03-17  1:13       ` Tejun Heo
     [not found]         ` <20160317011350.GN21104-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-03-17  2:36           ` Tahsin Erdogan
2016-03-18 17:50   ` [PATCH for-linus 1/2] writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() Tejun Heo
     [not found]     ` <20160318175003.GA20028-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-03-18 17:52       ` [PATCH for-linus 2/2] writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode Tejun Heo
     [not found]         ` <20160318175204.GB20028-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2016-03-18 23:33           ` Tahsin Erdogan
2016-03-20 15:44       ` [PATCH for-linus 1/2] writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).