Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
       [not found]                 ` <20200925011311.GJ482521@mit.edu>
@ 2020-09-25  7:31                   ` Ming Lei
  2020-09-25 16:19                     ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2020-09-25  7:31 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Jens Axboe, linux-ext4, linux-kernel@vger.kernel.org, linux-block,
	Linus Torvalds, linux-mm

On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > that one of the stack traces had virtio balloon in the trace, I
> > realized that when I switched the GCE VM type from e1-standard-2 to
> > n1-standard-2 (where e1 VM's are cheaper because they use
> > virtio-balloon to better manage host OS memory utilization), problem
> > has become, much, *much* rarer (and possibly has gone away, although
> > I'm going to want to run a lot more tests before I say that
> > conclusively) on my test setup.  At the very least, using an n1 VM
> > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > enough to unblock ext4 development.
> 
> .... and I spoke too soon.  A number of runs using -rc6 are now
> failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> indicator.
> 
> This is why debugging this is frustrating; it is very much a heisenbug
> --- although 5.8 seems to work completely reliably, as does commits
> before 37f4a24c2469.  Anything after that point will show random
> failures.  :-(

It does not make sense to mention 37f4a24c2469, which is reverted in
4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
as 568f27006577.

However, I can _not_ reproduce the issue by running the same test on
kernel built from 568f27006577 directly.

Also you have confirmed that the issue can't be fixed after reverting
568f27006577 against v5.9-rc4.

Looks the real issue(slab list corruption) should be introduced between
568f27006577 and v5.9-rc4.


thanks,
Ming



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25  7:31                   ` REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag Ming Lei
@ 2020-09-25 16:19                     ` Ming Lei
  2020-09-25 16:32                       ` Shakeel Butt
  2020-09-25 17:17                       ` Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: Ming Lei @ 2020-09-25 16:19 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Jens Axboe, linux-ext4, linux-kernel@vger.kernel.org, linux-block,
	Linus Torvalds, linux-mm, Roman Gushchin, Andrew Morton,
	Johannes Weiner, Vlastimil Babka, Shakeel Butt

On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote:
> On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > > that one of the stack traces had virtio balloon in the trace, I
> > > realized that when I switched the GCE VM type from e1-standard-2 to
> > > n1-standard-2 (where e1 VM's are cheaper because they use
> > > virtio-balloon to better manage host OS memory utilization), problem
> > > has become, much, *much* rarer (and possibly has gone away, although
> > > I'm going to want to run a lot more tests before I say that
> > > conclusively) on my test setup.  At the very least, using an n1 VM
> > > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > > enough to unblock ext4 development.
> > 
> > .... and I spoke too soon.  A number of runs using -rc6 are now
> > failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> > indicator.
> > 
> > This is why debugging this is frustrating; it is very much a heisenbug
> > --- although 5.8 seems to work completely reliably, as does commits
> > before 37f4a24c2469.  Anything after that point will show random
> > failures.  :-(
> 
> It does not make sense to mention 37f4a24c2469, which is reverted in
> 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
> as 568f27006577.
> 
> However, I can _not_ reproduce the issue by running the same test on
> kernel built from 568f27006577 directly.
> 
> Also you have confirmed that the issue can't be fixed after reverting
> 568f27006577 against v5.9-rc4.
> 
> Looks the real issue(slab list corruption) should be introduced between
> 568f27006577 and v5.9-rc4.

git bisect shows the first bad commit:

	[10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
		kmem_caches for all allocations

And I have double checked that the above commit is really the first bad
commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
is LIST_POISON1 (dead000000000100)', see the detailed stack trace and
kernel oops log in the following link:

	https://lore.kernel.org/lkml/20200916202026.GC38283@mit.edu/

And the kernel config is the one(without KASAN) used by Theodore in GCE VM, see
the following link:

	https://lore.kernel.org/lkml/20200917143012.GF38283@mit.edu/

The reproducer is xfstests generic/038. In my setting, test device is virtio-scsi, and
scratch device is virtio-blk.


[1] git bisect log:

git bisect start
# good: [568f2700657794b8258e72983ba508793a658942] blk-mq: centralise related handling into blk_mq_get_driver_tag
git bisect good 568f2700657794b8258e72983ba508793a658942
# bad: [f4d51dffc6c01a9e94650d95ce0104964f8ae822] Linux 5.9-rc4
git bisect bad f4d51dffc6c01a9e94650d95ce0104964f8ae822
# good: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag 'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm
git bisect good 8186749621ed6b8fc42644c399e8c755a2b6f630
# good: [60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e] Merge tag 'm68knommu-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
git bisect good 60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e
# bad: [57b077939287835b9396a1c3b40d35609cf2fcb8] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
git bisect bad 57b077939287835b9396a1c3b40d35609cf2fcb8
# bad: [0f43283be7fec4a76cd4ed50dc37db30719bde05] Merge branch 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect bad 0f43283be7fec4a76cd4ed50dc37db30719bde05
# good: [5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7] Merge tag 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect good 5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7
# good: [e3083c3f369700cd1eec3de93b8d8ec0918ff778] media: cafe-driver: use generic power management
git bisect good e3083c3f369700cd1eec3de93b8d8ec0918ff778
# bad: [42742d9bde2a8e11ec932cb5821f720a40a7c2a9] mm: thp: replace HTTP links with HTTPS ones
git bisect bad 42742d9bde2a8e11ec932cb5821f720a40a7c2a9
# bad: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of kmem_caches for all allocations
git bisect bad 10befea91b61c4e2c2d1df06a2e978d182fcf792
# good: [cfbe1636c3585c1e032bfac512fb8be903fbc913] mm, kcsan: instrument SLAB/SLUB free with "ASSERT_EXCLUSIVE_ACCESS"
git bisect good cfbe1636c3585c1e032bfac512fb8be903fbc913
# good: [0f190a7ab78878f9e6c6930fc0f5f92c1250b57d] mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io
git bisect good 0f190a7ab78878f9e6c6930fc0f5f92c1250b57d
# good: [286e04b8ed7a04279ae277f0f024430246ea5eec] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
git bisect good 286e04b8ed7a04279ae277f0f024430246ea5eec
# good: [9855609bde03e2472b99a95e869d29ee1e78a751] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
git bisect good 9855609bde03e2472b99a95e869d29ee1e78a751
# good: [272911a4ad18c48f8bc449a5db945a54987dd687] mm: memcg/slab: remove memcg_kmem_get_cache()
git bisect good 272911a4ad18c48f8bc449a5db945a54987dd687
# good: [15999eef7f25e2ea6a1c33f026166f472c5714e9] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
git bisect good 15999eef7f25e2ea6a1c33f026166f472c5714e9
# first bad commit: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of kmem_caches for all allocations



Thanks,
Ming



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 16:19                     ` Ming Lei
@ 2020-09-25 16:32                       ` Shakeel Butt
  2020-09-25 16:47                         ` Shakeel Butt
  2020-09-25 17:17                       ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2020-09-25 16:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Theodore Y. Ts'o, Jens Axboe, linux-ext4,
	linux-kernel@vger.kernel.org, open list:BLOCK LAYER,
	Linus Torvalds, Linux MM, Roman Gushchin, Andrew Morton,
	Johannes Weiner, Vlastimil Babka

On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote:
> > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > > > that one of the stack traces had virtio balloon in the trace, I
> > > > realized that when I switched the GCE VM type from e1-standard-2 to
> > > > n1-standard-2 (where e1 VM's are cheaper because they use
> > > > virtio-balloon to better manage host OS memory utilization), problem
> > > > has become, much, *much* rarer (and possibly has gone away, although
> > > > I'm going to want to run a lot more tests before I say that
> > > > conclusively) on my test setup.  At the very least, using an n1 VM
> > > > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > > > enough to unblock ext4 development.
> > >
> > > .... and I spoke too soon.  A number of runs using -rc6 are now
> > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> > > indicator.
> > >
> > > This is why debugging this is frustrating; it is very much a heisenbug
> > > --- although 5.8 seems to work completely reliably, as does commits
> > > before 37f4a24c2469.  Anything after that point will show random
> > > failures.  :-(
> >
> > It does not make sense to mention 37f4a24c2469, which is reverted in
> > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
> > as 568f27006577.
> >
> > However, I can _not_ reproduce the issue by running the same test on
> > kernel built from 568f27006577 directly.
> >
> > Also you have confirmed that the issue can't be fixed after reverting
> > 568f27006577 against v5.9-rc4.
> >
> > Looks the real issue(slab list corruption) should be introduced between
> > 568f27006577 and v5.9-rc4.
>
> git bisect shows the first bad commit:
>
>         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
>                 kmem_caches for all allocations
>
> And I have double checked that the above commit is really the first bad
> commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> is LIST_POISON1 (dead000000000100)', see the detailed stack trace and
> kernel oops log in the following link:
>
>         https://lore.kernel.org/lkml/20200916202026.GC38283@mit.edu/

The failure signature is similar to
https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/

>
> And the kernel config is the one(without KASAN) used by Theodore in GCE VM, see
> the following link:
>
>         https://lore.kernel.org/lkml/20200917143012.GF38283@mit.edu/
>
> The reproducer is xfstests generic/038. In my setting, test device is virtio-scsi, and
> scratch device is virtio-blk.
>
>
> [1] git bisect log:
>
> git bisect start
> # good: [568f2700657794b8258e72983ba508793a658942] blk-mq: centralise related handling into blk_mq_get_driver_tag
> git bisect good 568f2700657794b8258e72983ba508793a658942
> # bad: [f4d51dffc6c01a9e94650d95ce0104964f8ae822] Linux 5.9-rc4
> git bisect bad f4d51dffc6c01a9e94650d95ce0104964f8ae822
> # good: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag 'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm
> git bisect good 8186749621ed6b8fc42644c399e8c755a2b6f630
> # good: [60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e] Merge tag 'm68knommu-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
> git bisect good 60e76bb8a4e4c5398ea9053535e1fd0c9d6bb06e
> # bad: [57b077939287835b9396a1c3b40d35609cf2fcb8] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
> git bisect bad 57b077939287835b9396a1c3b40d35609cf2fcb8
> # bad: [0f43283be7fec4a76cd4ed50dc37db30719bde05] Merge branch 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
> git bisect bad 0f43283be7fec4a76cd4ed50dc37db30719bde05
> # good: [5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7] Merge tag 'xfs-5.9-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> git bisect good 5631c5e0eb9035d92ceb20fcd9cdb7779a3f5cc7
> # good: [e3083c3f369700cd1eec3de93b8d8ec0918ff778] media: cafe-driver: use generic power management
> git bisect good e3083c3f369700cd1eec3de93b8d8ec0918ff778
> # bad: [42742d9bde2a8e11ec932cb5821f720a40a7c2a9] mm: thp: replace HTTP links with HTTPS ones
> git bisect bad 42742d9bde2a8e11ec932cb5821f720a40a7c2a9
> # bad: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of kmem_caches for all allocations
> git bisect bad 10befea91b61c4e2c2d1df06a2e978d182fcf792
> # good: [cfbe1636c3585c1e032bfac512fb8be903fbc913] mm, kcsan: instrument SLAB/SLUB free with "ASSERT_EXCLUSIVE_ACCESS"
> git bisect good cfbe1636c3585c1e032bfac512fb8be903fbc913
> # good: [0f190a7ab78878f9e6c6930fc0f5f92c1250b57d] mm/page_io.c: use blk_io_schedule() for avoiding task hung in sync io
> git bisect good 0f190a7ab78878f9e6c6930fc0f5f92c1250b57d
> # good: [286e04b8ed7a04279ae277f0f024430246ea5eec] mm: memcg/slab: allocate obj_cgroups for non-root slab pages
> git bisect good 286e04b8ed7a04279ae277f0f024430246ea5eec
> # good: [9855609bde03e2472b99a95e869d29ee1e78a751] mm: memcg/slab: use a single set of kmem_caches for all accounted allocations
> git bisect good 9855609bde03e2472b99a95e869d29ee1e78a751
> # good: [272911a4ad18c48f8bc449a5db945a54987dd687] mm: memcg/slab: remove memcg_kmem_get_cache()
> git bisect good 272911a4ad18c48f8bc449a5db945a54987dd687
> # good: [15999eef7f25e2ea6a1c33f026166f472c5714e9] mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
> git bisect good 15999eef7f25e2ea6a1c33f026166f472c5714e9
> # first bad commit: [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of kmem_caches for all allocations
>
>
>
> Thanks,
> Ming
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 16:32                       ` Shakeel Butt
@ 2020-09-25 16:47                         ` Shakeel Butt
  2020-09-25 17:22                           ` Roman Gushchin
  0 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2020-09-25 16:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Theodore Y. Ts'o, Jens Axboe, linux-ext4,
	linux-kernel@vger.kernel.org, open list:BLOCK LAYER,
	Linus Torvalds, Linux MM, Roman Gushchin, Andrew Morton,
	Johannes Weiner, Vlastimil Babka

On Fri, Sep 25, 2020 at 9:32 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote:
> > > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> > > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > > > > that one of the stack traces had virtio balloon in the trace, I
> > > > > realized that when I switched the GCE VM type from e1-standard-2 to
> > > > > n1-standard-2 (where e1 VM's are cheaper because they use
> > > > > virtio-balloon to better manage host OS memory utilization), problem
> > > > > has become, much, *much* rarer (and possibly has gone away, although
> > > > > I'm going to want to run a lot more tests before I say that
> > > > > conclusively) on my test setup.  At the very least, using an n1 VM
> > > > > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > > > > enough to unblock ext4 development.
> > > >
> > > > .... and I spoke too soon.  A number of runs using -rc6 are now
> > > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> > > > indicator.
> > > >
> > > > This is why debugging this is frustrating; it is very much a heisenbug
> > > > --- although 5.8 seems to work completely reliably, as does commits
> > > > before 37f4a24c2469.  Anything after that point will show random
> > > > failures.  :-(
> > >
> > > It does not make sense to mention 37f4a24c2469, which is reverted in
> > > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
> > > as 568f27006577.
> > >
> > > However, I can _not_ reproduce the issue by running the same test on
> > > kernel built from 568f27006577 directly.
> > >
> > > Also you have confirmed that the issue can't be fixed after reverting
> > > 568f27006577 against v5.9-rc4.
> > >
> > > Looks the real issue(slab list corruption) should be introduced between
> > > 568f27006577 and v5.9-rc4.
> >
> > git bisect shows the first bad commit:
> >
> >         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> >                 kmem_caches for all allocations
> >
> > And I have double checked that the above commit is really the first bad
> > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > is LIST_POISON1 (dead000000000100)', see the detailed stack trace and
> > kernel oops log in the following link:
> >
> >         https://lore.kernel.org/lkml/20200916202026.GC38283@mit.edu/
>
> The failure signature is similar to
> https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/
>
> >
> > And the kernel config is the one(without KASAN) used by Theodore in GCE VM, see
> > the following link:
> >
> >         https://lore.kernel.org/lkml/20200917143012.GF38283@mit.edu/
> >
> > The reproducer is xfstests generic/038. In my setting, test device is virtio-scsi, and
> > scratch device is virtio-blk.

Is it possible to check SLUB as well to confirm that the issue is only
happening on SLAB?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 16:19                     ` Ming Lei
  2020-09-25 16:32                       ` Shakeel Butt
@ 2020-09-25 17:17                       ` Linus Torvalds
  2020-09-25 17:22                         ` Shakeel Butt
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-09-25 17:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Theodore Y. Ts'o, Jens Axboe, Ext4 Developers List,
	linux-kernel@vger.kernel.org, linux-block, Linux-MM,
	Roman Gushchin, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Shakeel Butt

On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> git bisect shows the first bad commit:
>
>         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
>                 kmem_caches for all allocations
>
> And I have double checked that the above commit is really the first bad
> commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> is LIST_POISON1 (dead000000000100)',

Thet commit doesn't revert cleanly, but I think that's purely because
we'd also need to revert

  849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
  74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
(un)account_slab_page()")

too.

Can you verify that a

    git revert 74d555bed5d0 849504809f86 10befea91b61

on top of current -git makes things work for you again?

I'm going to do an rc8 this release simply because we have another VM
issue that I hope to get fixed - but there we know what the problem
and the fix _is_, it just needs some care.

So if Roman (or somebody else) can see what's wrong and we can fix
this quickly, we don't need to go down the revert path, but ..

                  Linus


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 17:17                       ` Linus Torvalds
@ 2020-09-25 17:22                         ` Shakeel Butt
  2020-09-25 17:35                           ` Shakeel Butt
  0 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2020-09-25 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Theodore Y. Ts'o, Jens Axboe, Ext4 Developers List,
	linux-kernel@vger.kernel.org, linux-block, Linux-MM,
	Roman Gushchin, Andrew Morton, Johannes Weiner, Vlastimil Babka

On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > git bisect shows the first bad commit:
> >
> >         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> >                 kmem_caches for all allocations
> >
> > And I have double checked that the above commit is really the first bad
> > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > is LIST_POISON1 (dead000000000100)',
>
> Thet commit doesn't revert cleanly, but I think that's purely because
> we'd also need to revert
>
>   849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
>   74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
> (un)account_slab_page()")
>
> too.
>
> Can you verify that a
>
>     git revert 74d555bed5d0 849504809f86 10befea91b61
>
> on top of current -git makes things work for you again?
>
> I'm going to do an rc8 this release simply because we have another VM
> issue that I hope to get fixed - but there we know what the problem
> and the fix _is_, it just needs some care.
>
> So if Roman (or somebody else) can see what's wrong and we can fix
> this quickly, we don't need to go down the revert path, but ..
>

I think I have a theory. The issue is happening due to the potential
infinite recursion:

[ 5060.124412]  ___cache_free+0x488/0x6b0
*****Second recursion
[ 5060.128666]  kfree+0xc9/0x1d0
[ 5060.131947]  kmem_freepages+0xa0/0xf0
[ 5060.135746]  slab_destroy+0x19/0x50
[ 5060.139577]  slabs_destroy+0x6d/0x90
[ 5060.143379]  ___cache_free+0x4a3/0x6b0
*****First recursion
[ 5060.147896]  kfree+0xc9/0x1d0
[ 5060.151082]  kmem_freepages+0xa0/0xf0
[ 5060.155121]  slab_destroy+0x19/0x50
[ 5060.159028]  slabs_destroy+0x6d/0x90
[ 5060.162920]  ___cache_free+0x4a3/0x6b0
[ 5060.167097]  kfree+0xc9/0x1d0

___cache_free() is calling cache_flusharray() to flush the local cpu
array_cache if the cache has more elements than the limit (ac->avail
>= ac->limit).

cache_flusharray() is removing batchcount number of element from local
cpu array_cache and pass it slabs_destroy (if the node shared cache is
also full).

Note that we have not updated local cpu array_cache size yet and
called slabs_destroy() which can call kfree() through
unaccount_slab_page().

We are on the same CPU and this recursive kfree again check the
(ac->avail >= ac->limit) and call cache_flusharray() again and recurse
indefinitely.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 16:47                         ` Shakeel Butt
@ 2020-09-25 17:22                           ` Roman Gushchin
  0 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2020-09-25 17:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Ming Lei, Theodore Y. Ts'o, Jens Axboe, linux-ext4,
	linux-kernel@vger.kernel.org, open list:BLOCK LAYER,
	Linus Torvalds, Linux MM, Andrew Morton, Johannes Weiner,
	Vlastimil Babka

On Fri, Sep 25, 2020 at 09:47:43AM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 9:32 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 03:31:45PM +0800, Ming Lei wrote:
> > > > On Thu, Sep 24, 2020 at 09:13:11PM -0400, Theodore Y. Ts'o wrote:
> > > > > On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> > > > > > HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> > > > > > that one of the stack traces had virtio balloon in the trace, I
> > > > > > realized that when I switched the GCE VM type from e1-standard-2 to
> > > > > > n1-standard-2 (where e1 VM's are cheaper because they use
> > > > > > virtio-balloon to better manage host OS memory utilization), problem
> > > > > > has become, much, *much* rarer (and possibly has gone away, although
> > > > > > I'm going to want to run a lot more tests before I say that
> > > > > > conclusively) on my test setup.  At the very least, using an n1 VM
> > > > > > (which doesn't have virtio-balloon enabled in the hypervisor) is
> > > > > > enough to unblock ext4 development.
> > > > >
> > > > > .... and I spoke too soon.  A number of runs using -rc6 are now
> > > > > failing even with the n1-standard-2 VM, so virtio-ballon may not be an
> > > > > indicator.
> > > > >
> > > > > This is why debugging this is frustrating; it is very much a heisenbug
> > > > > --- although 5.8 seems to work completely reliably, as does commits
> > > > > before 37f4a24c2469.  Anything after that point will show random
> > > > > failures.  :-(
> > > >
> > > > It does not make sense to mention 37f4a24c2469, which is reverted in
> > > > 4e2f62e566b5. Later the patch in 37f4a24c2469 is fixed and re-commited
> > > > as 568f27006577.
> > > >
> > > > However, I can _not_ reproduce the issue by running the same test on
> > > > kernel built from 568f27006577 directly.
> > > >
> > > > Also you have confirmed that the issue can't be fixed after reverting
> > > > 568f27006577 against v5.9-rc4.
> > > >
> > > > Looks the real issue(slab list corruption) should be introduced between
> > > > 568f27006577 and v5.9-rc4.
> > >
> > > git bisect shows the first bad commit:
> > >
> > >         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > >                 kmem_caches for all allocations
> > >
> > > And I have double checked that the above commit is really the first bad
> > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > is LIST_POISON1 (dead000000000100)', see the detailed stack trace and
> > > kernel oops log in the following link:
> > >
> > >         https://lore.kernel.org/lkml/20200916202026.GC38283@mit.edu/
> >
> > The failure signature is similar to
> > https://lore.kernel.org/lkml/20200901075321.GL4299@shao2-debian/
> >
> > >
> > > And the kernel config is the one(without KASAN) used by Theodore in GCE VM, see
> > > the following link:
> > >
> > >         https://lore.kernel.org/lkml/20200917143012.GF38283@mit.edu/
> > >
> > > The reproducer is xfstests generic/038. In my setting, test device is virtio-scsi, and
> > > scratch device is virtio-blk.
> 
> Is it possible to check SLUB as well to confirm that the issue is only
> happening on SLAB?

Can you also, please, check if passing cgroup.memory=nokmem as a boot argument
is fixing the issue?

Thanks!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 17:22                         ` Shakeel Butt
@ 2020-09-25 17:35                           ` Shakeel Butt
  2020-09-25 17:47                             ` Roman Gushchin
  0 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2020-09-25 17:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Theodore Y. Ts'o, Jens Axboe, Ext4 Developers List,
	linux-kernel@vger.kernel.org, linux-block, Linux-MM,
	Roman Gushchin, Andrew Morton, Johannes Weiner, Vlastimil Babka

On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > git bisect shows the first bad commit:
> > >
> > >         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > >                 kmem_caches for all allocations
> > >
> > > And I have double checked that the above commit is really the first bad
> > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > is LIST_POISON1 (dead000000000100)',
> >
> > Thet commit doesn't revert cleanly, but I think that's purely because
> > we'd also need to revert
> >
> >   849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
> >   74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
> > (un)account_slab_page()")
> >
> > too.
> >
> > Can you verify that a
> >
> >     git revert 74d555bed5d0 849504809f86 10befea91b61
> >
> > on top of current -git makes things work for you again?
> >
> > I'm going to do an rc8 this release simply because we have another VM
> > issue that I hope to get fixed - but there we know what the problem
> > and the fix _is_, it just needs some care.
> >
> > So if Roman (or somebody else) can see what's wrong and we can fix
> > this quickly, we don't need to go down the revert path, but ..
> >
>
> I think I have a theory. The issue is happening due to the potential
> infinite recursion:
>
> [ 5060.124412]  ___cache_free+0x488/0x6b0
> *****Second recursion
> [ 5060.128666]  kfree+0xc9/0x1d0
> [ 5060.131947]  kmem_freepages+0xa0/0xf0
> [ 5060.135746]  slab_destroy+0x19/0x50
> [ 5060.139577]  slabs_destroy+0x6d/0x90
> [ 5060.143379]  ___cache_free+0x4a3/0x6b0
> *****First recursion
> [ 5060.147896]  kfree+0xc9/0x1d0
> [ 5060.151082]  kmem_freepages+0xa0/0xf0
> [ 5060.155121]  slab_destroy+0x19/0x50
> [ 5060.159028]  slabs_destroy+0x6d/0x90
> [ 5060.162920]  ___cache_free+0x4a3/0x6b0
> [ 5060.167097]  kfree+0xc9/0x1d0
>
> ___cache_free() is calling cache_flusharray() to flush the local cpu
> array_cache if the cache has more elements than the limit (ac->avail
> >= ac->limit).
>
> cache_flusharray() is removing batchcount number of element from local
> cpu array_cache and pass it slabs_destroy (if the node shared cache is
> also full).
>
> Note that we have not updated local cpu array_cache size yet and
> called slabs_destroy() which can call kfree() through
> unaccount_slab_page().
>
> We are on the same CPU and this recursive kfree again check the
> (ac->avail >= ac->limit) and call cache_flusharray() again and recurse
> indefinitely.

I can see two possible fixes. We can either do async kfree of
page_obj_cgroups(page) or we can update the local cpu array_cache's
size before slabs_destroy().

Shakeel


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 17:35                           ` Shakeel Butt
@ 2020-09-25 17:47                             ` Roman Gushchin
  2020-09-25 17:58                               ` Shakeel Butt
  0 siblings, 1 reply; 16+ messages in thread
From: Roman Gushchin @ 2020-09-25 17:47 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Linus Torvalds, Ming Lei, Theodore Y. Ts'o, Jens Axboe,
	Ext4 Developers List, linux-kernel@vger.kernel.org, linux-block,
	Linux-MM, Andrew Morton, Johannes Weiner, Vlastimil Babka

On Fri, Sep 25, 2020 at 10:35:03AM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > git bisect shows the first bad commit:
> > > >
> > > >         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > > >                 kmem_caches for all allocations
> > > >
> > > > And I have double checked that the above commit is really the first bad
> > > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > > is LIST_POISON1 (dead000000000100)',
> > >
> > > Thet commit doesn't revert cleanly, but I think that's purely because
> > > we'd also need to revert
> > >
> > >   849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
> > >   74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
> > > (un)account_slab_page()")
> > >
> > > too.
> > >
> > > Can you verify that a
> > >
> > >     git revert 74d555bed5d0 849504809f86 10befea91b61
> > >
> > > on top of current -git makes things work for you again?
> > >
> > > I'm going to do an rc8 this release simply because we have another VM
> > > issue that I hope to get fixed - but there we know what the problem
> > > and the fix _is_, it just needs some care.
> > >
> > > So if Roman (or somebody else) can see what's wrong and we can fix
> > > this quickly, we don't need to go down the revert path, but ..
> > >
> >
> > I think I have a theory. The issue is happening due to the potential
> > infinite recursion:
> >
> > [ 5060.124412]  ___cache_free+0x488/0x6b0
> > *****Second recursion
> > [ 5060.128666]  kfree+0xc9/0x1d0
> > [ 5060.131947]  kmem_freepages+0xa0/0xf0
> > [ 5060.135746]  slab_destroy+0x19/0x50
> > [ 5060.139577]  slabs_destroy+0x6d/0x90
> > [ 5060.143379]  ___cache_free+0x4a3/0x6b0
> > *****First recursion
> > [ 5060.147896]  kfree+0xc9/0x1d0
> > [ 5060.151082]  kmem_freepages+0xa0/0xf0
> > [ 5060.155121]  slab_destroy+0x19/0x50
> > [ 5060.159028]  slabs_destroy+0x6d/0x90
> > [ 5060.162920]  ___cache_free+0x4a3/0x6b0
> > [ 5060.167097]  kfree+0xc9/0x1d0
> >
> > ___cache_free() is calling cache_flusharray() to flush the local cpu
> > array_cache if the cache has more elements than the limit (ac->avail
> > >= ac->limit).
> >
> > cache_flusharray() is removing batchcount number of element from local
> > cpu array_cache and pass it slabs_destroy (if the node shared cache is
> > also full).
> >
> > Note that we have not updated local cpu array_cache size yet and
> > called slabs_destroy() which can call kfree() through
> > unaccount_slab_page().
> >
> > We are on the same CPU and this recursive kfree again check the
> > (ac->avail >= ac->limit) and call cache_flusharray() again and recurse
> > indefinitely.

It's a coll theory! And it explains why we haven't seen it with SLUB.

> 
> I can see two possible fixes. We can either do async kfree of
> page_obj_cgroups(page) or we can update the local cpu array_cache's
> size before slabs_destroy().

I wonder if something like this can fix the problem?
(completely untested).

--

diff --git a/mm/slab.c b/mm/slab.c
index 684ebe5b0c7a..c94b9ccfb803 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -186,6 +186,7 @@ struct array_cache {
        unsigned int limit;
        unsigned int batchcount;
        unsigned int touched;
+       bool flushing;
        void *entry[];  /*
                         * Must have this definition in here for the proper
                         * alignment of array_cache. Also simplifies accessing
@@ -526,6 +527,7 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch)
                ac->limit = limit;
                ac->batchcount = batch;
                ac->touched = 0;
+               ac->flushing = false;
        }
 }
 
@@ -3368,6 +3370,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
        int node = numa_mem_id();
        LIST_HEAD(list);
 
+       if (ac->flushing)
+               return;
+
+       ac->flushing = true;
+
        batchcount = ac->batchcount;
 
        check_irq_off();
@@ -3404,6 +3411,7 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
        spin_unlock(&n->list_lock);
        slabs_destroy(cachep, &list);
        ac->avail -= batchcount;
+       ac->flushing = false;
        memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
 }
 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 17:47                             ` Roman Gushchin
@ 2020-09-25 17:58                               ` Shakeel Butt
  2020-09-25 19:19                                 ` Shakeel Butt
  0 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2020-09-25 17:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Linus Torvalds, Ming Lei, Theodore Y. Ts'o, Jens Axboe,
	Ext4 Developers List, linux-kernel@vger.kernel.org, linux-block,
	Linux-MM, Andrew Morton, Johannes Weiner, Vlastimil Babka

On Fri, Sep 25, 2020 at 10:48 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Sep 25, 2020 at 10:35:03AM -0700, Shakeel Butt wrote:
> > On Fri, Sep 25, 2020 at 10:22 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Fri, Sep 25, 2020 at 10:17 AM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Fri, Sep 25, 2020 at 9:19 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > git bisect shows the first bad commit:
> > > > >
> > > > >         [10befea91b61c4e2c2d1df06a2e978d182fcf792] mm: memcg/slab: use a single set of
> > > > >                 kmem_caches for all allocations
> > > > >
> > > > > And I have double checked that the above commit is really the first bad
> > > > > commit for the list corruption issue of 'list_del corruption, ffffe1c241b00408->next
> > > > > is LIST_POISON1 (dead000000000100)',
> > > >
> > > > Thet commit doesn't revert cleanly, but I think that's purely because
> > > > we'd also need to revert
> > > >
> > > >   849504809f86 ("mm: memcg/slab: remove unused argument by charge_slab_page()")
> > > >   74d555bed5d0 ("mm: slab: rename (un)charge_slab_page() to
> > > > (un)account_slab_page()")
> > > >
> > > > too.
> > > >
> > > > Can you verify that a
> > > >
> > > >     git revert 74d555bed5d0 849504809f86 10befea91b61
> > > >
> > > > on top of current -git makes things work for you again?
> > > >
> > > > I'm going to do an rc8 this release simply because we have another VM
> > > > issue that I hope to get fixed - but there we know what the problem
> > > > and the fix _is_, it just needs some care.
> > > >
> > > > So if Roman (or somebody else) can see what's wrong and we can fix
> > > > this quickly, we don't need to go down the revert path, but ..
> > > >
> > >
> > > I think I have a theory. The issue is happening due to the potential
> > > infinite recursion:
> > >
> > > [ 5060.124412]  ___cache_free+0x488/0x6b0
> > > *****Second recursion
> > > [ 5060.128666]  kfree+0xc9/0x1d0
> > > [ 5060.131947]  kmem_freepages+0xa0/0xf0
> > > [ 5060.135746]  slab_destroy+0x19/0x50
> > > [ 5060.139577]  slabs_destroy+0x6d/0x90
> > > [ 5060.143379]  ___cache_free+0x4a3/0x6b0
> > > *****First recursion
> > > [ 5060.147896]  kfree+0xc9/0x1d0
> > > [ 5060.151082]  kmem_freepages+0xa0/0xf0
> > > [ 5060.155121]  slab_destroy+0x19/0x50
> > > [ 5060.159028]  slabs_destroy+0x6d/0x90
> > > [ 5060.162920]  ___cache_free+0x4a3/0x6b0
> > > [ 5060.167097]  kfree+0xc9/0x1d0
> > >
> > > ___cache_free() is calling cache_flusharray() to flush the local cpu
> > > array_cache if the cache has more elements than the limit (ac->avail
> > > >= ac->limit).
> > >
> > > cache_flusharray() is removing batchcount number of element from local
> > > cpu array_cache and pass it slabs_destroy (if the node shared cache is
> > > also full).
> > >
> > > Note that we have not updated local cpu array_cache size yet and
> > > called slabs_destroy() which can call kfree() through
> > > unaccount_slab_page().
> > >
> > > We are on the same CPU and this recursive kfree again check the
> > > (ac->avail >= ac->limit) and call cache_flusharray() again and recurse
> > > indefinitely.
>
> It's a coll theory! And it explains why we haven't seen it with SLUB.
>
> >
> > I can see two possible fixes. We can either do async kfree of
> > page_obj_cgroups(page) or we can update the local cpu array_cache's
> > size before slabs_destroy().
>
> I wonder if something like this can fix the problem?
> (completely untested).
>
> --
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 684ebe5b0c7a..c94b9ccfb803 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -186,6 +186,7 @@ struct array_cache {
>         unsigned int limit;
>         unsigned int batchcount;
>         unsigned int touched;
> +       bool flushing;
>         void *entry[];  /*
>                          * Must have this definition in here for the proper
>                          * alignment of array_cache. Also simplifies accessing
> @@ -526,6 +527,7 @@ static void init_arraycache(struct array_cache *ac, int limit, int batch)
>                 ac->limit = limit;
>                 ac->batchcount = batch;
>                 ac->touched = 0;
> +               ac->flushing = false;
>         }
>  }
>
> @@ -3368,6 +3370,11 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
>         int node = numa_mem_id();
>         LIST_HEAD(list);
>
> +       if (ac->flushing)
> +               return;
> +
> +       ac->flushing = true;
> +
>         batchcount = ac->batchcount;
>
>         check_irq_off();
> @@ -3404,6 +3411,7 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
>         spin_unlock(&n->list_lock);
>         slabs_destroy(cachep, &list);
>         ac->avail -= batchcount;
> +       ac->flushing = false;
>         memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
>  }
>

I don't think you can ignore the flushing. The __free_once() in
___cache_free() assumes there is a space available.

BTW do_drain() also have the same issue.

Why not move slabs_destroy() after we update ac->avail and memmove()?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 17:58                               ` Shakeel Butt
@ 2020-09-25 19:19                                 ` Shakeel Butt
  2020-09-25 20:56                                   ` Roman Gushchin
  2020-09-26  1:43                                   ` Ming Lei
  0 siblings, 2 replies; 16+ messages in thread
From: Shakeel Butt @ 2020-09-25 19:19 UTC (permalink / raw)
  To: Roman Gushchin, Ming Lei
  Cc: Johannes Weiner, Andrew Morton, Linus Torvalds,
	Theodore Y . Ts'o, Jens Axboe, Ext4 Developers List,
	linux-block, Vlastimil Babka, linux-mm, linux-kernel,
	Shakeel Butt

On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <shakeelb@google.com>
wrote:
>
[snip]
>
> I don't think you can ignore the flushing. The __free_once() in
> ___cache_free() assumes there is a space available.
>
> BTW do_drain() also have the same issue.
>
> Why not move slabs_destroy() after we update ac->avail and memmove()?

Ming, can you please try the following patch?


From: Shakeel Butt <shakeelb@google.com>

[PATCH] mm: slab: fix potential infinite recursion in ___cache_free

With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
kmem_caches for all allocations"), it becomes possible to call kfree()
from the slabs_destroy(). However if slabs_destroy() is being called for
the array_cache of the local CPU then this opens the potential scenario
of infinite recursion because kfree() called from slabs_destroy() can
call slabs_destroy() with the same array_cache of the local CPU. Since
the array_cache of the local CPU is not updated before calling
slabs_destroy(), it will try to free the same pages.

To fix the issue, simply update the cache before calling
slabs_destroy().

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 mm/slab.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 3160dff6fd76..f658e86ec8ce 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
 		kmem_cache_free(cachep->freelist_cache, freelist);
 }
 
+/*
+ * Update the size of the caches before calling slabs_destroy as it may
+ * recursively call kfree.
+ */
 static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
 {
 	struct page *page, *n;
@@ -2153,8 +2157,8 @@ static void do_drain(void *arg)
 	spin_lock(&n->list_lock);
 	free_block(cachep, ac->entry, ac->avail, node, &list);
 	spin_unlock(&n->list_lock);
-	slabs_destroy(cachep, &list);
 	ac->avail = 0;
+	slabs_destroy(cachep, &list);
 }
 
 static void drain_cpu_caches(struct kmem_cache *cachep)
@@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
 	}
 #endif
 	spin_unlock(&n->list_lock);
-	slabs_destroy(cachep, &list);
 	ac->avail -= batchcount;
 	memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
+	slabs_destroy(cachep, &list);
 }
 
 /*
-- 
2.28.0.681.g6f77f65b4e-goog



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 19:19                                 ` Shakeel Butt
@ 2020-09-25 20:56                                   ` Roman Gushchin
  2020-09-25 21:18                                     ` Shakeel Butt
  2020-09-26  1:43                                   ` Ming Lei
  1 sibling, 1 reply; 16+ messages in thread
From: Roman Gushchin @ 2020-09-25 20:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Ming Lei, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Theodore Y . Ts'o, Jens Axboe, Ext4 Developers List,
	linux-block, Vlastimil Babka, linux-mm, linux-kernel

On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <shakeelb@google.com>
> wrote:
> >
> [snip]
> >
> > I don't think you can ignore the flushing. The __free_once() in
> > ___cache_free() assumes there is a space available.
> >
> > BTW do_drain() also have the same issue.
> >
> > Why not move slabs_destroy() after we update ac->avail and memmove()?
> 
> Ming, can you please try the following patch?
> 
> 
> From: Shakeel Butt <shakeelb@google.com>
> 
> [PATCH] mm: slab: fix potential infinite recursion in ___cache_free
> 
> With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
> kmem_caches for all allocations"), it becomes possible to call kfree()
> from the slabs_destroy(). However if slabs_destroy() is being called for
> the array_cache of the local CPU then this opens the potential scenario
> of infinite recursion because kfree() called from slabs_destroy() can
> call slabs_destroy() with the same array_cache of the local CPU. Since
> the array_cache of the local CPU is not updated before calling
> slabs_destroy(), it will try to free the same pages.
> 
> To fix the issue, simply update the cache before calling
> slabs_destroy().
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>

I like the patch and I think it should fix the problem.

However the description above should be likely asjusted a bit.
It seems that the problem is not necessary caused by an infinite recursion,
it can be even simpler.

In cache_flusharray() we rely on the state of ac, which is described
by ac->avail. In particular we rely on batchcount < ac->avail,
as we shift the batchcount number of pointers by memmove.
But if slabs_destroy() is called before and leaded to a change of the
ac state, it can lead to a memory corruption.

Also, unconditionally resetting ac->avail to 0 in do_drain() after calling
to slab_destroy() seems to be wrong.
It explains double free BUGs we've seen in stacktraces.

Thanks!

> ---
>  mm/slab.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index 3160dff6fd76..f658e86ec8ce 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
>  		kmem_cache_free(cachep->freelist_cache, freelist);
>  }
>  
> +/*
> + * Update the size of the caches before calling slabs_destroy as it may
> + * recursively call kfree.
> + */
>  static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
>  {
>  	struct page *page, *n;
> @@ -2153,8 +2157,8 @@ static void do_drain(void *arg)
>  	spin_lock(&n->list_lock);
>  	free_block(cachep, ac->entry, ac->avail, node, &list);
>  	spin_unlock(&n->list_lock);
> -	slabs_destroy(cachep, &list);
>  	ac->avail = 0;
> +	slabs_destroy(cachep, &list);
>  }
>  
>  static void drain_cpu_caches(struct kmem_cache *cachep)
> @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
>  	}
>  #endif
>  	spin_unlock(&n->list_lock);
> -	slabs_destroy(cachep, &list);
>  	ac->avail -= batchcount;
>  	memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
> +	slabs_destroy(cachep, &list);
>  }
>  
>  /*
> -- 
> 2.28.0.681.g6f77f65b4e-goog
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 20:56                                   ` Roman Gushchin
@ 2020-09-25 21:18                                     ` Shakeel Butt
  2020-09-27 17:38                                       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2020-09-25 21:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Ming Lei, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Theodore Y . Ts'o, Jens Axboe, Ext4 Developers List,
	linux-block, Vlastimil Babka, Linux MM, LKML

On Fri, Sep 25, 2020 at 1:56 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote:
> > On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <shakeelb@google.com>
> > wrote:
> > >
> > [snip]
> > >
> > > I don't think you can ignore the flushing. The __free_once() in
> > > ___cache_free() assumes there is a space available.
> > >
> > > BTW do_drain() also have the same issue.
> > >
> > > Why not move slabs_destroy() after we update ac->avail and memmove()?
> >
> > Ming, can you please try the following patch?
> >
> >
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > [PATCH] mm: slab: fix potential infinite recursion in ___cache_free
> >
> > With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
> > kmem_caches for all allocations"), it becomes possible to call kfree()
> > from the slabs_destroy(). However if slabs_destroy() is being called for
> > the array_cache of the local CPU then this opens the potential scenario
> > of infinite recursion because kfree() called from slabs_destroy() can
> > call slabs_destroy() with the same array_cache of the local CPU. Since
> > the array_cache of the local CPU is not updated before calling
> > slabs_destroy(), it will try to free the same pages.
> >
> > To fix the issue, simply update the cache before calling
> > slabs_destroy().
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
>
> I like the patch and I think it should fix the problem.
>
> However the description above should be likely asjusted a bit.
> It seems that the problem is not necessary caused by an infinite recursion,
> it can be even simpler.
>
> In cache_flusharray() we rely on the state of ac, which is described
> by ac->avail. In particular we rely on batchcount < ac->avail,
> as we shift the batchcount number of pointers by memmove.
> But if slabs_destroy() is called before and leaded to a change of the
> ac state, it can lead to a memory corruption.
>
> Also, unconditionally resetting ac->avail to 0 in do_drain() after calling
> to slab_destroy() seems to be wrong.
> It explains double free BUGs we've seen in stacktraces.
>

Yes, you are right. Let's first get this patch tested and after
confirmation we can update the commit message.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 19:19                                 ` Shakeel Butt
  2020-09-25 20:56                                   ` Roman Gushchin
@ 2020-09-26  1:43                                   ` Ming Lei
  2020-09-26  6:42                                     ` Roman Gushchin
  1 sibling, 1 reply; 16+ messages in thread
From: Ming Lei @ 2020-09-26  1:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Theodore Y . Ts'o, Jens Axboe, Ext4 Developers List,
	linux-block, Vlastimil Babka, linux-mm, linux-kernel

On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote:
> On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <shakeelb@google.com>
> wrote:
> >
> [snip]
> >
> > I don't think you can ignore the flushing. The __free_once() in
> > ___cache_free() assumes there is a space available.
> >
> > BTW do_drain() also have the same issue.
> >
> > Why not move slabs_destroy() after we update ac->avail and memmove()?
> 
> Ming, can you please try the following patch?
> 
> 
> From: Shakeel Butt <shakeelb@google.com>
> 
> [PATCH] mm: slab: fix potential infinite recursion in ___cache_free
> 
> With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
> kmem_caches for all allocations"), it becomes possible to call kfree()
> from the slabs_destroy(). However if slabs_destroy() is being called for
> the array_cache of the local CPU then this opens the potential scenario
> of infinite recursion because kfree() called from slabs_destroy() can
> call slabs_destroy() with the same array_cache of the local CPU. Since
> the array_cache of the local CPU is not updated before calling
> slabs_destroy(), it will try to free the same pages.
> 
> To fix the issue, simply update the cache before calling
> slabs_destroy().
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/slab.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index 3160dff6fd76..f658e86ec8ce 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
>  		kmem_cache_free(cachep->freelist_cache, freelist);
>  }
>  
> +/*
> + * Update the size of the caches before calling slabs_destroy as it may
> + * recursively call kfree.
> + */
>  static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
>  {
>  	struct page *page, *n;
> @@ -2153,8 +2157,8 @@ static void do_drain(void *arg)
>  	spin_lock(&n->list_lock);
>  	free_block(cachep, ac->entry, ac->avail, node, &list);
>  	spin_unlock(&n->list_lock);
> -	slabs_destroy(cachep, &list);
>  	ac->avail = 0;
> +	slabs_destroy(cachep, &list);
>  }
>  
>  static void drain_cpu_caches(struct kmem_cache *cachep)
> @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
>  	}
>  #endif
>  	spin_unlock(&n->list_lock);
> -	slabs_destroy(cachep, &list);
>  	ac->avail -= batchcount;
>  	memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
> +	slabs_destroy(cachep, &list);
>  }

The issue can't be reproduced after applying this patch:

Tested-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-26  1:43                                   ` Ming Lei
@ 2020-09-26  6:42                                     ` Roman Gushchin
  0 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2020-09-26  6:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Shakeel Butt, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Theodore Y . Ts'o, Jens Axboe, Ext4 Developers List,
	linux-block, Vlastimil Babka, linux-mm, linux-kernel

On Sat, Sep 26, 2020 at 09:43:25AM +0800, Ming Lei wrote:
> On Fri, Sep 25, 2020 at 12:19:02PM -0700, Shakeel Butt wrote:
> > On Fri, Sep 25, 2020 at 10:58 AM Shakeel Butt <shakeelb@google.com>
> > wrote:
> > >
> > [snip]
> > >
> > > I don't think you can ignore the flushing. The __free_once() in
> > > ___cache_free() assumes there is a space available.
> > >
> > > BTW do_drain() also have the same issue.
> > >
> > > Why not move slabs_destroy() after we update ac->avail and memmove()?
> > 
> > Ming, can you please try the following patch?
> > 
> > 
> > From: Shakeel Butt <shakeelb@google.com>
> > 
> > [PATCH] mm: slab: fix potential infinite recursion in ___cache_free
> > 
> > With the commit 10befea91b61 ("mm: memcg/slab: use a single set of
> > kmem_caches for all allocations"), it becomes possible to call kfree()
> > from the slabs_destroy(). However if slabs_destroy() is being called for
> > the array_cache of the local CPU then this opens the potential scenario
> > of infinite recursion because kfree() called from slabs_destroy() can
> > call slabs_destroy() with the same array_cache of the local CPU. Since
> > the array_cache of the local CPU is not updated before calling
> > slabs_destroy(), it will try to free the same pages.
> > 
> > To fix the issue, simply update the cache before calling
> > slabs_destroy().
> > 
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > ---
> >  mm/slab.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 3160dff6fd76..f658e86ec8ce 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -1632,6 +1632,10 @@ static void slab_destroy(struct kmem_cache *cachep, struct page *page)
> >  		kmem_cache_free(cachep->freelist_cache, freelist);
> >  }
> >  
> > +/*
> > + * Update the size of the caches before calling slabs_destroy as it may
> > + * recursively call kfree.
> > + */
> >  static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list)
> >  {
> >  	struct page *page, *n;
> > @@ -2153,8 +2157,8 @@ static void do_drain(void *arg)
> >  	spin_lock(&n->list_lock);
> >  	free_block(cachep, ac->entry, ac->avail, node, &list);
> >  	spin_unlock(&n->list_lock);
> > -	slabs_destroy(cachep, &list);
> >  	ac->avail = 0;
> > +	slabs_destroy(cachep, &list);
> >  }
> >  
> >  static void drain_cpu_caches(struct kmem_cache *cachep)
> > @@ -3402,9 +3406,9 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> >  	}
> >  #endif
> >  	spin_unlock(&n->list_lock);
> > -	slabs_destroy(cachep, &list);
> >  	ac->avail -= batchcount;
> >  	memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
> > +	slabs_destroy(cachep, &list);
> >  }
> 
> The issue can't be reproduced after applying this patch:
> 
> Tested-by: Ming Lei <ming.lei@redhat.com>

Perfect, thank you very much for the confirmation!

Shakeel, can you, please, resend the patch with the proper fixes tag
and the updated commit log? Please, feel free to add
Reviewed-by: Roman Gushchin <guro@fb.com> .

Thank you!

Roman


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
  2020-09-25 21:18                                     ` Shakeel Butt
@ 2020-09-27 17:38                                       ` Theodore Y. Ts'o
  0 siblings, 0 replies; 16+ messages in thread
From: Theodore Y. Ts'o @ 2020-09-27 17:38 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, Ming Lei, Johannes Weiner, Andrew Morton,
	Linus Torvalds, Jens Axboe, Ext4 Developers List, linux-block,
	Vlastimil Babka, Linux MM, LKML

On Fri, Sep 25, 2020 at 02:18:48PM -0700, Shakeel Butt wrote:
> 
> Yes, you are right. Let's first get this patch tested and after
> confirmation we can update the commit message.

Thanks Shakeel!  I've tested your patch, as well as reverting the
three commits that Linus had suggested, and both seem to address the
problem for me as well.  I did see a small number of failures
immediately as soon as the VM has booted, when testing with the
"revert the three commits" but this appears to be a different failure,
which I had been seeing periodically during the bisect as well which
was no doubt introducing noise in my testing:

[   28.545018] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
[   28.545018] Modules linked in:
[   28.545018] irq event stamp: 4517759
[   28.545018] hardirqs last  enabled at (4517758): [<ffffffffaa600b9e>] asm_common_interrupt+0x1e/0x40
[   28.545018] hardirqs last disabled at (4517759): [<ffffffffaa5ee55b>] sysvec_apic_timer_interrupt+0xb/0x90
[   28.545018] softirqs last  enabled at (10634): [<ffffffffa9ac00fd>] irq_enter_rcu+0x6d/0x70
[   28.545018] softirqs last disabled at (10635): [<ffffffffaa600f72>] asm_call_on_stack+0x12/0x20
[   28.545018] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc6-xfstests-00007-g3f3cb48a7d90 #1916
[   28.545018] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[   28.545018] RIP: 0010:__do_softirq+0xa3/0x435
[   28.545018] Code: 00 83 80 ac 07 00 00 01 48 89 44 24 08 c7 44 24 1c 0a 00 00 00 65 66 c7 05 a8 ae 9e 55 00 00 e8 d3 92 3b ff fb b8 ff ff ff ff <48> c7 c3 40 51 00 ab 41 0f bc c7 89 c6 83 c6 01 89 74 24 04 75 6a
[   28.545018] RSP: 0000:ffffb89f000e0f98 EFLAGS: 00000202
[   28.545018] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 000000000000298a
[   28.545018] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffaa80009d
[   28.545018] RBP: ffffb89f000abda0 R08: 0000000000000001 R09: 0000000000000000
[   28.545018] R10: 0000000000000001 R11: 0000000000000046 R12: 0000000000000001
[   28.545018] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000080
[   28.545018] FS:  0000000000000000(0000) GS:ffff998e59200000(0000) knlGS:0000000000000000
[   28.545018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.545018] CR2: 0000000000000000 CR3: 000000023e012001 CR4: 00000000001706e0
[   28.545018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   28.545018] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   28.545018] Call Trace:
[   28.545018]  <IRQ>
[   28.545018]  asm_call_on_stack+0x12/0x20
[   28.545018]  </IRQ>
[   28.545018]  do_softirq_own_stack+0x4e/0x60
[   28.545018]  irq_exit_rcu+0x9f/0xe0
[   28.545018]  sysvec_call_function_single+0x43/0x90
[   28.545018]  asm_sysvec_call_function_single+0x12/0x20
[   28.545018] RIP: 0010:acpi_idle_do_entry+0x54/0x70
[   28.545018] Code: ed c3 e9 cf fe ff ff 65 48 8b 04 25 00 6e 01 00 48 8b 00 a8 08 75 ea e8 ba c0 5b ff e9 07 00 00 00 0f 00 2d f8 3d 4e 00 fb f4 <9c> 58 fa f6 c4 02 74 cf e9 5f c2 5b ff cc cc cc cc cc cc cc cc cc
[   28.545018] RSP: 0000:ffffb89f000abe88 EFLAGS: 00000202
[   28.545018] RAX: 000000000000293b RBX: ffff998e55640000 RCX: 0000000000001a12
[   28.545018] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffaa5fd2b6
[   28.545018] RBP: ffffffffab163760 R08: 0000000000000001 R09: 00000000000e003c
[   28.545018] R10: ffff998e582e2340 R11: 0000000000000046 R12: 0000000000000001
[   28.545018] R13: 0000000000000001 R14: ffffffffab1637e0 R15: 0000000000000000
[   28.545018]  ? acpi_idle_do_entry+0x46/0x70
[   28.545018]  ? acpi_idle_do_entry+0x46/0x70
[   28.545018]  acpi_idle_enter+0x7d/0xb0
[   28.545018]  cpuidle_enter_state+0x84/0x2c0
[   28.545018]  cpuidle_enter+0x29/0x40
[   28.545018]  cpuidle_idle_call+0x111/0x180
[   28.545018]  do_idle+0x7b/0xd0
[   28.545018]  cpu_startup_entry+0x19/0x20
[   28.545018]  secondary_startup_64+0xb6/0xc0

I think this was an issue relating to acpi_idle that others have
reported, but I thought this was fixed before -rc6 was released?  In
any case, this is post -rc6, so apparently there is something else
going on here, and this is probably unrelated to the regression which
Shakeel's patch is addressing.

					- Ted


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-09-27 17:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20200915044519.GA38283@mit.edu>
     [not found] ` <20200915073303.GA754106@T590>
     [not found]   ` <20200915224541.GB38283@mit.edu>
     [not found]     ` <20200915230941.GA791425@T590>
     [not found]       ` <20200916202026.GC38283@mit.edu>
     [not found]         ` <20200917022051.GA1004828@T590>
     [not found]           ` <20200917143012.GF38283@mit.edu>
     [not found]             ` <20200924005901.GB1806978@T590>
     [not found]               ` <20200924143345.GD482521@mit.edu>
     [not found]                 ` <20200925011311.GJ482521@mit.edu>
2020-09-25  7:31                   ` REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag Ming Lei
2020-09-25 16:19                     ` Ming Lei
2020-09-25 16:32                       ` Shakeel Butt
2020-09-25 16:47                         ` Shakeel Butt
2020-09-25 17:22                           ` Roman Gushchin
2020-09-25 17:17                       ` Linus Torvalds
2020-09-25 17:22                         ` Shakeel Butt
2020-09-25 17:35                           ` Shakeel Butt
2020-09-25 17:47                             ` Roman Gushchin
2020-09-25 17:58                               ` Shakeel Butt
2020-09-25 19:19                                 ` Shakeel Butt
2020-09-25 20:56                                   ` Roman Gushchin
2020-09-25 21:18                                     ` Shakeel Butt
2020-09-27 17:38                                       ` Theodore Y. Ts'o
2020-09-26  1:43                                   ` Ming Lei
2020-09-26  6:42                                     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).