Re: [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
To: Qian Cai <cai-J5quhbR+WMc@public.gmane.org>
Cc: Dan Schatzberg
	<schatzberg.dan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Vladimir Davydov
	<vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>,
	Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Chris Down <chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org>,
	Yang Shi
	<yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>,
	Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>,
	"open list:BLOCK LAYER"
	<linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	open list <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"open list:CONTROL GROUP (CGROUP)"
	<cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)"
	<linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>
Subject: Re: [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker
Date: Thu, 27 Feb 2020 13:14:30 -0500	[thread overview]
Message-ID: <20200227181430.GA44024@cmpxchg.org> (raw)
In-Reply-To: <1582736558.7365.131.camel-J5quhbR+WMc@public.gmane.org>

On Wed, Feb 26, 2020 at 12:02:38PM -0500, Qian Cai wrote:
> On Mon, 2020-02-24 at 17:17 -0500, Dan Schatzberg wrote:
> > Existing uses of loop device may have multiple cgroups reading/writing
> > to the same device. Simply charging resources for I/O to the backing
> > file could result in priority inversion where one cgroup gets
> > synchronously blocked, holding up all other I/O to the loop device.
> > 
> > In order to avoid this priority inversion, we use a single workqueue
> > where each work item is a "struct loop_worker" which contains a queue of
> > struct loop_cmds to issue. The loop device maintains a tree mapping blk
> > css_id -> loop_worker. This allows each cgroup to independently make
> > forward progress issuing I/O to the backing file.
> > 
> > There is also a single queue for I/O associated with the rootcg which
> > can be used in cases of extreme memory shortage where we cannot allocate
> > a loop_worker.
> > 
> > The locking for the tree and queues is fairly heavy handed - we acquire
> > the per-loop-device spinlock any time either is accessed. The existing
> > implementation serializes all I/O through a single thread anyways, so I
> > don't believe this is any worse.
> > 
> > Signed-off-by: Dan Schatzberg <schatzberg.dan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> 
> The locking in loop_free_idle_workers() will trigger this with sysfs reading,
> 
> [ 7080.047167] LTP: starting read_all_sys (read_all -d /sys -q -r 10)
> [ 7239.842276] cpufreq transition table exceeds PAGE_SIZE. Disabling
> 
> [ 7247.054961] =====================================================
> [ 7247.054971] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> [ 7247.054983] 5.6.0-rc3-next-20200226 #2 Tainted: G           O     
> [ 7247.054992] -----------------------------------------------------
> [ 7247.055002] read_all/8513 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [ 7247.055014] c0000006844864c8 (&fs->seq){+.+.}, at: file_path+0x24/0x40
> [ 7247.055041] 
>                and this task is already holding:
> [ 7247.055061] c0002006bab8b928 (&(&lo->lo_lock)->rlock){..-.}, at:
> loop_attr_do_show_backing_file+0x3c/0x120 [loop]
> [ 7247.055078] which would create a new lock dependency:
> [ 7247.055105]  (&(&lo->lo_lock)->rlock){..-.} -> (&fs->seq){+.+.}
> [ 7247.055125] 
>                but this new dependency connects a SOFTIRQ-irq-safe lock:
> [ 7247.055155]  (&(&lo->lo_lock)->rlock){..-.}
> [ 7247.055156] 
>                ... which became SOFTIRQ-irq-safe at:
> [ 7247.055196]   lock_acquire+0x130/0x360
> [ 7247.055221]   _raw_spin_lock_irq+0x68/0x90
> [ 7247.055230]   loop_free_idle_workers+0x44/0x3f0 [loop]
> [ 7247.055242]   call_timer_fn+0x110/0x5f0
> [ 7247.055260]   run_timer_softirq+0x8f8/0x9f0
> [ 7247.055278]   __do_softirq+0x34c/0x8c8
> [ 7247.055288]   irq_exit+0x16c/0x1d0
> [ 7247.055298]   timer_interrupt+0x1f0/0x680
> [ 7247.055308]   decrementer_common+0x124/0x130
> [ 7247.055328]   arch_local_irq_restore.part.8+0x34/0x90
> [ 7247.055352]   cpuidle_enter_state+0x11c/0x8f0
> [ 7247.055361]   cpuidle_enter+0x50/0x70
> [ 7247.055389]   call_cpuidle+0x4c/0x90
> [ 7247.055398]   do_idle+0x378/0x470
> [ 7247.055414]   cpu_startup_entry+0x3c/0x40
> [ 7247.055442]   start_secondary+0x7a8/0xa80
> [ 7247.055461]   start_secondary_prolog+0x10/0x14

That's kind of hilarious.

So even though it's a spin_lock_irq(), suggesting it's used from both
process and irq context, Dan appears to be adding the first user that
actually runs from irq context. It looks like it should have been a
regular spinlock all along. Until now, anyway.

Fixing it should be straight-forward. Use get_file() under lock to pin
the file, drop the lock to do file_path(), release file with fput().

WARNING: multiple messages have this Message-ID (diff)

From: Johannes Weiner <hannes@cmpxchg.org>
To: Qian Cai <cai@lca.pw>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>,
	Jens Axboe <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>,
	Li Zefan <lizefan@huawei.com>, Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>, Roman Gushchin <guro@fb.com>,
	Shakeel Butt <shakeelb@google.com>,
	Chris Down <chris@chrisdown.name>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"open list:BLOCK LAYER" <linux-block@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>,
	"open list:CONTROL GROUP (CGROUP)" <cgroups@vger.kernel.org>,
	"open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)" 
	<linux-mm@kvack.org>
Subject: Re: [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker
Date: Thu, 27 Feb 2020 13:14:30 -0500	[thread overview]
Message-ID: <20200227181430.GA44024@cmpxchg.org> (raw)
In-Reply-To: <1582736558.7365.131.camel@lca.pw>

On Wed, Feb 26, 2020 at 12:02:38PM -0500, Qian Cai wrote:
> On Mon, 2020-02-24 at 17:17 -0500, Dan Schatzberg wrote:
> > Existing uses of loop device may have multiple cgroups reading/writing
> > to the same device. Simply charging resources for I/O to the backing
> > file could result in priority inversion where one cgroup gets
> > synchronously blocked, holding up all other I/O to the loop device.
> > 
> > In order to avoid this priority inversion, we use a single workqueue
> > where each work item is a "struct loop_worker" which contains a queue of
> > struct loop_cmds to issue. The loop device maintains a tree mapping blk
> > css_id -> loop_worker. This allows each cgroup to independently make
> > forward progress issuing I/O to the backing file.
> > 
> > There is also a single queue for I/O associated with the rootcg which
> > can be used in cases of extreme memory shortage where we cannot allocate
> > a loop_worker.
> > 
> > The locking for the tree and queues is fairly heavy handed - we acquire
> > the per-loop-device spinlock any time either is accessed. The existing
> > implementation serializes all I/O through a single thread anyways, so I
> > don't believe this is any worse.
> > 
> > Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> The locking in loop_free_idle_workers() will trigger this with sysfs reading,
> 
> [ 7080.047167] LTP: starting read_all_sys (read_all -d /sys -q -r 10)
> [ 7239.842276] cpufreq transition table exceeds PAGE_SIZE. Disabling
> 
> [ 7247.054961] =====================================================
> [ 7247.054971] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> [ 7247.054983] 5.6.0-rc3-next-20200226 #2 Tainted: G           O     
> [ 7247.054992] -----------------------------------------------------
> [ 7247.055002] read_all/8513 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [ 7247.055014] c0000006844864c8 (&fs->seq){+.+.}, at: file_path+0x24/0x40
> [ 7247.055041] 
>                and this task is already holding:
> [ 7247.055061] c0002006bab8b928 (&(&lo->lo_lock)->rlock){..-.}, at:
> loop_attr_do_show_backing_file+0x3c/0x120 [loop]
> [ 7247.055078] which would create a new lock dependency:
> [ 7247.055105]  (&(&lo->lo_lock)->rlock){..-.} -> (&fs->seq){+.+.}
> [ 7247.055125] 
>                but this new dependency connects a SOFTIRQ-irq-safe lock:
> [ 7247.055155]  (&(&lo->lo_lock)->rlock){..-.}
> [ 7247.055156] 
>                ... which became SOFTIRQ-irq-safe at:
> [ 7247.055196]   lock_acquire+0x130/0x360
> [ 7247.055221]   _raw_spin_lock_irq+0x68/0x90
> [ 7247.055230]   loop_free_idle_workers+0x44/0x3f0 [loop]
> [ 7247.055242]   call_timer_fn+0x110/0x5f0
> [ 7247.055260]   run_timer_softirq+0x8f8/0x9f0
> [ 7247.055278]   __do_softirq+0x34c/0x8c8
> [ 7247.055288]   irq_exit+0x16c/0x1d0
> [ 7247.055298]   timer_interrupt+0x1f0/0x680
> [ 7247.055308]   decrementer_common+0x124/0x130
> [ 7247.055328]   arch_local_irq_restore.part.8+0x34/0x90
> [ 7247.055352]   cpuidle_enter_state+0x11c/0x8f0
> [ 7247.055361]   cpuidle_enter+0x50/0x70
> [ 7247.055389]   call_cpuidle+0x4c/0x90
> [ 7247.055398]   do_idle+0x378/0x470
> [ 7247.055414]   cpu_startup_entry+0x3c/0x40
> [ 7247.055442]   start_secondary+0x7a8/0xa80
> [ 7247.055461]   start_secondary_prolog+0x10/0x14

That's kind of hilarious.

So even though it's a spin_lock_irq(), suggesting it's used from both
process and irq context, Dan appears to be adding the first user that
actually runs from irq context. It looks like it should have been a
regular spinlock all along. Until now, anyway.

Fixing it should be straight-forward. Use get_file() under lock to pin
the file, drop the lock to do file_path(), release file with fput().

next prev parent reply	other threads:[~2020-02-27 18:14 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-24 22:17 [PATCH v3 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
2020-02-24 22:17 ` Dan Schatzberg
2020-02-24 22:17 ` Dan Schatzberg
     [not found] ` <cover.1582581887.git.schatzberg.dan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-02-24 22:17   ` [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker Dan Schatzberg
2020-02-24 22:17     ` Dan Schatzberg
2020-02-24 22:17     ` Dan Schatzberg
2020-02-26 17:02     ` Qian Cai
     [not found]       ` <1582736558.7365.131.camel-J5quhbR+WMc@public.gmane.org>
2020-02-27 18:14         ` Johannes Weiner [this message]
2020-02-27 18:14           ` Johannes Weiner
2020-02-24 22:17 ` [PATCH v3 2/3] mm: Charge active memcg when no mm is set Dan Schatzberg
2020-02-24 22:17   ` Dan Schatzberg
2020-02-24 22:17   ` Dan Schatzberg
2020-02-24 22:26   ` Hugh Dickins
2020-02-24 22:17 ` [PATCH v3 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
2020-02-24 22:17   ` Dan Schatzberg
2020-02-24 22:17   ` Dan Schatzberg
  -- strict thread matches above, loose matches on Subject: below --
2020-02-20 16:51 [PATCH v3 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
     [not found] ` <cover.1582216294.git.schatzberg.dan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2020-02-20 16:51   ` [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker Dan Schatzberg
2020-02-20 16:51     ` Dan Schatzberg
2020-02-20 16:51     ` Dan Schatzberg
2020-02-20 17:50     ` Johannes Weiner
2020-02-20 22:00     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200227181430.GA44024@cmpxchg.org \
    --to=hannes-druugvl0lcnafugrpc6u6w@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org \
    --cc=cai-J5quhbR+WMc@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=chris-6Bi1550iOqEnzZ6mRAm98g@public.gmane.org \
    --cc=guro-b10kYP2dOMg@public.gmane.org \
    --cc=hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    --cc=mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=schatzberg.dan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org \
    --cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=vdavydov.dev-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.