linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Sha Zhengju <handai.szj@gmail.com>
To: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Vivek Goyal <vgoyal@redhat.com>, Tejun Heo <tj@kernel.org>,
	Jens Axboe <axboe@kernel.dk>,
	"devel@openvz.org" <devel@openvz.org>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>
Subject: Re: [PATCH RFC] fsio: filesystem io accounting cgroup
Date: Wed, 10 Jul 2013 16:37:33 +0800	[thread overview]
Message-ID: <CAFj3OHVPemFmrFdUtMSEzLf2=MaQd0ouS1RY79Lrut7FHC6A5g@mail.gmail.com> (raw)
In-Reply-To: <51DCF92F.6090409@openvz.org>

On Wed, Jul 10, 2013 at 2:03 PM, Konstantin Khlebnikov
<khlebnikov@openvz.org> wrote:
> Sha Zhengju wrote:
>>
>> Hi,
>>
>> On Mon, Jul 8, 2013 at 5:59 PM, Konstantin Khlebnikov
>> <khlebnikov@openvz.org>  wrote:
>>>
>>> >  This is proof of concept, just basic functionality for IO controller.
>>> >  This cgroup will control filesystem usage on vfs layer, it's main goal
>>> > is
>>> >  bandwidth control. It's supposed to be much more lightweight than
>>> > memcg/blkio.
>>> >
>>> >  This patch shows easy way for accounting pages in dirty/writeback
>>> > state in
>>> >  per-inode manner. This is easier that doing this in memcg in per-page
>>> > manner.
>>> >  Main idea is in keeping on each inode pointer (->i_fsio) to cgroup
>>> > which owns
>>> >  dirty data in that inode. It's settled by fsio_account_page_dirtied()
>>> > when
>>> >  first dirty tag appears in the inode. Relying to mapping tags gives us
>>> > locking
>>> >  for free, this patch doesn't add any new locks to hot paths.
>>
>> While referring to dirty/writeback numbers, what I care about is 'how
>> many dirties in how many memory' and later may use the proportion to
>> decide throttling or something else. So if you are talking about nr of
>> dirty pages without memcg's amount of memory, I don't see the meaning
>> of a single number.
>
>
> I'm planning to add some thresholds or limits to fsio cgroup -- how many
> dirty pages
> this cgroup may have. memcg is completely different thing: memcg controls
> data storage
> while fsio controls data flows. Memcg already handles too much, I just don't
> want add
> yet another unrelated stuff into it. Otherwise we will end with one single

I don't think handling dirty pages is an unrelated stuff to memcg. See
problem met by others using memcg:
https://lists.linux-foundation.org/pipermail/containers/2013-June/032917.html
Memcg dirty/writeback accounting is also only antipasto, long term
work is memcg aware dirty throttling and possibly per-memcg flusher.
Greg has already done some works here.


> controller
> which would handle all possible resources, because they all related in some
> cases.
>
>
>>
>> What's more, counting dirty/writeback stats in per-node manner can
>> bring inaccuracy in some situations: considering two tasks from
>> different fsio cgroups are dirtying one file concurrently but may only
>> be counting in one fsio stats, or a task is moved to another fsio
>> cgroup after dirtrying one file. As talking about task moving, it is
>> the root cause of adding memcg locks in page stat routines, since
>> there's a race window between 'modify cgroup owner' and 'update stats
>> using cgroup pointer'. But if you are going to handle task move or
>> take care of ->i_fsio for better accuracy in future, I'm afraid you
>> will also need some synchronization mechanism in hot paths. Maybe also
>> a new lock or mapping->tree_lock(which is already hot enough) IMHO.
>
>
> Yes, per-inode accounting is less accurate. But this approach works really
> well in the real life. I don't want add new locks and loose performance just
> to fix accuracy for some artificial cases.
>
> to Tejun:
> BTW I don't like that volatility of task's cgroup ponters. I'd like to
> forbid
> moving tasks between cgroups except for 'current', existing behavior can be
> kept with help of task_work: instead external change of task->cgroups we can
> schedule task_work into it in and change that pointer in the 'current'
> context.
> That will save us a lot of rcu_lock/unlock and atomic operations in grabbing
> temporary pointers to current cgroup because current->cgroups will be
> stable.
> I don't think that external cross-cgroup task migration is really
> performance
> critical. Currently I don't know what to do with kernel threads and
> workqueues,
> but any way this problem doesn't look unsolvable.
>



--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-07-10  8:37 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-08  9:59 [PATCH RFC] fsio: filesystem io accounting cgroup Konstantin Khlebnikov
2013-07-10  4:43 ` Sha Zhengju
2013-07-10  6:03   ` Konstantin Khlebnikov
2013-07-10  8:37     ` Sha Zhengju [this message]
  -- strict thread matches above, loose matches on Subject: below --
2013-07-08 10:01 Konstantin Khlebnikov
2013-07-08 17:00 ` Tejun Heo
2013-07-08 17:52   ` Vivek Goyal
2013-07-08 17:56     ` Tejun Heo
2013-07-09  8:28       ` Konstantin Khlebnikov
2013-07-09 12:57         ` Tejun Heo
2013-07-09 13:15           ` Konstantin Khlebnikov
2013-07-09 13:16             ` Tejun Heo
2013-07-09 13:16               ` Tejun Heo
2013-07-09 13:43                 ` Konstantin Khlebnikov
2013-07-09 13:45                   ` Tejun Heo
2013-07-09 14:18                     ` Vivek Goyal
2013-07-09 14:29                       ` Tejun Heo
2013-07-09 14:54                         ` Vivek Goyal
2013-07-09 15:08                           ` Tejun Heo
     [not found]                             ` <20130710030955.GA3569@redhat.com>
2013-07-10  3:50                               ` Tejun Heo
2013-07-09 14:35                     ` Konstantin Khlebnikov
2013-07-09 14:42                       ` Tejun Heo
2013-07-09 15:06                       ` Vivek Goyal
2013-07-09 17:42                         ` Konstantin Khlebnikov
2013-07-09 18:35                           ` Vivek Goyal
2013-07-09 20:54                             ` Konstantin Khlebnikov
2013-07-08 18:11 ` Vivek Goyal
2013-07-09 15:39 ` Theodore Ts'o
2013-07-09 17:12   ` Konstantin Khlebnikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAFj3OHVPemFmrFdUtMSEzLf2=MaQd0ouS1RY79Lrut7FHC6A5g@mail.gmail.com' \
    --to=handai.szj@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=cgroups@vger.kernel.org \
    --cc=devel@openvz.org \
    --cc=gthelen@google.com \
    --cc=khlebnikov@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=tj@kernel.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).