From mboxrd@z Thu Jan  1 00:00:00 1970
From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [Lsf] IO less throttling and cgroup aware writeback
Date: Thu, 7 Apr 2011 15:24:24 -0400
Message-ID: <20110407192424.GE27778@redhat.com>
References: <BANLkTimq_CGOSOmHyvLk9C8DZuqX39RUdg@mail.gmail.com>
 <20110331222756.GC2904@dastard>
 <20110401171838.GD20986@redhat.com>
 <20110401214947.GE6957@dastard>
 <20110405131359.GA14239@redhat.com>
 <20110405225639.GB31057@dastard>
 <BANLkTikDPHcpjmb-EAiX+MQcu7hfE730DQ@mail.gmail.com>
 <20110406153954.GB18777@redhat.com>
 <xr937hb7568t.fsf@gthelen.mtv.corp.google.com>
 <20110406233602.GK31057@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Greg Thelen <gthelen@google.com>,
	Curt Wohlgemuth <curtw@google.com>,
	James Bottomley <James.Bottomley@hansenpartnership.com>,
	lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org
To: Dave Chinner <david@fromorbit.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:37279 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1756591Ab1DGTZC (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Thu, 7 Apr 2011 15:25:02 -0400
Content-Disposition: inline
In-Reply-To: <20110406233602.GK31057@dastard>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote:

[..]
> > mark_inode_dirty(inode)
> >    If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_mem=
cg->b_dirty
> >    using current->memcg as a key to select the correct list.
> >        This will require memory allocation of bdi_memcg, if this is=
 the
> >        first inode within the bdi,memcg.  If the allocation fails (=
rare,
> >        but possible), then fallback to adding the memcg to the root
> >        cgroup dirty inode list.
> >    If I_DIRTY is already set, then do nothing.
>=20
> This is where it gets tricky. Page cache dirtiness is tracked via
> I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and
> I_DIRTY_SYNC are for inode metadata changes, and a lot of
> filesystems track those themselves. Indeed, XFS doesn't mark inodes
> dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any
> more, and there's no way that tracking can be made cgroup aware.
>=20
> Hence it can be the case that only I_DIRTY_PAGES is tracked in
> the VFS dirty lists, and that is the flag you need to care about
> here.
>=20
> Further, we are actually looking at formalising this - changing the
> .dirty_inode() operation to take the dirty flags and return a result
> that indicates whether the inode should be tracked in the VFS dirty
> list at all. This would stop double tracking of dirty inodes and go
> a long way to solving some of the behavioural issues we have now
> (e.g. the VFS tracking and trying to writeback inodes that the
> filesystem has already cleaned).
>=20
> Hence I think you need to be explicit that this tracking is
> specifically for I_DIRTY_PAGES state, though will handle other dirty
> inode states if desired by the filesytem.

Ok, that makes sense. We are interested primarily in I_DIRTY_PAGES stat=
e
only.

IIUC, so first we need to fix existing code where we seem to be moving
any inode on bdi writeback list based on I_DIRTY flag.

BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? =
To
me both seem to mean that data needs to be written back and not the
inode itself.

>=20
> > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty.  Del=
ete bdi_memcg
> > if the list is now empty.
> >=20
> > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, b=
di)
> >    if over bg limit, then
> >        set bdi_memcg->b_over_limit
> >            If there is no bdi_memcg (because all inodes of current=E2=
=80=99s
> >            memcg dirty pages where first dirtied by other memcg) th=
en
> >            memcg lru to find inode and call writeback_single_inode(=
).
> >            This is to handle uncommon sharing.
>=20
> We don't want to introduce any new IO sources into
> balance_dirty_pages(). This needs to trigger memcg-LRU based bdi
> flusher writeback, not try to write back inodes itself.

Will we not enjoy more sequtial IO traffic once we find an inode by
traversing memcg->lru list? So isn't that better than pure LRU based
flushing?

>=20
> Alternatively, this problem won't exist if you transfer page =D1=89ac=
he
> state from one memcg to another when you move the inode from one
> memcg to another.

But in case of shared inode problem still remains. inode is being writt=
en
from two cgroups and it can't be in both the groups as per the exisitin=
g
design.

>=20
> >        reference memcg for bdi flusher
> >        awake bdi flusher
> >    if over fg limit
> >        IO-full: write bdi_memcg directly (if empty use memcg lru to=
 find
> >        inode to write)
> >=20
> >        Clarification: In IO-less: queue memcg-waiting description t=
o bdi
> >        flusher waiters (balance_list).
>=20
> I'd be looking at designing for IO-less throttling up front....

Agreed. Lets design it on top of IO less throttling patches. We also=20
shall have to modify IO less throttling a bit so that page completions
are not uniformly distributed across all the threads but we need to
account for groups first and then distribute completions with-in group
uniformly.

[..]
> >            use over_bground_thresh() to check global background lim=
it.
>=20
> the background flush needs to continue while over the global limit
> even if all the memcg's are under their limits. In which case, we
> need to consider if we need to be fair when writing back memcg's on
> a bdi i.e. do we cycle an inode at a time until b_io is empty, then
> cycle to the next memcg, and not come back to the first memcg with
> inodes queued on b_more_io until they all have empty b_io queues?
>=20

I think continue to cycle through memcg's even in this case will make=20
sense.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html