From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vivek Goyal Subject: Re: [Lsf] IO less throttling and cgroup aware writeback Date: Thu, 7 Apr 2011 15:24:24 -0400 Message-ID: <20110407192424.GE27778@redhat.com> References: <20110331222756.GC2904@dastard> <20110401171838.GD20986@redhat.com> <20110401214947.GE6957@dastard> <20110405131359.GA14239@redhat.com> <20110405225639.GB31057@dastard> <20110406153954.GB18777@redhat.com> <20110406233602.GK31057@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Greg Thelen , Curt Wohlgemuth , James Bottomley , lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org To: Dave Chinner Return-path: Received: from mx1.redhat.com ([209.132.183.28]:37279 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756591Ab1DGTZC (ORCPT ); Thu, 7 Apr 2011 15:25:02 -0400 Content-Disposition: inline In-Reply-To: <20110406233602.GK31057@dastard> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, Apr 07, 2011 at 09:36:02AM +1000, Dave Chinner wrote: [..] > > mark_inode_dirty(inode) > > If I_DIRTY is clear, set I_DIRTY and inserted inode into bdi_mem= cg->b_dirty > > using current->memcg as a key to select the correct list. > > This will require memory allocation of bdi_memcg, if this is= the > > first inode within the bdi,memcg. If the allocation fails (= rare, > > but possible), then fallback to adding the memcg to the root > > cgroup dirty inode list. > > If I_DIRTY is already set, then do nothing. >=20 > This is where it gets tricky. Page cache dirtiness is tracked via > I_DIRTY_PAGES, a subset of I_DIRTY. I_DIRTY_DATASYNC and > I_DIRTY_SYNC are for inode metadata changes, and a lot of > filesystems track those themselves. Indeed, XFS doesn't mark inodes > dirty at the VFS for I_DIRTY_*SYNC for pure metadata operations any > more, and there's no way that tracking can be made cgroup aware. >=20 > Hence it can be the case that only I_DIRTY_PAGES is tracked in > the VFS dirty lists, and that is the flag you need to care about > here. >=20 > Further, we are actually looking at formalising this - changing the > .dirty_inode() operation to take the dirty flags and return a result > that indicates whether the inode should be tracked in the VFS dirty > list at all. This would stop double tracking of dirty inodes and go > a long way to solving some of the behavioural issues we have now > (e.g. the VFS tracking and trying to writeback inodes that the > filesystem has already cleaned). >=20 > Hence I think you need to be explicit that this tracking is > specifically for I_DIRTY_PAGES state, though will handle other dirty > inode states if desired by the filesytem. Ok, that makes sense. We are interested primarily in I_DIRTY_PAGES stat= e only. IIUC, so first we need to fix existing code where we seem to be moving any inode on bdi writeback list based on I_DIRTY flag. BTW, what's the difference between I_DIRTY_DATASYNC and I_DIRTY_PAGES? = To me both seem to mean that data needs to be written back and not the inode itself. >=20 > > When I_DIRTY is cleared, remove inode from bdi_memcg->b_dirty. Del= ete bdi_memcg > > if the list is now empty. > >=20 > > balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(memcg, b= di) > > if over bg limit, then > > set bdi_memcg->b_over_limit > > If there is no bdi_memcg (because all inodes of current=E2= =80=99s > > memcg dirty pages where first dirtied by other memcg) th= en > > memcg lru to find inode and call writeback_single_inode(= ). > > This is to handle uncommon sharing. >=20 > We don't want to introduce any new IO sources into > balance_dirty_pages(). This needs to trigger memcg-LRU based bdi > flusher writeback, not try to write back inodes itself. Will we not enjoy more sequtial IO traffic once we find an inode by traversing memcg->lru list? So isn't that better than pure LRU based flushing? >=20 > Alternatively, this problem won't exist if you transfer page =D1=89ac= he > state from one memcg to another when you move the inode from one > memcg to another. But in case of shared inode problem still remains. inode is being writt= en from two cgroups and it can't be in both the groups as per the exisitin= g design. >=20 > > reference memcg for bdi flusher > > awake bdi flusher > > if over fg limit > > IO-full: write bdi_memcg directly (if empty use memcg lru to= find > > inode to write) > >=20 > > Clarification: In IO-less: queue memcg-waiting description t= o bdi > > flusher waiters (balance_list). >=20 > I'd be looking at designing for IO-less throttling up front.... Agreed. Lets design it on top of IO less throttling patches. We also=20 shall have to modify IO less throttling a bit so that page completions are not uniformly distributed across all the threads but we need to account for groups first and then distribute completions with-in group uniformly. [..] > > use over_bground_thresh() to check global background lim= it. >=20 > the background flush needs to continue while over the global limit > even if all the memcg's are under their limits. In which case, we > need to consider if we need to be fair when writing back memcg's on > a bdi i.e. do we cycle an inode at a time until b_io is empty, then > cycle to the next memcg, and not come back to the first memcg with > inodes queued on b_more_io until they all have empty b_io queues? >=20 I think continue to cycle through memcg's even in this case will make=20 sense. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html