From mboxrd@z Thu Jan 1 00:00:00 1970 From: Curt Wohlgemuth Subject: Re: [Lsf] IO less throttling and cgroup aware writeback (Was: Re: Preliminary Agenda and Activities for LSF) Date: Wed, 6 Apr 2011 07:49:25 -0700 Message-ID: References: <20110330041802.GA20849@dastard> <20110330153757.GD1291@redhat.com> <20110330222002.GB20849@dastard> <20110331141637.GA11139@redhat.com> <20110331222756.GC2904@dastard> <20110401171838.GD20986@redhat.com> <20110401214947.GE6957@dastard> <20110405131359.GA14239@redhat.com> <20110405225639.GB31057@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Vivek Goyal , Greg Thelen , James Bottomley , lsf@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org To: Dave Chinner Return-path: Received: from smtp-out.google.com ([74.125.121.67]:51675 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756151Ab1DFOt2 convert rfc822-to-8bit (ORCPT ); Wed, 6 Apr 2011 10:49:28 -0400 Received: from hpaq1.eem.corp.google.com (hpaq1.eem.corp.google.com [172.25.149.1]) by smtp-out.google.com with ESMTP id p36EnRhH009101 for ; Wed, 6 Apr 2011 07:49:27 -0700 Received: from qyk32 (qyk32.prod.google.com [10.241.83.160]) by hpaq1.eem.corp.google.com with ESMTP id p36EmRLO012882 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Wed, 6 Apr 2011 07:49:26 -0700 Received: by qyk32 with SMTP id 32so2569051qyk.8 for ; Wed, 06 Apr 2011 07:49:25 -0700 (PDT) In-Reply-To: <20110405225639.GB31057@dastard> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Tue, Apr 5, 2011 at 3:56 PM, Dave Chinner wrot= e: > On Tue, Apr 05, 2011 at 09:13:59AM -0400, Vivek Goyal wrote: >> On Sat, Apr 02, 2011 at 08:49:47AM +1100, Dave Chinner wrote: >> > On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: >> > > On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: >> > > > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: >> > > > > There >> > > > > is no context (memcg or otherwise) given to the bdi flusher.= =A0After >> > > > > the bdi flusher checks system-wide background limits, it use= s the >> > > > > over_bg_limit list to find (and rotate) an over limit memcg.= =A0Using >> > > > > the memcg, then the per memcg per bdi dirty inode list is wa= lked to >> > > > > find inode pages to writeback. =A0Once the memcg dirty memor= y usage >> > > > > drops below the memcg-thresh, the memcg is removed from the = global >> > > > > over_bg_limit list. >> > > > >> > > > If you want controlled hand-off of writeback, you need to pass= the >> > > > memcg that triggered the throttling directly to the bdi. You a= lready >> > > > know what both the bdi and memcg that need writeback are. Yes,= this >> > > > needs concurrency at the BDI flush level to handle, but see my >> > > > previous email in this thread for that.... >> > > > >> > > >> > > Even with memcg being passed around I don't think that we get ri= d of >> > > global list lock. > ..... >> > > The reason being that inodes are not exclusive to >> > > the memory cgroups. Multiple memory cgroups might be writting to= same >> > > inode. So inode still remains in the global list and memory cgro= ups >> > > kind of will have pointer to it. >> > >> > So two dirty inode lists that have to be kept in sync? That doesn'= t >> > sound particularly appealing. Nor does it scale to an inode being >> > dirty in multiple cgroups >> > >> > Besides, if you've got multiple memory groups dirtying the same >> > inode, then you cannot expect isolation between groups. I'd consid= er >> > this a broken configuration in this case - how often does this >> > actually happen, and what is the use case for supporting >> > it? >> > >> > Besides, the implications are that we'd have to break up contiguou= s >> > IOs in the writeback path simply because two sequential pages are >> > associated with different groups. That's really nasty, and exactly >> > the opposite of all the write combining we try to do throughout th= e >> > writeback path. Supporting this is also a mess, as we'd have to to= uch >> > quite a lot of filesystem code (i.e. .writepage(s) inplementations= ) >> > to do this. >> >> We did not plan on breaking up contigous IO even if these belonged t= o >> different cgroup for performance reason. So probably can live with s= ome >> inaccuracy and just trigger the writeback for one inode even if that >> meant that it could writeback the pages of some other cgroups doing = IO >> on that inode. > > Which, to me, violates the principle of isolation as it's been > described that this functionality is supposed to provide. > > It also means you will have handle the case of a cgroup over a > throttle limit and no inodes on it's dirty list. It's not a case of > "probably can live with" the resultant mess, the mess will occur and > so handling it needs to be designed in from the start. > >> > > So to start writeback on an inode >> > > you still shall have to take global lock, IIUC. >> > >> > Why not simply bdi -> list of dirty cgroups -> list of dirty inode= s >> > in cgroup, and go from there? I mean, really all that cgroup-aware >> > writeback needs is just adding a new container for managing >> > dirty inodes in the writeback path and a method for selecting that >> > container for writeback, right? >> >> This was the initial design where one inode is associated with one c= group >> even if process from multiple cgroups are doing IO to same inode. Th= en >> somebody raised the concern that it probably is too coarse. > > Got a pointer? > >> IMHO, as a first step, associating inode to one cgroup exclusively >> simplifies the things considerably and we can target that first. >> >> So yes, I agree that bdi->list_of_dirty_cgroups->list_of_drity_inode= s >> makes sense and is relatively simple way of doing things at the expe= nse >> of not being accurate for shared inode case. > > Can someone describe a valid shared inode use case? If not, we > should not even consider it as a requirement and explicitly document > it as a "not supported" use case. At the very least, when a task is moved from one cgroup to another, we've got a shared inode case. This probably won't happen more than once for most tasks, but it will likely be common. Curt > > As it is, I'm hearing different ideas and requirements from the > people working on the memcg side of this vs the IO controller side. > Perhaps the first step is documenting a common set of functional > requirements that demonstrates how everything will play well > together? > > e.g. Defining what isolation means, when and if it can be violated, > how violations are handled, when inodes in multiple memcgs are > acceptable and how they need to be accounted and handled by the > writepage path, how memcg's over the dirty threshold with no dirty > inodes are to be handled, how metadata IO is going to be handled by > IO controllers, what kswapd is going to do writeback when the pages > it's trying to writeback during a critical low memory event belong > to a cgroup that is throttled at the IO level, etc. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdev= el" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html