From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrea Righi Subject: Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller patches) Date: Thu, 7 Aug 2008 09:46:07 +0200 (MEST) Message-ID: <489AA83F.1040306@gmail.com> References: <20080804.175126.193692178.ryov@valinux.co.jp> <1217870433.20260.101.camel@nimitz> <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp> Reply-To: righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: In-Reply-To: <1217985189.3154.57.camel-xpvPi5bcW5X5OjGIXfuPlhrrLbDL3r4M6qtp775pBPw@public.gmane.org> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: containers-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= Cc: xen-devel-GuqFBffKawuULHF6PoxzQEEOCMrvLtNR@public.gmane.org, uchida-LYU95VJlayp8UrSeD/g0lQ@public.gmane.org, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Dave Hansen , dm-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, agk-9JcytcrH/bA+uJoB2kUjGw@public.gmane.org, virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, ngupta-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org List-Id: containers.vger.kernel.org RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8gd3JvdGU6Cj4gVGhpcyBSRkMgZW5kZWQgdXAgYmVp bmcgYSBiaXQgbG9uZ2VyIHRoYW4gSSBoYWQgb3JpZ2luYWxseSBpbnRlbmRlZCwgYnV0Cj4gaG9w ZWZ1bGx5IGl0IHdpbGwgc2VydmUgYXMgdGhlIHN0YXJ0IG9mIGEgZnJ1aXRmdWwgZGlzY3Vzc2lv bi4KClRoYW5rcyBmb3IgcG9zdGluZyB0aGlzIGRldGFpbGVkIFJGQyEgQSBmZXcgY29tbWVudHMg YmVsb3cuCgo+IEFzIHlvdSBwb2ludGVkIG91dCwgaXQgc2VlbXMgdGhhdCB0aGVyZSBpcyBub3Qg bXVjaCBjb25zZW5zdXMgYnVpbGRpbmcKPiBnb2luZyBvbiwgYnV0IHRoYXQgZG9lcyBub3QgbWVh biB0aGVyZSBpcyBhIGxhY2sgb2YgaW50ZXJlc3QuIFRvIGdldCB0aGUKPiBiYWxsIHJvbGxpbmcg aXQgaXMgcHJvYmFibHkgYSBnb29kIGlkZWEgdG8gY2xhcmlmeSB0aGUgc3RhdGUgb2YgdGhpbmdz Cj4gYW5kIHRyeSB0byBlc3RhYmxpc2ggd2hhdCB3ZSBhcmUgdHJ5aW5nIHRvIGFjY29tcGxpc2gu Cj4gCj4gKioqIFN0YXRlIG9mIHRoaW5ncyBpbiB0aGUgbWFpbnN0cmVhbSBrZXJuZWw8QlI+Cj4g VGhlIGtlcm5lbCBoYXMgaGFkIHNvbWV3aGF0IGFkYXZhbmNlZCBJL08gY29udHJvbCBjYXBhYmls aXRpZXMgZm9yIHF1aXRlCj4gc29tZSB0aW1lIG5vdzogQ0ZRLiBCdXQgdGhlIGN1cnJlbnQgQ0ZR IGhhcyBzb21lIHByb2JsZW1zOgo+ICAgLSBJL08gcHJpb3JpdHkgY2FuIGJlIHNldCBieSBQSUQs IFBHUlAsIG9yIFVJRCwgYnV0Li4uCj4gICAtIC4uLmFsbCB0aGUgcHJvY2Vzc2VzIHRoYXQgZmFs bCB3aXRoaW4gdGhlIHNhbWUgY2xhc3MvcHJpb3JpdHkgYXJlCj4gc2NoZWR1bGVkIHRvZ2V0aGVy IGFuZCBhcmJpdHJhcnkgZ3JvdXBpbmcgYXJlIG5vdCBwb3NzaWJsZS4KPiAgIC0gQnVmZmVyZWQg SS9PIGlzIG5vdCBoYW5kbGVkIHByb3Blcmx5Lgo+ICAgLSBDRlEncyBJTyBwcmlvcml0eSBpcyBh biBhdHRyaWJ1dGUgb2YgYSBwcm9jZXNzIHRoYXQgYWZmZWN0cyBhbGwKPiBkZXZpY2VzIGl0IHNl bmRzIEkvTyByZXF1ZXN0cyB0by4gSW4gb3RoZXIgd29yZHMsIHdpdGggdGhlIGN1cnJlbnQKPiBp bXBsZW1lbnRhdGlvbiBpdCBpcyBub3QgcG9zc2libGUgdG8gYXNzaWduIHBlci1kZXZpY2UgSU8g cHJpb3JpdGllcyB0bwo+IGEgdGFzay4KPiAKPiAqKiogR29hbHMKPiAgIDEuIENncm91cHMtYXdh cmUgSS9PIHNjaGVkdWxpbmcgKGJlaW5nIGFibGUgdG8gZGVmaW5lIGFyYml0cmFyeQo+IGdyb3Vw aW5ncyBvZiBwcm9jZXNzZXMgYW5kIHRyZWF0IGVhY2ggZ3JvdXAgYXMgYSBzaW5nbGUgc2NoZWR1 bGluZwo+IGVudGl0eSkuCj4gICAyLiBCZWluZyBhYmxlIHRvIHBlcmZvcm0gSS9PIGJhbmR3aWR0 aCBjb250cm9sIGluZGVwZW5kZW50bHkgb24gZWFjaAo+IGRldmljZS4KPiAgIDMuIEkvTyBiYW5k d2lkdGggc2hhcGluZy4KPiAgIDQuIFNjaGVkdWxlci1pbmRlcGVuZGVudCBJL08gYmFuZHdpZHRo IGNvbnRyb2wuCj4gICA1LiBVc2FibGUgd2l0aCBzdGFja2luZyBkZXZpY2VzIChtZCwgZG0gYW5k IG90aGVyIGRldmljZXMgb2YgdGhhdAo+IGlsaykuCj4gICA2LiBJL08gdHJhY2tpbmcgKGhhbmRs ZSBidWZmZXJlZCBhbmQgYXN5bmNocm9ub3VzIEkvTyBwcm9wZXJseSkuCgpUaGUgc2FtZSBhYm92 ZSBhbHNvIGZvciBJTyBvcGVyYXRpb25zL3NlYyAoYmFuZHdpZHRoIGludGVuZGVkIG5vdCBvbmx5 CmluIHRlcm1zIG9mIGJ5dGVzL3NlYyksIHBsdXM6Cgo3LiBPcHRpbWFsIGJhbmR3aWR0aCB1c2Fn ZTogYWxsb3cgdG8gZXhjZWVkIHRoZSBJTyBsaW1pdHMgdG8gdGFrZQphZHZhbnRhZ2Ugb2YgZnJl ZS91bnVzZWQgSU8gcmVzb3VyY2VzIChpLmUuIGFsbG93ICJidXJzdHMiIHdoZW4gdGhlCndob2xl IHBoeXNpY2FsIGJhbmR3aWR0aCBmb3IgYSBibG9jayBkZXZpY2UgaXMgbm90IGZ1bGx5IHVzZWQg YW5kIHRoZW4KInRocm90dGxlIiBhZ2FpbiB3aGVuIElPIGZyb20gdW5saW1pdGVkIGNncm91cHMg Y29tZXMgaW50byBwbGFjZSkKCjguICJmYWlyIHRocm90dGxpbmciOiBhdm9pZCB0byB0aHJvdHRs ZSBhbHdheXMgdGhlIHNhbWUgdGFzayB3aXRoaW4gYQpjZ3JvdXAsIGJ1dCB0cnkgdG8gZGlzdHJp YnV0ZSB0aGUgdGhyb3R0bGluZyBhbW9uZyBhbGwgdGhlIHRhc2tzCmJlbG9uZ2luZyB0byB0aGUg dGhyb3R0bGUgY2dyb3VwCgo+IFRoZSBsaXN0IG9mIGdvYWxzIGFib3ZlIGlzIG5vdCBleGhhdXN0 aXZlIGFuZCBpdCBpcyBhbHNvIGxpa2VseSB0bwo+IGNvbnRhaW4gc29tZSBub3Qtc28tbmljZS10 by1oYXZlIGZlYXR1cmVzIHNvIHlvdXIgZmVlZGJhY2sgd291bGQgYmUKPiBhcHByZWNpYXRlZC4K PiAKPiAxLiAmIDIuLSBDZ3JvdXBzLWF3YXJlIEkvTyBzY2hlZHVsaW5nIChiZWluZyBhYmxlIHRv IGRlZmluZSBhcmJpdHJhcnkKPiBncm91cGluZ3Mgb2YgcHJvY2Vzc2VzIGFuZCB0cmVhdCBlYWNo IGdyb3VwIGFzIGEgc2luZ2xlIHNjaGVkdWxpbmcKPiBpZGVudGl0eSkKPiAKPiBXZSBvYnZpb3Vz bHkgbmVlZCB0aGlzIGJlY2F1c2Ugb3VyIGZpbmFsIGdvYWwgaXMgdG8gYmUgYWJsZSB0byBjb250 cm9sCj4gdGhlIElPIGdlbmVyYXRlZCBieSBhIExpbnV4IGNvbnRhaW5lci4gVGhlIGdvb2QgbmV3 cyBpcyB0aGF0IHdlIGFscmVhZHkKPiBoYXZlIHRoZSBjZ3JvdXBzIGluZnJhc3RydWN0dXJlIHNv LCByZWdhcmRpbmcgdGhpcyBwcm9ibGVtLCB3ZSB3b3VsZAo+IGp1c3QgaGF2ZSB0byB0cmFuc2Zv cm0gb3VyIEkvTyBiYW5kd2lkdGggY29udHJvbGxlciBpbnRvIGEgY2dyb3VwCj4gc3Vic3lzdGVt Lgo+IAo+IFRoaXMgc2VlbXMgdG8gYmUgdGhlIGVhc2llc3QgcGFydCwgYnV0IHRoZSBjdXJyZW50 IGNncm91cHMKPiBpbmZyYXN0cnVjdHVyZSBoYXMgc29tZSBsaW1pdGF0aW9ucyB3aGVuIGl0IGNv bWVzIHRvIGRlYWxpbmcgd2l0aCBibG9jawo+IGRldmljZXM6IGltcG9zc2liaWxpdHkgb2YgY3Jl YXRpbmcvcmVtb3ZpbmcgY2VydGFpbiBjb250cm9sIHN0cnVjdHVyZXMKPiBkeW5hbWljYWxseSBh bmQgaGFyZGNvZGluZyBvZiBzdWJzeXN0ZW1zIChpLmUuIHJlc291cmNlIGNvbnRyb2xsZXJzKS4K PiBUaGlzIG1ha2VzIGl0IGRpZmZpY3VsdCB0byBoYW5kbGUgYmxvY2sgZGV2aWNlcyB0aGF0IGNh biBiZSBob3RwbHVnZ2VkCj4gYW5kIGdvIGF3YXkgYXQgYW55IHRpbWUgKHRoaXMgYXBwbGllcyBu b3Qgb25seSB0byB1c2Igc3RvcmFnZSBidXQgYWxzbwo+IHRvIHNvbWUgU0FUQSBhbmQgU0NTSSBk ZXZpY2VzKS4gVG8gY29wZSB3aXRoIHRoaXMgc2l0dWF0aW9uIHByb3Blcmx5IHdlCj4gd291bGQg bmVlZCBob3RwbHVnIHN1cHBvcnQgaW4gY2dyb3VwcywgYnV0LCBhcyBzdWdnZXN0ZWQgYmVmb3Jl IGFuZAo+IGRpc2N1c3NlZCBpbiB0aGUgcGFzdCAoc2VlICgwKSBiZWxvdyksIHRoZXJlIGFyZSBz b21lIGxpbWl0YXRpb25zLgo+IAo+IEV2ZW4gaW4gdGhlIG5vbi1ob3RwbHVnIGNhc2UgaXQgd291 bGQgYmUgbmljZSBpZiB3ZSBjb3VsZCB0cmVhdCBlYWNoCj4gYmxvY2sgSS9PIGRldmljZSBhcyBh biBpbmRlcGVuZGVudCByZXNvdXJjZSwgd2hpY2ggbWVhbnMgd2UgY291bGQgZG8KPiB0aGluZ3Mg bGlrZSBhbGxvY2F0aW5nIEkvTyBiYW5kd2lkdGggb24gYSBwZXItZGV2aWNlIGJhc2lzLiBBcyBs b25nIGFzCj4gcGVyZm9ybWFuY2UgaXMgbm90IGNvbXByb21pc2VkIHRvbyBtdWNoLCBhZGRpbmcg c29tZSBraW5kIG9mIGJhc2ljCj4gaG90cGx1ZyBzdXBwb3J0IHRvIGNncm91cHMgaXMgcHJvYmFi bHkgd29ydGggaXQuCj4KPiAoMCkgaHR0cDovL2xrbWwub3JnL2xrbWwvMjAwOC81LzIxLzEyCgpX aGF0IGFib3V0IHVzaW5nIG1ham9yLG1pbm9yIG51bWJlcnMgdG8gaWRlbnRpZnkgZWFjaCBkZXZp Y2UgYW5kIGFjY291bnQKSU8gc3RhdGlzdGljcz8gSWYgYSBkZXZpY2UgaXMgdW5wbHVnZ2VkIHdl IGNvdWxkIHJlc2V0IElPIHN0YXRpc3RpY3MKYW5kL29yIHJlbW92ZSBJTyBsaW1pdGF0aW9ucyBm b3IgdGhhdCBkZXZpY2UgZnJvbSB1c2Vyc3BhY2UgKGkuZS4gYnkgYQpkZWFtb24pLCBidXQgcGx1 Z2dpbi91bnBsdWdnaW5nIHRoZSBkZXZpY2Ugd291bGQgbm90IGJlIGJsb2NrZWQvYWZmZWN0ZWQK aW4gYW55IGNhc2UuIE9yIGFtIEkgb3ZlcnNpbXBsaWZ5aW5nIHRoZSBwcm9ibGVtPwoKPiAzLiAm IDQuICYgNS4gLSBJL08gYmFuZHdpZHRoIHNoYXBpbmcgJiBHZW5lcmFsIGRlc2lnbiBhc3BlY3Rz Cj4gCj4gVGhlIGltcGxlbWVudGF0aW9uIG9mIGFuIEkvTyBzY2hlZHVsaW5nIGFsZ29yaXRobSBp cyB0byBhIGNlcnRhaW4gZXh0ZW50Cj4gaW5mbHVlbmNlZCBieSB3aGF0IHdlIGFyZSB0cnlpbmcg dG8gYWNoaWV2ZSBpbiB0ZXJtcyBvZiBJL08gYmFuZHdpZHRoCj4gc2hhcGluZywgYnV0LCBhcyBk aXNjdXNzZWQgYmVsb3csIHRoZSByZXF1aXJlZCBhY2N1cmFjeSBjYW4gZGV0ZXJtaW5lCj4gdGhl IGxheWVyIHdoZXJlIHRoZSBJL08gY29udHJvbGxlciBoYXMgdG8gcmVzaWRlLiBPZmYgdGhlIHRv cCBvZiBteQo+IGhlYWQsIHRoZXJlIGFyZSB0aHJlZSBiYXNpYyBvcGVyYXRpb25zIHdlIG1heSB3 YW50IHBlcmZvcm06Cj4gICAtIEkvTyBuaWNlIHByaW9yaXRpemF0aW9uOiBpb25pY2UtbGlrZSBh cHByb2FjaC4KPiAgIC0gUHJvcG9ydGlvbmFsIGJhbmR3aWR0aCBzY2hlZHVsaW5nOiBlYWNoIHBy b2Nlc3MvZ3JvdXAgb2YgcHJvY2Vzc2VzCj4gaGFzIGEgd2VpZ2h0IHRoYXQgZGV0ZXJtaW5lcyB0 aGUgc2hhcmUgb2YgYmFuZHdpZHRoIHRoZXkgcmVjZWl2ZS4KPiAgIC0gSS9PIGxpbWl0aW5nOiBz ZXQgYW4gdXBwZXIgbGltaXQgdG8gdGhlIGJhbmR3aWR0aCBhIGdyb3VwIG9mIHRhc2tzCj4gY2Fu IHVzZS4KClVzZSBhIGRlYWRsaW5lLWJhc2VkIElPIHNjaGVkdWxpbmcgY291bGQgYmUgYW4gaW50 ZXJlc3RpbmcgcGF0aCB0byBiZQpleHBsb3JlZCBhcyB3ZWxsLCBJTUhPLCB0byB0cnkgdG8gZ3Vh cmFudGVlIHBlci1jZ3JvdXAgbWluaW11bSBiYW5kd2lkdGgKcmVxdWlyZW1lbnRzLgoKPiAKPiBJ ZiB3ZSBhcmUgcHVyc3VpbmcgYSBJL08gcHJpb3JpdGl6YXRpb24gbW9kZWwgw6AgbGEgQ0ZRIHRo ZSB0ZW1wdGF0aW9uIGlzCj4gdG8gaW1wbGVtZW50IGl0IGF0IHRoZSBlbGV2YXRvciBsYXllciBv ciBleHRlbmQgYW55IG9mIHRoZSBleGlzdGluZyBJL08KPiBzY2hlZHVsZXJzLgo+IAo+IFRoZXJl IGhhdmUgYmVlbiBzZXZlcmFsIHByb3Bvc2FscyB0aGF0IGV4dGVuZCBlaXRoZXIgdGhlIENGUSBz Y2hlZHVsZXIKPiAoc2VlICgxKSwgKDIpIGJlbG93KSBvciB0aGUgQVMgc2NoZWR1bGVyIChzZWUg KDMpIGJlbG93KS4gVGhlIHByb2JsZW0KPiB3aXRoIHRoZXNlIGNvbnRyb2xsZXJzIGlzIHRoYXQg dGhleSBhcmUgc2NoZWR1bGVyIGRlcGVuZGVudCwgd2hpY2ggbWVhbnMKPiB0aGF0IHRoZXkgYmVj b21lIHVudXNhYmxlIHdoZW4gd2UgY2hhbmdlIHRoZSBzY2hlZHVsZXIgb3Igd2hlbiB3ZSB3YW50 Cj4gdG8gY29udHJvbCBzdGFja2luZyBkZXZpY2VzIHdoaWNoIGRlZmluZSB0aGVpciBvd24gbWFr ZV9yZXF1ZXN0X2ZuCj4gZnVuY3Rpb24gKG1kIGFuZCBkbSBjb21lIHRvIG1pbmQpLiBJdCBjb3Vs ZCBiZSBhcmd1ZWQgdGhhdCB0aGUgcGh5c2ljYWwKPiBkZXZpY2VzIGNvbnRyb2xsZWQgYnkgYSBk bSBvciBtZCBkcml2ZXIgYXJlIGxpa2VseSB0byBiZSBmZWQgYnkKPiB0cmFkaXRpb25hbCBJL08g c2NoZWR1bGVycyBzdWNoIGFzIENGUSwgYnV0IHRoZXNlIEkvTyBzY2hlZHVsZXJzIHdvdWxkCj4g YmUgcnVubmluZyBpbmRlcGVuZGVudGx5IGZyb20gZWFjaCBvdGhlciwgZWFjaCBvbmUgY29udHJv bGxpbmcgaXRzIG93bgo+IGRldmljZSBpZ25vcmluZyB0aGUgZmFjdCB0aGF0IHRoZXkgcGFydCBv ZiBhIHN0YWNraW5nIGRldmljZS4gVGhpcyBsYWNrCj4gb2YgaW5mb3JtYXRpb24gYXQgdGhlIGVs ZXZhdG9yIGxheWVyIG1ha2VzIGl0IHByZXR0eSBkaWZmaWN1bHQgdG8gb2J0YWluCj4gYWNjdXJh dGUgcmVzdWx0cyB3aGVuIHVzaW5nIHN0YWNraW5nIGRldmljZXMuIEl0IHNlZW1zIHRoYXQgdW5s ZXNzIHdlCj4gY2FuIG1ha2UgdGhlIGVsZXZhdG9yIGxheWVyIGF3YXJlIG9mIHRoZSB0b3BvbG9n eSBvZiBzdGFja2luZyBkZXZpY2VzCj4gKHBvc3NpYmx5IGJ5IGV4dGVuZGluZyB0aGUgZWxldmF0 b3IgQVBJPykgZXZlbGF0b3ItYmFzZWQgYXBwcm9hY2hlcyBkbwo+IG5vdCBjb25zdGl0dXRlIGEg Z2VuZXJpYyBzb2x1dGlvbi4gSGVyZSBvbndhcmRzLCBmb3IgZGlzY3Vzc2lvbgo+IHB1cnBvc2Vz LCBJIHdpbGwgcmVmZXIgdG8gdGhpcyB0eXBlIG9mIEkvTyBiYW5kd2lkdGggY29udHJvbGxlcnMg YXMKPiBlbGV2YXRvci1iYXNlZCBJL08gY29udHJvbGxlcnMuCj4gCj4gQSBzaW1wbGUgd2F5IG9m IHNvbHZpbmcgdGhlIHByb2JsZW1zIGRpc2N1c3NlZCBpbiB0aGUgcHJldmlvdXMgcGFyYWdyYXBo Cj4gaXMgdG8gcGVyZm9ybSBJL08gY29udHJvbCBiZWZvcmUgdGhlIEkvTyBhY3R1YWxseSBlbnRl cnMgdGhlIGJsb2NrIGxheWVyCj4gZWl0aGVyIGF0IHRoZSBwYWdlY2FjaGUgbGV2ZWwgKHdoZW4g cGFnZXMgYXJlIGRpcnRpZWQpIG9yIGF0IHRoZSBlbnRyeQo+IHBvaW50IHRvIHRoZSBnZW5lcmlj IGJsb2NrIGxheWVyIChnZW5lcmljX21ha2VfcmVxdWVzdCgpKS4gQW5kcmVhJ3MgSS9PCj4gdGhy b3R0bGluZyBwYXRjaGVzIHN0aWNrIHRvIHRoZSBmb3JtZXIgdmFyaWFudCAoc2VlICg0KSBiZWxv dykgYW5kCj4gVHN1cnV0YS1zYW4gYW5kIFRha2FoYXNoaS1zYW4ncyBkbS1pb2JhbmQgKHNlZSAo NSkgYmVsb3cpIHRha2UgdGhlIGxhdGVyCj4gYXBwcm9hY2guIFRoZSByYXRpb25hbGUgaXMgdGhh dCBieSBob29raW5nIGludG8gdGhlIHNvdXJjZSBvZiBJL08KPiByZXF1ZXN0cyB3ZSBjYW4gcGVy Zm9ybSBJL08gY29udHJvbCBpbiBhIHRvcG9sb2d5LWFnbm9zdGljIGFuZAo+IGVsZXZhdG9yLWFn bm9zdGljIHdheS4gSSB3aWxsIHJlZmVyIHRvIHRoaXMgbmV3IHR5cGUgb2YgSS9PIGJhbmR3aWR0 aAo+IGNvbnRyb2xsZXIgYXMgYmxvY2sgbGF5ZXIgSS9PIGNvbnRyb2xsZXIuCj4gCj4gQnkgcmVz aWRpbmcganVzdCBhYm92ZSB0aGUgZ2VuZXJpYyBibG9jayBsYXllciB0aGUgaW1wbGVtZW50YXRp b24gb2YgYQo+IGJsb2NrIGxheWVyIEkvTyBjb250cm9sbGVyIGJlY29tZXMgcmVsYXRpdmVseSBl YXN5LCBidXQgYnkgbm90IHRha2luZwo+IGludG8gYWNjb3VudCB0aGUgY2hhcmFjdGVyaXN0aWNz IG9mIHRoZSB1bmRlcmx5aW5nIGRldmljZXMgd2UgbWlnaHQgcmlzawo+IHVuZGVydXRpbGl6aW5n IHRoZW0uIEZvciB0aGlzIHJlYXNvbiwgaW4gc29tZSBjYXNlcyBpdCB3b3VsZCBwcm9iYWJseQo+ IG1ha2Ugc2Vuc2UgdG8gY29tcGxlbWVudCBhIGdlbmVyaWMgSS9PIGNvbnRyb2xsZXIgd2l0aCBl bGV2YXRvci1iYXNlZAo+IEkvTyBjb250cm9sbGVyLCBzbyB0aGF0IHRoZSBtYXhpbXVtIHRocm91 Z2hwdXQgY2FuIGJlIHNxdWVlemVkIGZyb20gdGhlCj4gcGh5c2ljYWwgZGV2aWNlcy4KPiAKPiAo MSkgVWNoaWRhLXNhbidzIENGUS1iYXNlZCBzY2hlZHVsZXI6IGh0dHA6Ly9sd24ubmV0L0FydGlj bGVzLzI3NTk0NC8KPiAoMikgVmFzaWx5J3MgQ0ZRLWJhc2VkIHNjaGVkdWxlcjogaHR0cDovL2x3 bi5uZXQvQXJ0aWNsZXMvMjc0NjUyLwo+ICgzKSBOYXZlZW4gR3VwdGEncyBBUy1iYXNlZCBzY2hl ZHVsZXI6IGh0dHA6Ly9sd24ubmV0L0FydGljbGVzLzI4ODg5NS8KPiAoNCkgQW5kcmVhIFJpZ2hp J3MgaS9vIGJhbmR3aWR0aCBjb250cm9sbGVyIChJL08gdGhyb3R0bGluZyk6aHR0cDovL3RocmVh ZC5nbWFuZS5vcmcvZ21hbmUubGludXgua2VybmVsLmNvbnRhaW5lcnMvNTk3NQo+ICg1KSBUc3Vy dXRhLXNhbiBhbmQgVGFrYWhhc2hpLXNhbidzIGRtLWlvYmFuZDogaHR0cDovL3RocmVhZC5nbWFu ZS5vcmcvZ21hbmUubGludXgua2VybmVsLnZpcnR1YWxpemF0aW9uLzY1ODEKPiAKPiA2Li0gSS9P IHRyYWNraW5nCj4gCj4gVGhpcyBpcyBhcmd1YWJseSB0aGUgbW9zdCBpbXBvcnRhbnQgcGFydCwg c2luY2UgdG8gcGVyZm9ybSBJL08gY29udHJvbAo+IHdlIG5lZWQgdG8gYmUgYWJsZSB0byBkZXRl cm1pbmUgd2hlcmUgdGhlIEkvTyBpcyBjb21pbmcgZnJvbS4KPiAKPiBSZWFkcyBhcmUgdHJpdmlh bCBiZWNhdXNlIHRoZXkgYXJlIHNlcnZlZCBpbiB0aGUgY29udGV4dCBvZiB0aGUgdGFzawo+IHRo YXQgZ2VuZXJhdGVkIHRoZSBJL08uIEJ1dCBtb3N0IHdyaXRlcyBhcmUgcGVyZm9ybWVkIGJ5IHBk Zmx1c2gsCj4ga3N3YXBkLCBhbmQgZnJpZW5kcyBzbyBwZXJmb3JtaW5nIEkvTyBjb250cm9sIGp1 c3QgaW4gdGhlIHN5bmNocm9ub3VzCj4gSS9PIHBhdGggd291bGQgbGVhZCB0byBsYXJnZSBpbmFj Y3VyYWN5LiBUbyBnZXQgdGhpcyByaWdodCB3ZSB3b3VsZCBuZWVkCj4gdG8gdHJhY2sgb3duZXJz aGlwIGFsbCB0aGUgd2F5IHVwIHRvIHRoZSBwYWdlY2FjaGUgcGFnZS4gSW4gb3RoZXIgd29yZHMs Cj4gaXQgaXMgbmVjZXNzYXJ5IHRvIHRyYWNrIHdobyBpcyBkaXJ0eWluZyBwYWdlcyBzbyB0aGF0 IHdoZW4gdGhleSBhcmUKPiB3cml0dGVuIHRvIGRpc2sgdGhlIHJpZ2h0IHRhc2sgaXMgY2hhcmdl ZCBmb3IgdGhhdCBJL08uCj4gCj4gRm9ydHVuYXRlbHksIHN1Y2ggdHJhY2tpbmcgb2YgcGFnZXMg aXMgb25lIG9mIHRoZSB0aGluZ3MgdGhlIGV4aXN0aW5nCj4gbWVtb3J5IHJlc291cmNlIGNvbnRy b2xsZXIgaXMgZG9pbmcgdG8gY29udHJvbCBtZW1vcnkgdXNhZ2UuIFRoaXMgaXMgYQo+IGNsZXZl ciBvYnNlcnZhdGlvbiB3aGljaCBoYXMgYSB1c2VmdWwgaW1wbGljYXRpb246IGlmIHRoZSByYXRo ZXIKPiBpbWJyaWNhdGVkIHRyYWNraW5nIGFuZCBhY2NvdW50aW5nIHBhcnRzIG9mIHRoZSBtZW1v cnkgcmVzb3VyY2UKPiBjb250cm9sbGVyIHdlcmUgc3BsaXQgdGhlIEkvTyBjb250cm9sbGVyIGNv dWxkIGxldmVyYWdlIHRoZSBleGlzdGluZwo+IGluZnJhc3RydWN0dXJlIHRvIHRyYWNrIGJ1ZmZl cmVkIGFuZCBhc3luY2hyb25vdXMgSS9PLiBUaGlzIGlzIGV4YWN0bHkKPiB3aGF0IHRoZSBiaW8t Y2dyb3VwIChzZWUgKDYpIGJlbG93KSBwYXRjaGVzIHNldCBvdXQgdG8gZG8uCj4gCj4gSXQgaXMg YWxzbyBwb3NzaWJsZSB0byBkbyB3aXRob3V0IEkvTyB0cmFja2luZy4gRm9yIHRoYXQgd2Ugd291 bGQgbmVlZAo+IHRvIGhvb2sgaW50byB0aGUgc3luY2hyb25vdXMgSS9PIHBhdGggYW5kIGV2ZXJ5 IHBsYWNlIGluIHRoZSBrZXJuZWwKPiB3aGVyZSBwYWdlcyBhcmUgZGlydGllZCAoc2VlICg0KSBh Ym92ZSBmb3IgZGV0YWlscykuIEhvd2V2ZXIgY29udHJvbGxpbmcKPiB0aGUgcmF0ZSBhdCB3aGlj aCBhIGNncm91cCBjYW4gZ2VuZXJhdGUgZGlydHkgcGFnZXMgc2VlbXMgdG8gYmUgYSB0YXNrCj4g dGhhdCBiZWxvbmdzIGluIHRoZSBtZW1vcnkgY29udHJvbGxlciBub3QgdGhlIEkvTyBjb250cm9s bGVyLiBBcyBEYXZlCj4gYW5kIFBhdWwgc3VnZ2VzdGVkIGl0cyBwcm9iYWJseSBiZXR0ZXIgdG8g ZGVsZWdhdGUgdGhpcyB0byB0aGUgbWVtb3J5Cj4gY29udHJvbGxlci4gSW4gZmFjdCwgaXQgc2Vl bXMgdGhhdCBZYW1hbW90by1zYW4gaXMgY29va2luZyBzb21lIHBhdGNoZXMKPiB0aGF0IGltcGxl bWVudCBqdXN0IHRoYXQ6IGRpcnR5IGJhbGFuY2luZyBmb3IgY2dyb3VwcyAoc2VlICg3KSBmb3IK PiBkZXRhaWxzKS4KPiAKPiBBbm90aGVyIGFyZ3VtZW50IGluIGZhdm9yIG9mIEkvTyB0cmFja2lu ZyBpcyB0aGF0IG5vdCBvbmx5IGJsb2NrIGxheWVyCj4gSS9PIGNvbnRyb2xsZXJzIHdvdWxkIGJl bmVmaXQgZnJvbSBpdCwgYnV0IGFsc28gdGhlIGV4aXN0aW5nIEkvTwo+IHNjaGVkdWxlcnMgYW5k IHRoZSBlbGV2YXRvci1iYXNlZCBJL08gY29udHJvbGxlcnMgcHJvcG9zZWQgYnkKPiBVY2hpZGEt c2FuLCBWYXNpbHksIGFuZCBOYXZlZW4gKFlvc2hpa2F3YS1zYW4sIHdobyBpcyBDQ2VkLCBhbmQg bXlzZWxmCj4gYXJlIHdvcmtpbmcgb24gdGhpcyBhbmQgaG9wZWZ1bGx5IHdpbGwgYmUgc2VuZGlu ZyBwYXRjaGVzIHNvb24pLgo+IAo+ICg2KSBUc3VydXRhLXNhbiBhbmQgVGFrYWhhc2hpLXNhbidz IEkvTyB0cmFja2luZyBwYXRjaGVzOiBodHRwOi8vbGttbC5vcmcvbGttbC8yMDA4LzgvNC85MAo+ ICg3KSBZYW1hbW90by1zYW4gZGlydHkgYmFsYW5jaW5nIHBhdGNoZXM6IGh0dHA6Ly9sd24ubmV0 L0FydGljbGVzLzI4OTIzNy8KPiAKPiAqKiogSG93IHRvIG1vdmUgb24KPiAKPiBBcyBkaXNjdXNz ZWQgYmVmb3JlLCBpdCBwcm9iYWJseSBtYWtlcyBzZW5zZSB0byBoYXZlIGJvdGggYSBibG9jayBs YXllcgo+IEkvTyBjb250cm9sbGVyIGFuZCBhIGVsZXZhdG9yLWJhc2VkIG9uZSwgYW5kIHRoZXkg Y291bGQgY2VydGFpbmx5Cj4gY29oYWJpdGF0ZS4gQXMgZGlzY3Vzc2VkIGJlZm9yZSwgYWxsIG9m IHRoZW0gbmVlZCBJL08gdHJhY2tpbmcKPiBjYXBhYmlsaXRpZXMgc28gSSB3b3VsZCBsaWtlIHRv IHN1Z2dlc3QgdGhlIHBsYW4gYmVsb3cgdG8gZ2V0IHRoaW5ncwo+IHN0YXJ0ZWQ6Cj4gCj4gICAt IEltcHJvdmUgdGhlIEkvTyB0cmFja2luZyBwYXRjaGVzIChzZWUgKDYpIGFib3ZlKSB1bnRpbCB0 aGV5IGFyZSBpbgo+IG1lcmdlYWJsZSBzaGFwZS4KPiAgIC0gRml4IENGUSBhbmQgQVMgdG8gdXNl IHRoZSBuZXcgSS9PIHRyYWNraW5nIGZ1bmN0aW9uYWxpdHkgdG8gc2hvdyBpdHMKPiBiZW5lZml0 cy4gSWYgdGhlIHBlcmZvcm1hbmNlIGltcGFjdCBpcyBhY2NlcHRhYmxlIHRoaXMgc2hvdWxkIHN1 ZmZpY2UgdG8KPiBjb252aW5jZSB0aGUgcmVzcGVjdGl2ZSBtYWludGFpbmVyIGFuZCBnZXQgdGhl IEkvTyB0cmFja2luZyBwYXRjaGVzCj4gbWVyZ2VkLgo+ICAgLSBJbXBsZW1lbnQgYSBibG9jayBs YXllciByZXNvdXJjZSBjb250cm9sbGVyLiBkbS1pb2JhbmQgaXMgYSB3b3JraW5nCj4gc29sdXRp b24gYW5kIGZlYXR1cmUgcmljaCBidXQgaXRzIGRlcGVuZGVuY3kgb24gdGhlIGRtIGluZnJhc3Ry dWN0dXJlIGlzCj4gbGlrZWx5IHRvIGZpbmQgb3Bwb3NpdGlvbiAodGhlIGRtIGxheWVyIGRvZXMg bm90IGhhbmRsZSBiYXJyaWVycwo+IHByb3Blcmx5IGFuZCB0aGUgbWF4aW11bSBzaXplIG9mIEkv TyByZXF1ZXN0cyBjYW4gYmUgbGltaXRlZCBpbiBzb21lCj4gY2FzZXMpLiBJbiBzdWNoIGEgY2Fz ZSwgd2UgY291bGQgZWl0aGVyIHRyeSB0byBidWlsZCBhIHN0YW5kYWxvbmUKPiByZXNvdXJjZSBj b250cm9sbGVyIGJhc2VkIG9uIGRtLWlvYmFuZCAod2hpY2ggd291bGQgcHJvYmFibHkgaG9vayBp bnRvCj4gZ2VuZXJpY19tYWtlX3JlcXVlc3QpIG9yIHRyeSB0byBjb21lIHVwIHdpdGggc29tZXRo aW5nIG5ldy4KPiAgIC0gSWYgdGhlIEkvTyB0cmFja2luZyBwYXRjaGVzIG1ha2UgaXQgaW50byB0 aGUga2VybmVsIHdlIGNvdWxkIG1vdmUgb24KPiBhbmQgdHJ5IHRvIGdldCB0aGUgQ2dyb3VwIGV4 dGVuc2lvbnMgdG8gQ0ZRIGFuZCBBUyBtZW50aW9uZWQgYmVmb3JlIChzZWUKPiAoMSksICgyKSwg YW5kICgzKSBhYm92ZSBmb3IgZGV0YWlscykgbWVyZ2VkLgo+ICAgLSBEZWxlZ2F0ZSB0aGUgdGFz ayBvZiBjb250cm9sbGluZyB0aGUgcmF0ZSBhdCB3aGljaCBhIHRhc2sgY2FuCj4gZ2VuZXJhdGUg ZGlydHkgcGFnZXMgdG8gdGhlIG1lbW9yeSBjb250cm9sbGVyLgo+IAo+IFRoaXMgUkZDIGlzIHNv bWV3aGF0IHZhZ3VlIGJ1dCBteSBmZWVsaW5nIGlzIHRoYXQgd2UgYnVpbGQgc29tZQo+IGNvbnNl bnN1cyBvbiB0aGUgZ29hbHMgYW5kIGJhc2ljIGRlc2lnbiBhc3BlY3RzIGJlZm9yZSBkZWx2aW5n IGludG8KPiBpbXBsZW1lbnRhdGlvbiBkZXRhaWxzLgo+IAo+IEkgd291bGQgYXBwcmVjaWF0ZSB5 b3VyIGNvbW1lbnRzIGFuZCBmZWVkYmFjay4KClZlcnkgbmljZSBSRkMuCgotQW5kcmVhCl9fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fCkNvbnRhaW5lcnMgbWFp bGluZyBsaXN0CkNvbnRhaW5lcnNAbGlzdHMubGludXgtZm91bmRhdGlvbi5vcmcKaHR0cHM6Ly9s aXN0cy5saW51eC1mb3VuZGF0aW9uLm9yZy9tYWlsbWFuL2xpc3RpbmZvL2NvbnRhaW5lcnM= From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755153AbYHGHrE (ORCPT ); Thu, 7 Aug 2008 03:47:04 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754676AbYHGHqd (ORCPT ); Thu, 7 Aug 2008 03:46:33 -0400 Received: from as2.cineca.com ([130.186.84.242]:37039 "EHLO as2.cineca.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753590AbYHGHqb (ORCPT ); Thu, 7 Aug 2008 03:46:31 -0400 Message-ID: <489AA83F.1040306@gmail.com> From: Andrea Righi Reply-To: righi.andrea@gmail.com User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070604 Thunderbird/1.5.0.12 Mnenhy/0.7.5.666 MIME-Version: 1.0 To: =?UTF-8?B?RmVybmFuZG8gTHVpcyBWw6F6cXVleiBDYW8=?= Cc: Dave Hansen , Ryo Tsuruta , yoshikawa.takuya@oss.ntt.co.jp, taka@valinux.co.jp, uchida@ap.jp.nec.com, ngupta@google.com, linux-kernel@vger.kernel.org, dm-devel@redhat.com, containers@lists.linux-foundation.org, virtualization@lists.linux-foundation.org, xen-devel@lists.xensource.com, agk@sourceware.org Subject: Re: RFC: I/O bandwidth controller (was Re: Too many I/O controller patches) References: <20080804.175126.193692178.ryov@valinux.co.jp> <1217870433.20260.101.camel@nimitz> <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp> In-Reply-To: <1217985189.3154.57.camel@sebastian.kern.oss.ntt.co.jp> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Date: Thu, 7 Aug 2008 09:46:07 +0200 (MEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Fernando Luis Vázquez Cao wrote: > This RFC ended up being a bit longer than I had originally intended, but > hopefully it will serve as the start of a fruitful discussion. Thanks for posting this detailed RFC! A few comments below. > As you pointed out, it seems that there is not much consensus building > going on, but that does not mean there is a lack of interest. To get the > ball rolling it is probably a good idea to clarify the state of things > and try to establish what we are trying to accomplish. > > *** State of things in the mainstream kernel
> The kernel has had somewhat adavanced I/O control capabilities for quite > some time now: CFQ. But the current CFQ has some problems: > - I/O priority can be set by PID, PGRP, or UID, but... > - ...all the processes that fall within the same class/priority are > scheduled together and arbitrary grouping are not possible. > - Buffered I/O is not handled properly. > - CFQ's IO priority is an attribute of a process that affects all > devices it sends I/O requests to. In other words, with the current > implementation it is not possible to assign per-device IO priorities to > a task. > > *** Goals > 1. Cgroups-aware I/O scheduling (being able to define arbitrary > groupings of processes and treat each group as a single scheduling > entity). > 2. Being able to perform I/O bandwidth control independently on each > device. > 3. I/O bandwidth shaping. > 4. Scheduler-independent I/O bandwidth control. > 5. Usable with stacking devices (md, dm and other devices of that > ilk). > 6. I/O tracking (handle buffered and asynchronous I/O properly). The same above also for IO operations/sec (bandwidth intended not only in terms of bytes/sec), plus: 7. Optimal bandwidth usage: allow to exceed the IO limits to take advantage of free/unused IO resources (i.e. allow "bursts" when the whole physical bandwidth for a block device is not fully used and then "throttle" again when IO from unlimited cgroups comes into place) 8. "fair throttling": avoid to throttle always the same task within a cgroup, but try to distribute the throttling among all the tasks belonging to the throttle cgroup > The list of goals above is not exhaustive and it is also likely to > contain some not-so-nice-to-have features so your feedback would be > appreciated. > > 1. & 2.- Cgroups-aware I/O scheduling (being able to define arbitrary > groupings of processes and treat each group as a single scheduling > identity) > > We obviously need this because our final goal is to be able to control > the IO generated by a Linux container. The good news is that we already > have the cgroups infrastructure so, regarding this problem, we would > just have to transform our I/O bandwidth controller into a cgroup > subsystem. > > This seems to be the easiest part, but the current cgroups > infrastructure has some limitations when it comes to dealing with block > devices: impossibility of creating/removing certain control structures > dynamically and hardcoding of subsystems (i.e. resource controllers). > This makes it difficult to handle block devices that can be hotplugged > and go away at any time (this applies not only to usb storage but also > to some SATA and SCSI devices). To cope with this situation properly we > would need hotplug support in cgroups, but, as suggested before and > discussed in the past (see (0) below), there are some limitations. > > Even in the non-hotplug case it would be nice if we could treat each > block I/O device as an independent resource, which means we could do > things like allocating I/O bandwidth on a per-device basis. As long as > performance is not compromised too much, adding some kind of basic > hotplug support to cgroups is probably worth it. > > (0) http://lkml.org/lkml/2008/5/21/12 What about using major,minor numbers to identify each device and account IO statistics? If a device is unplugged we could reset IO statistics and/or remove IO limitations for that device from userspace (i.e. by a deamon), but pluggin/unplugging the device would not be blocked/affected in any case. Or am I oversimplifying the problem? > 3. & 4. & 5. - I/O bandwidth shaping & General design aspects > > The implementation of an I/O scheduling algorithm is to a certain extent > influenced by what we are trying to achieve in terms of I/O bandwidth > shaping, but, as discussed below, the required accuracy can determine > the layer where the I/O controller has to reside. Off the top of my > head, there are three basic operations we may want perform: > - I/O nice prioritization: ionice-like approach. > - Proportional bandwidth scheduling: each process/group of processes > has a weight that determines the share of bandwidth they receive. > - I/O limiting: set an upper limit to the bandwidth a group of tasks > can use. Use a deadline-based IO scheduling could be an interesting path to be explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth requirements. > > If we are pursuing a I/O prioritization model à la CFQ the temptation is > to implement it at the elevator layer or extend any of the existing I/O > schedulers. > > There have been several proposals that extend either the CFQ scheduler > (see (1), (2) below) or the AS scheduler (see (3) below). The problem > with these controllers is that they are scheduler dependent, which means > that they become unusable when we change the scheduler or when we want > to control stacking devices which define their own make_request_fn > function (md and dm come to mind). It could be argued that the physical > devices controlled by a dm or md driver are likely to be fed by > traditional I/O schedulers such as CFQ, but these I/O schedulers would > be running independently from each other, each one controlling its own > device ignoring the fact that they part of a stacking device. This lack > of information at the elevator layer makes it pretty difficult to obtain > accurate results when using stacking devices. It seems that unless we > can make the elevator layer aware of the topology of stacking devices > (possibly by extending the elevator API?) evelator-based approaches do > not constitute a generic solution. Here onwards, for discussion > purposes, I will refer to this type of I/O bandwidth controllers as > elevator-based I/O controllers. > > A simple way of solving the problems discussed in the previous paragraph > is to perform I/O control before the I/O actually enters the block layer > either at the pagecache level (when pages are dirtied) or at the entry > point to the generic block layer (generic_make_request()). Andrea's I/O > throttling patches stick to the former variant (see (4) below) and > Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later > approach. The rationale is that by hooking into the source of I/O > requests we can perform I/O control in a topology-agnostic and > elevator-agnostic way. I will refer to this new type of I/O bandwidth > controller as block layer I/O controller. > > By residing just above the generic block layer the implementation of a > block layer I/O controller becomes relatively easy, but by not taking > into account the characteristics of the underlying devices we might risk > underutilizing them. For this reason, in some cases it would probably > make sense to complement a generic I/O controller with elevator-based > I/O controller, so that the maximum throughput can be squeezed from the > physical devices. > > (1) Uchida-san's CFQ-based scheduler: http://lwn.net/Articles/275944/ > (2) Vasily's CFQ-based scheduler: http://lwn.net/Articles/274652/ > (3) Naveen Gupta's AS-based scheduler: http://lwn.net/Articles/288895/ > (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975 > (5) Tsuruta-san and Takahashi-san's dm-ioband: http://thread.gmane.org/gmane.linux.kernel.virtualization/6581 > > 6.- I/O tracking > > This is arguably the most important part, since to perform I/O control > we need to be able to determine where the I/O is coming from. > > Reads are trivial because they are served in the context of the task > that generated the I/O. But most writes are performed by pdflush, > kswapd, and friends so performing I/O control just in the synchronous > I/O path would lead to large inaccuracy. To get this right we would need > to track ownership all the way up to the pagecache page. In other words, > it is necessary to track who is dirtying pages so that when they are > written to disk the right task is charged for that I/O. > > Fortunately, such tracking of pages is one of the things the existing > memory resource controller is doing to control memory usage. This is a > clever observation which has a useful implication: if the rather > imbricated tracking and accounting parts of the memory resource > controller were split the I/O controller could leverage the existing > infrastructure to track buffered and asynchronous I/O. This is exactly > what the bio-cgroup (see (6) below) patches set out to do. > > It is also possible to do without I/O tracking. For that we would need > to hook into the synchronous I/O path and every place in the kernel > where pages are dirtied (see (4) above for details). However controlling > the rate at which a cgroup can generate dirty pages seems to be a task > that belongs in the memory controller not the I/O controller. As Dave > and Paul suggested its probably better to delegate this to the memory > controller. In fact, it seems that Yamamoto-san is cooking some patches > that implement just that: dirty balancing for cgroups (see (7) for > details). > > Another argument in favor of I/O tracking is that not only block layer > I/O controllers would benefit from it, but also the existing I/O > schedulers and the elevator-based I/O controllers proposed by > Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself > are working on this and hopefully will be sending patches soon). > > (6) Tsuruta-san and Takahashi-san's I/O tracking patches: http://lkml.org/lkml/2008/8/4/90 > (7) Yamamoto-san dirty balancing patches: http://lwn.net/Articles/289237/ > > *** How to move on > > As discussed before, it probably makes sense to have both a block layer > I/O controller and a elevator-based one, and they could certainly > cohabitate. As discussed before, all of them need I/O tracking > capabilities so I would like to suggest the plan below to get things > started: > > - Improve the I/O tracking patches (see (6) above) until they are in > mergeable shape. > - Fix CFQ and AS to use the new I/O tracking functionality to show its > benefits. If the performance impact is acceptable this should suffice to > convince the respective maintainer and get the I/O tracking patches > merged. > - Implement a block layer resource controller. dm-ioband is a working > solution and feature rich but its dependency on the dm infrastructure is > likely to find opposition (the dm layer does not handle barriers > properly and the maximum size of I/O requests can be limited in some > cases). In such a case, we could either try to build a standalone > resource controller based on dm-ioband (which would probably hook into > generic_make_request) or try to come up with something new. > - If the I/O tracking patches make it into the kernel we could move on > and try to get the Cgroup extensions to CFQ and AS mentioned before (see > (1), (2), and (3) above for details) merged. > - Delegate the task of controlling the rate at which a task can > generate dirty pages to the memory controller. > > This RFC is somewhat vague but my feeling is that we build some > consensus on the goals and basic design aspects before delving into > implementation details. > > I would appreciate your comments and feedback. Very nice RFC. -Andrea