* [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) [not found] ` <AANLkTi=LL6JEwOcZfSsapYn-isA3RBrV8kPq8EK6va8=@mail.gmail.com> @ 2011-03-10 19:11 ` Vivek Goyal 2011-03-10 19:41 ` Vivek Goyal 0 siblings, 1 reply; 12+ messages in thread From: Vivek Goyal @ 2011-03-10 19:11 UTC (permalink / raw) To: Justin TerAvest Cc: KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel, Chris Mason On Thu, Mar 10, 2011 at 10:57:52AM -0800, Justin TerAvest wrote: > On Thu, Mar 10, 2011 at 10:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Mar 10, 2011 at 10:08:03AM -0800, Justin TerAvest wrote: > > > > [..] > >> > I don't like to increase size of page_cgroup but I think you can record > >> > information without increasing size of page_cgroup. > >> > > >> > A) As Andrea did, encode it to pc->flags. > >> > But I'm afraid that there is a racy case because memory cgroup uses some > >> > test_and_set() bits. > >> > B) I wonder why the information cannot be recorded in page->private. > >> > When page has buffers, you can record the information to buffer struct. > >> > About swapio (if you take care of), you can record information to bio. > >> > >> Hi Kame, > >> > >> I'm concerned that by using something like buffer_heads stored in > >> page->private, we will only be supported on some filesystems and not > >> others. In addition, I'm not sure if all filesystems attach buffer > >> heads at the same time; if page->private is modified in the flusher > >> thread, we might not be able to determine the thread that dirtied the > >> page in the first place. > > > > I think the person who dirtied the page can store the information in > > page->private (assuming buffer heads were not generated) and if flusher > > thread later ends up generating buffer heads and ends up modifying > > page->private, this can be copied in buffer heads? > > This scares me a bit. > > As I understand it, fs/ code expects total ownership of page->private. > This adds a responsibility for every user to copy the data through and > store it in the buffer head (or anything else). btrfs seems to do > something entirely different in some cases and store a different kind > of value. If filesystems are using page->private for some other purpose also, then I guess we have issues. I am ccing linux-fsdevel to have some feedback on the idea of trying to store cgroup id of page dirtying thread in page->private and/or buffer head for tracking which group originally dirtied the page in IO controller during writeback. > > I don't know that it's right to add the burden to copy the original > value to everything that wants to use page->private. > How many such places are there? Thanks Vivek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-10 19:11 ` [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) Vivek Goyal @ 2011-03-10 19:41 ` Vivek Goyal 2011-03-10 21:15 ` Chris Mason 0 siblings, 1 reply; 12+ messages in thread From: Vivek Goyal @ 2011-03-10 19:41 UTC (permalink / raw) To: Justin TerAvest Cc: KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel, Chris Mason On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote: > On Thu, Mar 10, 2011 at 10:57:52AM -0800, Justin TerAvest wrote: > > On Thu, Mar 10, 2011 at 10:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Thu, Mar 10, 2011 at 10:08:03AM -0800, Justin TerAvest wrote: > > > > > > [..] > > >> > I don't like to increase size of page_cgroup but I think you can record > > >> > information without increasing size of page_cgroup. > > >> > > > >> > A) As Andrea did, encode it to pc->flags. > > >> > But I'm afraid that there is a racy case because memory cgroup uses some > > >> > test_and_set() bits. > > >> > B) I wonder why the information cannot be recorded in page->private. > > >> > When page has buffers, you can record the information to buffer struct. > > >> > About swapio (if you take care of), you can record information to bio. > > >> > > >> Hi Kame, > > >> > > >> I'm concerned that by using something like buffer_heads stored in > > >> page->private, we will only be supported on some filesystems and not > > >> others. In addition, I'm not sure if all filesystems attach buffer > > >> heads at the same time; if page->private is modified in the flusher > > >> thread, we might not be able to determine the thread that dirtied the > > >> page in the first place. > > > > > > I think the person who dirtied the page can store the information in > > > page->private (assuming buffer heads were not generated) and if flusher > > > thread later ends up generating buffer heads and ends up modifying > > > page->private, this can be copied in buffer heads? > > > > This scares me a bit. > > > > As I understand it, fs/ code expects total ownership of page->private. > > This adds a responsibility for every user to copy the data through and > > store it in the buffer head (or anything else). btrfs seems to do > > something entirely different in some cases and store a different kind > > of value. > > If filesystems are using page->private for some other purpose also, then > I guess we have issues. > > I am ccing linux-fsdevel to have some feedback on the idea of trying > to store cgroup id of page dirtying thread in page->private and/or buffer > head for tracking which group originally dirtied the page in IO controller > during writeback. A quick "grep" showed that btrfs, ceph and logfs are using page->private for other purposes also. I was under the impression that either page->private is null or it points to buffer heads for the writeback case. So storing the info directly in either buffer head directly or first in page->private and then transferring it to buffer heads would have helped. Thanks Vivek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-10 19:41 ` Vivek Goyal @ 2011-03-10 21:15 ` Chris Mason 2011-03-10 21:24 ` Andreas Dilger 0 siblings, 1 reply; 12+ messages in thread From: Chris Mason @ 2011-03-10 21:15 UTC (permalink / raw) To: Vivek Goyal Cc: Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500: > On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote: > > On Thu, Mar 10, 2011 at 10:57:52AM -0800, Justin TerAvest wrote: > > > On Thu, Mar 10, 2011 at 10:15 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Thu, Mar 10, 2011 at 10:08:03AM -0800, Justin TerAvest wrote: > > > > > > > > [..] > > > >> > I don't like to increase size of page_cgroup but I think you can record > > > >> > information without increasing size of page_cgroup. > > > >> > > > > >> > A) As Andrea did, encode it to pc->flags. > > > >> > Â But I'm afraid that there is a racy case because memory cgroup uses some > > > >> > Â test_and_set() bits. > > > >> > B) I wonder why the information cannot be recorded in page->private. > > > >> > Â When page has buffers, you can record the information to buffer struct. > > > >> > Â About swapio (if you take care of), you can record information to bio. > > > >> > > > >> Hi Kame, > > > >> > > > >> I'm concerned that by using something like buffer_heads stored in > > > >> page->private, we will only be supported on some filesystems and not > > > >> others. In addition, I'm not sure if all filesystems attach buffer > > > >> heads at the same time; if page->private is modified in the flusher > > > >> thread, we might not be able to determine the thread that dirtied the > > > >> page in the first place. > > > > > > > > I think the person who dirtied the page can store the information in > > > > page->private (assuming buffer heads were not generated) and if flusher > > > > thread later ends up generating buffer heads and ends up modifying > > > > page->private, this can be copied in buffer heads? > > > > > > This scares me a bit. > > > > > > As I understand it, fs/ code expects total ownership of page->private. > > > This adds a responsibility for every user to copy the data through and > > > store it in the buffer head (or anything else). btrfs seems to do > > > something entirely different in some cases and store a different kind > > > of value. > > > > If filesystems are using page->private for some other purpose also, then > > I guess we have issues. > > > > I am ccing linux-fsdevel to have some feedback on the idea of trying > > to store cgroup id of page dirtying thread in page->private and/or buffer > > head for tracking which group originally dirtied the page in IO controller > > during writeback. > > A quick "grep" showed that btrfs, ceph and logfs are using page->private > for other purposes also. > > I was under the impression that either page->private is null or it > points to buffer heads for the writeback case. So storing the info > directly in either buffer head directly or first in page->private and > then transferring it to buffer heads would have helped. Right, btrfs has its own uses for page->private, and we expect to own it. With a proper callback, the FS could store the extra information you need in out own structs. -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-10 21:15 ` Chris Mason @ 2011-03-10 21:24 ` Andreas Dilger 2011-03-10 21:38 ` Vivek Goyal 0 siblings, 1 reply; 12+ messages in thread From: Andreas Dilger @ 2011-03-10 21:24 UTC (permalink / raw) To: Chris Mason Cc: Vivek Goyal, Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel On 2011-03-10, at 2:15 PM, Chris Mason wrote: > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500: >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote: >>>>> I think the person who dirtied the page can store the information in >>>>> page->private (assuming buffer heads were not generated) and if flusher >>>>> thread later ends up generating buffer heads and ends up modifying >>>>> page->private, this can be copied in buffer heads? >>>> >>>> This scares me a bit. >>>> >>>> As I understand it, fs/ code expects total ownership of page->private. >>>> This adds a responsibility for every user to copy the data through and >>>> store it in the buffer head (or anything else). btrfs seems to do >>>> something entirely different in some cases and store a different kind >>>> of value. >>> >>> If filesystems are using page->private for some other purpose also, then >>> I guess we have issues. >>> >>> I am ccing linux-fsdevel to have some feedback on the idea of trying >>> to store cgroup id of page dirtying thread in page->private and/or buffer >>> head for tracking which group originally dirtied the page in IO controller >>> during writeback. >> >> A quick "grep" showed that btrfs, ceph and logfs are using page->private >> for other purposes also. >> >> I was under the impression that either page->private is null or it >> points to buffer heads for the writeback case. So storing the info >> directly in either buffer head directly or first in page->private and >> then transferring it to buffer heads would have helped. > > Right, btrfs has its own uses for page->private, and we expect to own > it. With a proper callback, the FS could store the extra information you > need in out own structs. There is no requirement that page->private ever points to a buffer_head, and Lustre clients use it for its own tracking structure (never touching buffer_heads at all). Any assumption about what a filesystem is storing in page->private in other parts of the code is just broken. Cheers, Andreas ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-10 21:24 ` Andreas Dilger @ 2011-03-10 21:38 ` Vivek Goyal 2011-03-10 21:43 ` Chris Mason 0 siblings, 1 reply; 12+ messages in thread From: Vivek Goyal @ 2011-03-10 21:38 UTC (permalink / raw) To: Andreas Dilger Cc: Chris Mason, Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote: > On 2011-03-10, at 2:15 PM, Chris Mason wrote: > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500: > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote: > >>>>> I think the person who dirtied the page can store the information in > >>>>> page->private (assuming buffer heads were not generated) and if flusher > >>>>> thread later ends up generating buffer heads and ends up modifying > >>>>> page->private, this can be copied in buffer heads? > >>>> > >>>> This scares me a bit. > >>>> > >>>> As I understand it, fs/ code expects total ownership of page->private. > >>>> This adds a responsibility for every user to copy the data through and > >>>> store it in the buffer head (or anything else). btrfs seems to do > >>>> something entirely different in some cases and store a different kind > >>>> of value. > >>> > >>> If filesystems are using page->private for some other purpose also, then > >>> I guess we have issues. > >>> > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying > >>> to store cgroup id of page dirtying thread in page->private and/or buffer > >>> head for tracking which group originally dirtied the page in IO controller > >>> during writeback. > >> > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private > >> for other purposes also. > >> > >> I was under the impression that either page->private is null or it > >> points to buffer heads for the writeback case. So storing the info > >> directly in either buffer head directly or first in page->private and > >> then transferring it to buffer heads would have helped. > > > > Right, btrfs has its own uses for page->private, and we expect to own > > it. With a proper callback, the FS could store the extra information you > > need in out own structs. > > There is no requirement that page->private ever points to a buffer_head, and Lustre clients use it for its own tracking structure (never touching buffer_heads at all). Any assumption about what a filesystem is storing in page->private in other parts of the code is just broken. Andreas, As Chris mentioned, will providing callbacks so that filesystem can save/restore page->private be reasonable? Thanks Vivek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-10 21:38 ` Vivek Goyal @ 2011-03-10 21:43 ` Chris Mason 2011-03-11 1:20 ` KAMEZAWA Hiroyuki 2011-03-11 1:46 ` Dave Chinner 0 siblings, 2 replies; 12+ messages in thread From: Chris Mason @ 2011-03-10 21:43 UTC (permalink / raw) To: Vivek Goyal Cc: Andreas Dilger, Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel Excerpts from Vivek Goyal's message of 2011-03-10 16:38:32 -0500: > On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote: > > On 2011-03-10, at 2:15 PM, Chris Mason wrote: > > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500: > > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote: > > >>>>> I think the person who dirtied the page can store the information in > > >>>>> page->private (assuming buffer heads were not generated) and if flusher > > >>>>> thread later ends up generating buffer heads and ends up modifying > > >>>>> page->private, this can be copied in buffer heads? > > >>>> > > >>>> This scares me a bit. > > >>>> > > >>>> As I understand it, fs/ code expects total ownership of page->private. > > >>>> This adds a responsibility for every user to copy the data through and > > >>>> store it in the buffer head (or anything else). btrfs seems to do > > >>>> something entirely different in some cases and store a different kind > > >>>> of value. > > >>> > > >>> If filesystems are using page->private for some other purpose also, then > > >>> I guess we have issues. > > >>> > > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying > > >>> to store cgroup id of page dirtying thread in page->private and/or buffer > > >>> head for tracking which group originally dirtied the page in IO controller > > >>> during writeback. > > >> > > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private > > >> for other purposes also. > > >> > > >> I was under the impression that either page->private is null or it > > >> points to buffer heads for the writeback case. So storing the info > > >> directly in either buffer head directly or first in page->private and > > >> then transferring it to buffer heads would have helped. > > > > > > Right, btrfs has its own uses for page->private, and we expect to own > > > it. With a proper callback, the FS could store the extra information you > > > need in out own structs. > > > > There is no requirement that page->private ever points to a buffer_head, and Lustre clients use it for its own tracking structure (never touching buffer_heads at all). Any assumption about what a filesystem is storing in page->private in other parts of the code is just broken. > > Andreas, > > As Chris mentioned, will providing callbacks so that filesystem can > save/restore page->private be reasonable? Just to clarify, I think saving/restoring page->private is going to be hard. I'd rather just have a call back that says here's a page, storage this for the block io controller please, and another one that returns any previously stored info. -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-10 21:43 ` Chris Mason @ 2011-03-11 1:20 ` KAMEZAWA Hiroyuki 2011-03-11 1:46 ` Dave Chinner 1 sibling, 0 replies; 12+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-03-11 1:20 UTC (permalink / raw) To: Chris Mason Cc: Vivek Goyal, Andreas Dilger, Justin TerAvest, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel On Thu, 10 Mar 2011 16:43:31 -0500 Chris Mason <chris.mason@oracle.com> wrote: > Excerpts from Vivek Goyal's message of 2011-03-10 16:38:32 -0500: > > On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote: > > > On 2011-03-10, at 2:15 PM, Chris Mason wrote: > > > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500: > > > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote: > > > >>>>> I think the person who dirtied the page can store the information in > > > >>>>> page->private (assuming buffer heads were not generated) and if flusher > > > >>>>> thread later ends up generating buffer heads and ends up modifying > > > >>>>> page->private, this can be copied in buffer heads? > > > >>>> > > > >>>> This scares me a bit. > > > >>>> > > > >>>> As I understand it, fs/ code expects total ownership of page->private. > > > >>>> This adds a responsibility for every user to copy the data through and > > > >>>> store it in the buffer head (or anything else). btrfs seems to do > > > >>>> something entirely different in some cases and store a different kind > > > >>>> of value. > > > >>> > > > >>> If filesystems are using page->private for some other purpose also, then > > > >>> I guess we have issues. > > > >>> > > > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying > > > >>> to store cgroup id of page dirtying thread in page->private and/or buffer > > > >>> head for tracking which group originally dirtied the page in IO controller > > > >>> during writeback. > > > >> > > > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private > > > >> for other purposes also. > > > >> > > > >> I was under the impression that either page->private is null or it > > > >> points to buffer heads for the writeback case. So storing the info > > > >> directly in either buffer head directly or first in page->private and > > > >> then transferring it to buffer heads would have helped. > > > > > > > > Right, btrfs has its own uses for page->private, and we expect to own > > > > it. With a proper callback, the FS could store the extra information you > > > > need in out own structs. > > > > > > There is no requirement that page->private ever points to a buffer_head, and Lustre clients use it for its own tracking structure (never touching buffer_heads at all). Any assumption about what a filesystem is storing in page->private in other parts of the code is just broken. > > > > Andreas, > > > > As Chris mentioned, will providing callbacks so that filesystem can > > save/restore page->private be reasonable? > > Just to clarify, I think saving/restoring page->private is going to be > hard. I'd rather just have a call back that says here's a page, storage > this for the block io controller please, and another one that returns > any previously stored info. > Hmm, Vivek, for dynamic allocation of io-record, how about this kind of tagging ? (just an idea. not compiled at all.) Pros. - much better than consuming 2bytes for all pages including pages other than file caches. - this will allow lockless lookup of iotag. - setting iotag can be done at the same time PAGECACHE_TAG_DIRTY... no extra lock will be required. - At clearing, we can expect lock for radix-tree is already held. Cons. - makes radix-tree struct larger and not good for cacheline. - some special care? will be required at page-migration. == @@ -51,6 +51,9 @@ struct radix_tree_node { struct rcu_head rcu_head; void __rcu *slots[RADIX_TREE_MAP_SIZE]; unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS]; +#ifdef CONFIG_BLK_CGROUP + unsigned short iotag[RADIX_TREE_MAP_SIZE]; +#endif }; struct radix_tree_path { @@ -487,6 +490,36 @@ void *radix_tree_tag_set(struct radix_tr } EXPORT_SYMBOL(radix_tree_tag_set); +#ifdef CONFIG_BLK_CGROUP +void *radix_tree_iotag_set(struct radix_tree_root *root, + unsigned long index, unsigned short tag) +{ + unsigned int height, shift; + struct radix_tree_node *node; + + height = root->height; + BUG_ON(index > radix_tree_maxindex(height)); + + node = indirect_to_ptr(root->rnode); + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + + while (height > 0) { + int offset; + + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + node = node->slots[offset]; + BUG(!node); + shift -= RADIX_TREE_MAP_SHIFT; + height--; + } + node->iotag[offset] = tag; + + return; +} +EXPORT_SYMBOL(radix_tree_iotag_set); + ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-10 21:43 ` Chris Mason 2011-03-11 1:20 ` KAMEZAWA Hiroyuki @ 2011-03-11 1:46 ` Dave Chinner 2011-03-11 2:15 ` Vivek Goyal 1 sibling, 1 reply; 12+ messages in thread From: Dave Chinner @ 2011-03-11 1:46 UTC (permalink / raw) To: Chris Mason Cc: Vivek Goyal, Andreas Dilger, Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel On Thu, Mar 10, 2011 at 04:43:31PM -0500, Chris Mason wrote: > Excerpts from Vivek Goyal's message of 2011-03-10 16:38:32 -0500: > > On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote: > > > On 2011-03-10, at 2:15 PM, Chris Mason wrote: > > > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500: > > > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote: > > > >>>>> I think the person who dirtied the page can store the information in > > > >>>>> page->private (assuming buffer heads were not generated) and if flusher > > > >>>>> thread later ends up generating buffer heads and ends up modifying > > > >>>>> page->private, this can be copied in buffer heads? > > > >>>> > > > >>>> This scares me a bit. > > > >>>> > > > >>>> As I understand it, fs/ code expects total ownership of page->private. > > > >>>> This adds a responsibility for every user to copy the data through and > > > >>>> store it in the buffer head (or anything else). btrfs seems to do > > > >>>> something entirely different in some cases and store a different kind > > > >>>> of value. > > > >>> > > > >>> If filesystems are using page->private for some other purpose also, then > > > >>> I guess we have issues. > > > >>> > > > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying > > > >>> to store cgroup id of page dirtying thread in page->private and/or buffer > > > >>> head for tracking which group originally dirtied the page in IO controller > > > >>> during writeback. > > > >> > > > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private > > > >> for other purposes also. > > > >> > > > >> I was under the impression that either page->private is null or it > > > >> points to buffer heads for the writeback case. So storing the info > > > >> directly in either buffer head directly or first in page->private and > > > >> then transferring it to buffer heads would have helped. > > > > > > > > Right, btrfs has its own uses for page->private, and we expect to own > > > > it. With a proper callback, the FS could store the extra information you > > > > need in out own structs. > > > > > > There is no requirement that page->private ever points to a > > > buffer_head, and Lustre clients use it for its own tracking > > > structure (never touching buffer_heads at all). Any > > > assumption about what a filesystem is storing in page->private > > > in other parts of the code is just broken. > > > > Andreas, > > > > As Chris mentioned, will providing callbacks so that filesystem > > can save/restore page->private be reasonable? > > Just to clarify, I think saving/restoring page->private is going > to be hard. I'd rather just have a call back that says here's a > page, storage this for the block io controller please, and another > one that returns any previously stored info. Agreed - there is absolutely no guarantee that some other thread doesn't grab the page while it is under writeback and dereference page->private expecting there to be buffer heads or some filesystem specific structure to be there. Hence swapping out the expected structure with something different is problematic. However, I think there's bigger issues. e.g. page->private might point to multiple bufferheads that map to non-contiguous disk blocks that were written by different threads - what happens if we get two concurrent IOs to the one page, perhaps with different cgroup IDs? Further, page->private might not even point to a per-page specific structure - it might point to a structure shared by multiple pages (e.g. an extent map). Adding a callback like this requires filesystems to be able to store per-page or per-block information for external users. Indeed, one of the areas of development in XFS right now is to move away from storing internal per-block/per-page information because of the memory overhead it causes. IMO, if you really need some per-page information, then just put it in the struct page - you can't hide the memory overhead just by having the filesystem to store it for you. That just adds unnecessary complexity... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-11 1:46 ` Dave Chinner @ 2011-03-11 2:15 ` Vivek Goyal 2011-03-11 2:52 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 12+ messages in thread From: Vivek Goyal @ 2011-03-11 2:15 UTC (permalink / raw) To: Dave Chinner Cc: Chris Mason, Andreas Dilger, Justin TerAvest, KAMEZAWA Hiroyuki, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel On Fri, Mar 11, 2011 at 12:46:18PM +1100, Dave Chinner wrote: > On Thu, Mar 10, 2011 at 04:43:31PM -0500, Chris Mason wrote: > > Excerpts from Vivek Goyal's message of 2011-03-10 16:38:32 -0500: > > > On Thu, Mar 10, 2011 at 02:24:07PM -0700, Andreas Dilger wrote: > > > > On 2011-03-10, at 2:15 PM, Chris Mason wrote: > > > > > Excerpts from Vivek Goyal's message of 2011-03-10 14:41:06 -0500: > > > > >> On Thu, Mar 10, 2011 at 02:11:15PM -0500, Vivek Goyal wrote: > > > > >>>>> I think the person who dirtied the page can store the information in > > > > >>>>> page->private (assuming buffer heads were not generated) and if flusher > > > > >>>>> thread later ends up generating buffer heads and ends up modifying > > > > >>>>> page->private, this can be copied in buffer heads? > > > > >>>> > > > > >>>> This scares me a bit. > > > > >>>> > > > > >>>> As I understand it, fs/ code expects total ownership of page->private. > > > > >>>> This adds a responsibility for every user to copy the data through and > > > > >>>> store it in the buffer head (or anything else). btrfs seems to do > > > > >>>> something entirely different in some cases and store a different kind > > > > >>>> of value. > > > > >>> > > > > >>> If filesystems are using page->private for some other purpose also, then > > > > >>> I guess we have issues. > > > > >>> > > > > >>> I am ccing linux-fsdevel to have some feedback on the idea of trying > > > > >>> to store cgroup id of page dirtying thread in page->private and/or buffer > > > > >>> head for tracking which group originally dirtied the page in IO controller > > > > >>> during writeback. > > > > >> > > > > >> A quick "grep" showed that btrfs, ceph and logfs are using page->private > > > > >> for other purposes also. > > > > >> > > > > >> I was under the impression that either page->private is null or it > > > > >> points to buffer heads for the writeback case. So storing the info > > > > >> directly in either buffer head directly or first in page->private and > > > > >> then transferring it to buffer heads would have helped. > > > > > > > > > > Right, btrfs has its own uses for page->private, and we expect to own > > > > > it. With a proper callback, the FS could store the extra information you > > > > > need in out own structs. > > > > > > > > There is no requirement that page->private ever points to a > > > > buffer_head, and Lustre clients use it for its own tracking > > > > structure (never touching buffer_heads at all). Any > > > > assumption about what a filesystem is storing in page->private > > > > in other parts of the code is just broken. > > > > > > Andreas, > > > > > > As Chris mentioned, will providing callbacks so that filesystem > > > can save/restore page->private be reasonable? > > > > Just to clarify, I think saving/restoring page->private is going > > to be hard. I'd rather just have a call back that says here's a > > page, storage this for the block io controller please, and another > > one that returns any previously stored info. > > Agreed - there is absolutely no guarantee that some other thread > doesn't grab the page while it is under writeback and dereference > page->private expecting there to be buffer heads or some filesystem > specific structure to be there. Hence swapping out the expected > structure with something different is problematic. > > However, I think there's bigger issues. e.g. page->private might > point to multiple bufferheads that map to non-contiguous disk blocks > that were written by different threads - what happens if we get two > concurrent IOs to the one page, perhaps with different cgroup IDs? I guess in such cases we can afford to lose some accuracy and a simple approximation can be the last writer's cgroup id is used for whole page. > > Further, page->private might not even point to a per-page specific > structure - it might point to a structure shared by multiple pages > (e.g. an extent map). Adding a callback like this requires > filesystems to be able to store per-page or per-block information > for external users. Indeed, one of the areas of development in XFS > right now is to move away from storing internal per-block/per-page > information because of the memory overhead it causes. Ok, if filesystem is trying to move away from per page information then these kind of callbacks become a burden. > > IMO, if you really need some per-page information, then just put it > in the struct page - you can't hide the memory overhead just by > having the filesystem to store it for you. That just adds > unnecessary complexity... Ok. I guess adding anything to struct page is going to be hard and we might have to fall back to looking into using page_cgroup for tracking page state. I was trying to explore the ways so that we don't have to instantiate whole page_cgroup structure just for trying to figure out who dirtied the page. Thanks Vivek ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-11 2:15 ` Vivek Goyal @ 2011-03-11 2:52 ` KAMEZAWA Hiroyuki 2011-03-11 3:15 ` Vivek Goyal 0 siblings, 1 reply; 12+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-03-11 2:52 UTC (permalink / raw) To: Vivek Goyal Cc: Dave Chinner, Chris Mason, Andreas Dilger, Justin TerAvest, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel On Thu, 10 Mar 2011 21:15:31 -0500 Vivek Goyal <vgoyal@redhat.com> wrote: > > IMO, if you really need some per-page information, then just put it > > in the struct page - you can't hide the memory overhead just by > > having the filesystem to store it for you. That just adds > > unnecessary complexity... > > Ok. I guess adding anything to struct page is going to be hard and > we might have to fall back to looking into using page_cgroup for > tracking page state. I was trying to explore the ways so that we don't > have to instantiate whole page_cgroup structure just for trying > to figure out who dirtied the page. > Is this bad ? == At handling ASYNC I/O in blkio layer, it's unknown that who dirtied the page. This lack of information makes impossible to throttole Async I/O per cgroup in blkio queue layer. This patch records the information into radix-tree and preserve the information. There is no 'clear' operation because all I/O starts when the page is marked as DIRTY. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- include/linux/radix-tree.h | 3 +++ lib/radix-tree.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+) Index: mmotm-Mar10/include/linux/radix-tree.h =================================================================== --- mmotm-Mar10.orig/include/linux/radix-tree.h +++ mmotm-Mar10/include/linux/radix-tree.h @@ -58,12 +58,14 @@ struct radix_tree_root { unsigned int height; gfp_t gfp_mask; struct radix_tree_node __rcu *rnode; + int iohint; }; #define RADIX_TREE_INIT(mask) { \ .height = 0, \ .gfp_mask = (mask), \ .rnode = NULL, \ + .iohint = 0, \ } #define RADIX_TREE(name, mask) \ @@ -74,6 +76,7 @@ do { \ (root)->height = 0; \ (root)->gfp_mask = (mask); \ (root)->rnode = NULL; \ + (root)->iohint = 0; \ } while (0) /** Index: mmotm-Mar10/lib/radix-tree.c =================================================================== --- mmotm-Mar10.orig/lib/radix-tree.c +++ mmotm-Mar10/lib/radix-tree.c @@ -31,6 +31,7 @@ #include <linux/string.h> #include <linux/bitops.h> #include <linux/rcupdate.h> +#include <linux/blkdev.h> #ifdef __KERNEL__ @@ -51,6 +52,9 @@ struct radix_tree_node { struct rcu_head rcu_head; void __rcu *slots[RADIX_TREE_MAP_SIZE]; unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS]; +#ifdef CONFIG_BLK_CGROUP + unsigned short iohint[RADIX_TREE_MAP_SIZE]; +#endif }; struct radix_tree_path { @@ -473,6 +477,8 @@ void *radix_tree_tag_set(struct radix_tr offset = (index >> shift) & RADIX_TREE_MAP_MASK; if (!tag_get(slot, tag, offset)) tag_set(slot, tag, offset); + if (height == 1 && slot && tag == PAGECACHE_TAG_DIRTY) + blkio_record_hint(&slot->iohint[offset]); slot = slot->slots[offset]; BUG_ON(slot == NULL); shift -= RADIX_TREE_MAP_SHIFT; @@ -1418,3 +1425,38 @@ void __init radix_tree_init(void) radix_tree_init_maxindex(); hotcpu_notifier(radix_tree_callback, 0); } + +#ifdef CONFIG_BLK_CGROUP + +unsigned short radix_tree_lookup_iohint(struct radix_tree_root *root, + int index) +{ + unsigned int height, shift; + struct radix_tree_node *node; + + node = rcu_redereference(root->rnode); + if (node == NULL) + return 0; + if (!radix_tree_is_indirect_ptr(node)) + return root->iohint; + node = radxi_tree_indirect_to_ptr(node); + + height = node->height; + if (index > radix_tree_maxindex(height)) + return 0; + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + for ( ; ; ) { + int offset; + + if (node == NULL) + return 0; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + if (height == 1) + return node->iohint[offset]; + node = rcu_rereference(node->slots[offset]); + shift -= RADIX_TREE_MAP_SHIFT; + height--; + } +} +EXPORT_SYMBOL(radix_tree_lookup_iohint); +#endif ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-11 2:52 ` KAMEZAWA Hiroyuki @ 2011-03-11 3:15 ` Vivek Goyal 2011-03-11 3:13 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 12+ messages in thread From: Vivek Goyal @ 2011-03-11 3:15 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Dave Chinner, Chris Mason, Andreas Dilger, Justin TerAvest, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel On Fri, Mar 11, 2011 at 11:52:35AM +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 10 Mar 2011 21:15:31 -0500 > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > IMO, if you really need some per-page information, then just put it > > > in the struct page - you can't hide the memory overhead just by > > > having the filesystem to store it for you. That just adds > > > unnecessary complexity... > > > > Ok. I guess adding anything to struct page is going to be hard and > > we might have to fall back to looking into using page_cgroup for > > tracking page state. I was trying to explore the ways so that we don't > > have to instantiate whole page_cgroup structure just for trying > > to figure out who dirtied the page. > > > > Is this bad ? > == Sounds like an interesting idea. I am primarily concered about the radix tree node size increase. Not sure how big a concern this is. Also tracking is useful for two things. - Proportinal IO - IO throttling For proportional IO, anyway we have to use it with memory controller to control per cgroup dirty ratio so storing info in page_cgroup should not hurt. The only other case where dependence on page_cgroup hurts is IO throttling where IO controller does not really need memory cgroup controller (I hope so). But we are still not sure if throttling IO at device level is a good idea and how to resolve issues related to priority inversion. But this definitely sounds better than adding a new field in page struct as I am assuming that it overall is going to consume less memory. Thanks Vivek > > At handling ASYNC I/O in blkio layer, it's unknown that who dirtied the page. > This lack of information makes impossible to throttole Async I/O per > cgroup in blkio queue layer. > > This patch records the information into radix-tree and preserve the > information. There is no 'clear' operation because all I/O starts when > the page is marked as DIRTY. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > --- > include/linux/radix-tree.h | 3 +++ > lib/radix-tree.c | 42 ++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 45 insertions(+) > > Index: mmotm-Mar10/include/linux/radix-tree.h > =================================================================== > --- mmotm-Mar10.orig/include/linux/radix-tree.h > +++ mmotm-Mar10/include/linux/radix-tree.h > @@ -58,12 +58,14 @@ struct radix_tree_root { > unsigned int height; > gfp_t gfp_mask; > struct radix_tree_node __rcu *rnode; > + int iohint; > }; > > #define RADIX_TREE_INIT(mask) { \ > .height = 0, \ > .gfp_mask = (mask), \ > .rnode = NULL, \ > + .iohint = 0, \ > } > > #define RADIX_TREE(name, mask) \ > @@ -74,6 +76,7 @@ do { \ > (root)->height = 0; \ > (root)->gfp_mask = (mask); \ > (root)->rnode = NULL; \ > + (root)->iohint = 0; \ > } while (0) > > /** > Index: mmotm-Mar10/lib/radix-tree.c > =================================================================== > --- mmotm-Mar10.orig/lib/radix-tree.c > +++ mmotm-Mar10/lib/radix-tree.c > @@ -31,6 +31,7 @@ > #include <linux/string.h> > #include <linux/bitops.h> > #include <linux/rcupdate.h> > +#include <linux/blkdev.h> > > > #ifdef __KERNEL__ > @@ -51,6 +52,9 @@ struct radix_tree_node { > struct rcu_head rcu_head; > void __rcu *slots[RADIX_TREE_MAP_SIZE]; > unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS]; > +#ifdef CONFIG_BLK_CGROUP > + unsigned short iohint[RADIX_TREE_MAP_SIZE]; > +#endif > }; > > struct radix_tree_path { > @@ -473,6 +477,8 @@ void *radix_tree_tag_set(struct radix_tr > offset = (index >> shift) & RADIX_TREE_MAP_MASK; > if (!tag_get(slot, tag, offset)) > tag_set(slot, tag, offset); > + if (height == 1 && slot && tag == PAGECACHE_TAG_DIRTY) > + blkio_record_hint(&slot->iohint[offset]); > slot = slot->slots[offset]; > BUG_ON(slot == NULL); > shift -= RADIX_TREE_MAP_SHIFT; > @@ -1418,3 +1425,38 @@ void __init radix_tree_init(void) > radix_tree_init_maxindex(); > hotcpu_notifier(radix_tree_callback, 0); > } > + > +#ifdef CONFIG_BLK_CGROUP > + > +unsigned short radix_tree_lookup_iohint(struct radix_tree_root *root, > + int index) > +{ > + unsigned int height, shift; > + struct radix_tree_node *node; > + > + node = rcu_redereference(root->rnode); > + if (node == NULL) > + return 0; > + if (!radix_tree_is_indirect_ptr(node)) > + return root->iohint; > + node = radxi_tree_indirect_to_ptr(node); > + > + height = node->height; > + if (index > radix_tree_maxindex(height)) > + return 0; > + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; > + for ( ; ; ) { > + int offset; > + > + if (node == NULL) > + return 0; > + offset = (index >> shift) & RADIX_TREE_MAP_MASK; > + if (height == 1) > + return node->iohint[offset]; > + node = rcu_rereference(node->slots[offset]); > + shift -= RADIX_TREE_MAP_SHIFT; > + height--; > + } > +} > +EXPORT_SYMBOL(radix_tree_lookup_iohint); > +#endif ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) 2011-03-11 3:15 ` Vivek Goyal @ 2011-03-11 3:13 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 12+ messages in thread From: KAMEZAWA Hiroyuki @ 2011-03-11 3:13 UTC (permalink / raw) To: Vivek Goyal Cc: Dave Chinner, Chris Mason, Andreas Dilger, Justin TerAvest, m-ikeda, jaxboe, linux-kernel, ryov, taka, righi.andrea, guijianfeng, balbir, ctalbott, nauman, mrubin, linux-fsdevel On Thu, 10 Mar 2011 22:15:04 -0500 Vivek Goyal <vgoyal@redhat.com> wrote: > On Fri, Mar 11, 2011 at 11:52:35AM +0900, KAMEZAWA Hiroyuki wrote: > > On Thu, 10 Mar 2011 21:15:31 -0500 > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > IMO, if you really need some per-page information, then just put it > > > > in the struct page - you can't hide the memory overhead just by > > > > having the filesystem to store it for you. That just adds > > > > unnecessary complexity... > > > > > > Ok. I guess adding anything to struct page is going to be hard and > > > we might have to fall back to looking into using page_cgroup for > > > tracking page state. I was trying to explore the ways so that we don't > > > have to instantiate whole page_cgroup structure just for trying > > > to figure out who dirtied the page. > > > > > > > Is this bad ? > > == > > Sounds like an interesting idea. I am primarily concered about the radix > tree node size increase. Not sure how big a concern this is. > > Also tracking is useful for two things. > > - Proportinal IO > - IO throttling > > For proportional IO, anyway we have to use it with memory controller to > control per cgroup dirty ratio so storing info in page_cgroup should > not hurt. > dirty-ratio for memcg will be implemented. It's definitely necessary. > The only other case where dependence on page_cgroup hurts is IO throttling > where IO controller does not really need memory cgroup controller (I hope > so). But we are still not sure if throttling IO at device level is a > good idea and how to resolve issues related to priority inversion. > Yes, that's priority inversion is my concern. Thanks, -Kame ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2011-03-11 3:15 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1299619256-12661-1-git-send-email-teravest@google.com> [not found] ` <20110309142237.6ab82523.kamezawa.hiroyu@jp.fujitsu.com> [not found] ` <AANLkTimRRAEp75CfyJFxU-5wOYfety6gjq=msZf0Wp8P@mail.gmail.com> [not found] ` <20110310181529.GF29464@redhat.com> [not found] ` <AANLkTi=LL6JEwOcZfSsapYn-isA3RBrV8kPq8EK6va8=@mail.gmail.com> 2011-03-10 19:11 ` [RFC] Storing cgroup id in page->private (Was: Re: [RFC] [PATCH 0/6] Provide cgroup isolation for buffered writes.) Vivek Goyal 2011-03-10 19:41 ` Vivek Goyal 2011-03-10 21:15 ` Chris Mason 2011-03-10 21:24 ` Andreas Dilger 2011-03-10 21:38 ` Vivek Goyal 2011-03-10 21:43 ` Chris Mason 2011-03-11 1:20 ` KAMEZAWA Hiroyuki 2011-03-11 1:46 ` Dave Chinner 2011-03-11 2:15 ` Vivek Goyal 2011-03-11 2:52 ` KAMEZAWA Hiroyuki 2011-03-11 3:15 ` Vivek Goyal 2011-03-11 3:13 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).