From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx157.postini.com [74.125.245.157]) by kanga.kvack.org (Postfix) with SMTP id 0DEC76B002C for ; Thu, 8 Mar 2012 02:28:57 -0500 (EST) Received: by dadv6 with SMTP id v6so211080dad.14 for ; Wed, 07 Mar 2012 23:28:57 -0800 (PST) Date: Thu, 8 Mar 2012 15:34:13 +0800 From: Zheng Liu Subject: Control page reclaim granularity Message-ID: <20120308073412.GA6975@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm , linux-kernel Cc: Konstantin Khlebnikov Hi list, Recently we encounter a problem about page reclaim. I abstract it in here. The problem is that there are two different file types. One is small index file, and another is large data file. The index file is mmaped into memory, and application hope that they can be kept in memory and don't be reclaimed too frequently. The data file is manipulted by read/write, and they should be reclaimed more frequently than the index file. As previously discussion [1], Konstantin suggest me to mmap index file with PROT_EXEC flag. Meanwhile he provides a patch to set a flag in mm_flags to increase the priority of mmaped file pages. However, these solutions are not perfect. I review the related patches (8cab4754 and c909e993) and I think that mmaped index file with PROT_EXEC flag is too tricky. From the view of applicaton programmer, index file is a regular file that stores some data. So they should be mmap with PROT_READ | PROT_WRITE rather than with PROT_EXEC. As commit log said (8cab4754), the purpose of this patch is to keep executable code in memory to improve the response of application. In addition, Kongstantin's patch needs to adjust the application program. So in some cases, we cannot touch the code of application, and this patch is useless. I have discussed with Kongstantin about this problem and we think maybe kernel should provide some mechanism. For example, user can set memory pressure priorities for vma or inode, or mmaped pages and file pages can be reclaimed separately. If someone has thought about it, please let me know. Any feedbacks are welcomed. Thank you. Previously discussion: 1. http://marc.info/?l=linux-mm&m=132947026019538&w=2 Regards, Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id 326B06B002C for ; Thu, 8 Mar 2012 03:39:52 -0500 (EST) Received: by wibhq7 with SMTP id hq7so8138wib.2 for ; Thu, 08 Mar 2012 00:39:50 -0800 (PST) From: Greg Thelen Subject: Re: Control page reclaim granularity References: <20120308073412.GA6975@gmail.com> Date: Thu, 08 Mar 2012 00:39:49 -0800 In-Reply-To: <20120308073412.GA6975@gmail.com> (Zheng Liu's message of "Thu, 8 Mar 2012 15:34:13 +0800") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm Cc: linux-kernel , Konstantin Khlebnikov Zheng Liu writes: > Hi list, > > Recently we encounter a problem about page reclaim. I abstract it in here. > The problem is that there are two different file types. One is small index > file, and another is large data file. The index file is mmaped into memory, > and application hope that they can be kept in memory and don't be reclaimed > too frequently. The data file is manipulted by read/write, and they should > be reclaimed more frequently than the index file. > > As previously discussion [1], Konstantin suggest me to mmap index file with > PROT_EXEC flag. Meanwhile he provides a patch to set a flag in mm_flags to > increase the priority of mmaped file pages. However, these solutions are > not perfect. I review the related patches (8cab4754 and c909e993) and I > think that mmaped index file with PROT_EXEC flag is too tricky. From the > view of applicaton programmer, index file is a regular file that stores > some data. So they should be mmap with PROT_READ | PROT_WRITE rather than > with PROT_EXEC. As commit log said (8cab4754), the purpose of this patch > is to keep executable code in memory to improve the response of application. > In addition, Kongstantin's patch needs to adjust the application program. > So in some cases, we cannot touch the code of application, and this patch is > useless. > > I have discussed with Kongstantin about this problem and we think maybe > kernel should provide some mechanism. For example, user can set memory > pressure priorities for vma or inode, or mmaped pages and file pages can be > reclaimed separately. If someone has thought about it, please let me know. > Any feedbacks are welcomed. Thank you. > > Previously discussion: > 1. http://marc.info/?l=linux-mm&m=132947026019538&w=2 > > Regards, > Zheng It's not exactly the same approach, but we have toyed with the idea of charging different inodes to different cgroups. Each cgroup would have different soft/hard limits to allow for different cache behavior. http://www.spinics.net/lists/linux-mm/msg06006.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx168.postini.com [74.125.245.168]) by kanga.kvack.org (Postfix) with SMTP id E6A6D6B002C for ; Thu, 8 Mar 2012 04:35:23 -0500 (EST) Received: by pbcup15 with SMTP id up15so1564043pbc.14 for ; Thu, 08 Mar 2012 01:35:23 -0800 (PST) Date: Thu, 8 Mar 2012 18:35:14 +0900 From: Minchan Kim Subject: Re: Control page reclaim granularity Message-ID: <20120308093514.GA28856@barrios> References: <20120308073412.GA6975@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120308073412.GA6975@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm , linux-kernel , Konstantin Khlebnikov Cc: riel@redhat.com, kosaki.motohiro@jp.fujitsu.com On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote: > Hi list, > > Recently we encounter a problem about page reclaim. I abstract it in here. > The problem is that there are two different file types. One is small index > file, and another is large data file. The index file is mmaped into memory, > and application hope that they can be kept in memory and don't be reclaimed > too frequently. The data file is manipulted by read/write, and they should > be reclaimed more frequently than the index file. > > As previously discussion [1], Konstantin suggest me to mmap index file with > PROT_EXEC flag. Meanwhile he provides a patch to set a flag in mm_flags to > increase the priority of mmaped file pages. However, these solutions are > not perfect. I review the related patches (8cab4754 and c909e993) and I > think that mmaped index file with PROT_EXEC flag is too tricky. From the > view of applicaton programmer, index file is a regular file that stores > some data. So they should be mmap with PROT_READ | PROT_WRITE rather than > with PROT_EXEC. As commit log said (8cab4754), the purpose of this patch > is to keep executable code in memory to improve the response of application. > In addition, Kongstantin's patch needs to adjust the application program. > So in some cases, we cannot touch the code of application, and this patch is > useless. > > I have discussed with Kongstantin about this problem and we think maybe > kernel should provide some mechanism. For example, user can set memory > pressure priorities for vma or inode, or mmaped pages and file pages can be > reclaimed separately. If someone has thought about it, please let me know. > Any feedbacks are welcomed. Thank you. > > Previously discussion: > 1. http://marc.info/?l=linux-mm&m=132947026019538&w=2 > > Regards, > Zheng I think it's a regression since 2.6.28. Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped). But we removed that routine when we applied split lru page replacement. Rik, KOSAKI. What's the rationale? We have to decide whether recovering that routine or creating new logic to keep mapped page in memory. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx168.postini.com [74.125.245.168]) by kanga.kvack.org (Postfix) with SMTP id B066A6B0083 for ; Thu, 8 Mar 2012 11:08:00 -0500 (EST) Received: by pbcup15 with SMTP id up15so1929263pbc.14 for ; Thu, 08 Mar 2012 08:08:00 -0800 (PST) Date: Fri, 9 Mar 2012 00:13:18 +0800 From: Zheng Liu Subject: Re: Control page reclaim granularity Message-ID: <20120308161318.GA9904@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Greg Thelen Cc: linux-mm , linux-kernel , Konstantin Khlebnikov Hi Greg, Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel mailing list. So please Cc me. I am glad to receive your reply and I am very interesting for your approach. Actually I am not very familiar with CGroup. So would you please send your patch to me if you can? Thank you all the same. Regards, Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx205.postini.com [74.125.245.205]) by kanga.kvack.org (Postfix) with SMTP id 747C86B007E for ; Thu, 8 Mar 2012 11:32:55 -0500 (EST) Received: by obbta14 with SMTP id ta14so1128011obb.14 for ; Thu, 08 Mar 2012 08:32:54 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20120308161318.GA9904@gmail.com> References: <20120308161318.GA9904@gmail.com> Date: Fri, 9 Mar 2012 00:32:54 +0800 Message-ID: Subject: Re: Control page reclaim granularity From: Zhu Yanhai Content-Type: multipart/alternative; boundary=14dae93b590ef6607704babdd389 Sender: owner-linux-mm@kvack.org List-ID: To: Greg Thelen , linux-mm , linux-kernel , Konstantin Khlebnikov , gnehzuil gnehzuil --14dae93b590ef6607704babdd389 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable How about we provide a interface to make you able to specify which pages should be charged to which cgroup, then you may create two cgroups for the two files, set each of them to the desired size which you want the cache of the file to be. By that way you can control them as exactly as you want. The current memcg allows a group which has not any processes but still pages in soft limit tree (hope I'm not wrong) so you dont' have to put any of you worker process into the groups either. -zyh =E5=9C=A8 2012=E5=B9=B43=E6=9C=889=E6=97=A5=E6=98=9F=E6=9C=9F=E4=BA=94=EF= =BC=8CZheng Liu =E5=86=99=E9=81=93=EF=BC=9A > Hi Greg, > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > mailing list. So please Cc me. > > I am glad to receive your reply and I am very interesting for your > approach. Actually I am not very familiar with CGroup. So would you > please send your patch to me if you can? Thank you all the same. > > Regards, > Zheng > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" i= n > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > --=20 Sent from Gmail Mobile --14dae93b590ef6607704babdd389 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable How about we provide a interface to make you able to specify which pages sh= ould be charged to which cgroup, then you may create two cgroups for the tw= o files, set each of them to the desired size which you want the cache of t= he file to be. =C2=A0By that way you can control them as exactly as you wan= t.
The current memcg allows a group which has not any processes but still page= s in soft limit tree (hope I'm not wrong) so you dont' have to put = any of you worker process into the groups either.

-zyh

=E5=9C= =A8 2012=E5=B9=B43=E6=9C=889=E6=97=A5=E6=98=9F=E6=9C=9F=E4=BA=94=EF=BC=8CZh= eng Liu <gnehzuil.liu@gmail.co= m> =E5=86=99=E9=81=93=EF=BC=9A
> Hi Greg,
>
> Sorry, I forgot to say that I don't subsc= ribe linux-mm and linux-kernel
> mailing list. =C2=A0So please Cc me.=
>
> I am glad to receive your reply and I am very interesting = for your
> approach. =C2=A0Actually I am not very familiar with CGroup. =C2=A0So = would you
> please send your patch to me if you can? =C2=A0Thank you = all the same.
>
> Regards,
> Zheng
> --
> To = unsubscribe from this list: send the line "unsubscribe linux-kernel&qu= ot; in
> the body of a message to = majordomo@vger.kernel.org
> More majordomo info at =C2=A0http://vger.kernel.org/majo= rdomo-info.html
> Please read the FAQ at =C2=A0http= ://www.tux.org/lkml/
>

--
Sent from Gmail Mobile
--14dae93b590ef6607704babdd389-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx135.postini.com [74.125.245.135]) by kanga.kvack.org (Postfix) with SMTP id 8A3796B007E for ; Thu, 8 Mar 2012 11:48:44 -0500 (EST) Received: by pbcup15 with SMTP id up15so1971185pbc.14 for ; Thu, 08 Mar 2012 08:48:43 -0800 (PST) Date: Fri, 9 Mar 2012 00:54:03 +0800 From: Zheng Liu Subject: Re: Control page reclaim granularity Message-ID: <20120308165403.GA10005@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120308093514.GA28856@barrios> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Hi Minchan, Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel mailing list. So please Cc me. IMHO, maybe we should re-think about how does user use mmap(2). I describe the cases I known in our product system. They can be categorized into two cases. One is mmaped all data files into memory and sometime it uses write(2) to append some data, and another uses mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the second case, the application wants to keep mmaped page into memory and let file pages to be reclaimed firstly. So, IMO, when application uses mmap(2) to manipulate files, it is possible to imply that it wants keep these mmaped pages into memory and do not be reclaimed. At least these pages do not be reclaimed early than file pages. I think that maybe we can recover that routine and provide a sysctl parameter to let the user to set this ratio between mmaped pages and file pages. Regards, Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id 412ED6B0044 for ; Sun, 11 Mar 2012 20:28:14 -0400 (EDT) Received: by dadv6 with SMTP id v6so4804002dad.14 for ; Sun, 11 Mar 2012 17:28:13 -0700 (PDT) Date: Mon, 12 Mar 2012 09:28:06 +0900 From: Minchan Kim Subject: Re: Control page reclaim granularity Message-ID: <20120312002806.GA2436@barrios> References: <20120308093514.GA28856@barrios> <20120308165403.GA10005@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120308165403.GA10005@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim , linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > Hi Minchan, > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > mailing list. So please Cc me. > > IMHO, maybe we should re-think about how does user use mmap(2). I > describe the cases I known in our product system. They can be > categorized into two cases. One is mmaped all data files into memory > and sometime it uses write(2) to append some data, and another uses > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the > second case, the application wants to keep mmaped page into memory and > let file pages to be reclaimed firstly. So, IMO, when application uses > mmap(2) to manipulate files, it is possible to imply that it wants keep > these mmaped pages into memory and do not be reclaimed. At least these > pages do not be reclaimed early than file pages. I think that maybe we > can recover that routine and provide a sysctl parameter to let the user > to set this ratio between mmaped pages and file pages. I am not convinced why we should handle mapped page specially. Sometimem, someone may use mmap by reducing buffer copy compared to read system call. So I think we can't make sure mmaped pages are always win. My suggestion is that it would be better to declare by user explicitly. I think we can implement it by madvise and fadvise's WILLNEED option. Current implementation is just readahead if there isn't a page in memory but I think we can promote from inactive to active if there is already a page in memory. It's more clear and it couldn't be affected by kernel page reclaim algorithm change like this. > > Regards, > Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx126.postini.com [74.125.245.126]) by kanga.kvack.org (Postfix) with SMTP id 92AFA6B0044 for ; Sun, 11 Mar 2012 22:00:44 -0400 (EDT) Received: by dadv6 with SMTP id v6so4913440dad.14 for ; Sun, 11 Mar 2012 19:00:43 -0700 (PDT) Date: Mon, 12 Mar 2012 10:06:09 +0800 From: Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312020609.GA9610@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312002806.GA2436@barrios> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > I forgot to Ccing you. > Sorry. > > ---------- Forwarded message ---------- > From: Minchan Kim > Date: Mon, Mar 12, 2012 at 9:28 AM > Subject: Re: Control page reclaim granularity > To: Minchan Kim , linux-mm , > linux-kernel , Konstantin Khlebnikov < > khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > > > On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > > Hi Minchan, > > > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > > mailing list. So please Cc me. > > > > IMHO, maybe we should re-think about how does user use mmap(2). I > > describe the cases I known in our product system. They can be > > categorized into two cases. One is mmaped all data files into memory > > and sometime it uses write(2) to append some data, and another uses > > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the > > second case, the application wants to keep mmaped page into memory and > > let file pages to be reclaimed firstly. So, IMO, when application uses > > mmap(2) to manipulate files, it is possible to imply that it wants keep > > these mmaped pages into memory and do not be reclaimed. At least these > > pages do not be reclaimed early than file pages. I think that maybe we > > can recover that routine and provide a sysctl parameter to let the user > > to set this ratio between mmaped pages and file pages. > > I am not convinced why we should handle mapped page specially. > Sometimem, someone may use mmap by reducing buffer copy compared to read > system call. > So I think we can't make sure mmaped pages are always win. > > My suggestion is that it would be better to declare by user explicitly. > I think we can implement it by madvise and fadvise's WILLNEED option. > Current implementation is just readahead if there isn't a page in memory > but I think > we can promote from inactive to active if there is already a page in > memory. > > It's more clear and it couldn't be affected by kernel page reclaim > algorithm change > like this. Thank you for your advice. But I still have question about this solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED option, it will cause an inconsistently status for pages that be manipulated by madvise(2) and/or fadvise(2). For example, when I call madvise with WILLNEED flag, some pages will be moved into active list if they already have been in memory, and other pages will be read into memory and be saved in inactive list if they don't be in memory. Then pages that are in inactive list are possible to be reclaim. So from the view of users, it is inconsistent because some pages are in memory and some pages are reclaimed. But actually the user hopes that all of pages can be kept in memory. IMHO, this inconsistency is weird and makes users puzzled. Regards, Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id 536006B0044 for ; Mon, 12 Mar 2012 01:19:12 -0400 (EDT) Received: by dadv6 with SMTP id v6so5147604dad.14 for ; Sun, 11 Mar 2012 22:19:11 -0700 (PDT) Date: Mon, 12 Mar 2012 14:19:04 +0900 From: Minchan Kim Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312051904.GA3831@barrios> References: <20120312002806.GA2436@barrios> <20120312020609.GA9610@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312020609.GA9610@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim , linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > > I forgot to Ccing you. > > Sorry. > > > > ---------- Forwarded message ---------- > > From: Minchan Kim > > Date: Mon, Mar 12, 2012 at 9:28 AM > > Subject: Re: Control page reclaim granularity > > To: Minchan Kim , linux-mm , > > linux-kernel , Konstantin Khlebnikov < > > khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > > > > > > On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > > > Hi Minchan, > > > > > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > > > mailing list. So please Cc me. > > > > > > IMHO, maybe we should re-think about how does user use mmap(2). I > > > describe the cases I known in our product system. They can be > > > categorized into two cases. One is mmaped all data files into memory > > > and sometime it uses write(2) to append some data, and another uses > > > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the > > > second case, the application wants to keep mmaped page into memory and > > > let file pages to be reclaimed firstly. So, IMO, when application uses > > > mmap(2) to manipulate files, it is possible to imply that it wants keep > > > these mmaped pages into memory and do not be reclaimed. At least these > > > pages do not be reclaimed early than file pages. I think that maybe we > > > can recover that routine and provide a sysctl parameter to let the user > > > to set this ratio between mmaped pages and file pages. > > > > I am not convinced why we should handle mapped page specially. > > Sometimem, someone may use mmap by reducing buffer copy compared to read > > system call. > > So I think we can't make sure mmaped pages are always win. > > > > My suggestion is that it would be better to declare by user explicitly. > > I think we can implement it by madvise and fadvise's WILLNEED option. > > Current implementation is just readahead if there isn't a page in memory > > but I think > > we can promote from inactive to active if there is already a page in > > memory. > > > > It's more clear and it couldn't be affected by kernel page reclaim > > algorithm change > > like this. > > Thank you for your advice. But I still have question about this > solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > option, it will cause an inconsistently status for pages that be > manipulated by madvise(2) and/or fadvise(2). For example, when I call > madvise with WILLNEED flag, some pages will be moved into active list if > they already have been in memory, and other pages will be read into > memory and be saved in inactive list if they don't be in memory. Then > pages that are in inactive list are possible to be reclaim. So from the > view of users, it is inconsistent because some pages are in memory and > some pages are reclaimed. But actually the user hopes that all of pages > can be kept in memory. IMHO, this inconsistency is weird and makes users > puzzled. Now problem is that 1. User want to keep pages which are used once in a while in memory. 2. Kernel want to reclaim them because they are surely reclaim target pages in point of view by LRU. The most desriable approach is that user should use mlock to guarantee them in memory. But mlock is too big overhead and user doesn't want to keep memory all pages all at once.(Ie, he want demand paging when he need the page) Right? madvise, it's a just hint for kernel and kernel doesn't need to make sure madvise's behavior. In point of view, such inconsistency might not be a big problem. Big problem I think now is that user should use madvise(WILLNEED) periodically because such activation happens once when user calls madvise. If user doesn't use page frequently after user calls it, it ends up moving into inactive list and even could be reclaimed. It's not good. :-( Okay. How about adding new VM_WORKINGSET? And reclaimer would give one more round trip in active/inactive list when reclaim happens if the page is referenced. Sigh. We have no room for new VM_FLAG in 32 bit. > > Regards, > Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx173.postini.com [74.125.245.173]) by kanga.kvack.org (Postfix) with SMTP id 8C2626B0044 for ; Mon, 12 Mar 2012 02:20:39 -0400 (EDT) Received: by bkwq16 with SMTP id q16so3234440bkw.14 for ; Sun, 11 Mar 2012 23:20:37 -0700 (PDT) Message-ID: <4F5D95AF.1020108@openvz.org> Date: Mon, 12 Mar 2012 10:20:31 +0400 From: Konstantin Khlebnikov MIME-Version: 1.0 Subject: Re: Fwd: Control page reclaim granularity References: <20120312002806.GA2436@barrios> <20120312020609.GA9610@gmail.com> <20120312051904.GA3831@barrios> In-Reply-To: <20120312051904.GA3831@barrios> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" Minchan Kim wrote: > On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: >> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: >>> I forgot to Ccing you. >>> Sorry. >>> >>> ---------- Forwarded message ---------- >>> From: Minchan Kim >>> Date: Mon, Mar 12, 2012 at 9:28 AM >>> Subject: Re: Control page reclaim granularity >>> To: Minchan Kim, linux-mm, >>> linux-kernel, Konstantin Khlebnikov< >>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com >>> >>> >>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: >>>> Hi Minchan, >>>> >>>> Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel >>>> mailing list. So please Cc me. >>>> >>>> IMHO, maybe we should re-think about how does user use mmap(2). I >>>> describe the cases I known in our product system. They can be >>>> categorized into two cases. One is mmaped all data files into memory >>>> and sometime it uses write(2) to append some data, and another uses >>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the >>>> second case, the application wants to keep mmaped page into memory and >>>> let file pages to be reclaimed firstly. So, IMO, when application uses >>>> mmap(2) to manipulate files, it is possible to imply that it wants keep >>>> these mmaped pages into memory and do not be reclaimed. At least these >>>> pages do not be reclaimed early than file pages. I think that maybe we >>>> can recover that routine and provide a sysctl parameter to let the user >>>> to set this ratio between mmaped pages and file pages. >>> >>> I am not convinced why we should handle mapped page specially. >>> Sometimem, someone may use mmap by reducing buffer copy compared to read >>> system call. >>> So I think we can't make sure mmaped pages are always win. >>> >>> My suggestion is that it would be better to declare by user explicitly. >>> I think we can implement it by madvise and fadvise's WILLNEED option. >>> Current implementation is just readahead if there isn't a page in memory >>> but I think >>> we can promote from inactive to active if there is already a page in >>> memory. >>> >>> It's more clear and it couldn't be affected by kernel page reclaim >>> algorithm change >>> like this. >> >> Thank you for your advice. But I still have question about this >> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED >> option, it will cause an inconsistently status for pages that be >> manipulated by madvise(2) and/or fadvise(2). For example, when I call >> madvise with WILLNEED flag, some pages will be moved into active list if >> they already have been in memory, and other pages will be read into >> memory and be saved in inactive list if they don't be in memory. Then >> pages that are in inactive list are possible to be reclaim. So from the >> view of users, it is inconsistent because some pages are in memory and >> some pages are reclaimed. But actually the user hopes that all of pages >> can be kept in memory. IMHO, this inconsistency is weird and makes users >> puzzled. > > Now problem is that > > 1. User want to keep pages which are used once in a while in memory. > 2. Kernel want to reclaim them because they are surely reclaim target > pages in point of view by LRU. > > The most desriable approach is that user should use mlock to guarantee > them in memory. But mlock is too big overhead and user doesn't want to keep > memory all pages all at once.(Ie, he want demand paging when he need the page) > Right? > > madvise, it's a just hint for kernel and kernel doesn't need to make sure madvise's behavior. > In point of view, such inconsistency might not be a big problem. > > Big problem I think now is that user should use madvise(WILLNEED) periodically because such > activation happens once when user calls madvise. If user doesn't use page frequently after > user calls it, it ends up moving into inactive list and even could be reclaimed. > It's not good. :-( > > Okay. How about adding new VM_WORKINGSET? > And reclaimer would give one more round trip in active/inactive list when reclaim happens > if the page is referenced. > > Sigh. We have no room for new VM_FLAG in 32 bit. It would be nice to mark struct address_space with this flag and export AS_UNEVICTABLE somehow. Maybe we can reuse file-locking engine for managing these bits =) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx149.postini.com [74.125.245.149]) by kanga.kvack.org (Postfix) with SMTP id 188DE6B0044 for ; Mon, 12 Mar 2012 04:08:48 -0400 (EDT) Received: by dadv6 with SMTP id v6so5376515dad.14 for ; Mon, 12 Mar 2012 01:08:47 -0700 (PDT) Date: Mon, 12 Mar 2012 16:14:14 +0800 From: Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312081413.GA10923@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F5D95AF.1020108@openvz.org> Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov Cc: linux-mm , linux-kernel , Minchan Kim , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > Minchan Kim wrote: >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: >>>> I forgot to Ccing you. >>>> Sorry. >>>> >>>> ---------- Forwarded message ---------- >>>> From: Minchan Kim >>>> Date: Mon, Mar 12, 2012 at 9:28 AM >>>> Subject: Re: Control page reclaim granularity >>>> To: Minchan Kim, linux-mm, >>>> linux-kernel, Konstantin Khlebnikov< >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com >>>> >>>> >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: >>>>> Hi Minchan, >>>>> >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and >>>>> linux-kernel >>>>> mailing list. So please Cc me. >>>>> >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I >>>>> describe the cases I known in our product system. They can be >>>>> categorized into two cases. One is mmaped all data files into memory >>>>> and sometime it uses write(2) to append some data, and another uses >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In >>>>> the >>>>> second case, the application wants to keep mmaped page into memory >>>>> and >>>>> let file pages to be reclaimed firstly. So, IMO, when application >>>>> uses >>>>> mmap(2) to manipulate files, it is possible to imply that it wants >>>>> keep >>>>> these mmaped pages into memory and do not be reclaimed. At least >>>>> these >>>>> pages do not be reclaimed early than file pages. I think that >>>>> maybe we >>>>> can recover that routine and provide a sysctl parameter to let the >>>>> user >>>>> to set this ratio between mmaped pages and file pages. >>>> >>>> I am not convinced why we should handle mapped page specially. >>>> Sometimem, someone may use mmap by reducing buffer copy compared to >>>> read >>>> system call. >>>> So I think we can't make sure mmaped pages are always win. >>>> >>>> My suggestion is that it would be better to declare by user explicitly. >>>> I think we can implement it by madvise and fadvise's WILLNEED option. >>>> Current implementation is just readahead if there isn't a page in >>>> memory >>>> but I think >>>> we can promote from inactive to active if there is already a page in >>>> memory. >>>> >>>> It's more clear and it couldn't be affected by kernel page reclaim >>>> algorithm change >>>> like this. >>> >>> Thank you for your advice. But I still have question about this >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED >>> option, it will cause an inconsistently status for pages that be >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call >>> madvise with WILLNEED flag, some pages will be moved into active list if >>> they already have been in memory, and other pages will be read into >>> memory and be saved in inactive list if they don't be in memory. Then >>> pages that are in inactive list are possible to be reclaim. So from the >>> view of users, it is inconsistent because some pages are in memory and >>> some pages are reclaimed. But actually the user hopes that all of pages >>> can be kept in memory. IMHO, this inconsistency is weird and makes >>> users >>> puzzled. >> >> Now problem is that >> >> 1. User want to keep pages which are used once in a while in memory. >> 2. Kernel want to reclaim them because they are surely reclaim target >> pages in point of view by LRU. >> >> The most desriable approach is that user should use mlock to guarantee >> them in memory. But mlock is too big overhead and user doesn't want to >> keep >> memory all pages all at once.(Ie, he want demand paging when he need >> the page) >> Right? >> >> madvise, it's a just hint for kernel and kernel doesn't need to make >> sure madvise's behavior. >> In point of view, such inconsistency might not be a big problem. >> >> Big problem I think now is that user should use madvise(WILLNEED) >> periodically because such >> activation happens once when user calls madvise. If user doesn't use >> page frequently after >> user calls it, it ends up moving into inactive list and even could be >> reclaimed. >> It's not good. :-( >> >> Okay. How about adding new VM_WORKINGSET? >> And reclaimer would give one more round trip in active/inactive list >> when reclaim happens >> if the page is referenced. >> >> Sigh. We have no room for new VM_FLAG in 32 bit. > > It would be nice to mark struct address_space with this flag and export > AS_UNEVICTABLE somehow. > Maybe we can reuse file-locking engine for managing these bits =) Make sense to me. We can mark this flag in struct address_space and check it in page_refereneced_file(). If this flag is set, it will be cleard and the function returns referenced > 1. Then this page can be promoted into activate list. But I prefer to set/clear this flag in madvise. PS, I have subscribed linux-mm mailing list. :-) Regards, Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx179.postini.com [74.125.245.179]) by kanga.kvack.org (Postfix) with SMTP id B4AFA6B0044 for ; Mon, 12 Mar 2012 09:42:35 -0400 (EDT) Received: by dadv6 with SMTP id v6so5843015dad.14 for ; Mon, 12 Mar 2012 06:42:34 -0700 (PDT) Date: Mon, 12 Mar 2012 22:42:26 +0900 From: Minchan Kim Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312134226.GA5120@barrios> References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312081413.GA10923@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov , linux-mm , linux-kernel , Minchan Kim , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > > Minchan Kim wrote: > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > >>>> I forgot to Ccing you. > >>>> Sorry. > >>>> > >>>> ---------- Forwarded message ---------- > >>>> From: Minchan Kim > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM > >>>> Subject: Re: Control page reclaim granularity > >>>> To: Minchan Kim, linux-mm, > >>>> linux-kernel, Konstantin Khlebnikov< > >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > >>>> > >>>> > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > >>>>> Hi Minchan, > >>>>> > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and > >>>>> linux-kernel > >>>>> mailing list. So please Cc me. > >>>>> > >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I > >>>>> describe the cases I known in our product system. They can be > >>>>> categorized into two cases. One is mmaped all data files into memory > >>>>> and sometime it uses write(2) to append some data, and another uses > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In > >>>>> the > >>>>> second case, the application wants to keep mmaped page into memory > >>>>> and > >>>>> let file pages to be reclaimed firstly. So, IMO, when application > >>>>> uses > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants > >>>>> keep > >>>>> these mmaped pages into memory and do not be reclaimed. At least > >>>>> these > >>>>> pages do not be reclaimed early than file pages. I think that > >>>>> maybe we > >>>>> can recover that routine and provide a sysctl parameter to let the > >>>>> user > >>>>> to set this ratio between mmaped pages and file pages. > >>>> > >>>> I am not convinced why we should handle mapped page specially. > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to > >>>> read > >>>> system call. > >>>> So I think we can't make sure mmaped pages are always win. > >>>> > >>>> My suggestion is that it would be better to declare by user explicitly. > >>>> I think we can implement it by madvise and fadvise's WILLNEED option. > >>>> Current implementation is just readahead if there isn't a page in > >>>> memory > >>>> but I think > >>>> we can promote from inactive to active if there is already a page in > >>>> memory. > >>>> > >>>> It's more clear and it couldn't be affected by kernel page reclaim > >>>> algorithm change > >>>> like this. > >>> > >>> Thank you for your advice. But I still have question about this > >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > >>> option, it will cause an inconsistently status for pages that be > >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call > >>> madvise with WILLNEED flag, some pages will be moved into active list if > >>> they already have been in memory, and other pages will be read into > >>> memory and be saved in inactive list if they don't be in memory. Then > >>> pages that are in inactive list are possible to be reclaim. So from the > >>> view of users, it is inconsistent because some pages are in memory and > >>> some pages are reclaimed. But actually the user hopes that all of pages > >>> can be kept in memory. IMHO, this inconsistency is weird and makes > >>> users > >>> puzzled. > >> > >> Now problem is that > >> > >> 1. User want to keep pages which are used once in a while in memory. > >> 2. Kernel want to reclaim them because they are surely reclaim target > >> pages in point of view by LRU. > >> > >> The most desriable approach is that user should use mlock to guarantee > >> them in memory. But mlock is too big overhead and user doesn't want to > >> keep > >> memory all pages all at once.(Ie, he want demand paging when he need > >> the page) > >> Right? > >> > >> madvise, it's a just hint for kernel and kernel doesn't need to make > >> sure madvise's behavior. > >> In point of view, such inconsistency might not be a big problem. > >> > >> Big problem I think now is that user should use madvise(WILLNEED) > >> periodically because such > >> activation happens once when user calls madvise. If user doesn't use > >> page frequently after > >> user calls it, it ends up moving into inactive list and even could be > >> reclaimed. > >> It's not good. :-( > >> > >> Okay. How about adding new VM_WORKINGSET? > >> And reclaimer would give one more round trip in active/inactive list > >> erwhen reclaim happens > >> if the page is referenced. > >> > >> Sigh. We have no room for new VM_FLAG in 32 bit. > > p > > It would be nice to mark struct address_space with this flag and export > > AS_UNEVICTABLE somehow. > > Maybe we can reuse file-locking engine for managing these bits =) > > Make sense to me. We can mark this flag in struct address_space and check > it in page_refereneced_file(). If this flag is set, it will be cleard and Disadvantage is that we could set reclaim granularity as per-inode. I want to set it as per-vma, not per-inode. > the function returns referenced > 1. Then this page can be promoted into > activate list. But I prefer to set/clear this flag in madvise. Hmm, My idea is following as, If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which are set by new VM flag and the page is referenced recently at least once. It means it gives one more round trip in his list(ie, active/inactive list) rather than activation so that the page would become less reclaimable. > > PS, I have subscribed linux-mm mailing list. :-) Congratulations! :) > > Regards, > Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx129.postini.com [74.125.245.129]) by kanga.kvack.org (Postfix) with SMTP id 073016B0044 for ; Mon, 12 Mar 2012 10:18:25 -0400 (EDT) Received: by bkwq16 with SMTP id q16so3695536bkw.14 for ; Mon, 12 Mar 2012 07:18:24 -0700 (PDT) Message-ID: <4F5E05AD.20200@openvz.org> Date: Mon, 12 Mar 2012 18:18:21 +0400 From: Konstantin Khlebnikov MIME-Version: 1.0 Subject: Re: Fwd: Control page reclaim granularity References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> In-Reply-To: <20120312134226.GA5120@barrios> Content-Type: multipart/mixed; boundary="------------040505090707040106050504" Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" This is a multi-part message in MIME format. --------------040505090707040106050504 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Minchan Kim wrote: > On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: >> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: >>> Minchan Kim wrote: >>>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: >>>>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: >>>>>> I forgot to Ccing you. >>>>>> Sorry. >>>>>> >>>>>> ---------- Forwarded message ---------- >>>>>> From: Minchan Kim >>>>>> Date: Mon, Mar 12, 2012 at 9:28 AM >>>>>> Subject: Re: Control page reclaim granularity >>>>>> To: Minchan Kim, linux-mm, >>>>>> linux-kernel, Konstantin Khlebnikov< >>>>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com >>>>>> >>>>>> >>>>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: >>>>>>> Hi Minchan, >>>>>>> >>>>>>> Sorry, I forgot to say that I don't subscribe linux-mm and >>>>>>> linux-kernel >>>>>>> mailing list. So please Cc me. >>>>>>> >>>>>>> IMHO, maybe we should re-think about how does user use mmap(2). I >>>>>>> describe the cases I known in our product system. They can be >>>>>>> categorized into two cases. One is mmaped all data files into memory >>>>>>> and sometime it uses write(2) to append some data, and another uses >>>>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In >>>>>>> the >>>>>>> second case, the application wants to keep mmaped page into memory >>>>>>> and >>>>>>> let file pages to be reclaimed firstly. So, IMO, when application >>>>>>> uses >>>>>>> mmap(2) to manipulate files, it is possible to imply that it wants >>>>>>> keep >>>>>>> these mmaped pages into memory and do not be reclaimed. At least >>>>>>> these >>>>>>> pages do not be reclaimed early than file pages. I think that >>>>>>> maybe we >>>>>>> can recover that routine and provide a sysctl parameter to let the >>>>>>> user >>>>>>> to set this ratio between mmaped pages and file pages. >>>>>> >>>>>> I am not convinced why we should handle mapped page specially. >>>>>> Sometimem, someone may use mmap by reducing buffer copy compared to >>>>>> read >>>>>> system call. >>>>>> So I think we can't make sure mmaped pages are always win. >>>>>> >>>>>> My suggestion is that it would be better to declare by user explicitly. >>>>>> I think we can implement it by madvise and fadvise's WILLNEED option. >>>>>> Current implementation is just readahead if there isn't a page in >>>>>> memory >>>>>> but I think >>>>>> we can promote from inactive to active if there is already a page in >>>>>> memory. >>>>>> >>>>>> It's more clear and it couldn't be affected by kernel page reclaim >>>>>> algorithm change >>>>>> like this. >>>>> >>>>> Thank you for your advice. But I still have question about this >>>>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED >>>>> option, it will cause an inconsistently status for pages that be >>>>> manipulated by madvise(2) and/or fadvise(2). For example, when I call >>>>> madvise with WILLNEED flag, some pages will be moved into active list if >>>>> they already have been in memory, and other pages will be read into >>>>> memory and be saved in inactive list if they don't be in memory. Then >>>>> pages that are in inactive list are possible to be reclaim. So from the >>>>> view of users, it is inconsistent because some pages are in memory and >>>>> some pages are reclaimed. But actually the user hopes that all of pages >>>>> can be kept in memory. IMHO, this inconsistency is weird and makes >>>>> users >>>>> puzzled. >>>> >>>> Now problem is that >>>> >>>> 1. User want to keep pages which are used once in a while in memory. >>>> 2. Kernel want to reclaim them because they are surely reclaim target >>>> pages in point of view by LRU. >>>> >>>> The most desriable approach is that user should use mlock to guarantee >>>> them in memory. But mlock is too big overhead and user doesn't want to >>>> keep >>>> memory all pages all at once.(Ie, he want demand paging when he need >>>> the page) >>>> Right? >>>> >>>> madvise, it's a just hint for kernel and kernel doesn't need to make >>>> sure madvise's behavior. >>>> In point of view, such inconsistency might not be a big problem. >>>> >>>> Big problem I think now is that user should use madvise(WILLNEED) >>>> periodically because such >>>> activation happens once when user calls madvise. If user doesn't use >>>> page frequently after >>>> user calls it, it ends up moving into inactive list and even could be >>>> reclaimed. >>>> It's not good. :-( >>>> >>>> Okay. How about adding new VM_WORKINGSET? >>>> And reclaimer would give one more round trip in active/inactive list >>>> erwhen reclaim happens >>>> if the page is referenced. >>>> >>>> Sigh. We have no room for new VM_FLAG in 32 bit. >>> p >>> It would be nice to mark struct address_space with this flag and export >>> AS_UNEVICTABLE somehow. >>> Maybe we can reuse file-locking engine for managing these bits =) >> >> Make sense to me. We can mark this flag in struct address_space and check >> it in page_refereneced_file(). If this flag is set, it will be cleard and > > Disadvantage is that we could set reclaim granularity as per-inode. > I want to set it as per-vma, not per-inode. But with per-inode flag we can tune all files, not only memory-mapped. See, attached patch. Currently I thinking about managing code, file-locking engine really fits perfectly =) > >> the function returns referenced> 1. Then this page can be promoted into >> activate list. But I prefer to set/clear this flag in madvise. > > Hmm, My idea is following as, > If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list > and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which > are set by new VM flag and the page is referenced recently at least once. > It means it gives one more round trip in his list(ie, active/inactive list) > rather than activation so that the page would become less reclaimable. > >> >> PS, I have subscribed linux-mm mailing list. :-) > > Congratulations! :) > >> >> Regards, >> Zheng > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org --------------040505090707040106050504 Content-Type: text/plain; name="mm-introduce-mapping-as_workingset-flag" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="mm-introduce-mapping-as_workingset-flag" mm: introduce mapping AS_WORKINGSET flag From: Konstantin Khlebnikov This patch introduces new flag AS_WORKINGSET in mapping->flags. If it set reclaimer will activates all pages for this inode after first usage. Signed-off-by: Konstantin Khlebnikov --- include/linux/pagemap.h | 16 ++++++++++++++++ mm/vmscan.c | 15 ++++++++++++--- 2 files changed, 28 insertions(+), 3 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index cfaaa69..c15fc17 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -24,6 +24,7 @@ enum mapping_flags { AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ + AS_WORKINGSET = __GFP_BITS_SHIFT + 4, /* promote pages activation */ }; static inline void mapping_set_error(struct address_space *mapping, int error) @@ -53,6 +54,21 @@ static inline int mapping_unevictable(struct address_space *mapping) return !!mapping; } +static inline void mapping_set_workingset(struct address_space *mapping) +{ + set_bit(AS_WORKINGSET, &mapping->flags); +} + +static inline void mapping_clear_workingset(struct address_space *mapping) +{ + clear_bit(AS_WORKINGSET, &mapping->flags); +} + +static inline int mapping_test_workingset(struct address_space *mapping) +{ + return mapping && test_bit(AS_WORKINGSET, &mapping->flags); +} + static inline gfp_t mapping_gfp_mask(struct address_space * mapping) { return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; diff --git a/mm/vmscan.c b/mm/vmscan.c index 57b9658..5ccbe8c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -701,6 +701,7 @@ enum page_references { }; static enum page_references page_check_references(struct page *page, + struct address_space *mapping, struct mem_cgroup_zone *mz, struct scan_control *sc) { @@ -721,6 +722,13 @@ static enum page_references page_check_references(struct page *page, if (vm_flags & VM_LOCKED) return PAGEREF_RECLAIM; + /* + * Activate workingset page if referenced at least once. + */ + if (mapping_test_workingset(mapping) && + (referenced_ptes || referenced_page)) + return PAGEREF_ACTIVATE; + if (referenced_ptes) { if (PageAnon(page)) return PAGEREF_ACTIVATE; @@ -828,7 +836,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } - references = page_check_references(page, mz, sc); + mapping = page_mapping(page); + + references = page_check_references(page, mapping, mz, sc); switch (references) { case PAGEREF_ACTIVATE: goto activate_locked; @@ -848,11 +858,10 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto keep_locked; if (!add_to_swap(page)) goto activate_locked; + mapping = &swapper_space; may_enter_fs = 1; } - mapping = page_mapping(page); - /* * The page is mapped into the page tables of one or more * processes. Try to unmap it here. --------------040505090707040106050504-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id D05266B0044 for ; Mon, 12 Mar 2012 10:55:34 -0400 (EDT) Message-ID: <4F5E0E5C.8040508@redhat.com> Date: Mon, 12 Mar 2012 10:55:24 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: Control page reclaim granularity References: <20120308073412.GA6975@gmail.com> <20120308093514.GA28856@barrios> In-Reply-To: <20120308093514.GA28856@barrios> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , linux-kernel , Konstantin Khlebnikov , kosaki.motohiro@jp.fujitsu.com On 03/08/2012 04:35 AM, Minchan Kim wrote: > On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote: >> Hi list, >> >> Recently we encounter a problem about page reclaim. I abstract it in here. >> The problem is that there are two different file types. One is small index >> file, and another is large data file. The index file is mmaped into memory, >> and application hope that they can be kept in memory and don't be reclaimed >> too frequently. The data file is manipulted by read/write, and they should >> be reclaimed more frequently than the index file. They should indeed be. The data pages should not get promoted to the active list unless they get referenced twice while on the inactive list. Mmaped pages, on the other hand, get promoted to the active list after just one reference. Also, as long as the inactive file list is larger than the active file list, we do not reclaim active file pages at all. > I think it's a regression since 2.6.28. > Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped). > But we removed that routine when we applied split lru page replacement. > Rik, KOSAKI. What's the rationale? One main reason is scalability. We have to treat pages in such a way that we do not have to search through gigabytes of memory to find a few eviction candidates to place on the inactive list - where they could get reused and stopped from eviction again. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx188.postini.com [74.125.245.188]) by kanga.kvack.org (Postfix) with SMTP id 9247E6B004A for ; Mon, 12 Mar 2012 11:10:16 -0400 (EDT) Received: by yenm8 with SMTP id m8so3400343yen.14 for ; Mon, 12 Mar 2012 08:10:15 -0700 (PDT) Date: Mon, 12 Mar 2012 23:15:43 +0800 From: Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312151542.GA16253@gmail.com> References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312134226.GA5120@barrios> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Konstantin Khlebnikov , linux-mm , linux-kernel , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com On Mon, Mar 12, 2012 at 10:42:26PM +0900, Minchan Kim wrote: > On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: > > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > > > Minchan Kim wrote: > > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > > >>>> I forgot to Ccing you. > > >>>> Sorry. > > >>>> > > >>>> ---------- Forwarded message ---------- > > >>>> From: Minchan Kim > > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM > > >>>> Subject: Re: Control page reclaim granularity > > >>>> To: Minchan Kim, linux-mm, > > >>>> linux-kernel, Konstantin Khlebnikov< > > >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > > >>>> > > >>>> > > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > > >>>>> Hi Minchan, > > >>>>> > > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and > > >>>>> linux-kernel > > >>>>> mailing list. So please Cc me. > > >>>>> > > >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I > > >>>>> describe the cases I known in our product system. They can be > > >>>>> categorized into two cases. One is mmaped all data files into memory > > >>>>> and sometime it uses write(2) to append some data, and another uses > > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In > > >>>>> the > > >>>>> second case, the application wants to keep mmaped page into memory > > >>>>> and > > >>>>> let file pages to be reclaimed firstly. So, IMO, when application > > >>>>> uses > > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants > > >>>>> keep > > >>>>> these mmaped pages into memory and do not be reclaimed. At least > > >>>>> these > > >>>>> pages do not be reclaimed early than file pages. I think that > > >>>>> maybe we > > >>>>> can recover that routine and provide a sysctl parameter to let the > > >>>>> user > > >>>>> to set this ratio between mmaped pages and file pages. > > >>>> > > >>>> I am not convinced why we should handle mapped page specially. > > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to > > >>>> read > > >>>> system call. > > >>>> So I think we can't make sure mmaped pages are always win. > > >>>> > > >>>> My suggestion is that it would be better to declare by user explicitly. > > >>>> I think we can implement it by madvise and fadvise's WILLNEED option. > > >>>> Current implementation is just readahead if there isn't a page in > > >>>> memory > > >>>> but I think > > >>>> we can promote from inactive to active if there is already a page in > > >>>> memory. > > >>>> > > >>>> It's more clear and it couldn't be affected by kernel page reclaim > > >>>> algorithm change > > >>>> like this. > > >>> > > >>> Thank you for your advice. But I still have question about this > > >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > > >>> option, it will cause an inconsistently status for pages that be > > >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call > > >>> madvise with WILLNEED flag, some pages will be moved into active list if > > >>> they already have been in memory, and other pages will be read into > > >>> memory and be saved in inactive list if they don't be in memory. Then > > >>> pages that are in inactive list are possible to be reclaim. So from the > > >>> view of users, it is inconsistent because some pages are in memory and > > >>> some pages are reclaimed. But actually the user hopes that all of pages > > >>> can be kept in memory. IMHO, this inconsistency is weird and makes > > >>> users > > >>> puzzled. > > >> > > >> Now problem is that > > >> > > >> 1. User want to keep pages which are used once in a while in memory. > > >> 2. Kernel want to reclaim them because they are surely reclaim target > > >> pages in point of view by LRU. > > >> > > >> The most desriable approach is that user should use mlock to guarantee > > >> them in memory. But mlock is too big overhead and user doesn't want to > > >> keep > > >> memory all pages all at once.(Ie, he want demand paging when he need > > >> the page) > > >> Right? > > >> > > >> madvise, it's a just hint for kernel and kernel doesn't need to make > > >> sure madvise's behavior. > > >> In point of view, such inconsistency might not be a big problem. > > >> > > >> Big problem I think now is that user should use madvise(WILLNEED) > > >> periodically because such > > >> activation happens once when user calls madvise. If user doesn't use > > >> page frequently after > > >> user calls it, it ends up moving into inactive list and even could be > > >> reclaimed. > > >> It's not good. :-( > > >> > > >> Okay. How about adding new VM_WORKINGSET? > > >> And reclaimer would give one more round trip in active/inactive list > > >> erwhen reclaim happens > > >> if the page is referenced. > > >> > > >> Sigh. We have no room for new VM_FLAG in 32 bit. > > > p > > > It would be nice to mark struct address_space with this flag and export > > > AS_UNEVICTABLE somehow. > > > Maybe we can reuse file-locking engine for managing these bits =) > > > > Make sense to me. We can mark this flag in struct address_space and check > > it in page_refereneced_file(). If this flag is set, it will be cleard and > > Disadvantage is that we could set reclaim granularity as per-inode. > I want to set it as per-vma, not per-inode. I don't think this is a disadvantage. This per-inode reclaim granularity is useful for us. Actually I have thought to implement a per-inode memcg to let different file sets to be reclaimed separately. So maybe we can provide two mechanisms to let the user to choose how to use them. > > > the function returns referenced > 1. Then this page can be promoted into > > activate list. But I prefer to set/clear this flag in madvise. > > Hmm, My idea is following as, > If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list > and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which > are set by new VM flag and the page is referenced recently at least once. > It means it gives one more round trip in his list(ie, active/inactive list) > rather than activation so that the page would become less reclaimable. No matter what the page is given one more round trip or is promoted into active list, it can satisfy our current requirement. So now the question is which is better. If we add a new VM flag, as you said before, vma->vm_flags has no room for it in 32 bit. I have noticed that this topic has been discussed [1] and the result is that vm_flags is still a unsigned long type. So we need to use a tricky technique to solve it. If we add a new flag in struct addpress_space, it might be easy to implement it. 1. http://lkml.indiana.edu/hypermail/linux/kernel/1104.1/00975.html Regards, Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx146.postini.com [74.125.245.146]) by kanga.kvack.org (Postfix) with SMTP id DDB9B6B004A for ; Mon, 12 Mar 2012 22:48:27 -0400 (EDT) Received: by dadv6 with SMTP id v6so152627dad.14 for ; Mon, 12 Mar 2012 19:48:27 -0700 (PDT) Date: Tue, 13 Mar 2012 11:48:18 +0900 From: Minchan Kim Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313024818.GA7125@barrios> References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> <4F5E05AD.20200@openvz.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F5E05AD.20200@openvz.org> Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov Cc: Minchan Kim , linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote: > Minchan Kim wrote: > >On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: > >>On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > >>>Minchan Kim wrote: > >>>>On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > >>>>>On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > >>>>>>I forgot to Ccing you. > >>>>>>Sorry. > >>>>>> > >>>>>>---------- Forwarded message ---------- > >>>>>>From: Minchan Kim > >>>>>>Date: Mon, Mar 12, 2012 at 9:28 AM > >>>>>>Subject: Re: Control page reclaim granularity > >>>>>>To: Minchan Kim, linux-mm, > >>>>>>linux-kernel, Konstantin Khlebnikov< > >>>>>>khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > >>>>>> > >>>>>> > >>>>>>On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > >>>>>>>Hi Minchan, > >>>>>>> > >>>>>>>Sorry, I forgot to say that I don't subscribe linux-mm and > >>>>>>>linux-kernel > >>>>>>>mailing list. So please Cc me. > >>>>>>> > >>>>>>>IMHO, maybe we should re-think about how does user use mmap(2). I > >>>>>>>describe the cases I known in our product system. They can be > >>>>>>>categorized into two cases. One is mmaped all data files into memory > >>>>>>>and sometime it uses write(2) to append some data, and another uses > >>>>>>>mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In > >>>>>>>the > >>>>>>>second case, the application wants to keep mmaped page into memory > >>>>>>>and > >>>>>>>let file pages to be reclaimed firstly. So, IMO, when application > >>>>>>>uses > >>>>>>>mmap(2) to manipulate files, it is possible to imply that it wants > >>>>>>>keep > >>>>>>>these mmaped pages into memory and do not be reclaimed. At least > >>>>>>>these > >>>>>>>pages do not be reclaimed early than file pages. I think that > >>>>>>>maybe we > >>>>>>>can recover that routine and provide a sysctl parameter to let the > >>>>>>>user > >>>>>>>to set this ratio between mmaped pages and file pages. > >>>>>> > >>>>>>I am not convinced why we should handle mapped page specially. > >>>>>>Sometimem, someone may use mmap by reducing buffer copy compared to > >>>>>>read > >>>>>>system call. > >>>>>>So I think we can't make sure mmaped pages are always win. > >>>>>> > >>>>>>My suggestion is that it would be better to declare by user explicitly. > >>>>>>I think we can implement it by madvise and fadvise's WILLNEED option. > >>>>>>Current implementation is just readahead if there isn't a page in > >>>>>>memory > >>>>>>but I think > >>>>>>we can promote from inactive to active if there is already a page in > >>>>>>memory. > >>>>>> > >>>>>>It's more clear and it couldn't be affected by kernel page reclaim > >>>>>>algorithm change > >>>>>>like this. > >>>>> > >>>>>Thank you for your advice. But I still have question about this > >>>>>solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > >>>>>option, it will cause an inconsistently status for pages that be > >>>>>manipulated by madvise(2) and/or fadvise(2). For example, when I call > >>>>>madvise with WILLNEED flag, some pages will be moved into active list if > >>>>>they already have been in memory, and other pages will be read into > >>>>>memory and be saved in inactive list if they don't be in memory. Then > >>>>>pages that are in inactive list are possible to be reclaim. So from the > >>>>>view of users, it is inconsistent because some pages are in memory and > >>>>>some pages are reclaimed. But actually the user hopes that all of pages > >>>>>can be kept in memory. IMHO, this inconsistency is weird and makes > >>>>>users > >>>>>puzzled. > >>>> > >>>>Now problem is that > >>>> > >>>>1. User want to keep pages which are used once in a while in memory. > >>>>2. Kernel want to reclaim them because they are surely reclaim target > >>>> pages in point of view by LRU. > >>>> > >>>>The most desriable approach is that user should use mlock to guarantee > >>>>them in memory. But mlock is too big overhead and user doesn't want to > >>>>keep > >>>>memory all pages all at once.(Ie, he want demand paging when he need > >>>>the page) > >>>>Right? > >>>> > >>>>madvise, it's a just hint for kernel and kernel doesn't need to make > >>>>sure madvise's behavior. > >>>>In point of view, such inconsistency might not be a big problem. > >>>> > >>>>Big problem I think now is that user should use madvise(WILLNEED) > >>>>periodically because such > >>>>activation happens once when user calls madvise. If user doesn't use > >>>>page frequently after > >>>>user calls it, it ends up moving into inactive list and even could be > >>>>reclaimed. > >>>>It's not good. :-( > >>>> > >>>>Okay. How about adding new VM_WORKINGSET? > >>>>And reclaimer would give one more round trip in active/inactive list > >>>>erwhen reclaim happens > >>>>if the page is referenced. > >>>> > >>>>Sigh. We have no room for new VM_FLAG in 32 bit. > >>>p > >>>It would be nice to mark struct address_space with this flag and export > >>>AS_UNEVICTABLE somehow. > >>>Maybe we can reuse file-locking engine for managing these bits =) > >> > >>Make sense to me. We can mark this flag in struct address_space and check > >>it in page_refereneced_file(). If this flag is set, it will be cleard and > > > >Disadvantage is that we could set reclaim granularity as per-inode. > >I want to set it as per-vma, not per-inode. > > But with per-inode flag we can tune all files, not only memory-mapped. I don't oppose per-inode setting but I believe we need file range or mmapped vma, still. One file may have different characteristic part, something is working set something is streaming part. > See, attached patch. Currently I thinking about managing code, > file-locking engine really fits perfectly =) file-locking engine? You consider fcntl as interface for it? What do you mean? > > > > >>the function returns referenced> 1. Then this page can be promoted into > >>activate list. But I prefer to set/clear this flag in madvise. > > > >Hmm, My idea is following as, > >If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list > >and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which > >are set by new VM flag and the page is referenced recently at least once. > >It means it gives one more round trip in his list(ie, active/inactive list) > >rather than activation so that the page would become less reclaimable. > > > >> > >>PS, I have subscribed linux-mm mailing list. :-) > > > >Congratulations! :) > > > >> > >>Regards, > >>Zheng > > > >-- > >To unsubscribe, send a message with 'unsubscribe linux-mm' in > >the body to majordomo@kvack.org. For more info on Linux MM, > >see: http://www.linux-mm.org/ . > >Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > >Don't email: email@kvack.org > > mm: introduce mapping AS_WORKINGSET flag > > From: Konstantin Khlebnikov > > This patch introduces new flag AS_WORKINGSET in mapping->flags. > If it set reclaimer will activates all pages for this inode after first usage. > > Signed-off-by: Konstantin Khlebnikov > --- > include/linux/pagemap.h | 16 ++++++++++++++++ > mm/vmscan.c | 15 ++++++++++++--- > 2 files changed, 28 insertions(+), 3 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index cfaaa69..c15fc17 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -24,6 +24,7 @@ enum mapping_flags { > AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ > AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ > AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ > + AS_WORKINGSET = __GFP_BITS_SHIFT + 4, /* promote pages activation */ > }; > > static inline void mapping_set_error(struct address_space *mapping, int error) > @@ -53,6 +54,21 @@ static inline int mapping_unevictable(struct address_space *mapping) > return !!mapping; > } > > +static inline void mapping_set_workingset(struct address_space *mapping) > +{ > + set_bit(AS_WORKINGSET, &mapping->flags); > +} > + > +static inline void mapping_clear_workingset(struct address_space *mapping) > +{ > + clear_bit(AS_WORKINGSET, &mapping->flags); > +} > + > +static inline int mapping_test_workingset(struct address_space *mapping) > +{ > + return mapping && test_bit(AS_WORKINGSET, &mapping->flags); > +} > + > static inline gfp_t mapping_gfp_mask(struct address_space * mapping) > { > return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 57b9658..5ccbe8c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -701,6 +701,7 @@ enum page_references { > }; > > static enum page_references page_check_references(struct page *page, > + struct address_space *mapping, > struct mem_cgroup_zone *mz, > struct scan_control *sc) > { > @@ -721,6 +722,13 @@ static enum page_references page_check_references(struct page *page, > if (vm_flags & VM_LOCKED) > return PAGEREF_RECLAIM; > > + /* > + * Activate workingset page if referenced at least once. > + */ > + if (mapping_test_workingset(mapping) && > + (referenced_ptes || referenced_page)) > + return PAGEREF_ACTIVATE; > + > if (referenced_ptes) { > if (PageAnon(page)) > return PAGEREF_ACTIVATE; > @@ -828,7 +836,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > } > > - references = page_check_references(page, mz, sc); > + mapping = page_mapping(page); > + > + references = page_check_references(page, mapping, mz, sc); > switch (references) { > case PAGEREF_ACTIVATE: > goto activate_locked; > @@ -848,11 +858,10 @@ static unsigned long shrink_page_list(struct list_head *page_list, > goto keep_locked; > if (!add_to_swap(page)) > goto activate_locked; > + mapping = &swapper_space; > may_enter_fs = 1; > } > > - mapping = page_mapping(page); > - > /* > * The page is mapped into the page tables of one or more > * processes. Try to unmap it here. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id 8FDEC6B004D for ; Mon, 12 Mar 2012 22:51:15 -0400 (EDT) Received: by dadv6 with SMTP id v6so156637dad.14 for ; Mon, 12 Mar 2012 19:51:14 -0700 (PDT) Date: Tue, 13 Mar 2012 11:51:08 +0900 From: Minchan Kim Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313025108.GB7125@barrios> References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> <20120312151542.GA16253@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312151542.GA16253@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim , Konstantin Khlebnikov , linux-mm , linux-kernel , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com On Mon, Mar 12, 2012 at 11:15:43PM +0800, Zheng Liu wrote: > On Mon, Mar 12, 2012 at 10:42:26PM +0900, Minchan Kim wrote: > > On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: > > > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > > > > Minchan Kim wrote: > > > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > > > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > > > >>>> I forgot to Ccing you. > > > >>>> Sorry. > > > >>>> > > > >>>> ---------- Forwarded message ---------- > > > >>>> From: Minchan Kim > > > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM > > > >>>> Subject: Re: Control page reclaim granularity > > > >>>> To: Minchan Kim, linux-mm, > > > >>>> linux-kernel, Konstantin Khlebnikov< > > > >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > > > >>>> > > > >>>> > > > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > > > >>>>> Hi Minchan, > > > >>>>> > > > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and > > > >>>>> linux-kernel > > > >>>>> mailing list. So please Cc me. > > > >>>>> > > > >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I > > > >>>>> describe the cases I known in our product system. They can be > > > >>>>> categorized into two cases. One is mmaped all data files into memory > > > >>>>> and sometime it uses write(2) to append some data, and another uses > > > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In > > > >>>>> the > > > >>>>> second case, the application wants to keep mmaped page into memory > > > >>>>> and > > > >>>>> let file pages to be reclaimed firstly. So, IMO, when application > > > >>>>> uses > > > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants > > > >>>>> keep > > > >>>>> these mmaped pages into memory and do not be reclaimed. At least > > > >>>>> these > > > >>>>> pages do not be reclaimed early than file pages. I think that > > > >>>>> maybe we > > > >>>>> can recover that routine and provide a sysctl parameter to let the > > > >>>>> user > > > >>>>> to set this ratio between mmaped pages and file pages. > > > >>>> > > > >>>> I am not convinced why we should handle mapped page specially. > > > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to > > > >>>> read > > > >>>> system call. > > > >>>> So I think we can't make sure mmaped pages are always win. > > > >>>> > > > >>>> My suggestion is that it would be better to declare by user explicitly. > > > >>>> I think we can implement it by madvise and fadvise's WILLNEED option. > > > >>>> Current implementation is just readahead if there isn't a page in > > > >>>> memory > > > >>>> but I think > > > >>>> we can promote from inactive to active if there is already a page in > > > >>>> memory. > > > >>>> > > > >>>> It's more clear and it couldn't be affected by kernel page reclaim > > > >>>> algorithm change > > > >>>> like this. > > > >>> > > > >>> Thank you for your advice. But I still have question about this > > > >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > > > >>> option, it will cause an inconsistently status for pages that be > > > >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call > > > >>> madvise with WILLNEED flag, some pages will be moved into active list if > > > >>> they already have been in memory, and other pages will be read into > > > >>> memory and be saved in inactive list if they don't be in memory. Then > > > >>> pages that are in inactive list are possible to be reclaim. So from the > > > >>> view of users, it is inconsistent because some pages are in memory and > > > >>> some pages are reclaimed. But actually the user hopes that all of pages > > > >>> can be kept in memory. IMHO, this inconsistency is weird and makes > > > >>> users > > > >>> puzzled. > > > >> > > > >> Now problem is that > > > >> > > > >> 1. User want to keep pages which are used once in a while in memory. > > > >> 2. Kernel want to reclaim them because they are surely reclaim target > > > >> pages in point of view by LRU. > > > >> > > > >> The most desriable approach is that user should use mlock to guarantee > > > >> them in memory. But mlock is too big overhead and user doesn't want to > > > >> keep > > > >> memory all pages all at once.(Ie, he want demand paging when he need > > > >> the page) > > > >> Right? > > > >> > > > >> madvise, it's a just hint for kernel and kernel doesn't need to make > > > >> sure madvise's behavior. > > > >> In point of view, such inconsistency might not be a big problem. > > > >> > > > >> Big problem I think now is that user should use madvise(WILLNEED) > > > >> periodically because such > > > >> activation happens once when user calls madvise. If user doesn't use > > > >> page frequently after > > > >> user calls it, it ends up moving into inactive list and even could be > > > >> reclaimed. > > > >> It's not good. :-( > > > >> > > > >> Okay. How about adding new VM_WORKINGSET? > > > >> And reclaimer would give one more round trip in active/inactive list > > > >> erwhen reclaim happens > > > >> if the page is referenced. > > > >> > > > >> Sigh. We have no room for new VM_FLAG in 32 bit. > > > > p > > > > It would be nice to mark struct address_space with this flag and export > > > > AS_UNEVICTABLE somehow. > > > > Maybe we can reuse file-locking engine for managing these bits =) > > > > > > Make sense to me. We can mark this flag in struct address_space and check > > > it in page_refereneced_file(). If this flag is set, it will be cleard and > > > > Disadvantage is that we could set reclaim granularity as per-inode. > > I want to set it as per-vma, not per-inode. > > I don't think this is a disadvantage. This per-inode reclaim > granularity is useful for us. Actually I have thought to implement a > per-inode memcg to let different file sets to be reclaimed separately. > So maybe we can provide two mechanisms to let the user to choose how to > use them. I don't oppose supporting both mechanism but I don't want to give only per-inode approach. > > > > > > the function returns referenced > 1. Then this page can be promoted into > > > activate list. But I prefer to set/clear this flag in madvise. > > > > Hmm, My idea is following as, > > If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list > > and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which > > are set by new VM flag and the page is referenced recently at least once. > > It means it gives one more round trip in his list(ie, active/inactive list) > > rather than activation so that the page would become less reclaimable. > > No matter what the page is given one more round trip or is promoted into > active list, it can satisfy our current requirement. So now the > question is which is better. If we add a new VM flag, as you said > before, vma->vm_flags has no room for it in 32 bit. I have noticed that > this topic has been discussed [1] and the result is that vm_flags is > still a unsigned long type. So we need to use a tricky technique to solve > it. If we add a new flag in struct addpress_space, it might be easy to > implement it. In case of per-inode, it's good but it doesn't work for per-vma and file-range. > > 1. http://lkml.indiana.edu/hypermail/linux/kernel/1104.1/00975.html > > Regards, > Zheng -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx120.postini.com [74.125.245.120]) by kanga.kvack.org (Postfix) with SMTP id 3FC8F6B004A for ; Mon, 12 Mar 2012 22:58:04 -0400 (EDT) Received: by dadv6 with SMTP id v6so167143dad.14 for ; Mon, 12 Mar 2012 19:58:03 -0700 (PDT) Date: Tue, 13 Mar 2012 11:57:57 +0900 From: Minchan Kim Subject: Re: Control page reclaim granularity Message-ID: <20120313025756.GC7125@barrios> References: <20120308073412.GA6975@gmail.com> <20120308093514.GA28856@barrios> <4F5E0E5C.8040508@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F5E0E5C.8040508@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Minchan Kim , linux-mm , linux-kernel , Konstantin Khlebnikov , kosaki.motohiro@jp.fujitsu.com On Mon, Mar 12, 2012 at 10:55:24AM -0400, Rik van Riel wrote: > On 03/08/2012 04:35 AM, Minchan Kim wrote: > >On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote: > >>Hi list, > >> > >>Recently we encounter a problem about page reclaim. I abstract it in here. > >>The problem is that there are two different file types. One is small index > >>file, and another is large data file. The index file is mmaped into memory, > >>and application hope that they can be kept in memory and don't be reclaimed > >>too frequently. The data file is manipulted by read/write, and they should > >>be reclaimed more frequently than the index file. > > They should indeed be. The data pages should not get promoted > to the active list unless they get referenced twice while on > the inactive list. > > Mmaped pages, on the other hand, get promoted to the active > list after just one reference. As I look the code, mmaped page doesn't get promoted by one reference. It will get promoted by second-round trip or touched by several mapping when first round trip. if (referenced_page || referenced_ptes > 1) return PAGEREF_ACTIVATE; > > Also, as long as the inactive file list is larger than the > active file list, we do not reclaim active file pages at > all. True. > > >I think it's a regression since 2.6.28. > >Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped). > >But we removed that routine when we applied split lru page replacement. > >Rik, KOSAKI. What's the rationale? > > One main reason is scalability. We have to treat pages > in such a way that we do not have to search through > gigabytes of memory to find a few eviction candidates > to place on the inactive list - where they could get > reused and stopped from eviction again. Okay. Thanks, Rik. Then, another question. Why did we handle mmaped page specially at that time? Just out of curiosity. > > -- > All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx113.postini.com [74.125.245.113]) by kanga.kvack.org (Postfix) with SMTP id BDEBF6B004A for ; Tue, 13 Mar 2012 00:37:26 -0400 (EDT) Received: by bkwq16 with SMTP id q16so110337bkw.14 for ; Mon, 12 Mar 2012 21:37:24 -0700 (PDT) Message-ID: <4F5ECF01.2000402@openvz.org> Date: Tue, 13 Mar 2012 08:37:21 +0400 From: Konstantin Khlebnikov MIME-Version: 1.0 Subject: Re: Fwd: Control page reclaim granularity References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> <4F5E05AD.20200@openvz.org> <20120313024818.GA7125@barrios> In-Reply-To: <20120313024818.GA7125@barrios> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" Minchan Kim wrote: > On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote: >> Minchan Kim wrote: >>> On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: >>>> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: >>>>> Minchan Kim wrote: >>>>>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: >>>>>> >>>>>> Now problem is that >>>>>> >>>>>> 1. User want to keep pages which are used once in a while in memory. >>>>>> 2. Kernel want to reclaim them because they are surely reclaim target >>>>>> pages in point of view by LRU. >>>>>> >>>>>> The most desriable approach is that user should use mlock to guarantee >>>>>> them in memory. But mlock is too big overhead and user doesn't want to >>>>>> keep >>>>>> memory all pages all at once.(Ie, he want demand paging when he need >>>>>> the page) >>>>>> Right? >>>>>> >>>>>> madvise, it's a just hint for kernel and kernel doesn't need to make >>>>>> sure madvise's behavior. >>>>>> In point of view, such inconsistency might not be a big problem. >>>>>> >>>>>> Big problem I think now is that user should use madvise(WILLNEED) >>>>>> periodically because such >>>>>> activation happens once when user calls madvise. If user doesn't use >>>>>> page frequently after >>>>>> user calls it, it ends up moving into inactive list and even could be >>>>>> reclaimed. >>>>>> It's not good. :-( >>>>>> >>>>>> Okay. How about adding new VM_WORKINGSET? >>>>>> And reclaimer would give one more round trip in active/inactive list >>>>>> erwhen reclaim happens >>>>>> if the page is referenced. >>>>>> >>>>>> Sigh. We have no room for new VM_FLAG in 32 bit. >>>>> p >>>>> It would be nice to mark struct address_space with this flag and export >>>>> AS_UNEVICTABLE somehow. >>>>> Maybe we can reuse file-locking engine for managing these bits =) >>>> >>>> Make sense to me. We can mark this flag in struct address_space and check >>>> it in page_refereneced_file(). If this flag is set, it will be cleard and >>> >>> Disadvantage is that we could set reclaim granularity as per-inode. >>> I want to set it as per-vma, not per-inode. >> >> But with per-inode flag we can tune all files, not only memory-mapped. > > I don't oppose per-inode setting but I believe we need file range or mmapped vma, > still. One file may have different characteristic part, something is working set > something is streaming part. > >> See, attached patch. Currently I thinking about managing code, >> file-locking engine really fits perfectly =) > > file-locking engine? > You consider fcntl as interface for it? > What do you mean? > If we set bits on inode we somehow account its users and clear AS_WORKINGSET and AS_UNEVICTABLE at last file close. We can use file-locking engine for locking inodes in memory -- file lock automatically release inode at last fput(). Maybe it's too tricky and we should add couple simple atomic counters to generic strict inode (like i_writecount/i_readcount) but in this case we will add new code on fast-path. So, looks like invention new kind of struct file_lock is best approach. I don't want implement range-locking for now, but I can do it if somebody really wants this. Yes, we can use fcntl(), but fadvise() is much better. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx176.postini.com [74.125.245.176]) by kanga.kvack.org (Postfix) with SMTP id AE4C26B004A for ; Tue, 13 Mar 2012 01:00:12 -0400 (EDT) Received: by bkwq16 with SMTP id q16so120078bkw.14 for ; Mon, 12 Mar 2012 22:00:11 -0700 (PDT) Message-ID: <4F5ED458.5070301@openvz.org> Date: Tue, 13 Mar 2012 09:00:08 +0400 From: Konstantin Khlebnikov MIME-Version: 1.0 Subject: Re: Fwd: Control page reclaim granularity References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> <4F5E05AD.20200@openvz.org> <20120313024818.GA7125@barrios> <4F5ECF01.2000402@openvz.org> In-Reply-To: <4F5ECF01.2000402@openvz.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" Konstantin Khlebnikov wrote: > Minchan Kim wrote: >> On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote: >>> Minchan Kim wrote: >>>> On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: >>>>> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: >>>>>> Minchan Kim wrote: >>>>>>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > >>>>>>> >>>>>>> Now problem is that >>>>>>> >>>>>>> 1. User want to keep pages which are used once in a while in memory. >>>>>>> 2. Kernel want to reclaim them because they are surely reclaim target >>>>>>> pages in point of view by LRU. >>>>>>> >>>>>>> The most desriable approach is that user should use mlock to guarantee >>>>>>> them in memory. But mlock is too big overhead and user doesn't want to >>>>>>> keep >>>>>>> memory all pages all at once.(Ie, he want demand paging when he need >>>>>>> the page) >>>>>>> Right? >>>>>>> >>>>>>> madvise, it's a just hint for kernel and kernel doesn't need to make >>>>>>> sure madvise's behavior. >>>>>>> In point of view, such inconsistency might not be a big problem. >>>>>>> >>>>>>> Big problem I think now is that user should use madvise(WILLNEED) >>>>>>> periodically because such >>>>>>> activation happens once when user calls madvise. If user doesn't use >>>>>>> page frequently after >>>>>>> user calls it, it ends up moving into inactive list and even could be >>>>>>> reclaimed. >>>>>>> It's not good. :-( >>>>>>> >>>>>>> Okay. How about adding new VM_WORKINGSET? >>>>>>> And reclaimer would give one more round trip in active/inactive list >>>>>>> erwhen reclaim happens >>>>>>> if the page is referenced. >>>>>>> >>>>>>> Sigh. We have no room for new VM_FLAG in 32 bit. >>>>>> p >>>>>> It would be nice to mark struct address_space with this flag and export >>>>>> AS_UNEVICTABLE somehow. >>>>>> Maybe we can reuse file-locking engine for managing these bits =) >>>>> >>>>> Make sense to me. We can mark this flag in struct address_space and check >>>>> it in page_refereneced_file(). If this flag is set, it will be cleard and >>>> >>>> Disadvantage is that we could set reclaim granularity as per-inode. >>>> I want to set it as per-vma, not per-inode. >>> >>> But with per-inode flag we can tune all files, not only memory-mapped. >> >> I don't oppose per-inode setting but I believe we need file range or mmapped vma, >> still. One file may have different characteristic part, something is working set >> something is streaming part. >> >>> See, attached patch. Currently I thinking about managing code, >>> file-locking engine really fits perfectly =) >> >> file-locking engine? >> You consider fcntl as interface for it? >> What do you mean? >> > > If we set bits on inode we somehow account its users and clear AS_WORKINGSET and AS_UNEVICTABLE > at last file close. We can use file-locking engine for locking inodes in memory -- file lock automatically > release inode at last fput(). Maybe it's too tricky and we should add couple simple atomic counters to > generic strict inode (like i_writecount/i_readcount) but in this case we will add new code on fast-path. > So, looks like invention new kind of struct file_lock is best approach. > I don't want implement range-locking for now, but I can do it if somebody really wants this. > > Yes, we can use fcntl(), but fadvise() is much better. Another mad idea: if we mark vma, then we can add fake vma (belong init_mm for example) to inode rmap to lock inode's pages range in memory without actually mapping file. In page_referenced_one() we should handle this fake vma differently, because page_check_address() will always fail for it. Thus we can effectively implement AS_WORKINGSET and AS_UNEVICTABLE for arbitrary page ranges. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx128.postini.com [74.125.245.128]) by kanga.kvack.org (Postfix) with SMTP id D824C6B004A for ; Tue, 13 Mar 2012 02:24:58 -0400 (EDT) Received: by iajr24 with SMTP id r24so494028iaj.14 for ; Mon, 12 Mar 2012 23:24:58 -0700 (PDT) From: Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Date: Tue, 13 Mar 2012 14:30:14 +0800 Message-Id: <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> In-Reply-To: <20120313024818.GA7125@barrios> References: <20120313024818.GA7125@barrios> Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org Cc: khlebnikov@openvz.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu This only a first trivial try. If this flag is set, reclaimer just give this page one more round trip rather than promote it into active list. Any comments or advices are welcomed. Regards, Zheng [PATCH] mm: per-inode mmaped page reclaim From: Zheng Liu In some cases, user wants to control mmaped page reclaim granularity. A new flag is added into struct address_space to give the page one more round trip. AS_WORKINGSET flag cannot be added in vma->vm_flags because this flag has no room for a new flag in 32 bit. Now user can call madvise(2) to set this flag for a file. If this flag is set, all pages will be given one more round trip when reclaimer tries to shrink pages. Signed-off-by: Zheng Liu --- include/asm-generic/mman-common.h | 2 ++ include/linux/pagemap.h | 16 ++++++++++++++++ mm/madvise.c | 8 ++++++++ mm/vmscan.c | 15 +++++++++++++++ 4 files changed, 41 insertions(+), 0 deletions(-) diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h index 787abbb..7d26c9b 100644 --- a/include/asm-generic/mman-common.h +++ b/include/asm-generic/mman-common.h @@ -48,6 +48,8 @@ #define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ #define MADV_NOHUGEPAGE 15 /* Not worth backing with hugepages */ +#define MADV_WORKINGSET 16 /* give one more round trip */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index cfaaa69..80532a0 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -24,6 +24,7 @@ enum mapping_flags { AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ + AS_WORKINGSET = __GFP_BITS_SHIFT + 4, /* give one more round trip */ }; static inline void mapping_set_error(struct address_space *mapping, int error) @@ -36,6 +37,21 @@ static inline void mapping_set_error(struct address_space *mapping, int error) } } +static inline void mapping_set_workingset(struct address_space *mapping) +{ + set_bit(AS_WORKINGSET, &mapping->flags); +} + +static inline void mapping_clear_workingset(struct address_space *mapping) +{ + clear_bit(AS_WORKINGSET, &mapping->flags); +} + +static inline int mapping_test_workingset(struct address_space *mapping) +{ + return mapping && test_bit(AS_WORKINGSET, &mapping->flags); +} + static inline void mapping_set_unevictable(struct address_space *mapping) { set_bit(AS_UNEVICTABLE, &mapping->flags); diff --git a/mm/madvise.c b/mm/madvise.c index 74bf193..8ca6c9b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -77,6 +77,14 @@ static long madvise_behavior(struct vm_area_struct * vma, if (error) goto out; break; + case MADV_WORKINGSET: + if (vma->vm_file && vma->vm_file->f_mapping) { + mapping_set_workingset(vma->vm_file->f_mapping); + } else { + error = -EPERM; + goto out; + } + break; } if (new_flags == vma->vm_flags) { diff --git a/mm/vmscan.c b/mm/vmscan.c index c52b235..51f745b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -721,6 +721,15 @@ static enum page_references page_check_references(struct page *page, if (vm_flags & VM_LOCKED) return PAGEREF_RECLAIM; + /* + * give this page one more round trip because workingset + * flag is set. + */ + if (mapping_test_workingset(page_mapping(page))) { + mapping_clear_workingset(page_mapping(page)); + return PAGEREF_KEEP; + } + if (referenced_ptes) { if (PageAnon(page)) return PAGEREF_ACTIVATE; @@ -1737,6 +1746,12 @@ static void shrink_active_list(unsigned long nr_to_scan, continue; } + if (mapping_test_workingset(page_mapping(page))) { + mapping_clear_workingset(page_mapping(page)); + list_add(&page->lru, &l_active); + continue; + } + if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) { nr_rotated += hpage_nr_pages(page); /* -- 1.7.4.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx102.postini.com [74.125.245.102]) by kanga.kvack.org (Postfix) with SMTP id 3EED36B004D for ; Tue, 13 Mar 2012 02:43:11 -0400 (EDT) Received: by dadv6 with SMTP id v6so510004dad.14 for ; Mon, 12 Mar 2012 23:43:10 -0700 (PDT) Date: Tue, 13 Mar 2012 14:48:32 +0800 From: Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313064832.GA4968@gmail.com> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org Cc: khlebnikov@openvz.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu Sorry, please forgive me. This patch has a defect. When one page is scaned and flag is clear, all other's flags also are clear too. Regards, Zheng On Tue, Mar 13, 2012 at 02:30:14PM +0800, Zheng Liu wrote: > This only a first trivial try. If this flag is set, reclaimer just give this > page one more round trip rather than promote it into active list. Any comments > or advices are welcomed. > > Regards, > Zheng > > [PATCH] mm: per-inode mmaped page reclaim > > From: Zheng Liu > > In some cases, user wants to control mmaped page reclaim granularity. A new > flag is added into struct address_space to give the page one more round trip. > AS_WORKINGSET flag cannot be added in vma->vm_flags because this flag has no > room for a new flag in 32 bit. Now user can call madvise(2) to set this flag > for a file. If this flag is set, all pages will be given one more round trip > when reclaimer tries to shrink pages. > > Signed-off-by: Zheng Liu > --- > include/asm-generic/mman-common.h | 2 ++ > include/linux/pagemap.h | 16 ++++++++++++++++ > mm/madvise.c | 8 ++++++++ > mm/vmscan.c | 15 +++++++++++++++ > 4 files changed, 41 insertions(+), 0 deletions(-) > > diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h > index 787abbb..7d26c9b 100644 > --- a/include/asm-generic/mman-common.h > +++ b/include/asm-generic/mman-common.h > @@ -48,6 +48,8 @@ > #define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ > #define MADV_NOHUGEPAGE 15 /* Not worth backing with hugepages */ > > +#define MADV_WORKINGSET 16 /* give one more round trip */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index cfaaa69..80532a0 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -24,6 +24,7 @@ enum mapping_flags { > AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ > AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ > AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ > + AS_WORKINGSET = __GFP_BITS_SHIFT + 4, /* give one more round trip */ > }; > > static inline void mapping_set_error(struct address_space *mapping, int error) > @@ -36,6 +37,21 @@ static inline void mapping_set_error(struct address_space *mapping, int error) > } > } > > +static inline void mapping_set_workingset(struct address_space *mapping) > +{ > + set_bit(AS_WORKINGSET, &mapping->flags); > +} > + > +static inline void mapping_clear_workingset(struct address_space *mapping) > +{ > + clear_bit(AS_WORKINGSET, &mapping->flags); > +} > + > +static inline int mapping_test_workingset(struct address_space *mapping) > +{ > + return mapping && test_bit(AS_WORKINGSET, &mapping->flags); > +} > + > static inline void mapping_set_unevictable(struct address_space *mapping) > { > set_bit(AS_UNEVICTABLE, &mapping->flags); > diff --git a/mm/madvise.c b/mm/madvise.c > index 74bf193..8ca6c9b 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -77,6 +77,14 @@ static long madvise_behavior(struct vm_area_struct * vma, > if (error) > goto out; > break; > + case MADV_WORKINGSET: > + if (vma->vm_file && vma->vm_file->f_mapping) { > + mapping_set_workingset(vma->vm_file->f_mapping); > + } else { > + error = -EPERM; > + goto out; > + } > + break; > } > > if (new_flags == vma->vm_flags) { > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c52b235..51f745b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -721,6 +721,15 @@ static enum page_references page_check_references(struct page *page, > if (vm_flags & VM_LOCKED) > return PAGEREF_RECLAIM; > > + /* > + * give this page one more round trip because workingset > + * flag is set. > + */ > + if (mapping_test_workingset(page_mapping(page))) { > + mapping_clear_workingset(page_mapping(page)); > + return PAGEREF_KEEP; > + } > + > if (referenced_ptes) { > if (PageAnon(page)) > return PAGEREF_ACTIVATE; > @@ -1737,6 +1746,12 @@ static void shrink_active_list(unsigned long nr_to_scan, > continue; > } > > + if (mapping_test_workingset(page_mapping(page))) { > + mapping_clear_workingset(page_mapping(page)); > + list_add(&page->lru, &l_active); > + continue; > + } > + > if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) { > nr_rotated += hpage_nr_pages(page); > /* > -- > 1.7.4.1 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx143.postini.com [74.125.245.143]) by kanga.kvack.org (Postfix) with SMTP id D722A6B004A for ; Tue, 13 Mar 2012 03:21:13 -0400 (EDT) Received: by bkwq16 with SMTP id q16so191272bkw.14 for ; Tue, 13 Mar 2012 00:21:12 -0700 (PDT) Message-ID: <4F5EF563.5000700@openvz.org> Date: Tue, 13 Mar 2012 11:21:07 +0400 From: Konstantin Khlebnikov MIME-Version: 1.0 Subject: Re: Fwd: Control page reclaim granularity References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> In-Reply-To: <20120313064832.GA4968@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu Zheng Liu wrote: > Sorry, please forgive me. This patch has a defect. When one page is > scaned and flag is clear, all other's flags also are clear too. Yeah, funny patch =) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id D39746B004A for ; Tue, 13 Mar 2012 03:43:15 -0400 (EDT) Received: by ggeq1 with SMTP id q1so286277gge.14 for ; Tue, 13 Mar 2012 00:43:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <4F5EF563.5000700@openvz.org> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> Date: Tue, 13 Mar 2012 13:13:14 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Hi, I noticed this discussion and decided to pitch in one small idea from my side. It would be nice to range lock an inode's pages by storing those ranges which would be locked. This could also add some good routines for the kernel in terms of range locking for a single inode. However, wouldn't this add some overhead to shrink_page_list() since that code would need to go through all these ranges while trying to reclaim a single page ? One small suggestion from my side is: Why don't we implement something like : "Complete page-cache reclaim control from usermode"? In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin mentioned) for a file's inode from usermode by using ioctl or fcntl or maybe even go as far as implementing an O_NORECL option to the open system call. After setting the AS_UNEVICTABLE, the usermode application can choose to keep and remove pages by using the fadvise(WILLNEED) and fadvise(DONTNEED). ( I think maybe the presence of any VMA is might not really be required for this idea. ) Thanks, Kautuk. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx170.postini.com [74.125.245.170]) by kanga.kvack.org (Postfix) with SMTP id ABF026B004D for ; Tue, 13 Mar 2012 03:47:42 -0400 (EDT) Received: by yenm8 with SMTP id m8so296710yen.14 for ; Tue, 13 Mar 2012 00:47:41 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> Date: Tue, 13 Mar 2012 13:17:41 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org On Tue, Mar 13, 2012 at 1:13 PM, Kautuk Consul wrote: > Hi, > > I noticed this discussion and decided to pitch in one small idea from my side. > > It would be nice to range lock an inode's pages by storing those > ranges which would be locked. > This could also add some good routines for the kernel in terms of > range locking for a single inode. > However, wouldn't this add some overhead to shrink_page_list() since > that code would need to go through > all these ranges while trying to reclaim a single page ? > > One small suggestion from my side is: > Why don't we implement something like : "Complete page-cache reclaim > control from usermode"? > In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin > mentioned) for a file's > inode from usermode by using ioctl or fcntl or maybe even go as far as > implementing an O_NORECL > option to the open system call. > Of course, only an application executing with root privileges should be allowed to set the inode's mapping flags in this manner. > After setting the AS_UNEVICTABLE, the usermode application can choose > to keep and remove pages by > using the fadvise(WILLNEED) and fadvise(DONTNEED). > > ( I think maybe the presence of any VMA is might not really be > required for this idea. ) > > Thanks, > Kautuk. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx155.postini.com [74.125.245.155]) by kanga.kvack.org (Postfix) with SMTP id 511516B004A for ; Tue, 13 Mar 2012 04:00:23 -0400 (EDT) Received: by iajr24 with SMTP id r24so640240iaj.14 for ; Tue, 13 Mar 2012 01:00:22 -0700 (PDT) Date: Tue, 13 Mar 2012 16:05:35 +0800 From: Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313080535.GA5243@gmail.com> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Kautuk Consul Cc: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Mar 13, 2012 at 01:17:41PM +0530, Kautuk Consul wrote: > On Tue, Mar 13, 2012 at 1:13 PM, Kautuk Consul wrote: > > Hi, > > > > I noticed this discussion and decided to pitch in one small idea from my side. > > > > It would be nice to range lock an inode's pages by storing those > > ranges which would be locked. > > This could also add some good routines for the kernel in terms of > > range locking for a single inode. > > However, wouldn't this add some overhead to shrink_page_list() since > > that code would need to go through > > all these ranges while trying to reclaim a single page ? > > > > One small suggestion from my side is: > > Why don't we implement something like : "Complete page-cache reclaim > > control from usermode"? > > In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin > > mentioned) for a file's > > inode from usermode by using ioctl or fcntl or maybe even go as far as > > implementing an O_NORECL > > option to the open system call. > > > > Of course, only an application executing with root privileges should > be allowed to set the inode's > mapping flags in this manner. Hi Kautuk, IMHO, running application with root privilege is too dangerous. We should avoid it. Regards, Zheng > > > > After setting the AS_UNEVICTABLE, the usermode application can choose > > to keep and remove pages by > > using the fadvise(WILLNEED) and fadvise(DONTNEED). > > > > ( I think maybe the presence of any VMA is might not really be > > required for this idea. ) > > > > Thanks, > > Kautuk. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx198.postini.com [74.125.245.198]) by kanga.kvack.org (Postfix) with SMTP id 672D46B004A for ; Tue, 13 Mar 2012 04:04:40 -0400 (EDT) Received: by yhr47 with SMTP id 47so295715yhr.14 for ; Tue, 13 Mar 2012 01:04:39 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20120313080535.GA5243@gmail.com> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> Date: Tue, 13 Mar 2012 13:34:39 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Kautuk Consul , minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org > > Hi Kautuk, > > IMHO, running application with root privilege is too dangerous. =A0We > should avoid it. > I agree, but that's not my point. All I'm saying is that we probably don't want to give normal unprivileged usermode apps the capability to set the mapping to AS_UNEVICTABLE as anyone can then write an application that hogs memory without allowing the kernel to free it through memory recl= aim. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx106.postini.com [74.125.245.106]) by kanga.kvack.org (Postfix) with SMTP id B23FC6B004A for ; Tue, 13 Mar 2012 04:08:57 -0400 (EDT) Received: by yenm8 with SMTP id m8so315159yen.14 for ; Tue, 13 Mar 2012 01:08:56 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> Date: Tue, 13 Mar 2012 13:38:56 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Kautuk Consul , minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org > > I agree, but that's not my point. > > All I'm saying is that we probably don't want to give normal > unprivileged usermode apps > the capability to set the mapping to AS_UNEVICTABLE as anyone can then > write an application > that hogs memory without allowing the kernel to free it through memory reclaim. Sorry, I mean : "... that hogs kernel unmapped page-cache memory without allowing the kernel to free it through memory reclaim." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx166.postini.com [74.125.245.166]) by kanga.kvack.org (Postfix) with SMTP id B13F16B007E for ; Tue, 13 Mar 2012 04:22:53 -0400 (EDT) Received: by dadv6 with SMTP id v6so691208dad.14 for ; Tue, 13 Mar 2012 01:22:53 -0700 (PDT) Date: Tue, 13 Mar 2012 16:28:18 +0800 From: Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313082818.GA5421@gmail.com> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Kautuk Consul Cc: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org On Tue, Mar 13, 2012 at 01:38:56PM +0530, Kautuk Consul wrote: > > > > I agree, but that's not my point. > > > > All I'm saying is that we probably don't want to give normal > > unprivileged usermode apps > > the capability to set the mapping to AS_UNEVICTABLE as anyone can then > > write an application > > that hogs memory without allowing the kernel to free it through memory reclaim. Yes, I think so. But it seems that there has some codes that are possible to be abused. For example, as I said previously, applications can mmap a normal data file with PROT_EXEC flag. Then this file gets a high priority to keep in memory (commit: 8cab4754). So my point is that we cannot control applications how to use these mechanisms. We just provide them and let applications to choose how to use them. :-) Regards, Zheng > > Sorry, I mean : > "... that hogs kernel unmapped page-cache memory without allowing the > kernel to free it through memory reclaim." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id 6AE156B004A for ; Tue, 13 Mar 2012 04:36:29 -0400 (EDT) Received: by ggeq1 with SMTP id q1so332800gge.14 for ; Tue, 13 Mar 2012 01:36:28 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20120313082818.GA5421@gmail.com> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> <20120313082818.GA5421@gmail.com> Date: Tue, 13 Mar 2012 14:06:28 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org > > Yes, I think so. =A0But it seems that there has some codes that are > possible to be abused. =A0For example, as I said previously, applications > can mmap a normal data file with PROT_EXEC flag. =A0Then this file gets a > high priority to keep in memory (commit: 8cab4754). =A0So my point is tha= t > we cannot control applications how to use these mechanisms. =A0We just > provide them and let applications to choose how to use them. > :-) > That's true, but we are not talking about higher priority here, because in extreme memory reclaim case even PROT_EXEC pages will be reclaimed. But I understand your point. It might be okay to have this for all privileges applications. The only problem that might happen might be in OOM because we will have to include selection points for these page-cache pages (proportionately) while finding the most expensive process to kill. ( I'm talking about the page-cache pages which are not mapped to usermode page-tables at all. ) If any usermode application reads in an extremely huge file, whose inode has been set to AS_UNEVICTABLE, we might want to kill those applications that read in those pages(proportionately) so that the guilty application can be killed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx207.postini.com [74.125.245.207]) by kanga.kvack.org (Postfix) with SMTP id C78546B004A for ; Tue, 13 Mar 2012 05:03:48 -0400 (EDT) Received: by yenm8 with SMTP id m8so365371yen.14 for ; Tue, 13 Mar 2012 02:03:48 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> <20120313082818.GA5421@gmail.com> Date: Tue, 13 Mar 2012 14:33:47 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org > The only problem that might happen might be in OOM because we will > have to include selection points for > these page-cache pages (proportionately) while finding the most > expensive process to kill. > ( I'm talking about the page-cache pages which are not mapped to > usermode page-tables at all. ) > > If any usermode application reads in an extremely huge file, whose > inode has been set to AS_UNEVICTABLE, > we might want to kill those applications that read in those > pages(proportionately) so that the guilty application > can be killed. On some more thought, I guess for OOM and proprtionate working set accounting, the approach mentioned by Konstantin (with fake VMA) should work fine with respect to the way oom_kill.c accounts for virtual address size of kill candidates. So, I now think that the best way might indeed be to have a fake VMA to account for the page-cache pages not mapped to usermode. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id BD6526B004A for ; Tue, 13 Mar 2012 10:57:13 -0400 (EDT) Message-ID: <4F5F603F.2070600@redhat.com> Date: Tue, 13 Mar 2012 10:57:03 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: Control page reclaim granularity References: <20120308073412.GA6975@gmail.com> <20120308093514.GA28856@barrios> <4F5E0E5C.8040508@redhat.com> <20120313025756.GC7125@barrios> In-Reply-To: <20120313025756.GC7125@barrios> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-mm , linux-kernel , Konstantin Khlebnikov , kosaki.motohiro@jp.fujitsu.com On 03/12/2012 10:57 PM, Minchan Kim wrote: > On Mon, Mar 12, 2012 at 10:55:24AM -0400, Rik van Riel wrote: >> On 03/08/2012 04:35 AM, Minchan Kim wrote: >>> Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped). >>> But we removed that routine when we applied split lru page replacement. >>> Rik, KOSAKI. What's the rationale? >> >> One main reason is scalability. We have to treat pages >> in such a way that we do not have to search through >> gigabytes of memory to find a few eviction candidates >> to place on the inactive list - where they could get >> reused and stopped from eviction again. > > Okay. Thanks, Rik. > Then, another question. > Why did we handle mmaped page specially at that time? > Just out of curiosity. We had to, because we had only one set of LRU lists. Something had to be done to keep streaming IO from pushing other things out of memory. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx166.postini.com [74.125.245.166]) by kanga.kvack.org (Postfix) with SMTP id EEEDB6B004A for ; Wed, 14 Mar 2012 03:19:51 -0400 (EDT) Received: by laah2 with SMTP id h2so60707laa.2 for ; Wed, 14 Mar 2012 00:19:50 -0700 (PDT) From: Greg Thelen Subject: Re: Control page reclaim granularity References: <20120308161318.GA9904@gmail.com> Date: Wed, 14 Mar 2012 00:19:43 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm , Zheng Liu Cc: linux-kernel , Konstantin Khlebnikov Zheng Liu writes: > Hi Greg, > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > mailing list. So please Cc me. > > I am glad to receive your reply and I am very interesting for your > approach. Actually I am not very familiar with CGroup. So would you > please send your patch to me if you can? Thank you all the same. > > Regards, > Zheng Sorry for the delay, I had trouble finding my old prototype patch. The patch below is based on v2.6.34. The patch is just an idea not a complete solution. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752833Ab2CHH27 (ORCPT ); Thu, 8 Mar 2012 02:28:59 -0500 Received: from mail-pz0-f46.google.com ([209.85.210.46]:55980 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751487Ab2CHH25 (ORCPT ); Thu, 8 Mar 2012 02:28:57 -0500 Date: Thu, 8 Mar 2012 15:34:13 +0800 From: Zheng Liu To: linux-mm , linux-kernel Cc: Konstantin Khlebnikov Subject: Control page reclaim granularity Message-ID: <20120308073412.GA6975@gmail.com> Mail-Followup-To: linux-mm , linux-kernel , Konstantin Khlebnikov MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi list, Recently we encounter a problem about page reclaim. I abstract it in here. The problem is that there are two different file types. One is small index file, and another is large data file. The index file is mmaped into memory, and application hope that they can be kept in memory and don't be reclaimed too frequently. The data file is manipulted by read/write, and they should be reclaimed more frequently than the index file. As previously discussion [1], Konstantin suggest me to mmap index file with PROT_EXEC flag. Meanwhile he provides a patch to set a flag in mm_flags to increase the priority of mmaped file pages. However, these solutions are not perfect. I review the related patches (8cab4754 and c909e993) and I think that mmaped index file with PROT_EXEC flag is too tricky. From the view of applicaton programmer, index file is a regular file that stores some data. So they should be mmap with PROT_READ | PROT_WRITE rather than with PROT_EXEC. As commit log said (8cab4754), the purpose of this patch is to keep executable code in memory to improve the response of application. In addition, Kongstantin's patch needs to adjust the application program. So in some cases, we cannot touch the code of application, and this patch is useless. I have discussed with Kongstantin about this problem and we think maybe kernel should provide some mechanism. For example, user can set memory pressure priorities for vma or inode, or mmaped pages and file pages can be reclaimed separately. If someone has thought about it, please let me know. Any feedbacks are welcomed. Thank you. Previously discussion: 1. http://marc.info/?l=linux-mm&m=132947026019538&w=2 Regards, Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755489Ab2CHIrB (ORCPT ); Thu, 8 Mar 2012 03:47:01 -0500 Received: from mail-ww0-f74.google.com ([74.125.82.74]:48430 "EHLO mail-ww0-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753048Ab2CHIrA (ORCPT ); Thu, 8 Mar 2012 03:47:00 -0500 From: Greg Thelen To: linux-mm Cc: linux-kernel , Konstantin Khlebnikov Subject: Re: Control page reclaim granularity References: <20120308073412.GA6975@gmail.com> Date: Thu, 08 Mar 2012 00:39:49 -0800 In-Reply-To: <20120308073412.GA6975@gmail.com> (Zheng Liu's message of "Thu, 8 Mar 2012 15:34:13 +0800") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Zheng Liu writes: > Hi list, > > Recently we encounter a problem about page reclaim. I abstract it in here. > The problem is that there are two different file types. One is small index > file, and another is large data file. The index file is mmaped into memory, > and application hope that they can be kept in memory and don't be reclaimed > too frequently. The data file is manipulted by read/write, and they should > be reclaimed more frequently than the index file. > > As previously discussion [1], Konstantin suggest me to mmap index file with > PROT_EXEC flag. Meanwhile he provides a patch to set a flag in mm_flags to > increase the priority of mmaped file pages. However, these solutions are > not perfect. I review the related patches (8cab4754 and c909e993) and I > think that mmaped index file with PROT_EXEC flag is too tricky. From the > view of applicaton programmer, index file is a regular file that stores > some data. So they should be mmap with PROT_READ | PROT_WRITE rather than > with PROT_EXEC. As commit log said (8cab4754), the purpose of this patch > is to keep executable code in memory to improve the response of application. > In addition, Kongstantin's patch needs to adjust the application program. > So in some cases, we cannot touch the code of application, and this patch is > useless. > > I have discussed with Kongstantin about this problem and we think maybe > kernel should provide some mechanism. For example, user can set memory > pressure priorities for vma or inode, or mmaped pages and file pages can be > reclaimed separately. If someone has thought about it, please let me know. > Any feedbacks are welcomed. Thank you. > > Previously discussion: > 1. http://marc.info/?l=linux-mm&m=132947026019538&w=2 > > Regards, > Zheng It's not exactly the same approach, but we have toyed with the idea of charging different inodes to different cgroups. Each cgroup would have different soft/hard limits to allow for different cache behavior. http://www.spinics.net/lists/linux-mm/msg06006.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755121Ab2CHJf2 (ORCPT ); Thu, 8 Mar 2012 04:35:28 -0500 Received: from mail-pz0-f46.google.com ([209.85.210.46]:61377 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752636Ab2CHJfX (ORCPT ); Thu, 8 Mar 2012 04:35:23 -0500 Date: Thu, 8 Mar 2012 18:35:14 +0900 From: Minchan Kim To: linux-mm , linux-kernel , Konstantin Khlebnikov Cc: riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Control page reclaim granularity Message-ID: <20120308093514.GA28856@barrios> References: <20120308073412.GA6975@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120308073412.GA6975@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote: > Hi list, > > Recently we encounter a problem about page reclaim. I abstract it in here. > The problem is that there are two different file types. One is small index > file, and another is large data file. The index file is mmaped into memory, > and application hope that they can be kept in memory and don't be reclaimed > too frequently. The data file is manipulted by read/write, and they should > be reclaimed more frequently than the index file. > > As previously discussion [1], Konstantin suggest me to mmap index file with > PROT_EXEC flag. Meanwhile he provides a patch to set a flag in mm_flags to > increase the priority of mmaped file pages. However, these solutions are > not perfect. I review the related patches (8cab4754 and c909e993) and I > think that mmaped index file with PROT_EXEC flag is too tricky. From the > view of applicaton programmer, index file is a regular file that stores > some data. So they should be mmap with PROT_READ | PROT_WRITE rather than > with PROT_EXEC. As commit log said (8cab4754), the purpose of this patch > is to keep executable code in memory to improve the response of application. > In addition, Kongstantin's patch needs to adjust the application program. > So in some cases, we cannot touch the code of application, and this patch is > useless. > > I have discussed with Kongstantin about this problem and we think maybe > kernel should provide some mechanism. For example, user can set memory > pressure priorities for vma or inode, or mmaped pages and file pages can be > reclaimed separately. If someone has thought about it, please let me know. > Any feedbacks are welcomed. Thank you. > > Previously discussion: > 1. http://marc.info/?l=linux-mm&m=132947026019538&w=2 > > Regards, > Zheng I think it's a regression since 2.6.28. Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped). But we removed that routine when we applied split lru page replacement. Rik, KOSAKI. What's the rationale? We have to decide whether recovering that routine or creating new logic to keep mapped page in memory. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758213Ab2CHQIB (ORCPT ); Thu, 8 Mar 2012 11:08:01 -0500 Received: from mail-pw0-f46.google.com ([209.85.160.46]:42353 "EHLO mail-pw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757830Ab2CHQIA (ORCPT ); Thu, 8 Mar 2012 11:08:00 -0500 Date: Fri, 9 Mar 2012 00:13:18 +0800 From: Zheng Liu To: Greg Thelen Cc: linux-mm , linux-kernel , Konstantin Khlebnikov Subject: Re: Control page reclaim granularity Message-ID: <20120308161318.GA9904@gmail.com> Mail-Followup-To: Greg Thelen , linux-mm , linux-kernel , Konstantin Khlebnikov MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Greg, Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel mailing list. So please Cc me. I am glad to receive your reply and I am very interesting for your approach. Actually I am not very familiar with CGroup. So would you please send your patch to me if you can? Thank you all the same. Regards, Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758296Ab2CHQsr (ORCPT ); Thu, 8 Mar 2012 11:48:47 -0500 Received: from mail-pz0-f46.google.com ([209.85.210.46]:60081 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756731Ab2CHQso (ORCPT ); Thu, 8 Mar 2012 11:48:44 -0500 Date: Fri, 9 Mar 2012 00:54:03 +0800 From: Zheng Liu To: Minchan Kim Cc: linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Control page reclaim granularity Message-ID: <20120308165403.GA10005@gmail.com> Mail-Followup-To: Minchan Kim , linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120308093514.GA28856@barrios> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Minchan, Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel mailing list. So please Cc me. IMHO, maybe we should re-think about how does user use mmap(2). I describe the cases I known in our product system. They can be categorized into two cases. One is mmaped all data files into memory and sometime it uses write(2) to append some data, and another uses mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the second case, the application wants to keep mmaped page into memory and let file pages to be reclaimed firstly. So, IMO, when application uses mmap(2) to manipulate files, it is possible to imply that it wants keep these mmaped pages into memory and do not be reclaimed. At least these pages do not be reclaimed early than file pages. I think that maybe we can recover that routine and provide a sysctl parameter to let the user to set this ratio between mmaped pages and file pages. Regards, Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754545Ab2CLA2R (ORCPT ); Sun, 11 Mar 2012 20:28:17 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:37973 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754444Ab2CLA2N (ORCPT ); Sun, 11 Mar 2012 20:28:13 -0400 Date: Mon, 12 Mar 2012 09:28:06 +0900 From: Minchan Kim To: Minchan Kim , linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Control page reclaim granularity Message-ID: <20120312002806.GA2436@barrios> References: <20120308093514.GA28856@barrios> <20120308165403.GA10005@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120308165403.GA10005@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > Hi Minchan, > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > mailing list. So please Cc me. > > IMHO, maybe we should re-think about how does user use mmap(2). I > describe the cases I known in our product system. They can be > categorized into two cases. One is mmaped all data files into memory > and sometime it uses write(2) to append some data, and another uses > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the > second case, the application wants to keep mmaped page into memory and > let file pages to be reclaimed firstly. So, IMO, when application uses > mmap(2) to manipulate files, it is possible to imply that it wants keep > these mmaped pages into memory and do not be reclaimed. At least these > pages do not be reclaimed early than file pages. I think that maybe we > can recover that routine and provide a sysctl parameter to let the user > to set this ratio between mmaped pages and file pages. I am not convinced why we should handle mapped page specially. Sometimem, someone may use mmap by reducing buffer copy compared to read system call. So I think we can't make sure mmaped pages are always win. My suggestion is that it would be better to declare by user explicitly. I think we can implement it by madvise and fadvise's WILLNEED option. Current implementation is just readahead if there isn't a page in memory but I think we can promote from inactive to active if there is already a page in memory. It's more clear and it couldn't be affected by kernel page reclaim algorithm change like this. > > Regards, > Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754599Ab2CLCAv (ORCPT ); Sun, 11 Mar 2012 22:00:51 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:65326 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754610Ab2CLCAo (ORCPT ); Sun, 11 Mar 2012 22:00:44 -0400 Date: Mon, 12 Mar 2012 10:06:09 +0800 From: Zheng Liu To: Minchan Kim Cc: linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312020609.GA9610@gmail.com> Mail-Followup-To: Minchan Kim , linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312002806.GA2436@barrios> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > I forgot to Ccing you. > Sorry. > > ---------- Forwarded message ---------- > From: Minchan Kim > Date: Mon, Mar 12, 2012 at 9:28 AM > Subject: Re: Control page reclaim granularity > To: Minchan Kim , linux-mm , > linux-kernel , Konstantin Khlebnikov < > khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > > > On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > > Hi Minchan, > > > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > > mailing list. So please Cc me. > > > > IMHO, maybe we should re-think about how does user use mmap(2). I > > describe the cases I known in our product system. They can be > > categorized into two cases. One is mmaped all data files into memory > > and sometime it uses write(2) to append some data, and another uses > > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the > > second case, the application wants to keep mmaped page into memory and > > let file pages to be reclaimed firstly. So, IMO, when application uses > > mmap(2) to manipulate files, it is possible to imply that it wants keep > > these mmaped pages into memory and do not be reclaimed. At least these > > pages do not be reclaimed early than file pages. I think that maybe we > > can recover that routine and provide a sysctl parameter to let the user > > to set this ratio between mmaped pages and file pages. > > I am not convinced why we should handle mapped page specially. > Sometimem, someone may use mmap by reducing buffer copy compared to read > system call. > So I think we can't make sure mmaped pages are always win. > > My suggestion is that it would be better to declare by user explicitly. > I think we can implement it by madvise and fadvise's WILLNEED option. > Current implementation is just readahead if there isn't a page in memory > but I think > we can promote from inactive to active if there is already a page in > memory. > > It's more clear and it couldn't be affected by kernel page reclaim > algorithm change > like this. Thank you for your advice. But I still have question about this solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED option, it will cause an inconsistently status for pages that be manipulated by madvise(2) and/or fadvise(2). For example, when I call madvise with WILLNEED flag, some pages will be moved into active list if they already have been in memory, and other pages will be read into memory and be saved in inactive list if they don't be in memory. Then pages that are in inactive list are possible to be reclaim. So from the view of users, it is inconsistent because some pages are in memory and some pages are reclaimed. But actually the user hopes that all of pages can be kept in memory. IMHO, this inconsistency is weird and makes users puzzled. Regards, Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752495Ab2CLFTT (ORCPT ); Mon, 12 Mar 2012 01:19:19 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:55469 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751573Ab2CLFTM (ORCPT ); Mon, 12 Mar 2012 01:19:12 -0400 Date: Mon, 12 Mar 2012 14:19:04 +0900 From: Minchan Kim To: Minchan Kim , linux-mm , linux-kernel , Konstantin Khlebnikov , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312051904.GA3831@barrios> References: <20120312002806.GA2436@barrios> <20120312020609.GA9610@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312020609.GA9610@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > > I forgot to Ccing you. > > Sorry. > > > > ---------- Forwarded message ---------- > > From: Minchan Kim > > Date: Mon, Mar 12, 2012 at 9:28 AM > > Subject: Re: Control page reclaim granularity > > To: Minchan Kim , linux-mm , > > linux-kernel , Konstantin Khlebnikov < > > khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > > > > > > On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > > > Hi Minchan, > > > > > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > > > mailing list. So please Cc me. > > > > > > IMHO, maybe we should re-think about how does user use mmap(2). I > > > describe the cases I known in our product system. They can be > > > categorized into two cases. One is mmaped all data files into memory > > > and sometime it uses write(2) to append some data, and another uses > > > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the > > > second case, the application wants to keep mmaped page into memory and > > > let file pages to be reclaimed firstly. So, IMO, when application uses > > > mmap(2) to manipulate files, it is possible to imply that it wants keep > > > these mmaped pages into memory and do not be reclaimed. At least these > > > pages do not be reclaimed early than file pages. I think that maybe we > > > can recover that routine and provide a sysctl parameter to let the user > > > to set this ratio between mmaped pages and file pages. > > > > I am not convinced why we should handle mapped page specially. > > Sometimem, someone may use mmap by reducing buffer copy compared to read > > system call. > > So I think we can't make sure mmaped pages are always win. > > > > My suggestion is that it would be better to declare by user explicitly. > > I think we can implement it by madvise and fadvise's WILLNEED option. > > Current implementation is just readahead if there isn't a page in memory > > but I think > > we can promote from inactive to active if there is already a page in > > memory. > > > > It's more clear and it couldn't be affected by kernel page reclaim > > algorithm change > > like this. > > Thank you for your advice. But I still have question about this > solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > option, it will cause an inconsistently status for pages that be > manipulated by madvise(2) and/or fadvise(2). For example, when I call > madvise with WILLNEED flag, some pages will be moved into active list if > they already have been in memory, and other pages will be read into > memory and be saved in inactive list if they don't be in memory. Then > pages that are in inactive list are possible to be reclaim. So from the > view of users, it is inconsistent because some pages are in memory and > some pages are reclaimed. But actually the user hopes that all of pages > can be kept in memory. IMHO, this inconsistency is weird and makes users > puzzled. Now problem is that 1. User want to keep pages which are used once in a while in memory. 2. Kernel want to reclaim them because they are surely reclaim target pages in point of view by LRU. The most desriable approach is that user should use mlock to guarantee them in memory. But mlock is too big overhead and user doesn't want to keep memory all pages all at once.(Ie, he want demand paging when he need the page) Right? madvise, it's a just hint for kernel and kernel doesn't need to make sure madvise's behavior. In point of view, such inconsistency might not be a big problem. Big problem I think now is that user should use madvise(WILLNEED) periodically because such activation happens once when user calls madvise. If user doesn't use page frequently after user calls it, it ends up moving into inactive list and even could be reclaimed. It's not good. :-( Okay. How about adding new VM_WORKINGSET? And reclaimer would give one more round trip in active/inactive list when reclaim happens if the page is referenced. Sigh. We have no room for new VM_FLAG in 32 bit. > > Regards, > Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753159Ab2CLGUq (ORCPT ); Mon, 12 Mar 2012 02:20:46 -0400 Received: from mail-bk0-f46.google.com ([209.85.214.46]:38358 "EHLO mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751246Ab2CLGUj (ORCPT ); Mon, 12 Mar 2012 02:20:39 -0400 Message-ID: <4F5D95AF.1020108@openvz.org> Date: Mon, 12 Mar 2012 10:20:31 +0400 From: Konstantin Khlebnikov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120217 Firefox/10.0.2 Iceape/2.7.2 MIME-Version: 1.0 To: Minchan Kim CC: linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" Subject: Re: Fwd: Control page reclaim granularity References: <20120312002806.GA2436@barrios> <20120312020609.GA9610@gmail.com> <20120312051904.GA3831@barrios> In-Reply-To: <20120312051904.GA3831@barrios> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Minchan Kim wrote: > On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: >> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: >>> I forgot to Ccing you. >>> Sorry. >>> >>> ---------- Forwarded message ---------- >>> From: Minchan Kim >>> Date: Mon, Mar 12, 2012 at 9:28 AM >>> Subject: Re: Control page reclaim granularity >>> To: Minchan Kim, linux-mm, >>> linux-kernel, Konstantin Khlebnikov< >>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com >>> >>> >>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: >>>> Hi Minchan, >>>> >>>> Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel >>>> mailing list. So please Cc me. >>>> >>>> IMHO, maybe we should re-think about how does user use mmap(2). I >>>> describe the cases I known in our product system. They can be >>>> categorized into two cases. One is mmaped all data files into memory >>>> and sometime it uses write(2) to append some data, and another uses >>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the >>>> second case, the application wants to keep mmaped page into memory and >>>> let file pages to be reclaimed firstly. So, IMO, when application uses >>>> mmap(2) to manipulate files, it is possible to imply that it wants keep >>>> these mmaped pages into memory and do not be reclaimed. At least these >>>> pages do not be reclaimed early than file pages. I think that maybe we >>>> can recover that routine and provide a sysctl parameter to let the user >>>> to set this ratio between mmaped pages and file pages. >>> >>> I am not convinced why we should handle mapped page specially. >>> Sometimem, someone may use mmap by reducing buffer copy compared to read >>> system call. >>> So I think we can't make sure mmaped pages are always win. >>> >>> My suggestion is that it would be better to declare by user explicitly. >>> I think we can implement it by madvise and fadvise's WILLNEED option. >>> Current implementation is just readahead if there isn't a page in memory >>> but I think >>> we can promote from inactive to active if there is already a page in >>> memory. >>> >>> It's more clear and it couldn't be affected by kernel page reclaim >>> algorithm change >>> like this. >> >> Thank you for your advice. But I still have question about this >> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED >> option, it will cause an inconsistently status for pages that be >> manipulated by madvise(2) and/or fadvise(2). For example, when I call >> madvise with WILLNEED flag, some pages will be moved into active list if >> they already have been in memory, and other pages will be read into >> memory and be saved in inactive list if they don't be in memory. Then >> pages that are in inactive list are possible to be reclaim. So from the >> view of users, it is inconsistent because some pages are in memory and >> some pages are reclaimed. But actually the user hopes that all of pages >> can be kept in memory. IMHO, this inconsistency is weird and makes users >> puzzled. > > Now problem is that > > 1. User want to keep pages which are used once in a while in memory. > 2. Kernel want to reclaim them because they are surely reclaim target > pages in point of view by LRU. > > The most desriable approach is that user should use mlock to guarantee > them in memory. But mlock is too big overhead and user doesn't want to keep > memory all pages all at once.(Ie, he want demand paging when he need the page) > Right? > > madvise, it's a just hint for kernel and kernel doesn't need to make sure madvise's behavior. > In point of view, such inconsistency might not be a big problem. > > Big problem I think now is that user should use madvise(WILLNEED) periodically because such > activation happens once when user calls madvise. If user doesn't use page frequently after > user calls it, it ends up moving into inactive list and even could be reclaimed. > It's not good. :-( > > Okay. How about adding new VM_WORKINGSET? > And reclaimer would give one more round trip in active/inactive list when reclaim happens > if the page is referenced. > > Sigh. We have no room for new VM_FLAG in 32 bit. It would be nice to mark struct address_space with this flag and export AS_UNEVICTABLE somehow. Maybe we can reuse file-locking engine for managing these bits =) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754747Ab2CLII5 (ORCPT ); Mon, 12 Mar 2012 04:08:57 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:51422 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753080Ab2CLIIr (ORCPT ); Mon, 12 Mar 2012 04:08:47 -0400 Date: Mon, 12 Mar 2012 16:14:14 +0800 From: Zheng Liu To: Konstantin Khlebnikov Cc: linux-mm , linux-kernel , Minchan Kim , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312081413.GA10923@gmail.com> Mail-Followup-To: Konstantin Khlebnikov , linux-mm , linux-kernel , Minchan Kim , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F5D95AF.1020108@openvz.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > Minchan Kim wrote: >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: >>>> I forgot to Ccing you. >>>> Sorry. >>>> >>>> ---------- Forwarded message ---------- >>>> From: Minchan Kim >>>> Date: Mon, Mar 12, 2012 at 9:28 AM >>>> Subject: Re: Control page reclaim granularity >>>> To: Minchan Kim, linux-mm, >>>> linux-kernel, Konstantin Khlebnikov< >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com >>>> >>>> >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: >>>>> Hi Minchan, >>>>> >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and >>>>> linux-kernel >>>>> mailing list. So please Cc me. >>>>> >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I >>>>> describe the cases I known in our product system. They can be >>>>> categorized into two cases. One is mmaped all data files into memory >>>>> and sometime it uses write(2) to append some data, and another uses >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In >>>>> the >>>>> second case, the application wants to keep mmaped page into memory >>>>> and >>>>> let file pages to be reclaimed firstly. So, IMO, when application >>>>> uses >>>>> mmap(2) to manipulate files, it is possible to imply that it wants >>>>> keep >>>>> these mmaped pages into memory and do not be reclaimed. At least >>>>> these >>>>> pages do not be reclaimed early than file pages. I think that >>>>> maybe we >>>>> can recover that routine and provide a sysctl parameter to let the >>>>> user >>>>> to set this ratio between mmaped pages and file pages. >>>> >>>> I am not convinced why we should handle mapped page specially. >>>> Sometimem, someone may use mmap by reducing buffer copy compared to >>>> read >>>> system call. >>>> So I think we can't make sure mmaped pages are always win. >>>> >>>> My suggestion is that it would be better to declare by user explicitly. >>>> I think we can implement it by madvise and fadvise's WILLNEED option. >>>> Current implementation is just readahead if there isn't a page in >>>> memory >>>> but I think >>>> we can promote from inactive to active if there is already a page in >>>> memory. >>>> >>>> It's more clear and it couldn't be affected by kernel page reclaim >>>> algorithm change >>>> like this. >>> >>> Thank you for your advice. But I still have question about this >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED >>> option, it will cause an inconsistently status for pages that be >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call >>> madvise with WILLNEED flag, some pages will be moved into active list if >>> they already have been in memory, and other pages will be read into >>> memory and be saved in inactive list if they don't be in memory. Then >>> pages that are in inactive list are possible to be reclaim. So from the >>> view of users, it is inconsistent because some pages are in memory and >>> some pages are reclaimed. But actually the user hopes that all of pages >>> can be kept in memory. IMHO, this inconsistency is weird and makes >>> users >>> puzzled. >> >> Now problem is that >> >> 1. User want to keep pages which are used once in a while in memory. >> 2. Kernel want to reclaim them because they are surely reclaim target >> pages in point of view by LRU. >> >> The most desriable approach is that user should use mlock to guarantee >> them in memory. But mlock is too big overhead and user doesn't want to >> keep >> memory all pages all at once.(Ie, he want demand paging when he need >> the page) >> Right? >> >> madvise, it's a just hint for kernel and kernel doesn't need to make >> sure madvise's behavior. >> In point of view, such inconsistency might not be a big problem. >> >> Big problem I think now is that user should use madvise(WILLNEED) >> periodically because such >> activation happens once when user calls madvise. If user doesn't use >> page frequently after >> user calls it, it ends up moving into inactive list and even could be >> reclaimed. >> It's not good. :-( >> >> Okay. How about adding new VM_WORKINGSET? >> And reclaimer would give one more round trip in active/inactive list >> when reclaim happens >> if the page is referenced. >> >> Sigh. We have no room for new VM_FLAG in 32 bit. > > It would be nice to mark struct address_space with this flag and export > AS_UNEVICTABLE somehow. > Maybe we can reuse file-locking engine for managing these bits =) Make sense to me. We can mark this flag in struct address_space and check it in page_refereneced_file(). If this flag is set, it will be cleard and the function returns referenced > 1. Then this page can be promoted into activate list. But I prefer to set/clear this flag in madvise. PS, I have subscribed linux-mm mailing list. :-) Regards, Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755538Ab2CLNmi (ORCPT ); Mon, 12 Mar 2012 09:42:38 -0400 Received: from mail-gx0-f174.google.com ([209.85.161.174]:46025 "EHLO mail-gx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752604Ab2CLNmf (ORCPT ); Mon, 12 Mar 2012 09:42:35 -0400 Date: Mon, 12 Mar 2012 22:42:26 +0900 From: Minchan Kim To: Konstantin Khlebnikov , linux-mm , linux-kernel , Minchan Kim , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312134226.GA5120@barrios> References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312081413.GA10923@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > > Minchan Kim wrote: > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > >>>> I forgot to Ccing you. > >>>> Sorry. > >>>> > >>>> ---------- Forwarded message ---------- > >>>> From: Minchan Kim > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM > >>>> Subject: Re: Control page reclaim granularity > >>>> To: Minchan Kim, linux-mm, > >>>> linux-kernel, Konstantin Khlebnikov< > >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > >>>> > >>>> > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > >>>>> Hi Minchan, > >>>>> > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and > >>>>> linux-kernel > >>>>> mailing list. So please Cc me. > >>>>> > >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I > >>>>> describe the cases I known in our product system. They can be > >>>>> categorized into two cases. One is mmaped all data files into memory > >>>>> and sometime it uses write(2) to append some data, and another uses > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In > >>>>> the > >>>>> second case, the application wants to keep mmaped page into memory > >>>>> and > >>>>> let file pages to be reclaimed firstly. So, IMO, when application > >>>>> uses > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants > >>>>> keep > >>>>> these mmaped pages into memory and do not be reclaimed. At least > >>>>> these > >>>>> pages do not be reclaimed early than file pages. I think that > >>>>> maybe we > >>>>> can recover that routine and provide a sysctl parameter to let the > >>>>> user > >>>>> to set this ratio between mmaped pages and file pages. > >>>> > >>>> I am not convinced why we should handle mapped page specially. > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to > >>>> read > >>>> system call. > >>>> So I think we can't make sure mmaped pages are always win. > >>>> > >>>> My suggestion is that it would be better to declare by user explicitly. > >>>> I think we can implement it by madvise and fadvise's WILLNEED option. > >>>> Current implementation is just readahead if there isn't a page in > >>>> memory > >>>> but I think > >>>> we can promote from inactive to active if there is already a page in > >>>> memory. > >>>> > >>>> It's more clear and it couldn't be affected by kernel page reclaim > >>>> algorithm change > >>>> like this. > >>> > >>> Thank you for your advice. But I still have question about this > >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > >>> option, it will cause an inconsistently status for pages that be > >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call > >>> madvise with WILLNEED flag, some pages will be moved into active list if > >>> they already have been in memory, and other pages will be read into > >>> memory and be saved in inactive list if they don't be in memory. Then > >>> pages that are in inactive list are possible to be reclaim. So from the > >>> view of users, it is inconsistent because some pages are in memory and > >>> some pages are reclaimed. But actually the user hopes that all of pages > >>> can be kept in memory. IMHO, this inconsistency is weird and makes > >>> users > >>> puzzled. > >> > >> Now problem is that > >> > >> 1. User want to keep pages which are used once in a while in memory. > >> 2. Kernel want to reclaim them because they are surely reclaim target > >> pages in point of view by LRU. > >> > >> The most desriable approach is that user should use mlock to guarantee > >> them in memory. But mlock is too big overhead and user doesn't want to > >> keep > >> memory all pages all at once.(Ie, he want demand paging when he need > >> the page) > >> Right? > >> > >> madvise, it's a just hint for kernel and kernel doesn't need to make > >> sure madvise's behavior. > >> In point of view, such inconsistency might not be a big problem. > >> > >> Big problem I think now is that user should use madvise(WILLNEED) > >> periodically because such > >> activation happens once when user calls madvise. If user doesn't use > >> page frequently after > >> user calls it, it ends up moving into inactive list and even could be > >> reclaimed. > >> It's not good. :-( > >> > >> Okay. How about adding new VM_WORKINGSET? > >> And reclaimer would give one more round trip in active/inactive list > >> erwhen reclaim happens > >> if the page is referenced. > >> > >> Sigh. We have no room for new VM_FLAG in 32 bit. > > p > > It would be nice to mark struct address_space with this flag and export > > AS_UNEVICTABLE somehow. > > Maybe we can reuse file-locking engine for managing these bits =) > > Make sense to me. We can mark this flag in struct address_space and check > it in page_refereneced_file(). If this flag is set, it will be cleard and Disadvantage is that we could set reclaim granularity as per-inode. I want to set it as per-vma, not per-inode. > the function returns referenced > 1. Then this page can be promoted into > activate list. But I prefer to set/clear this flag in madvise. Hmm, My idea is following as, If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which are set by new VM flag and the page is referenced recently at least once. It means it gives one more round trip in his list(ie, active/inactive list) rather than activation so that the page would become less reclaimable. > > PS, I have subscribed linux-mm mailing list. :-) Congratulations! :) > > Regards, > Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755685Ab2CLOzk (ORCPT ); Mon, 12 Mar 2012 10:55:40 -0400 Received: from mx1.redhat.com ([209.132.183.28]:30419 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755282Ab2CLOzh (ORCPT ); Mon, 12 Mar 2012 10:55:37 -0400 Message-ID: <4F5E0E5C.8040508@redhat.com> Date: Mon, 12 Mar 2012 10:55:24 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120216 Thunderbird/10.0.1 MIME-Version: 1.0 To: Minchan Kim CC: linux-mm , linux-kernel , Konstantin Khlebnikov , kosaki.motohiro@jp.fujitsu.com Subject: Re: Control page reclaim granularity References: <20120308073412.GA6975@gmail.com> <20120308093514.GA28856@barrios> In-Reply-To: <20120308093514.GA28856@barrios> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/08/2012 04:35 AM, Minchan Kim wrote: > On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote: >> Hi list, >> >> Recently we encounter a problem about page reclaim. I abstract it in here. >> The problem is that there are two different file types. One is small index >> file, and another is large data file. The index file is mmaped into memory, >> and application hope that they can be kept in memory and don't be reclaimed >> too frequently. The data file is manipulted by read/write, and they should >> be reclaimed more frequently than the index file. They should indeed be. The data pages should not get promoted to the active list unless they get referenced twice while on the inactive list. Mmaped pages, on the other hand, get promoted to the active list after just one reference. Also, as long as the inactive file list is larger than the active file list, we do not reclaim active file pages at all. > I think it's a regression since 2.6.28. > Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped). > But we removed that routine when we applied split lru page replacement. > Rik, KOSAKI. What's the rationale? One main reason is scalability. We have to treat pages in such a way that we do not have to search through gigabytes of memory to find a few eviction candidates to place on the inactive list - where they could get reused and stopped from eviction again. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755895Ab2CLPKe (ORCPT ); Mon, 12 Mar 2012 11:10:34 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:61275 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755737Ab2CLPKQ (ORCPT ); Mon, 12 Mar 2012 11:10:16 -0400 Date: Mon, 12 Mar 2012 23:15:43 +0800 From: Zheng Liu To: Minchan Kim Cc: Konstantin Khlebnikov , linux-mm , linux-kernel , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120312151542.GA16253@gmail.com> Mail-Followup-To: Minchan Kim , Konstantin Khlebnikov , linux-mm , linux-kernel , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312134226.GA5120@barrios> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 12, 2012 at 10:42:26PM +0900, Minchan Kim wrote: > On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: > > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > > > Minchan Kim wrote: > > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > > >>>> I forgot to Ccing you. > > >>>> Sorry. > > >>>> > > >>>> ---------- Forwarded message ---------- > > >>>> From: Minchan Kim > > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM > > >>>> Subject: Re: Control page reclaim granularity > > >>>> To: Minchan Kim, linux-mm, > > >>>> linux-kernel, Konstantin Khlebnikov< > > >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > > >>>> > > >>>> > > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > > >>>>> Hi Minchan, > > >>>>> > > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and > > >>>>> linux-kernel > > >>>>> mailing list. So please Cc me. > > >>>>> > > >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I > > >>>>> describe the cases I known in our product system. They can be > > >>>>> categorized into two cases. One is mmaped all data files into memory > > >>>>> and sometime it uses write(2) to append some data, and another uses > > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In > > >>>>> the > > >>>>> second case, the application wants to keep mmaped page into memory > > >>>>> and > > >>>>> let file pages to be reclaimed firstly. So, IMO, when application > > >>>>> uses > > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants > > >>>>> keep > > >>>>> these mmaped pages into memory and do not be reclaimed. At least > > >>>>> these > > >>>>> pages do not be reclaimed early than file pages. I think that > > >>>>> maybe we > > >>>>> can recover that routine and provide a sysctl parameter to let the > > >>>>> user > > >>>>> to set this ratio between mmaped pages and file pages. > > >>>> > > >>>> I am not convinced why we should handle mapped page specially. > > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to > > >>>> read > > >>>> system call. > > >>>> So I think we can't make sure mmaped pages are always win. > > >>>> > > >>>> My suggestion is that it would be better to declare by user explicitly. > > >>>> I think we can implement it by madvise and fadvise's WILLNEED option. > > >>>> Current implementation is just readahead if there isn't a page in > > >>>> memory > > >>>> but I think > > >>>> we can promote from inactive to active if there is already a page in > > >>>> memory. > > >>>> > > >>>> It's more clear and it couldn't be affected by kernel page reclaim > > >>>> algorithm change > > >>>> like this. > > >>> > > >>> Thank you for your advice. But I still have question about this > > >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > > >>> option, it will cause an inconsistently status for pages that be > > >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call > > >>> madvise with WILLNEED flag, some pages will be moved into active list if > > >>> they already have been in memory, and other pages will be read into > > >>> memory and be saved in inactive list if they don't be in memory. Then > > >>> pages that are in inactive list are possible to be reclaim. So from the > > >>> view of users, it is inconsistent because some pages are in memory and > > >>> some pages are reclaimed. But actually the user hopes that all of pages > > >>> can be kept in memory. IMHO, this inconsistency is weird and makes > > >>> users > > >>> puzzled. > > >> > > >> Now problem is that > > >> > > >> 1. User want to keep pages which are used once in a while in memory. > > >> 2. Kernel want to reclaim them because they are surely reclaim target > > >> pages in point of view by LRU. > > >> > > >> The most desriable approach is that user should use mlock to guarantee > > >> them in memory. But mlock is too big overhead and user doesn't want to > > >> keep > > >> memory all pages all at once.(Ie, he want demand paging when he need > > >> the page) > > >> Right? > > >> > > >> madvise, it's a just hint for kernel and kernel doesn't need to make > > >> sure madvise's behavior. > > >> In point of view, such inconsistency might not be a big problem. > > >> > > >> Big problem I think now is that user should use madvise(WILLNEED) > > >> periodically because such > > >> activation happens once when user calls madvise. If user doesn't use > > >> page frequently after > > >> user calls it, it ends up moving into inactive list and even could be > > >> reclaimed. > > >> It's not good. :-( > > >> > > >> Okay. How about adding new VM_WORKINGSET? > > >> And reclaimer would give one more round trip in active/inactive list > > >> erwhen reclaim happens > > >> if the page is referenced. > > >> > > >> Sigh. We have no room for new VM_FLAG in 32 bit. > > > p > > > It would be nice to mark struct address_space with this flag and export > > > AS_UNEVICTABLE somehow. > > > Maybe we can reuse file-locking engine for managing these bits =) > > > > Make sense to me. We can mark this flag in struct address_space and check > > it in page_refereneced_file(). If this flag is set, it will be cleard and > > Disadvantage is that we could set reclaim granularity as per-inode. > I want to set it as per-vma, not per-inode. I don't think this is a disadvantage. This per-inode reclaim granularity is useful for us. Actually I have thought to implement a per-inode memcg to let different file sets to be reclaimed separately. So maybe we can provide two mechanisms to let the user to choose how to use them. > > > the function returns referenced > 1. Then this page can be promoted into > > activate list. But I prefer to set/clear this flag in madvise. > > Hmm, My idea is following as, > If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list > and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which > are set by new VM flag and the page is referenced recently at least once. > It means it gives one more round trip in his list(ie, active/inactive list) > rather than activation so that the page would become less reclaimable. No matter what the page is given one more round trip or is promoted into active list, it can satisfy our current requirement. So now the question is which is better. If we add a new VM flag, as you said before, vma->vm_flags has no room for it in 32 bit. I have noticed that this topic has been discussed [1] and the result is that vm_flags is still a unsigned long type. So we need to use a tricky technique to solve it. If we add a new flag in struct addpress_space, it might be easy to implement it. 1. http://lkml.indiana.edu/hypermail/linux/kernel/1104.1/00975.html Regards, Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755182Ab2CMCs2 (ORCPT ); Mon, 12 Mar 2012 22:48:28 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:52331 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754155Ab2CMCs1 (ORCPT ); Mon, 12 Mar 2012 22:48:27 -0400 Date: Tue, 13 Mar 2012 11:48:18 +0900 From: Minchan Kim To: Konstantin Khlebnikov Cc: Minchan Kim , linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313024818.GA7125@barrios> References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> <4F5E05AD.20200@openvz.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F5E05AD.20200@openvz.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote: > Minchan Kim wrote: > >On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: > >>On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > >>>Minchan Kim wrote: > >>>>On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > >>>>>On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > >>>>>>I forgot to Ccing you. > >>>>>>Sorry. > >>>>>> > >>>>>>---------- Forwarded message ---------- > >>>>>>From: Minchan Kim > >>>>>>Date: Mon, Mar 12, 2012 at 9:28 AM > >>>>>>Subject: Re: Control page reclaim granularity > >>>>>>To: Minchan Kim, linux-mm, > >>>>>>linux-kernel, Konstantin Khlebnikov< > >>>>>>khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > >>>>>> > >>>>>> > >>>>>>On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > >>>>>>>Hi Minchan, > >>>>>>> > >>>>>>>Sorry, I forgot to say that I don't subscribe linux-mm and > >>>>>>>linux-kernel > >>>>>>>mailing list. So please Cc me. > >>>>>>> > >>>>>>>IMHO, maybe we should re-think about how does user use mmap(2). I > >>>>>>>describe the cases I known in our product system. They can be > >>>>>>>categorized into two cases. One is mmaped all data files into memory > >>>>>>>and sometime it uses write(2) to append some data, and another uses > >>>>>>>mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In > >>>>>>>the > >>>>>>>second case, the application wants to keep mmaped page into memory > >>>>>>>and > >>>>>>>let file pages to be reclaimed firstly. So, IMO, when application > >>>>>>>uses > >>>>>>>mmap(2) to manipulate files, it is possible to imply that it wants > >>>>>>>keep > >>>>>>>these mmaped pages into memory and do not be reclaimed. At least > >>>>>>>these > >>>>>>>pages do not be reclaimed early than file pages. I think that > >>>>>>>maybe we > >>>>>>>can recover that routine and provide a sysctl parameter to let the > >>>>>>>user > >>>>>>>to set this ratio between mmaped pages and file pages. > >>>>>> > >>>>>>I am not convinced why we should handle mapped page specially. > >>>>>>Sometimem, someone may use mmap by reducing buffer copy compared to > >>>>>>read > >>>>>>system call. > >>>>>>So I think we can't make sure mmaped pages are always win. > >>>>>> > >>>>>>My suggestion is that it would be better to declare by user explicitly. > >>>>>>I think we can implement it by madvise and fadvise's WILLNEED option. > >>>>>>Current implementation is just readahead if there isn't a page in > >>>>>>memory > >>>>>>but I think > >>>>>>we can promote from inactive to active if there is already a page in > >>>>>>memory. > >>>>>> > >>>>>>It's more clear and it couldn't be affected by kernel page reclaim > >>>>>>algorithm change > >>>>>>like this. > >>>>> > >>>>>Thank you for your advice. But I still have question about this > >>>>>solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > >>>>>option, it will cause an inconsistently status for pages that be > >>>>>manipulated by madvise(2) and/or fadvise(2). For example, when I call > >>>>>madvise with WILLNEED flag, some pages will be moved into active list if > >>>>>they already have been in memory, and other pages will be read into > >>>>>memory and be saved in inactive list if they don't be in memory. Then > >>>>>pages that are in inactive list are possible to be reclaim. So from the > >>>>>view of users, it is inconsistent because some pages are in memory and > >>>>>some pages are reclaimed. But actually the user hopes that all of pages > >>>>>can be kept in memory. IMHO, this inconsistency is weird and makes > >>>>>users > >>>>>puzzled. > >>>> > >>>>Now problem is that > >>>> > >>>>1. User want to keep pages which are used once in a while in memory. > >>>>2. Kernel want to reclaim them because they are surely reclaim target > >>>> pages in point of view by LRU. > >>>> > >>>>The most desriable approach is that user should use mlock to guarantee > >>>>them in memory. But mlock is too big overhead and user doesn't want to > >>>>keep > >>>>memory all pages all at once.(Ie, he want demand paging when he need > >>>>the page) > >>>>Right? > >>>> > >>>>madvise, it's a just hint for kernel and kernel doesn't need to make > >>>>sure madvise's behavior. > >>>>In point of view, such inconsistency might not be a big problem. > >>>> > >>>>Big problem I think now is that user should use madvise(WILLNEED) > >>>>periodically because such > >>>>activation happens once when user calls madvise. If user doesn't use > >>>>page frequently after > >>>>user calls it, it ends up moving into inactive list and even could be > >>>>reclaimed. > >>>>It's not good. :-( > >>>> > >>>>Okay. How about adding new VM_WORKINGSET? > >>>>And reclaimer would give one more round trip in active/inactive list > >>>>erwhen reclaim happens > >>>>if the page is referenced. > >>>> > >>>>Sigh. We have no room for new VM_FLAG in 32 bit. > >>>p > >>>It would be nice to mark struct address_space with this flag and export > >>>AS_UNEVICTABLE somehow. > >>>Maybe we can reuse file-locking engine for managing these bits =) > >> > >>Make sense to me. We can mark this flag in struct address_space and check > >>it in page_refereneced_file(). If this flag is set, it will be cleard and > > > >Disadvantage is that we could set reclaim granularity as per-inode. > >I want to set it as per-vma, not per-inode. > > But with per-inode flag we can tune all files, not only memory-mapped. I don't oppose per-inode setting but I believe we need file range or mmapped vma, still. One file may have different characteristic part, something is working set something is streaming part. > See, attached patch. Currently I thinking about managing code, > file-locking engine really fits perfectly =) file-locking engine? You consider fcntl as interface for it? What do you mean? > > > > >>the function returns referenced> 1. Then this page can be promoted into > >>activate list. But I prefer to set/clear this flag in madvise. > > > >Hmm, My idea is following as, > >If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list > >and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which > >are set by new VM flag and the page is referenced recently at least once. > >It means it gives one more round trip in his list(ie, active/inactive list) > >rather than activation so that the page would become less reclaimable. > > > >> > >>PS, I have subscribed linux-mm mailing list. :-) > > > >Congratulations! :) > > > >> > >>Regards, > >>Zheng > > > >-- > >To unsubscribe, send a message with 'unsubscribe linux-mm' in > >the body to majordomo@kvack.org. For more info on Linux MM, > >see: http://www.linux-mm.org/ . > >Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > >Don't email: email@kvack.org > > mm: introduce mapping AS_WORKINGSET flag > > From: Konstantin Khlebnikov > > This patch introduces new flag AS_WORKINGSET in mapping->flags. > If it set reclaimer will activates all pages for this inode after first usage. > > Signed-off-by: Konstantin Khlebnikov > --- > include/linux/pagemap.h | 16 ++++++++++++++++ > mm/vmscan.c | 15 ++++++++++++--- > 2 files changed, 28 insertions(+), 3 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index cfaaa69..c15fc17 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -24,6 +24,7 @@ enum mapping_flags { > AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ > AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ > AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ > + AS_WORKINGSET = __GFP_BITS_SHIFT + 4, /* promote pages activation */ > }; > > static inline void mapping_set_error(struct address_space *mapping, int error) > @@ -53,6 +54,21 @@ static inline int mapping_unevictable(struct address_space *mapping) > return !!mapping; > } > > +static inline void mapping_set_workingset(struct address_space *mapping) > +{ > + set_bit(AS_WORKINGSET, &mapping->flags); > +} > + > +static inline void mapping_clear_workingset(struct address_space *mapping) > +{ > + clear_bit(AS_WORKINGSET, &mapping->flags); > +} > + > +static inline int mapping_test_workingset(struct address_space *mapping) > +{ > + return mapping && test_bit(AS_WORKINGSET, &mapping->flags); > +} > + > static inline gfp_t mapping_gfp_mask(struct address_space * mapping) > { > return (__force gfp_t)mapping->flags & __GFP_BITS_MASK; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 57b9658..5ccbe8c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -701,6 +701,7 @@ enum page_references { > }; > > static enum page_references page_check_references(struct page *page, > + struct address_space *mapping, > struct mem_cgroup_zone *mz, > struct scan_control *sc) > { > @@ -721,6 +722,13 @@ static enum page_references page_check_references(struct page *page, > if (vm_flags & VM_LOCKED) > return PAGEREF_RECLAIM; > > + /* > + * Activate workingset page if referenced at least once. > + */ > + if (mapping_test_workingset(mapping) && > + (referenced_ptes || referenced_page)) > + return PAGEREF_ACTIVATE; > + > if (referenced_ptes) { > if (PageAnon(page)) > return PAGEREF_ACTIVATE; > @@ -828,7 +836,9 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > } > > - references = page_check_references(page, mz, sc); > + mapping = page_mapping(page); > + > + references = page_check_references(page, mapping, mz, sc); > switch (references) { > case PAGEREF_ACTIVATE: > goto activate_locked; > @@ -848,11 +858,10 @@ static unsigned long shrink_page_list(struct list_head *page_list, > goto keep_locked; > if (!add_to_swap(page)) > goto activate_locked; > + mapping = &swapper_space; > may_enter_fs = 1; > } > > - mapping = page_mapping(page); > - > /* > * The page is mapped into the page tables of one or more > * processes. Try to unmap it here. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759438Ab2CMCvS (ORCPT ); Mon, 12 Mar 2012 22:51:18 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:39307 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754660Ab2CMCvP (ORCPT ); Mon, 12 Mar 2012 22:51:15 -0400 Date: Tue, 13 Mar 2012 11:51:08 +0900 From: Minchan Kim To: Minchan Kim , Konstantin Khlebnikov , linux-mm , linux-kernel , riel@redhat.com, kosaki.motohiro@jp.fujitsu.com Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313025108.GB7125@barrios> References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> <20120312151542.GA16253@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120312151542.GA16253@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 12, 2012 at 11:15:43PM +0800, Zheng Liu wrote: > On Mon, Mar 12, 2012 at 10:42:26PM +0900, Minchan Kim wrote: > > On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: > > > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: > > > > Minchan Kim wrote: > > > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > > > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote: > > > >>>> I forgot to Ccing you. > > > >>>> Sorry. > > > >>>> > > > >>>> ---------- Forwarded message ---------- > > > >>>> From: Minchan Kim > > > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM > > > >>>> Subject: Re: Control page reclaim granularity > > > >>>> To: Minchan Kim, linux-mm, > > > >>>> linux-kernel, Konstantin Khlebnikov< > > > >>>> khlebnikov@openvz.org>, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com > > > >>>> > > > >>>> > > > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote: > > > >>>>> Hi Minchan, > > > >>>>> > > > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and > > > >>>>> linux-kernel > > > >>>>> mailing list. So please Cc me. > > > >>>>> > > > >>>>> IMHO, maybe we should re-think about how does user use mmap(2). I > > > >>>>> describe the cases I known in our product system. They can be > > > >>>>> categorized into two cases. One is mmaped all data files into memory > > > >>>>> and sometime it uses write(2) to append some data, and another uses > > > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In > > > >>>>> the > > > >>>>> second case, the application wants to keep mmaped page into memory > > > >>>>> and > > > >>>>> let file pages to be reclaimed firstly. So, IMO, when application > > > >>>>> uses > > > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants > > > >>>>> keep > > > >>>>> these mmaped pages into memory and do not be reclaimed. At least > > > >>>>> these > > > >>>>> pages do not be reclaimed early than file pages. I think that > > > >>>>> maybe we > > > >>>>> can recover that routine and provide a sysctl parameter to let the > > > >>>>> user > > > >>>>> to set this ratio between mmaped pages and file pages. > > > >>>> > > > >>>> I am not convinced why we should handle mapped page specially. > > > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to > > > >>>> read > > > >>>> system call. > > > >>>> So I think we can't make sure mmaped pages are always win. > > > >>>> > > > >>>> My suggestion is that it would be better to declare by user explicitly. > > > >>>> I think we can implement it by madvise and fadvise's WILLNEED option. > > > >>>> Current implementation is just readahead if there isn't a page in > > > >>>> memory > > > >>>> but I think > > > >>>> we can promote from inactive to active if there is already a page in > > > >>>> memory. > > > >>>> > > > >>>> It's more clear and it couldn't be affected by kernel page reclaim > > > >>>> algorithm change > > > >>>> like this. > > > >>> > > > >>> Thank you for your advice. But I still have question about this > > > >>> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED > > > >>> option, it will cause an inconsistently status for pages that be > > > >>> manipulated by madvise(2) and/or fadvise(2). For example, when I call > > > >>> madvise with WILLNEED flag, some pages will be moved into active list if > > > >>> they already have been in memory, and other pages will be read into > > > >>> memory and be saved in inactive list if they don't be in memory. Then > > > >>> pages that are in inactive list are possible to be reclaim. So from the > > > >>> view of users, it is inconsistent because some pages are in memory and > > > >>> some pages are reclaimed. But actually the user hopes that all of pages > > > >>> can be kept in memory. IMHO, this inconsistency is weird and makes > > > >>> users > > > >>> puzzled. > > > >> > > > >> Now problem is that > > > >> > > > >> 1. User want to keep pages which are used once in a while in memory. > > > >> 2. Kernel want to reclaim them because they are surely reclaim target > > > >> pages in point of view by LRU. > > > >> > > > >> The most desriable approach is that user should use mlock to guarantee > > > >> them in memory. But mlock is too big overhead and user doesn't want to > > > >> keep > > > >> memory all pages all at once.(Ie, he want demand paging when he need > > > >> the page) > > > >> Right? > > > >> > > > >> madvise, it's a just hint for kernel and kernel doesn't need to make > > > >> sure madvise's behavior. > > > >> In point of view, such inconsistency might not be a big problem. > > > >> > > > >> Big problem I think now is that user should use madvise(WILLNEED) > > > >> periodically because such > > > >> activation happens once when user calls madvise. If user doesn't use > > > >> page frequently after > > > >> user calls it, it ends up moving into inactive list and even could be > > > >> reclaimed. > > > >> It's not good. :-( > > > >> > > > >> Okay. How about adding new VM_WORKINGSET? > > > >> And reclaimer would give one more round trip in active/inactive list > > > >> erwhen reclaim happens > > > >> if the page is referenced. > > > >> > > > >> Sigh. We have no room for new VM_FLAG in 32 bit. > > > > p > > > > It would be nice to mark struct address_space with this flag and export > > > > AS_UNEVICTABLE somehow. > > > > Maybe we can reuse file-locking engine for managing these bits =) > > > > > > Make sense to me. We can mark this flag in struct address_space and check > > > it in page_refereneced_file(). If this flag is set, it will be cleard and > > > > Disadvantage is that we could set reclaim granularity as per-inode. > > I want to set it as per-vma, not per-inode. > > I don't think this is a disadvantage. This per-inode reclaim > granularity is useful for us. Actually I have thought to implement a > per-inode memcg to let different file sets to be reclaimed separately. > So maybe we can provide two mechanisms to let the user to choose how to > use them. I don't oppose supporting both mechanism but I don't want to give only per-inode approach. > > > > > > the function returns referenced > 1. Then this page can be promoted into > > > activate list. But I prefer to set/clear this flag in madvise. > > > > Hmm, My idea is following as, > > If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list > > and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which > > are set by new VM flag and the page is referenced recently at least once. > > It means it gives one more round trip in his list(ie, active/inactive list) > > rather than activation so that the page would become less reclaimable. > > No matter what the page is given one more round trip or is promoted into > active list, it can satisfy our current requirement. So now the > question is which is better. If we add a new VM flag, as you said > before, vma->vm_flags has no room for it in 32 bit. I have noticed that > this topic has been discussed [1] and the result is that vm_flags is > still a unsigned long type. So we need to use a tricky technique to solve > it. If we add a new flag in struct addpress_space, it might be easy to > implement it. In case of per-inode, it's good but it doesn't work for per-vma and file-range. > > 1. http://lkml.indiana.edu/hypermail/linux/kernel/1104.1/00975.html > > Regards, > Zheng From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759412Ab2CMC6G (ORCPT ); Mon, 12 Mar 2012 22:58:06 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:44882 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754541Ab2CMC6E (ORCPT ); Mon, 12 Mar 2012 22:58:04 -0400 Date: Tue, 13 Mar 2012 11:57:57 +0900 From: Minchan Kim To: Rik van Riel Cc: Minchan Kim , linux-mm , linux-kernel , Konstantin Khlebnikov , kosaki.motohiro@jp.fujitsu.com Subject: Re: Control page reclaim granularity Message-ID: <20120313025756.GC7125@barrios> References: <20120308073412.GA6975@gmail.com> <20120308093514.GA28856@barrios> <4F5E0E5C.8040508@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F5E0E5C.8040508@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 12, 2012 at 10:55:24AM -0400, Rik van Riel wrote: > On 03/08/2012 04:35 AM, Minchan Kim wrote: > >On Thu, Mar 08, 2012 at 03:34:13PM +0800, Zheng Liu wrote: > >>Hi list, > >> > >>Recently we encounter a problem about page reclaim. I abstract it in here. > >>The problem is that there are two different file types. One is small index > >>file, and another is large data file. The index file is mmaped into memory, > >>and application hope that they can be kept in memory and don't be reclaimed > >>too frequently. The data file is manipulted by read/write, and they should > >>be reclaimed more frequently than the index file. > > They should indeed be. The data pages should not get promoted > to the active list unless they get referenced twice while on > the inactive list. > > Mmaped pages, on the other hand, get promoted to the active > list after just one reference. As I look the code, mmaped page doesn't get promoted by one reference. It will get promoted by second-round trip or touched by several mapping when first round trip. if (referenced_page || referenced_ptes > 1) return PAGEREF_ACTIVATE; > > Also, as long as the inactive file list is larger than the > active file list, we do not reclaim active file pages at > all. True. > > >I think it's a regression since 2.6.28. > >Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped). > >But we removed that routine when we applied split lru page replacement. > >Rik, KOSAKI. What's the rationale? > > One main reason is scalability. We have to treat pages > in such a way that we do not have to search through > gigabytes of memory to find a few eviction candidates > to place on the inactive list - where they could get > reused and stopped from eviction again. Okay. Thanks, Rik. Then, another question. Why did we handle mmaped page specially at that time? Just out of curiosity. > > -- > All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754506Ab2CMEh2 (ORCPT ); Tue, 13 Mar 2012 00:37:28 -0400 Received: from mail-bk0-f46.google.com ([209.85.214.46]:62187 "EHLO mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753833Ab2CMEh0 (ORCPT ); Tue, 13 Mar 2012 00:37:26 -0400 Message-ID: <4F5ECF01.2000402@openvz.org> Date: Tue, 13 Mar 2012 08:37:21 +0400 From: Konstantin Khlebnikov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120217 Firefox/10.0.2 Iceape/2.7.2 MIME-Version: 1.0 To: Minchan Kim CC: linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" Subject: Re: Fwd: Control page reclaim granularity References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> <4F5E05AD.20200@openvz.org> <20120313024818.GA7125@barrios> In-Reply-To: <20120313024818.GA7125@barrios> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Minchan Kim wrote: > On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote: >> Minchan Kim wrote: >>> On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: >>>> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: >>>>> Minchan Kim wrote: >>>>>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: >>>>>> >>>>>> Now problem is that >>>>>> >>>>>> 1. User want to keep pages which are used once in a while in memory. >>>>>> 2. Kernel want to reclaim them because they are surely reclaim target >>>>>> pages in point of view by LRU. >>>>>> >>>>>> The most desriable approach is that user should use mlock to guarantee >>>>>> them in memory. But mlock is too big overhead and user doesn't want to >>>>>> keep >>>>>> memory all pages all at once.(Ie, he want demand paging when he need >>>>>> the page) >>>>>> Right? >>>>>> >>>>>> madvise, it's a just hint for kernel and kernel doesn't need to make >>>>>> sure madvise's behavior. >>>>>> In point of view, such inconsistency might not be a big problem. >>>>>> >>>>>> Big problem I think now is that user should use madvise(WILLNEED) >>>>>> periodically because such >>>>>> activation happens once when user calls madvise. If user doesn't use >>>>>> page frequently after >>>>>> user calls it, it ends up moving into inactive list and even could be >>>>>> reclaimed. >>>>>> It's not good. :-( >>>>>> >>>>>> Okay. How about adding new VM_WORKINGSET? >>>>>> And reclaimer would give one more round trip in active/inactive list >>>>>> erwhen reclaim happens >>>>>> if the page is referenced. >>>>>> >>>>>> Sigh. We have no room for new VM_FLAG in 32 bit. >>>>> p >>>>> It would be nice to mark struct address_space with this flag and export >>>>> AS_UNEVICTABLE somehow. >>>>> Maybe we can reuse file-locking engine for managing these bits =) >>>> >>>> Make sense to me. We can mark this flag in struct address_space and check >>>> it in page_refereneced_file(). If this flag is set, it will be cleard and >>> >>> Disadvantage is that we could set reclaim granularity as per-inode. >>> I want to set it as per-vma, not per-inode. >> >> But with per-inode flag we can tune all files, not only memory-mapped. > > I don't oppose per-inode setting but I believe we need file range or mmapped vma, > still. One file may have different characteristic part, something is working set > something is streaming part. > >> See, attached patch. Currently I thinking about managing code, >> file-locking engine really fits perfectly =) > > file-locking engine? > You consider fcntl as interface for it? > What do you mean? > If we set bits on inode we somehow account its users and clear AS_WORKINGSET and AS_UNEVICTABLE at last file close. We can use file-locking engine for locking inodes in memory -- file lock automatically release inode at last fput(). Maybe it's too tricky and we should add couple simple atomic counters to generic strict inode (like i_writecount/i_readcount) but in this case we will add new code on fast-path. So, looks like invention new kind of struct file_lock is best approach. I don't want implement range-locking for now, but I can do it if somebody really wants this. Yes, we can use fcntl(), but fadvise() is much better. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758044Ab2CMFAP (ORCPT ); Tue, 13 Mar 2012 01:00:15 -0400 Received: from mail-bk0-f46.google.com ([209.85.214.46]:41388 "EHLO mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752372Ab2CMFAM (ORCPT ); Tue, 13 Mar 2012 01:00:12 -0400 Message-ID: <4F5ED458.5070301@openvz.org> Date: Tue, 13 Mar 2012 09:00:08 +0400 From: Konstantin Khlebnikov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120217 Firefox/10.0.2 Iceape/2.7.2 MIME-Version: 1.0 To: Minchan Kim CC: linux-mm , linux-kernel , "riel@redhat.com" , "kosaki.motohiro@jp.fujitsu.com" Subject: Re: Fwd: Control page reclaim granularity References: <4F5D95AF.1020108@openvz.org> <20120312081413.GA10923@gmail.com> <20120312134226.GA5120@barrios> <4F5E05AD.20200@openvz.org> <20120313024818.GA7125@barrios> <4F5ECF01.2000402@openvz.org> In-Reply-To: <4F5ECF01.2000402@openvz.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Konstantin Khlebnikov wrote: > Minchan Kim wrote: >> On Mon, Mar 12, 2012 at 06:18:21PM +0400, Konstantin Khlebnikov wrote: >>> Minchan Kim wrote: >>>> On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote: >>>>> On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote: >>>>>> Minchan Kim wrote: >>>>>>> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote: > >>>>>>> >>>>>>> Now problem is that >>>>>>> >>>>>>> 1. User want to keep pages which are used once in a while in memory. >>>>>>> 2. Kernel want to reclaim them because they are surely reclaim target >>>>>>> pages in point of view by LRU. >>>>>>> >>>>>>> The most desriable approach is that user should use mlock to guarantee >>>>>>> them in memory. But mlock is too big overhead and user doesn't want to >>>>>>> keep >>>>>>> memory all pages all at once.(Ie, he want demand paging when he need >>>>>>> the page) >>>>>>> Right? >>>>>>> >>>>>>> madvise, it's a just hint for kernel and kernel doesn't need to make >>>>>>> sure madvise's behavior. >>>>>>> In point of view, such inconsistency might not be a big problem. >>>>>>> >>>>>>> Big problem I think now is that user should use madvise(WILLNEED) >>>>>>> periodically because such >>>>>>> activation happens once when user calls madvise. If user doesn't use >>>>>>> page frequently after >>>>>>> user calls it, it ends up moving into inactive list and even could be >>>>>>> reclaimed. >>>>>>> It's not good. :-( >>>>>>> >>>>>>> Okay. How about adding new VM_WORKINGSET? >>>>>>> And reclaimer would give one more round trip in active/inactive list >>>>>>> erwhen reclaim happens >>>>>>> if the page is referenced. >>>>>>> >>>>>>> Sigh. We have no room for new VM_FLAG in 32 bit. >>>>>> p >>>>>> It would be nice to mark struct address_space with this flag and export >>>>>> AS_UNEVICTABLE somehow. >>>>>> Maybe we can reuse file-locking engine for managing these bits =) >>>>> >>>>> Make sense to me. We can mark this flag in struct address_space and check >>>>> it in page_refereneced_file(). If this flag is set, it will be cleard and >>>> >>>> Disadvantage is that we could set reclaim granularity as per-inode. >>>> I want to set it as per-vma, not per-inode. >>> >>> But with per-inode flag we can tune all files, not only memory-mapped. >> >> I don't oppose per-inode setting but I believe we need file range or mmapped vma, >> still. One file may have different characteristic part, something is working set >> something is streaming part. >> >>> See, attached patch. Currently I thinking about managing code, >>> file-locking engine really fits perfectly =) >> >> file-locking engine? >> You consider fcntl as interface for it? >> What do you mean? >> > > If we set bits on inode we somehow account its users and clear AS_WORKINGSET and AS_UNEVICTABLE > at last file close. We can use file-locking engine for locking inodes in memory -- file lock automatically > release inode at last fput(). Maybe it's too tricky and we should add couple simple atomic counters to > generic strict inode (like i_writecount/i_readcount) but in this case we will add new code on fast-path. > So, looks like invention new kind of struct file_lock is best approach. > I don't want implement range-locking for now, but I can do it if somebody really wants this. > > Yes, we can use fcntl(), but fadvise() is much better. Another mad idea: if we mark vma, then we can add fake vma (belong init_mm for example) to inode rmap to lock inode's pages range in memory without actually mapping file. In page_referenced_one() we should handle this fake vma differently, because page_check_address() will always fail for it. Thus we can effectively implement AS_WORKINGSET and AS_UNEVICTABLE for arbitrary page ranges. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759819Ab2CMGZA (ORCPT ); Tue, 13 Mar 2012 02:25:00 -0400 Received: from mail-iy0-f174.google.com ([209.85.210.174]:45424 "EHLO mail-iy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754318Ab2CMGY6 (ORCPT ); Tue, 13 Mar 2012 02:24:58 -0400 From: Zheng Liu To: minchan@kernel.org Cc: khlebnikov@openvz.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Date: Tue, 13 Mar 2012 14:30:14 +0800 Message-Id: <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> X-Mailer: git-send-email 1.7.4.1 In-Reply-To: <20120313024818.GA7125@barrios> References: <20120313024818.GA7125@barrios> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This only a first trivial try. If this flag is set, reclaimer just give this page one more round trip rather than promote it into active list. Any comments or advices are welcomed. Regards, Zheng [PATCH] mm: per-inode mmaped page reclaim From: Zheng Liu In some cases, user wants to control mmaped page reclaim granularity. A new flag is added into struct address_space to give the page one more round trip. AS_WORKINGSET flag cannot be added in vma->vm_flags because this flag has no room for a new flag in 32 bit. Now user can call madvise(2) to set this flag for a file. If this flag is set, all pages will be given one more round trip when reclaimer tries to shrink pages. Signed-off-by: Zheng Liu --- include/asm-generic/mman-common.h | 2 ++ include/linux/pagemap.h | 16 ++++++++++++++++ mm/madvise.c | 8 ++++++++ mm/vmscan.c | 15 +++++++++++++++ 4 files changed, 41 insertions(+), 0 deletions(-) diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h index 787abbb..7d26c9b 100644 --- a/include/asm-generic/mman-common.h +++ b/include/asm-generic/mman-common.h @@ -48,6 +48,8 @@ #define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ #define MADV_NOHUGEPAGE 15 /* Not worth backing with hugepages */ +#define MADV_WORKINGSET 16 /* give one more round trip */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index cfaaa69..80532a0 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -24,6 +24,7 @@ enum mapping_flags { AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ + AS_WORKINGSET = __GFP_BITS_SHIFT + 4, /* give one more round trip */ }; static inline void mapping_set_error(struct address_space *mapping, int error) @@ -36,6 +37,21 @@ static inline void mapping_set_error(struct address_space *mapping, int error) } } +static inline void mapping_set_workingset(struct address_space *mapping) +{ + set_bit(AS_WORKINGSET, &mapping->flags); +} + +static inline void mapping_clear_workingset(struct address_space *mapping) +{ + clear_bit(AS_WORKINGSET, &mapping->flags); +} + +static inline int mapping_test_workingset(struct address_space *mapping) +{ + return mapping && test_bit(AS_WORKINGSET, &mapping->flags); +} + static inline void mapping_set_unevictable(struct address_space *mapping) { set_bit(AS_UNEVICTABLE, &mapping->flags); diff --git a/mm/madvise.c b/mm/madvise.c index 74bf193..8ca6c9b 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -77,6 +77,14 @@ static long madvise_behavior(struct vm_area_struct * vma, if (error) goto out; break; + case MADV_WORKINGSET: + if (vma->vm_file && vma->vm_file->f_mapping) { + mapping_set_workingset(vma->vm_file->f_mapping); + } else { + error = -EPERM; + goto out; + } + break; } if (new_flags == vma->vm_flags) { diff --git a/mm/vmscan.c b/mm/vmscan.c index c52b235..51f745b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -721,6 +721,15 @@ static enum page_references page_check_references(struct page *page, if (vm_flags & VM_LOCKED) return PAGEREF_RECLAIM; + /* + * give this page one more round trip because workingset + * flag is set. + */ + if (mapping_test_workingset(page_mapping(page))) { + mapping_clear_workingset(page_mapping(page)); + return PAGEREF_KEEP; + } + if (referenced_ptes) { if (PageAnon(page)) return PAGEREF_ACTIVATE; @@ -1737,6 +1746,12 @@ static void shrink_active_list(unsigned long nr_to_scan, continue; } + if (mapping_test_workingset(page_mapping(page))) { + mapping_clear_workingset(page_mapping(page)); + list_add(&page->lru, &l_active); + continue; + } + if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) { nr_rotated += hpage_nr_pages(page); /* -- 1.7.4.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753420Ab2CMGnO (ORCPT ); Tue, 13 Mar 2012 02:43:14 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:46803 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753177Ab2CMGnL (ORCPT ); Tue, 13 Mar 2012 02:43:11 -0400 Date: Tue, 13 Mar 2012 14:48:32 +0800 From: Zheng Liu To: minchan@kernel.org Cc: khlebnikov@openvz.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313064832.GA4968@gmail.com> Mail-Followup-To: minchan@kernel.org, khlebnikov@openvz.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry, please forgive me. This patch has a defect. When one page is scaned and flag is clear, all other's flags also are clear too. Regards, Zheng On Tue, Mar 13, 2012 at 02:30:14PM +0800, Zheng Liu wrote: > This only a first trivial try. If this flag is set, reclaimer just give this > page one more round trip rather than promote it into active list. Any comments > or advices are welcomed. > > Regards, > Zheng > > [PATCH] mm: per-inode mmaped page reclaim > > From: Zheng Liu > > In some cases, user wants to control mmaped page reclaim granularity. A new > flag is added into struct address_space to give the page one more round trip. > AS_WORKINGSET flag cannot be added in vma->vm_flags because this flag has no > room for a new flag in 32 bit. Now user can call madvise(2) to set this flag > for a file. If this flag is set, all pages will be given one more round trip > when reclaimer tries to shrink pages. > > Signed-off-by: Zheng Liu > --- > include/asm-generic/mman-common.h | 2 ++ > include/linux/pagemap.h | 16 ++++++++++++++++ > mm/madvise.c | 8 ++++++++ > mm/vmscan.c | 15 +++++++++++++++ > 4 files changed, 41 insertions(+), 0 deletions(-) > > diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h > index 787abbb..7d26c9b 100644 > --- a/include/asm-generic/mman-common.h > +++ b/include/asm-generic/mman-common.h > @@ -48,6 +48,8 @@ > #define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ > #define MADV_NOHUGEPAGE 15 /* Not worth backing with hugepages */ > > +#define MADV_WORKINGSET 16 /* give one more round trip */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index cfaaa69..80532a0 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -24,6 +24,7 @@ enum mapping_flags { > AS_ENOSPC = __GFP_BITS_SHIFT + 1, /* ENOSPC on async write */ > AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */ > AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */ > + AS_WORKINGSET = __GFP_BITS_SHIFT + 4, /* give one more round trip */ > }; > > static inline void mapping_set_error(struct address_space *mapping, int error) > @@ -36,6 +37,21 @@ static inline void mapping_set_error(struct address_space *mapping, int error) > } > } > > +static inline void mapping_set_workingset(struct address_space *mapping) > +{ > + set_bit(AS_WORKINGSET, &mapping->flags); > +} > + > +static inline void mapping_clear_workingset(struct address_space *mapping) > +{ > + clear_bit(AS_WORKINGSET, &mapping->flags); > +} > + > +static inline int mapping_test_workingset(struct address_space *mapping) > +{ > + return mapping && test_bit(AS_WORKINGSET, &mapping->flags); > +} > + > static inline void mapping_set_unevictable(struct address_space *mapping) > { > set_bit(AS_UNEVICTABLE, &mapping->flags); > diff --git a/mm/madvise.c b/mm/madvise.c > index 74bf193..8ca6c9b 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -77,6 +77,14 @@ static long madvise_behavior(struct vm_area_struct * vma, > if (error) > goto out; > break; > + case MADV_WORKINGSET: > + if (vma->vm_file && vma->vm_file->f_mapping) { > + mapping_set_workingset(vma->vm_file->f_mapping); > + } else { > + error = -EPERM; > + goto out; > + } > + break; > } > > if (new_flags == vma->vm_flags) { > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c52b235..51f745b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -721,6 +721,15 @@ static enum page_references page_check_references(struct page *page, > if (vm_flags & VM_LOCKED) > return PAGEREF_RECLAIM; > > + /* > + * give this page one more round trip because workingset > + * flag is set. > + */ > + if (mapping_test_workingset(page_mapping(page))) { > + mapping_clear_workingset(page_mapping(page)); > + return PAGEREF_KEEP; > + } > + > if (referenced_ptes) { > if (PageAnon(page)) > return PAGEREF_ACTIVATE; > @@ -1737,6 +1746,12 @@ static void shrink_active_list(unsigned long nr_to_scan, > continue; > } > > + if (mapping_test_workingset(page_mapping(page))) { > + mapping_clear_workingset(page_mapping(page)); > + list_add(&page->lru, &l_active); > + continue; > + } > + > if (page_referenced(page, 0, mz->mem_cgroup, &vm_flags)) { > nr_rotated += hpage_nr_pages(page); > /* > -- > 1.7.4.1 > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758234Ab2CMHVO (ORCPT ); Tue, 13 Mar 2012 03:21:14 -0400 Received: from mail-bk0-f46.google.com ([209.85.214.46]:49271 "EHLO mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753915Ab2CMHVN (ORCPT ); Tue, 13 Mar 2012 03:21:13 -0400 Message-ID: <4F5EF563.5000700@openvz.org> Date: Tue, 13 Mar 2012 11:21:07 +0400 From: Konstantin Khlebnikov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120217 Firefox/10.0.2 Iceape/2.7.2 MIME-Version: 1.0 To: minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu Subject: Re: Fwd: Control page reclaim granularity References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> In-Reply-To: <20120313064832.GA4968@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Zheng Liu wrote: > Sorry, please forgive me. This patch has a defect. When one page is > scaned and flag is clear, all other's flags also are clear too. Yeah, funny patch =) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759689Ab2CMHnR (ORCPT ); Tue, 13 Mar 2012 03:43:17 -0400 Received: from mail-gx0-f174.google.com ([209.85.161.174]:65464 "EHLO mail-gx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754503Ab2CMHnP (ORCPT ); Tue, 13 Mar 2012 03:43:15 -0400 MIME-Version: 1.0 In-Reply-To: <4F5EF563.5000700@openvz.org> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> Date: Tue, 13 Mar 2012 13:13:14 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul To: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, I noticed this discussion and decided to pitch in one small idea from my side. It would be nice to range lock an inode's pages by storing those ranges which would be locked. This could also add some good routines for the kernel in terms of range locking for a single inode. However, wouldn't this add some overhead to shrink_page_list() since that code would need to go through all these ranges while trying to reclaim a single page ? One small suggestion from my side is: Why don't we implement something like : "Complete page-cache reclaim control from usermode"? In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin mentioned) for a file's inode from usermode by using ioctl or fcntl or maybe even go as far as implementing an O_NORECL option to the open system call. After setting the AS_UNEVICTABLE, the usermode application can choose to keep and remove pages by using the fadvise(WILLNEED) and fadvise(DONTNEED). ( I think maybe the presence of any VMA is might not really be required for this idea. ) Thanks, Kautuk. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759948Ab2CMHrn (ORCPT ); Tue, 13 Mar 2012 03:47:43 -0400 Received: from mail-yx0-f174.google.com ([209.85.213.174]:49116 "EHLO mail-yx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754930Ab2CMHrm (ORCPT ); Tue, 13 Mar 2012 03:47:42 -0400 MIME-Version: 1.0 In-Reply-To: References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> Date: Tue, 13 Mar 2012 13:17:41 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul To: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 13, 2012 at 1:13 PM, Kautuk Consul wrote: > Hi, > > I noticed this discussion and decided to pitch in one small idea from my side. > > It would be nice to range lock an inode's pages by storing those > ranges which would be locked. > This could also add some good routines for the kernel in terms of > range locking for a single inode. > However, wouldn't this add some overhead to shrink_page_list() since > that code would need to go through > all these ranges while trying to reclaim a single page ? > > One small suggestion from my side is: > Why don't we implement something like : "Complete page-cache reclaim > control from usermode"? > In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin > mentioned) for a file's > inode from usermode by using ioctl or fcntl or maybe even go as far as > implementing an O_NORECL > option to the open system call. > Of course, only an application executing with root privileges should be allowed to set the inode's mapping flags in this manner. > After setting the AS_UNEVICTABLE, the usermode application can choose > to keep and remove pages by > using the fadvise(WILLNEED) and fadvise(DONTNEED). > > ( I think maybe the presence of any VMA is might not really be > required for this idea. ) > > Thanks, > Kautuk. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759977Ab2CMIAY (ORCPT ); Tue, 13 Mar 2012 04:00:24 -0400 Received: from mail-iy0-f174.google.com ([209.85.210.174]:56848 "EHLO mail-iy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756899Ab2CMIAX (ORCPT ); Tue, 13 Mar 2012 04:00:23 -0400 Date: Tue, 13 Mar 2012 16:05:35 +0800 From: Zheng Liu To: Kautuk Consul Cc: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313080535.GA5243@gmail.com> Mail-Followup-To: Kautuk Consul , minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 13, 2012 at 01:17:41PM +0530, Kautuk Consul wrote: > On Tue, Mar 13, 2012 at 1:13 PM, Kautuk Consul wrote: > > Hi, > > > > I noticed this discussion and decided to pitch in one small idea from my side. > > > > It would be nice to range lock an inode's pages by storing those > > ranges which would be locked. > > This could also add some good routines for the kernel in terms of > > range locking for a single inode. > > However, wouldn't this add some overhead to shrink_page_list() since > > that code would need to go through > > all these ranges while trying to reclaim a single page ? > > > > One small suggestion from my side is: > > Why don't we implement something like : "Complete page-cache reclaim > > control from usermode"? > > In this, we can set/unset the mapping to AS_UNEVICTABLE (as Konstantin > > mentioned) for a file's > > inode from usermode by using ioctl or fcntl or maybe even go as far as > > implementing an O_NORECL > > option to the open system call. > > > > Of course, only an application executing with root privileges should > be allowed to set the inode's > mapping flags in this manner. Hi Kautuk, IMHO, running application with root privilege is too dangerous. We should avoid it. Regards, Zheng > > > > After setting the AS_UNEVICTABLE, the usermode application can choose > > to keep and remove pages by > > using the fadvise(WILLNEED) and fadvise(DONTNEED). > > > > ( I think maybe the presence of any VMA is might not really be > > required for this idea. ) > > > > Thanks, > > Kautuk. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760003Ab2CMIEp (ORCPT ); Tue, 13 Mar 2012 04:04:45 -0400 Received: from mail-yx0-f174.google.com ([209.85.213.174]:45815 "EHLO mail-yx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751245Ab2CMIEk convert rfc822-to-8bit (ORCPT ); Tue, 13 Mar 2012 04:04:40 -0400 MIME-Version: 1.0 In-Reply-To: <20120313080535.GA5243@gmail.com> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> Date: Tue, 13 Mar 2012 13:34:39 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul To: Kautuk Consul , minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > Hi Kautuk, > > IMHO, running application with root privilege is too dangerous.  We > should avoid it. > I agree, but that's not my point. All I'm saying is that we probably don't want to give normal unprivileged usermode apps the capability to set the mapping to AS_UNEVICTABLE as anyone can then write an application that hogs memory without allowing the kernel to free it through memory reclaim. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759997Ab2CMII7 (ORCPT ); Tue, 13 Mar 2012 04:08:59 -0400 Received: from mail-yx0-f174.google.com ([209.85.213.174]:44120 "EHLO mail-yx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756496Ab2CMII5 (ORCPT ); Tue, 13 Mar 2012 04:08:57 -0400 MIME-Version: 1.0 In-Reply-To: References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> Date: Tue, 13 Mar 2012 13:38:56 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul To: Kautuk Consul , minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > I agree, but that's not my point. > > All I'm saying is that we probably don't want to give normal > unprivileged usermode apps > the capability to set the mapping to AS_UNEVICTABLE as anyone can then > write an application > that hogs memory without allowing the kernel to free it through memory reclaim. Sorry, I mean : "... that hogs kernel unmapped page-cache memory without allowing the kernel to free it through memory reclaim." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755099Ab2CMIW4 (ORCPT ); Tue, 13 Mar 2012 04:22:56 -0400 Received: from mail-pz0-f46.google.com ([209.85.210.46]:58858 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752265Ab2CMIWx (ORCPT ); Tue, 13 Mar 2012 04:22:53 -0400 Date: Tue, 13 Mar 2012 16:28:18 +0800 From: Zheng Liu To: Kautuk Consul Cc: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: Fwd: Control page reclaim granularity Message-ID: <20120313082818.GA5421@gmail.com> Mail-Followup-To: Kautuk Consul , minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 13, 2012 at 01:38:56PM +0530, Kautuk Consul wrote: > > > > I agree, but that's not my point. > > > > All I'm saying is that we probably don't want to give normal > > unprivileged usermode apps > > the capability to set the mapping to AS_UNEVICTABLE as anyone can then > > write an application > > that hogs memory without allowing the kernel to free it through memory reclaim. Yes, I think so. But it seems that there has some codes that are possible to be abused. For example, as I said previously, applications can mmap a normal data file with PROT_EXEC flag. Then this file gets a high priority to keep in memory (commit: 8cab4754). So my point is that we cannot control applications how to use these mechanisms. We just provide them and let applications to choose how to use them. :-) Regards, Zheng > > Sorry, I mean : > "... that hogs kernel unmapped page-cache memory without allowing the > kernel to free it through memory reclaim." From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759682Ab2CMIgc (ORCPT ); Tue, 13 Mar 2012 04:36:32 -0400 Received: from mail-yx0-f174.google.com ([209.85.213.174]:37113 "EHLO mail-yx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752294Ab2CMIg3 convert rfc822-to-8bit (ORCPT ); Tue, 13 Mar 2012 04:36:29 -0400 MIME-Version: 1.0 In-Reply-To: <20120313082818.GA5421@gmail.com> References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> <20120313082818.GA5421@gmail.com> Date: Tue, 13 Mar 2012 14:06:28 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul To: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > Yes, I think so.  But it seems that there has some codes that are > possible to be abused.  For example, as I said previously, applications > can mmap a normal data file with PROT_EXEC flag.  Then this file gets a > high priority to keep in memory (commit: 8cab4754).  So my point is that > we cannot control applications how to use these mechanisms.  We just > provide them and let applications to choose how to use them. > :-) > That's true, but we are not talking about higher priority here, because in extreme memory reclaim case even PROT_EXEC pages will be reclaimed. But I understand your point. It might be okay to have this for all privileges applications. The only problem that might happen might be in OOM because we will have to include selection points for these page-cache pages (proportionately) while finding the most expensive process to kill. ( I'm talking about the page-cache pages which are not mapped to usermode page-tables at all. ) If any usermode application reads in an extremely huge file, whose inode has been set to AS_UNEVICTABLE, we might want to kill those applications that read in those pages(proportionately) so that the guilty application can be killed. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759784Ab2CMJDu (ORCPT ); Tue, 13 Mar 2012 05:03:50 -0400 Received: from mail-yx0-f174.google.com ([209.85.213.174]:37568 "EHLO mail-yx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754045Ab2CMJDs (ORCPT ); Tue, 13 Mar 2012 05:03:48 -0400 MIME-Version: 1.0 In-Reply-To: References: <20120313024818.GA7125@barrios> <1331620214-4893-1-git-send-email-wenqing.lz@taobao.com> <20120313064832.GA4968@gmail.com> <4F5EF563.5000700@openvz.org> <20120313080535.GA5243@gmail.com> <20120313082818.GA5421@gmail.com> Date: Tue, 13 Mar 2012 14:33:47 +0530 Message-ID: Subject: Re: Fwd: Control page reclaim granularity From: Kautuk Consul To: minchan@kernel.org, riel@redhat.com, kosaki.motohiro@jp.fujitsu.com, Zheng Liu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > The only problem that might happen might be in OOM because we will > have to include selection points for > these page-cache pages (proportionately) while finding the most > expensive process to kill. > ( I'm talking about the page-cache pages which are not mapped to > usermode page-tables at all. ) > > If any usermode application reads in an extremely huge file, whose > inode has been set to AS_UNEVICTABLE, > we might want to kill those applications that read in those > pages(proportionately) so that the guilty application > can be killed. On some more thought, I guess for OOM and proprtionate working set accounting, the approach mentioned by Konstantin (with fake VMA) should work fine with respect to the way oom_kill.c accounts for virtual address size of kill candidates. So, I now think that the best way might indeed be to have a fake VMA to account for the page-cache pages not mapped to usermode. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755507Ab2CMO5R (ORCPT ); Tue, 13 Mar 2012 10:57:17 -0400 Received: from mx1.redhat.com ([209.132.183.28]:7976 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754674Ab2CMO5Q (ORCPT ); Tue, 13 Mar 2012 10:57:16 -0400 Message-ID: <4F5F603F.2070600@redhat.com> Date: Tue, 13 Mar 2012 10:57:03 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120216 Thunderbird/10.0.1 MIME-Version: 1.0 To: Minchan Kim CC: linux-mm , linux-kernel , Konstantin Khlebnikov , kosaki.motohiro@jp.fujitsu.com Subject: Re: Control page reclaim granularity References: <20120308073412.GA6975@gmail.com> <20120308093514.GA28856@barrios> <4F5E0E5C.8040508@redhat.com> <20120313025756.GC7125@barrios> In-Reply-To: <20120313025756.GC7125@barrios> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/12/2012 10:57 PM, Minchan Kim wrote: > On Mon, Mar 12, 2012 at 10:55:24AM -0400, Rik van Riel wrote: >> On 03/08/2012 04:35 AM, Minchan Kim wrote: >>> Before we were trying to keep mapped pages in memory(See calc_reclaim_mapped). >>> But we removed that routine when we applied split lru page replacement. >>> Rik, KOSAKI. What's the rationale? >> >> One main reason is scalability. We have to treat pages >> in such a way that we do not have to search through >> gigabytes of memory to find a few eviction candidates >> to place on the inactive list - where they could get >> reused and stopped from eviction again. > > Okay. Thanks, Rik. > Then, another question. > Why did we handle mmaped page specially at that time? > Just out of curiosity. We had to, because we had only one set of LRU lists. Something had to be done to keep streaming IO from pushing other things out of memory. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030677Ab2CNHUK (ORCPT ); Wed, 14 Mar 2012 03:20:10 -0400 Received: from mail-fa0-f74.google.com ([209.85.161.74]:52609 "EHLO mail-fa0-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752643Ab2CNHTv (ORCPT ); Wed, 14 Mar 2012 03:19:51 -0400 From: Greg Thelen To: linux-mm , Zheng Liu Cc: linux-kernel , Konstantin Khlebnikov Subject: Re: Control page reclaim granularity References: <20120308161318.GA9904@gmail.com> Date: Wed, 14 Mar 2012 00:19:43 -0700 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Zheng Liu writes: > Hi Greg, > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel > mailing list. So please Cc me. > > I am glad to receive your reply and I am very interesting for your > approach. Actually I am not very familiar with CGroup. So would you > please send your patch to me if you can? Thank you all the same. > > Regards, > Zheng Sorry for the delay, I had trouble finding my old prototype patch. The patch below is based on v2.6.34. The patch is just an idea not a complete solution. >>From b1b127e0e1443446d51353b0d7a776bddc046009 Mon Sep 17 00:00:00 2001 From: Greg Thelen Date: Sat, 5 Jun 2010 17:26:06 -0700 Subject: [PATCH] memcg: prototype of dentry/cgroup binding. JUST A PROTOTYPE: DO NOT SUBMIT This creates a /dev/cgroup/memory/X/memory.dir_roots file which one can use to register a directory file descriptors. The idea is that future charges to registered directories, including child inodes, will be billed to memcg X rather than whatever memcg the faulting process runs within. --- fs/dcache.c | 4 +++ include/linux/dcache.h | 1 + include/linux/memcontrol.h | 2 +- mm/filemap.c | 3 ++ mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 73 insertions(+), 1 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index f1358e5..dda48d7 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -70,6 +70,7 @@ struct dentry_stat_t dentry_stat = { static void __d_free(struct dentry *dentry) { WARN_ON(!list_empty(&dentry->d_alias)); + BUG_ON(dentry->d_mem); if (dname_external(dentry)) kfree(dentry->d_name.name); kmem_cache_free(dentry_cache, dentry); @@ -172,6 +173,7 @@ static struct dentry *d_kill(struct dentry *dentry) struct dentry *parent; list_del(&dentry->d_u.d_child); + mem_cgroup_disassociate_from_dentry(dentry); dentry_stat.nr_dentry--; /* For d_free, below */ /*drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); @@ -953,6 +955,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) dentry->d_inode = NULL; dentry->d_parent = NULL; dentry->d_sb = NULL; + dentry->d_mem = NULL; dentry->d_op = NULL; dentry->d_fsdata = NULL; dentry->d_mounted = 0; @@ -964,6 +967,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) if (parent) { dentry->d_parent = dget(parent); dentry->d_sb = parent->d_sb; + dentry->d_mem = parent->d_mem; } else { INIT_LIST_HEAD(&dentry->d_u.d_child); } diff --git a/include/linux/dcache.h b/include/linux/dcache.h index eebb617..523d58b 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -114,6 +114,7 @@ struct dentry { unsigned long d_time; /* used by d_revalidate */ const struct dentry_operations *d_op; struct super_block *d_sb; /* The root of the dentry tree */ + struct mem_cgroup *d_mem; /* Optional memcg */ void *d_fsdata; /* fs-specific data */ unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 44301c6..a8b54f9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -71,6 +71,7 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, struct mem_cgroup *mem_cont, int active, int file); extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask); +void mem_cgroup_disassociate_from_dentry(struct dentry *dentry); int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem); extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page); @@ -309,4 +310,3 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ - diff --git a/mm/filemap.c b/mm/filemap.c index 140ebda..a9a525b 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -400,8 +400,11 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(page->mapping != NULL); + page->mapping = mapping; /* XXX: hack? */ error = mem_cgroup_cache_charge(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); + page->mapping = NULL; /* XXX: hack? */ if (error) goto out; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8a79a6f..de9f150 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -793,6 +793,23 @@ void mem_cgroup_move_lists(struct page *page, mem_cgroup_add_lru_list(page, to); } +static void mem_cgroup_associate_dentry(struct mem_cgroup *mem, + struct dentry *dentry) +{ + css_get(&mem->css); + BUG_ON(dentry->d_mem); + dentry->d_mem = mem; +} + +void mem_cgroup_disassociate_from_dentry(struct dentry *dentry) +{ + if (!dentry->d_mem) + return; + + css_put(&dentry->d_mem->css); + dentry->d_mem = NULL; +} + int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem) { int ret; @@ -1914,6 +1931,29 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, return 0; prefetchw(pc); + /* + * If the page is inode and related dentry indicates a cgroup, then + * charge that cgroup. Otherwise fallback on the mm's cgroup. + * + * TODO(gthelen): this needs more thought. + */ + if ((memcg == NULL) && !PageAnon(page)) { + struct address_space *as; + struct inode *inode; + struct dentry *dentry; + + /* what kind of locking is needed to walk this? dcache_lock (gulp)? */ + as = (struct address_space *)page_rmapping(page); + if (as != NULL) { + inode = as->host; + BUG_ON(inode == NULL); + list_for_each_entry(dentry, &inode->i_dentry, d_alias) { + memcg = dentry->d_mem; + break; + } + } + } + mem = memcg; ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true); if (ret || !mem) @@ -3539,6 +3579,26 @@ unlock: return ret; } +static int mem_cgroup_dir_roots_write(struct cgroup *cgrp, struct cftype *cft, + u64 dir_fd) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp); + struct file *dir; + int status = 0; + + dir = fget(dir_fd); + if (!dir) + return -EINVAL; + + if (dir->f_dentry->d_mem) + status = -EINVAL; + else + mem_cgroup_associate_dentry(mem, dir->f_dentry); + + fput(dir); + return status; +} + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -3594,6 +3654,10 @@ static struct cftype mem_cgroup_files[] = { .read_u64 = mem_cgroup_move_charge_read, .write_u64 = mem_cgroup_move_charge_write, }, + { + .name = "dir_roots", + .write_u64 = mem_cgroup_dir_roots_write, + }, }; #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP -- 1.7.7.3