[ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path
@ 2010-06-25 20:43 Greg Thelen
  2010-06-28  2:03 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Thelen @ 2010-06-25 20:43 UTC (permalink / raw)
  To: lsf10-pc; +Cc: linux-mm

For the upcoming Linux VM summit, I am interesting in discussing the
following proposal.

Problem: When tasks from multiple cgroups share files the charging can be
non-deterministic.  This requires that all such cgroups have unnecessarily high
limits.  It would be nice if the charging was deterministic, using the file's
path to determine which cgroup to charge.  This would benefit charging of
commonly used files (eg: libc) as well as large databases shared by only a few
tasks.

Example: assume two tasks (T1 and T2), each in a separate cgroup.  Each task
wants to access a large (1GB) database file.  To catch memory leaks a tight
memory limit on each task's cgroup is set.  However, the large database file
presents a problem.  If the file has not been cached, then the first task to
access the file is charged, thereby requiring that task's cgroup to have a limit
large enough to include the database file.  If the order of access is unknown
(due to process restart, etc), then all cgroups accessing the file need to have
a limit large enough to include the database.  This is wasteful because the
database won't be charged to both T1 and T2.  It would be useful to introduce
determinism by declaring that a particular cgroup is charged for a particular
set of files.

/dev/cgroup/cg1/cg11  # T1: want memory.limit = 30MB
/dev/cgroup/cg1/cg12  # T2: want memory.limit = 100MB
/dev/cgroup/cg1       # want memory.limit = 1GB + 30MB + 100MB

I have implemented a prototype that allows a file system hierarchy be charge a
particular cgroup using a new bind mount option:
+ mount -t cgroup none /cgroup -o memory
+ mount --bind /tmp/db /tmp/db -o cgroup=/dev/cgroup/cg1

Any accesses to files within /tmp/db are charged to /dev/cgroup/cg1.  Access to
other files behave normally - they charge the cgroup of the current task.

--
Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path
  2010-06-25 20:43 [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path Greg Thelen
@ 2010-06-28  2:03 ` KAMEZAWA Hiroyuki
  2010-06-28  5:07   ` Balbir Singh
  2010-06-29  5:31   ` [ATTEND][LSF/VM TOPIC] " Greg Thelen
  0 siblings, 2 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-28  2:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: lsf10-pc, linux-mm, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com

On Fri, 25 Jun 2010 13:43:45 -0700
Greg Thelen <gthelen@google.com> wrote:

> For the upcoming Linux VM summit, I am interesting in discussing the
> following proposal.
> 
> Problem: When tasks from multiple cgroups share files the charging can be
> non-deterministic.  This requires that all such cgroups have unnecessarily high
> limits.  It would be nice if the charging was deterministic, using the file's
> path to determine which cgroup to charge.  This would benefit charging of
> commonly used files (eg: libc) as well as large databases shared by only a few
> tasks.
> 
> Example: assume two tasks (T1 and T2), each in a separate cgroup.  Each task
> wants to access a large (1GB) database file.  To catch memory leaks a tight
> memory limit on each task's cgroup is set.  However, the large database file
> presents a problem.  If the file has not been cached, then the first task to
> access the file is charged, thereby requiring that task's cgroup to have a limit
> large enough to include the database file.  If the order of access is unknown
> (due to process restart, etc), then all cgroups accessing the file need to have
> a limit large enough to include the database.  This is wasteful because the
> database won't be charged to both T1 and T2.  It would be useful to introduce
> determinism by declaring that a particular cgroup is charged for a particular
> set of files.
> 
> /dev/cgroup/cg1/cg11  # T1: want memory.limit = 30MB
> /dev/cgroup/cg1/cg12  # T2: want memory.limit = 100MB
> /dev/cgroup/cg1       # want memory.limit = 1GB + 30MB + 100MB
> 
> I have implemented a prototype that allows a file system hierarchy be charge a
> particular cgroup using a new bind mount option:
> + mount -t cgroup none /cgroup -o memory
> + mount --bind /tmp/db /tmp/db -o cgroup=/dev/cgroup/cg1
> 
> Any accesses to files within /tmp/db are charged to /dev/cgroup/cg1.  Access to
> other files behave normally - they charge the cgroup of the current task.
> 

Interesting, but I want to use madvice() etc..for this kind of jobs, rather than
deep hooks into the kernel.

madvise(addr, size, MEMORY_RECHAEGE_THIS_PAGES_TO_ME);

Then, you can write a command as:

  file_recharge [path name] [cgroup]
  - this commands move a file cache to specified cgroup.

A daemon program which uses this command + inotify will give us much
flexible controls on file cache on memcg. Do you have some requirements
that this move-charge shouldn't be done in lazy manner ?

Status:
We have codes for move-charge, inotify but have no code for new madvise.


Thanks,
-Kame






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path
  2010-06-28  2:03 ` KAMEZAWA Hiroyuki
@ 2010-06-28  5:07   ` Balbir Singh
  2010-06-29  6:42     ` Greg Thelen
  2010-06-29  5:31   ` [ATTEND][LSF/VM TOPIC] " Greg Thelen
  1 sibling, 1 reply; 8+ messages in thread
From: Balbir Singh @ 2010-06-28  5:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, lsf10-pc, linux-mm, nishimura@mxp.nes.nec.co.jp

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-06-28 11:03:27]:

> On Fri, 25 Jun 2010 13:43:45 -0700
> Greg Thelen <gthelen@google.com> wrote:
> 
> > For the upcoming Linux VM summit, I am interesting in discussing the
> > following proposal.
> > 
> > Problem: When tasks from multiple cgroups share files the charging can be
> > non-deterministic.  This requires that all such cgroups have unnecessarily high
> > limits.  It would be nice if the charging was deterministic, using the file's
> > path to determine which cgroup to charge.  This would benefit charging of
> > commonly used files (eg: libc) as well as large databases shared by only a few
> > tasks.
> > 
> > Example: assume two tasks (T1 and T2), each in a separate cgroup.  Each task
> > wants to access a large (1GB) database file.  To catch memory leaks a tight
> > memory limit on each task's cgroup is set.  However, the large database file
> > presents a problem.  If the file has not been cached, then the first task to
> > access the file is charged, thereby requiring that task's cgroup to have a limit
> > large enough to include the database file.  If the order of access is unknown
> > (due to process restart, etc), then all cgroups accessing the file need to have
> > a limit large enough to include the database.  This is wasteful because the
> > database won't be charged to both T1 and T2.  It would be useful to introduce
> > determinism by declaring that a particular cgroup is charged for a particular
> > set of files.
> > 
> > /dev/cgroup/cg1/cg11  # T1: want memory.limit = 30MB
> > /dev/cgroup/cg1/cg12  # T2: want memory.limit = 100MB
> > /dev/cgroup/cg1       # want memory.limit = 1GB + 30MB + 100MB
> > 
> > I have implemented a prototype that allows a file system hierarchy be charge a
> > particular cgroup using a new bind mount option:
> > + mount -t cgroup none /cgroup -o memory
> > + mount --bind /tmp/db /tmp/db -o cgroup=/dev/cgroup/cg1
> > 
> > Any accesses to files within /tmp/db are charged to /dev/cgroup/cg1.  Access to
> > other files behave normally - they charge the cgroup of the current task.
> > 
> 
> Interesting, but I want to use madvice() etc..for this kind of jobs, rather than
> deep hooks into the kernel.
> 
> madvise(addr, size, MEMORY_RECHAEGE_THIS_PAGES_TO_ME);
> 
> Then, you can write a command as:
> 
>   file_recharge [path name] [cgroup]
>   - this commands move a file cache to specified cgroup.
> 
> A daemon program which uses this command + inotify will give us much
> flexible controls on file cache on memcg. Do you have some requirements
> that this move-charge shouldn't be done in lazy manner ?
> 
> Status:
> We have codes for move-charge, inotify but have no code for new madvise.

I have not see the approach yet, but ideally one would want to avoid
changing the application, otherwise we are going to get very tightly
bound in the API issues.

I want to understand why do we need bind mounts? I think this needs
more discussion.

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path
  2010-06-28  2:03 ` KAMEZAWA Hiroyuki
  2010-06-28  5:07   ` Balbir Singh
@ 2010-06-29  5:31   ` Greg Thelen
  2010-06-29  6:30     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 8+ messages in thread
From: Greg Thelen @ 2010-06-29  5:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: lsf10-pc, linux-mm, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com

On Sun, Jun 27, 2010 at 7:03 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 25 Jun 2010 13:43:45 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> For the upcoming Linux VM summit, I am interesting in discussing the
>> following proposal.
>>
>> Problem: When tasks from multiple cgroups share files the charging can be
>> non-deterministic.  This requires that all such cgroups have unnecessarily high
>> limits.  It would be nice if the charging was deterministic, using the file's
>> path to determine which cgroup to charge.  This would benefit charging of
>> commonly used files (eg: libc) as well as large databases shared by only a few
>> tasks.
>>
>> Example: assume two tasks (T1 and T2), each in a separate cgroup.  Each task
>> wants to access a large (1GB) database file.  To catch memory leaks a tight
>> memory limit on each task's cgroup is set.  However, the large database file
>> presents a problem.  If the file has not been cached, then the first task to
>> access the file is charged, thereby requiring that task's cgroup to have a limit
>> large enough to include the database file.  If the order of access is unknown
>> (due to process restart, etc), then all cgroups accessing the file need to have
>> a limit large enough to include the database.  This is wasteful because the
>> database won't be charged to both T1 and T2.  It would be useful to introduce
>> determinism by declaring that a particular cgroup is charged for a particular
>> set of files.
>>
>> /dev/cgroup/cg1/cg11  # T1: want memory.limit = 30MB
>> /dev/cgroup/cg1/cg12  # T2: want memory.limit = 100MB
>> /dev/cgroup/cg1       # want memory.limit = 1GB + 30MB + 100MB
>>
>> I have implemented a prototype that allows a file system hierarchy be charge a
>> particular cgroup using a new bind mount option:
>> + mount -t cgroup none /cgroup -o memory
>> + mount --bind /tmp/db /tmp/db -o cgroup=/dev/cgroup/cg1
>>
>> Any accesses to files within /tmp/db are charged to /dev/cgroup/cg1.  Access to
>> other files behave normally - they charge the cgroup of the current task.
>>
>
> Interesting, but I want to use madvice() etc..for this kind of jobs, rather than
> deep hooks into the kernel.
>
> madvise(addr, size, MEMORY_RECHAEGE_THIS_PAGES_TO_ME);
>
> Then, you can write a command as:
>
>  file_recharge [path name] [cgroup]
>  - this commands move a file cache to specified cgroup.
>
> A daemon program which uses this command + inotify will give us much
> flexible controls on file cache on memcg. Do you have some requirements
> that this move-charge shouldn't be done in lazy manner ?
>
> Status:
> We have codes for move-charge, inotify but have no code for new madvise.
>
>
> Thanks,
> -Kame

This is an interesting approach.  I like the idea of minimizing kernel
changes.  I want to make sure I understand the idea using terms from
my above example.

1. The daemon establishes inotify() watches on /tmp/db and all sub
directories to catch any accesses.

2. If cg11(T1) is the first process to mmap a portion of a /tmp/db
file (pages_1) then cg11 will be charged.  T1 will not use madvise()
because cg11 does not want to be charged.  cg11 will be temporarily
charged for pages_1.

3. inotify() will inform the proposed daemon that T1 opened /tmp/db,
so the daemon will use file_recharge, which runs the following within
the cg1 cgroup:
- fd = open("/tmp/db/.../path_to_file")
- va = mmap(NULL, size=stat(fd).st_size, fd)
- madvise(fd, va, st_size, MEMORY_RECHARGE_THIS_PAGES_TO_ME).  This
will move the charge of pages_1 from cg11 to cg1.

Did I state this correctly?

I am concerned that the follow-on step does not move the pages to cg1:
4. T1 then touches more /tmp/db pages (pages_2) using the same mmap.
This charges cg11.  I assume that inotify() would not notify the
daemon for this case because the file is still open.  So the pages
will not be moved to cg1.  Or are you suggesting that inotify()
enhanced to advertise charge events?

If the number of directories within /tmp/db is large, then inotify()
maybe expensive.  I don't think this is a problem.

Another worry I have is that if for some reason the daemon is started
after the job, or if the daemon crashes and is restarted, then files
may have been opened and charged to cg11 without the inotify being
setup.  The daemon would have problems finding the pages that were
charged to cg11 and need to be moved to cg1.  The daemon could scan
the open file table of T1, but any files that are no longer opened may
be charged to cg11 with no way for the daemon to find them.

--
Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path
  2010-06-29  5:31   ` [ATTEND][LSF/VM TOPIC] " Greg Thelen
@ 2010-06-29  6:30     ` KAMEZAWA Hiroyuki
  2010-07-01  4:16       ` Greg Thelen
  0 siblings, 1 reply; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-29  6:30 UTC (permalink / raw)
  To: Greg Thelen
  Cc: lsf10-pc, linux-mm, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com

On Mon, 28 Jun 2010 22:31:03 -0700
Greg Thelen <gthelen@google.com> wrote:

> On Sun, Jun 27, 2010 at 7:03 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Fri, 25 Jun 2010 13:43:45 -0700
> > Greg Thelen <gthelen@google.com> wrote:

> >> /dev/cgroup/cg1/cg11 A # T1: want memory.limit = 30MB
> >> /dev/cgroup/cg1/cg12 A # T2: want memory.limit = 100MB
> >> /dev/cgroup/cg1 A  A  A  # want memory.limit = 1GB + 30MB + 100MB
> >>
> >> I have implemented a prototype that allows a file system hierarchy be charge a
> >> particular cgroup using a new bind mount option:
> >> + mount -t cgroup none /cgroup -o memory
> >> + mount --bind /tmp/db /tmp/db -o cgroup=/dev/cgroup/cg1
> >>
> >> Any accesses to files within /tmp/db are charged to /dev/cgroup/cg1. A Access to
> >> other files behave normally - they charge the cgroup of the current task.
> >>
> >
> > Interesting, but I want to use madvice() etc..for this kind of jobs, rather than
> > deep hooks into the kernel.
> >
> > madvise(addr, size, MEMORY_RECHAEGE_THIS_PAGES_TO_ME);
> >
> > Then, you can write a command as:
> >
> > A file_recharge [path name] [cgroup]
> > A - this commands move a file cache to specified cgroup.
> >
> > A daemon program which uses this command + inotify will give us much
> > flexible controls on file cache on memcg. Do you have some requirements
> > that this move-charge shouldn't be done in lazy manner ?
> >
> > Status:
> > We have codes for move-charge, inotify but have no code for new madvise.
> >
> >
> > Thanks,
> > -Kame
> 
> This is an interesting approach.  I like the idea of minimizing kernel
> changes.  I want to make sure I understand the idea using terms from
> my above example.
> 
> 1. The daemon establishes inotify() watches on /tmp/db and all sub
> directories to catch any accesses.
> 
> 2. If cg11(T1) is the first process to mmap a portion of a /tmp/db
> file (pages_1) then cg11 will be charged.  T1 will not use madvise()
> because cg11 does not want to be charged.  cg11 will be temporarily
> charged for pages_1.
> 
yes.

> 3. inotify() will inform the proposed daemon that T1 opened /tmp/db,
> so the daemon will use file_recharge, which runs the following within
> the cg1 cgroup:
> - fd = open("/tmp/db/.../path_to_file")
> - va = mmap(NULL, size=stat(fd).st_size, fd)
> - madvise(fd, va, st_size, MEMORY_RECHARGE_THIS_PAGES_TO_ME).  This
> will move the charge of pages_1 from cg11 to cg1.
> 
> Did I state this correctly?
> 
yes.


> I am concerned that the follow-on step does not move the pages to cg1:
> 4. T1 then touches more /tmp/db pages (pages_2) using the same mmap.
> This charges cg11.  I assume that inotify() would not notify the
> daemon for this case because the file is still open. 
you're right.

> So the pages will not be moved to cg1.  Or are you suggesting
> that inotify() enhanced to advertise charge events?

IIUC, now, inotify() doesn't support mmap. But it has read/write notification.
So, let's think about mmapped pages.

For easy implementation, I suggest file_recharge should map the whole file
and move them all under it. But maybe this is an answer you want.

If I write an _easy_ daemon, which will do...

==
  register inotify and add watches.
  The wathces will see OPEN and IN_DELETE_SELF.

  run 2 threads.

Thread1:
  while(1) {
      read() // check events from inotify.
      maintain opened-file information.
  }

Thread2:
  while (1) {
      check opend-file information.
      select a file // you may implement some scheduling, here.
      open,
      mmap
      mincore() .... checks the file is cached.
      madvice() 
      // if you want, touch pages and add Access bit to them.
      close(),

      sleep if necessary.
 }
==
batch-style cron-job rather than sleep will not be very bad for usual use.
But we may need some interface to implement something clever algorithm.


> If the number of directories within /tmp/db is large, then inotify()
> maybe expensive.  I don't think this is a problem.
> 
> Another worry I have is that if for some reason the daemon is started
> after the job, or if the daemon crashes and is restarted, then files
> may have been opened and charged to cg11 without the inotify being
> setup. 
yes.

> The daemon would have problems finding the pages that were
> charged to cg11 and need to be moved to cg1.  The daemon could scan
> the open file table of T1, but any files that are no longer opened may
> be charged to cg11 with no way for the daemon to find them.
> 

Above thread-1 can maintain "opened-file" database.
Or you can run a recovery-scirpt to open /proc/<xxxx>/fd of processes
to trigger OPEN events.

But yes, some in-kernel approach may be required. as...new interface to memcg
rather than madvise.

/memory.move_file_caches
- when you open this and write()/ioctl() file descriptor to this file,
  all on-memory pages of files will be moved to this cgroup.

Hmm...we may be able to add an interface to know last-pagecache-update time.
(Because access-time is tend to be omitted at mount....)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: deterministic cgroup charging using file path
  2010-06-28  5:07   ` Balbir Singh
@ 2010-06-29  6:42     ` Greg Thelen
  0 siblings, 0 replies; 8+ messages in thread
From: Greg Thelen @ 2010-06-29  6:42 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, nishimura@mxp.nes.nec.co.jp

On Sun, Jun 27, 2010 at 10:07 PM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-06-28 11:03:27]:
>
>> On Fri, 25 Jun 2010 13:43:45 -0700
>> Greg Thelen <gthelen@google.com> wrote:
>>
>> > For the upcoming Linux VM summit, I am interesting in discussing the
>> > following proposal.
>> >
>> > Problem: When tasks from multiple cgroups share files the charging can be
>> > non-deterministic.  This requires that all such cgroups have unnecessarily high
>> > limits.  It would be nice if the charging was deterministic, using the file's
>> > path to determine which cgroup to charge.  This would benefit charging of
>> > commonly used files (eg: libc) as well as large databases shared by only a few
>> > tasks.
>> >
>> > Example: assume two tasks (T1 and T2), each in a separate cgroup.  Each task
>> > wants to access a large (1GB) database file.  To catch memory leaks a tight
>> > memory limit on each task's cgroup is set.  However, the large database file
>> > presents a problem.  If the file has not been cached, then the first task to
>> > access the file is charged, thereby requiring that task's cgroup to have a limit
>> > large enough to include the database file.  If the order of access is unknown
>> > (due to process restart, etc), then all cgroups accessing the file need to have
>> > a limit large enough to include the database.  This is wasteful because the
>> > database won't be charged to both T1 and T2.  It would be useful to introduce
>> > determinism by declaring that a particular cgroup is charged for a particular
>> > set of files.
>> >
>> > /dev/cgroup/cg1/cg11  # T1: want memory.limit = 30MB
>> > /dev/cgroup/cg1/cg12  # T2: want memory.limit = 100MB
>> > /dev/cgroup/cg1       # want memory.limit = 1GB + 30MB + 100MB
>> >
>> > I have implemented a prototype that allows a file system hierarchy be charge a
>> > particular cgroup using a new bind mount option:
>> > + mount -t cgroup none /cgroup -o memory
>> > + mount --bind /tmp/db /tmp/db -o cgroup=/dev/cgroup/cg1
>> >
>> > Any accesses to files within /tmp/db are charged to /dev/cgroup/cg1.  Access to
>> > other files behave normally - they charge the cgroup of the current task.
>> >
>>
>> Interesting, but I want to use madvice() etc..for this kind of jobs, rather than
>> deep hooks into the kernel.
>>
>> madvise(addr, size, MEMORY_RECHAEGE_THIS_PAGES_TO_ME);
>>
>> Then, you can write a command as:
>>
>>   file_recharge [path name] [cgroup]
>>   - this commands move a file cache to specified cgroup.
>>
>> A daemon program which uses this command + inotify will give us much
>> flexible controls on file cache on memcg. Do you have some requirements
>> that this move-charge shouldn't be done in lazy manner ?
>>
>> Status:
>> We have codes for move-charge, inotify but have no code for new madvise.
>
> I have not see the approach yet, but ideally one would want to avoid
> changing the application, otherwise we are going to get very tightly
> bound in the API issues.

I agree that changing the application is undesirable.  I think the
madvise suggestion (above) would not involve changing applications -
it would only be used for a manager daemon in response to a inotify as
a mechanism change the charge of previously allocated file pages.

> I want to understand why do we need bind mounts?

I'm not certain that bind mounts are needed.  I chose to use bind
mounts as a way to create a file system namespace that charged to a
particular cgroup.  There are other mechanisms.  Another approach
would be to have a way to dentry attribute (d_cgroup) that is
inherited by child dentrys.  I tend to prefer the bind mount over the
dentry approach because is reduces the number of cgroup references.
However, there may be even better ways.

> I think this needs more discussion.

I agree that more discussion is required.

--
Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path
  2010-06-29  6:30     ` KAMEZAWA Hiroyuki
@ 2010-07-01  4:16       ` Greg Thelen
  2010-07-01  6:33         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 8+ messages in thread
From: Greg Thelen @ 2010-07-01  4:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: lsf10-pc, linux-mm, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, Ying Han

On Mon, Jun 28, 2010 at 11:30 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 28 Jun 2010 22:31:03 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> On Sun, Jun 27, 2010 at 7:03 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Fri, 25 Jun 2010 13:43:45 -0700
>> > Greg Thelen <gthelen@google.com> wrote:
>
>> >> /dev/cgroup/cg1/cg11  # T1: want memory.limit = 30MB
>> >> /dev/cgroup/cg1/cg12  # T2: want memory.limit = 100MB
>> >> /dev/cgroup/cg1       # want memory.limit = 1GB + 30MB + 100MB
>> >>
>> >> I have implemented a prototype that allows a file system hierarchy be charge a
>> >> particular cgroup using a new bind mount option:
>> >> + mount -t cgroup none /cgroup -o memory
>> >> + mount --bind /tmp/db /tmp/db -o cgroup=/dev/cgroup/cg1
>> >>
>> >> Any accesses to files within /tmp/db are charged to /dev/cgroup/cg1.  Access to
>> >> other files behave normally - they charge the cgroup of the current task.
>> >>
>> >
>> > Interesting, but I want to use madvice() etc..for this kind of jobs, rather than
>> > deep hooks into the kernel.
>> >
>> > madvise(addr, size, MEMORY_RECHAEGE_THIS_PAGES_TO_ME);
>> >
>> > Then, you can write a command as:
>> >
>> >  file_recharge [path name] [cgroup]
>> >  - this commands move a file cache to specified cgroup.
>> >
>> > A daemon program which uses this command + inotify will give us much
>> > flexible controls on file cache on memcg. Do you have some requirements
>> > that this move-charge shouldn't be done in lazy manner ?
>> >
>> > Status:
>> > We have codes for move-charge, inotify but have no code for new madvise.
>> >
>> >
>> > Thanks,
>> > -Kame
>>
>> This is an interesting approach.  I like the idea of minimizing kernel
>> changes.  I want to make sure I understand the idea using terms from
>> my above example.
>>
>> 1. The daemon establishes inotify() watches on /tmp/db and all sub
>> directories to catch any accesses.
>>
>> 2. If cg11(T1) is the first process to mmap a portion of a /tmp/db
>> file (pages_1) then cg11 will be charged.  T1 will not use madvise()
>> because cg11 does not want to be charged.  cg11 will be temporarily
>> charged for pages_1.
>>
> yes.
>
>> 3. inotify() will inform the proposed daemon that T1 opened /tmp/db,
>> so the daemon will use file_recharge, which runs the following within
>> the cg1 cgroup:
>> - fd = open("/tmp/db/.../path_to_file")
>> - va = mmap(NULL, size=stat(fd).st_size, fd)
>> - madvise(fd, va, st_size, MEMORY_RECHARGE_THIS_PAGES_TO_ME).  This
>> will move the charge of pages_1 from cg11 to cg1.
>>
>> Did I state this correctly?
>>
> yes.
>
>
>> I am concerned that the follow-on step does not move the pages to cg1:
>> 4. T1 then touches more /tmp/db pages (pages_2) using the same mmap.
>> This charges cg11.  I assume that inotify() would not notify the
>> daemon for this case because the file is still open.
> you're right.
>
>> So the pages will not be moved to cg1.  Or are you suggesting
>> that inotify() enhanced to advertise charge events?
>
> IIUC, now, inotify() doesn't support mmap. But it has read/write notification.
> So, let's think about mmapped pages.
>
> For easy implementation, I suggest file_recharge should map the whole file
> and move them all under it. But maybe this is an answer you want.
>
> If I write an _easy_ daemon, which will do...
>
> ==
>  register inotify and add watches.
>  The wathces will see OPEN and IN_DELETE_SELF.
>
>  run 2 threads.
>
> Thread1:
>  while(1) {
>      read() // check events from inotify.
>      maintain opened-file information.
>  }
>
> Thread2:
>  while (1) {
>      check opend-file information.
>      select a file // you may implement some scheduling, here.
>      open,
>      mmap
>      mincore() .... checks the file is cached.
>      madvice()
>      // if you want, touch pages and add Access bit to them.
>      close(),
>
>      sleep if necessary.
>  }
> ==
> batch-style cron-job rather than sleep will not be very bad for usual use.
> But we may need some interface to implement something clever algorithm.

I have to collect some data about expected usages of this feature.  I
will have more information tomorrow.  Depending on the how quickly the
charges need to be corrected or the number of opened files, this
daemon may end up doing a lot of polling to correct memory charges.

>> If the number of directories within /tmp/db is large, then inotify()
>> maybe expensive.  I don't think this is a problem.
>>
>> Another worry I have is that if for some reason the daemon is started
>> after the job, or if the daemon crashes and is restarted, then files
>> may have been opened and charged to cg11 without the inotify being
>> setup.
> yes.
>
>> The daemon would have problems finding the pages that were
>> charged to cg11 and need to be moved to cg1.  The daemon could scan
>> the open file table of T1, but any files that are no longer opened may
>> be charged to cg11 with no way for the daemon to find them.
>>
>
> Above thread-1 can maintain "opened-file" database.
> Or you can run a recovery-scirpt to open /proc/<xxxx>/fd of processes
> to trigger OPEN events.

If a file has been unlinked, then the OPEN events would need to scan
/proc/xxx/fd to find an open file handle to open.  This is probably a
corner case, but I wanted to mention it.

> But yes, some in-kernel approach may be required. as...new interface to memcg
> rather than madvise.
>
> /memory.move_file_caches
> - when you open this and write()/ioctl() file descriptor to this file,
>  all on-memory pages of files will be moved to this cgroup.

Are you suggesting that this move_file_caches interface would
associate the given file, dentry, or inode with the cgroup so that
future charges are charged to the intended cgroup?  Or (I suspect)
that the daemon would this need to be periodically use this routine to
correct any incorrect charges.

> Hmm...we may be able to add an interface to know last-pagecache-update time.
> (Because access-time is tend to be omitted at mount....)

Are you thinking that we could introduce a cgroup-wide attribute
(maybe a timestamp, or increasing sequence number, or even just a bit)
that would be set whenever a cgroup statistic (page cache usage in
this case) was updated?  This bit would be cleared whenever all needed
migrations occurred.  The daemon could poll this bit to know if any
migrations were needed.

Another aspect that I am thinking would have to be added to the daemon
would be oom handling.  If cg11 is charged for non-reclaimable files
(tmpfs) that belong to cg1, then the task may oom.  The daemon would
have to listen for oom and then immediately migration the charge from
cg11 to cg1 to lower memory pressure in cg11.

--
Greg

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path
  2010-07-01  4:16       ` Greg Thelen
@ 2010-07-01  6:33         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-07-01  6:33 UTC (permalink / raw)
  To: Greg Thelen
  Cc: lsf10-pc, linux-mm, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, Ying Han

On Wed, 30 Jun 2010 21:16:43 -0700
Greg Thelen <gthelen@google.com> wrote:

> > ==
> > A register inotify and add watches.
> > A The wathces will see OPEN and IN_DELETE_SELF.
> >
> > A run 2 threads.
> >
> > Thread1:
> > A while(1) {
> > A  A  A read() // check events from inotify.
> > A  A  A maintain opened-file information.
> > A }
> >
> > Thread2:
> > A while (1) {
> > A  A  A check opend-file information.
> > A  A  A select a file // you may implement some scheduling, here.
> > A  A  A open,
> > A  A  A mmap
> > A  A  A mincore() .... checks the file is cached.
> > A  A  A madvice()
> > A  A  A // if you want, touch pages and add Access bit to them.
> > A  A  A close(),
> >
> > A  A  A sleep if necessary.
> > A }
> > ==
> > batch-style cron-job rather than sleep will not be very bad for usual use.
> > But we may need some interface to implement something clever algorithm.
> 
> I have to collect some data about expected usages of this feature.  I
> will have more information tomorrow.  Depending on the how quickly the
> charges need to be corrected or the number of opened files, this
> daemon may end up doing a lot of polling to correct memory charges.
> 
maybe. but many applications works with a-lot-of-jobs without special
kernel support.



> >> If the number of directories within /tmp/db is large, then inotify()
> >> maybe expensive. A I don't think this is a problem.
> >>
> >> Another worry I have is that if for some reason the daemon is started
> >> after the job, or if the daemon crashes and is restarted, then files
> >> may have been opened and charged to cg11 without the inotify being
> >> setup.
> > yes.
> >
> >> The daemon would have problems finding the pages that were
> >> charged to cg11 and need to be moved to cg1. A The daemon could scan
> >> the open file table of T1, but any files that are no longer opened may
> >> be charged to cg11 with no way for the daemon to find them.
> >>
> >
> > Above thread-1 can maintain "opened-file" database.
> > Or you can run a recovery-scirpt to open /proc/<xxxx>/fd of processes
> > to trigger OPEN events.
> 
> If a file has been unlinked, then the OPEN events would need to scan
> /proc/xxx/fd to find an open file handle to open.  This is probably a
> corner case, but I wanted to mention it.
> 
sure.

> > But yes, some in-kernel approach may be required. as...new interface to memcg
> > rather than madvise.
> >
> > /memory.move_file_caches
> > - when you open this and write()/ioctl() file descriptor to this file,
> > A all on-memory pages of files will be moved to this cgroup.
> 
> Are you suggesting that this move_file_caches interface would
> associate the given file, dentry, or inode with the cgroup so that
> future charges are charged to the intended cgroup?  Or (I suspect)
> that the daemon would this need to be periodically use this routine to
> correct any incorrect charges.
> 
My idea is for recharging instead of mincode()+madise().


> > Hmm...we may be able to add an interface to know last-pagecache-update time.
> > (Because access-time is tend to be omitted at mount....)
> 
> Are you thinking that we could introduce a cgroup-wide attribute
> (maybe a timestamp, or increasing sequence number, or even just a bit)
> that would be set whenever a cgroup statistic (page cache usage in
> this case) was updated?  This bit would be cleared whenever all needed
> migrations occurred.  The daemon could poll this bit to know if any
> migrations were needed.

Now, memory cgroup has "threshold" cgroup notifier. 
I think it's useful in this case.

> 
> Another aspect that I am thinking would have to be added to the daemon
> would be oom handling.  If cg11 is charged for non-reclaimable files
> (tmpfs) that belong to cg1, then the task may oom.  The daemon would
> have to listen for oom and then immediately migration the charge from
> cg11 to cg1 to lower memory pressure in cg11.
> 

Now, memory cgroup has an interface to disable-oom-kill + oom-notifier.
I think it's useful.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-07-01  6:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-25 20:43 [ATTEND][LSF/VM TOPIC] deterministic cgroup charging using file path Greg Thelen
2010-06-28  2:03 ` KAMEZAWA Hiroyuki
2010-06-28  5:07   ` Balbir Singh
2010-06-29  6:42     ` Greg Thelen
2010-06-29  5:31   ` [ATTEND][LSF/VM TOPIC] " Greg Thelen
2010-06-29  6:30     ` KAMEZAWA Hiroyuki
2010-07-01  4:16       ` Greg Thelen
2010-07-01  6:33         ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).