Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
       [not found]                           ` <4FD2C6C5.1070900@linaro.org>
@ 2012-06-12  7:16                             ` Minchan Kim
  2012-06-12 16:03                               ` KOSAKI Motohiro
  2012-06-12 19:35                               ` John Stultz
  0 siblings, 2 replies; 6+ messages in thread
From: Minchan Kim @ 2012-06-12  7:16 UTC (permalink / raw)
  To: John Stultz
  Cc: KOSAKI Motohiro, Dave Hansen, Dmitry Adamushko, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
	linux-mm@kvack.org

Please, Cced linux-mm.

On 06/09/2012 12:45 PM, John Stultz wrote:

> On 06/07/2012 09:50 PM, KOSAKI Motohiro wrote:
>> (6/7/12 11:03 PM), John Stultz wrote:
>>
>>> So I'm falling back to using a shrinker for now, but I think Dmitry's
>>> point is an interesting one, and am interested in finding a better
>>> place to trigger purging volatile ranges from the mm code. If anyone
>>> has any
>>> suggestions, let me know, otherwise I'll go back to trying to better
>>> grok the mm code.
>>
>> I hate vm feature to abuse shrink_slab(). because of, it was not
>> designed generic callback.
>> it was designed for shrinking filesystem metadata. Therefore, vm
>> keeping a balance between
>> page scanning and slab scanning. then, a lot of shrink_slab misuse may
>> lead to break balancing
>> logic. i.e. drop icache/dcache too many and makes perfomance impact.
>>
>> As far as a code impact is small, I'm prefer to connect w/ vm reclaim
>> code directly.
> 
> I can see your concern about mis-using the shrinker code. Also your
> other email's point about the problem of having LRU range purging
> behavior on a NUMA system makes some sense too.  Unfortunately I'm not
> yet familiar enough with the reclaim core to sort out how to best track
> and connect the volatile range purging in the vm's reclaim core yet.
> 
> So for now, I've moved the code back to using the shrinker (along with
> fixing a few bugs along the way).
> Thus, currently we manage the ranges as so:
>     [per fs volatile range lru head] -> [volatile range] -> [volatile
> range] -> [volatile range]
> With the per-fs shrinker zaping the volatile ranges from the lru.
> 
> I *think* ideally, the pages in a volatile range should be similar to
> non-dirty file-backed pages.  There is a cost to restore them, but
> freeing them is very cheap.  The trick is that volatile ranges
> introduces a new relationship between pages. Since the neighboring
> virtual pages in a volatile range are in effect tied together, purging
> one effectively ruins the value of keeping the others, regardless of
> which zone they are physically.
> 
> So maybe the right appraoch give up the per-fs volatile range lru, and
> try a varient of what DaveC and DaveH have suggested: Letting the page
> based lru reclamation handle the selection on a physical page basis, but
> then zapping the entirety of the neighboring range if any one page is
> reclaimed.  In order to try to preserve the range based LRU behavior,
> activate all the pages in the range together when the range is marked


You mean deactivation for fast reclaiming, not activation when memory pressure happen?

> volatile.  Since we assume ranges are un-touched when volatile, that
> should preserve LRU purging behavior on single node systems and on
> multi-node systems it will approximate fairly closely.
> 
> My main concern with this approach is marking and unmarking volatile
> ranges needs to be fast, so I'm worried about the additional overhead of
> activating each of the containing pages on mark_volatile.


Yes. it could be a problem if range is very large and populated already.
Why can't we make new hooks?

Just concept for showing my intention..

+int shrink_volatile_pages(struct zone *zone)
+{
+       int ret = 0;
+       if (zone_page_state(zone, NR_ZONE_VOLATILE))
+               ret = shmem_purge_one_volatile_range();
+       return ret;
+}
+
 static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
        struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
                .priority = sc->priority,
        };
        struct mem_cgroup *memcg;
+       int ret;
+
+       /*
+        * Before we dive into trouble maker, let's look at easy-
+        * reclaimable pages and avoid costly-reclaim if possible.
+        */
+       do {
+               ret = shrink_volatile_pages();
+               if (ret)
+                       zone_watermark_ok(zone, sc->order, xxx);
+                               return;
+       } while(ret)

Off-topic:

I want to drive low memory notification level-triggering instead of raw vmstat trigger.
(It's rather long thread https://lkml.org/lkml/2012/5/1/97)

level 1: out-of-easy reclaimable pages (NR_VOLATILE + NR_UNMAPPED_CLEAN_PAGE)
level 2 (more sever VM pressure than level 1): level2 + reclaimable dirty pages

When it is out of easy-reclaimable pages, it might be good indication for
low memory notification.


> 
> The other question I have with this approach is if we're on a system
> that doesn't have swap, it *seems* (not totally sure I understand it
> yet) the tmpfs file pages will be skipped over when we call
> shrink_lruvec.  So it seems we may need to add a new lru_list enum and
> nr[] entry (maybe LRU_VOLATILE?).   So then it may be that when we mark
> a range as volatile, instead of just activating it, we move it to the
> volatile lru, and then when we shrink from that list, we call back to
> the filesystem to trigger the entire range purging.


Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope we can avoid that if possible.

Off-topic: 
But I'm not sure because I might try to make new easy-reclaimable LRU list for low memory notification.
That LRU list would contain non-mapped clean cache page and volatile pages if I decide adding it.
Both pages has a common characteristic that recreating page is less costly.
It's true for eMMC/SSD like device, at least.

> 
> Does that sound reasonable?  Any other suggested approaches?  I'll think
> some more about it this weekend and try to get a patch scratched out
> early next week.
> 
> thanks
> -john
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  2012-06-12  7:16                             ` [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers Minchan Kim
@ 2012-06-12 16:03                               ` KOSAKI Motohiro
  2012-06-12 19:35                               ` John Stultz
  1 sibling, 0 replies; 6+ messages in thread
From: KOSAKI Motohiro @ 2012-06-12 16:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: John Stultz, Dave Hansen, Dmitry Adamushko, LKML, Andrew Morton,
	Android Kernel Team, Robert Love, Mel Gorman, Hugh Dickins,
	Rik van Riel, Dave Chinner, Neil Brown, Andrea Righi,
	Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
	linux-mm@kvack.org

> Off-topic:
> But I'm not sure because I might try to make new easy-reclaimable LRU list for low memory notification.
> That LRU list would contain non-mapped clean cache page and volatile pages if I decide adding it.
> Both pages has a common characteristic that recreating page is less costly.
> It's true for eMMC/SSD like device, at least.

+1.

I like L2 inactive list.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  2012-06-12  7:16                             ` [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers Minchan Kim
  2012-06-12 16:03                               ` KOSAKI Motohiro
@ 2012-06-12 19:35                               ` John Stultz
  2012-06-13  0:10                                 ` Minchan Kim
  1 sibling, 1 reply; 6+ messages in thread
From: John Stultz @ 2012-06-12 19:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Dave Hansen, Dmitry Adamushko, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
	linux-mm@kvack.org

On 06/12/2012 12:16 AM, Minchan Kim wrote:
> Please, Cced linux-mm.
>
> On 06/09/2012 12:45 PM, John Stultz wrote:
>
>> On 06/07/2012 09:50 PM, KOSAKI Motohiro wrote:
>>> (6/7/12 11:03 PM), John Stultz wrote:
>>>
>>>> So I'm falling back to using a shrinker for now, but I think Dmitry's
>>>> point is an interesting one, and am interested in finding a better
>>>> place to trigger purging volatile ranges from the mm code. If anyone
>>>> has any
>>>> suggestions, let me know, otherwise I'll go back to trying to better
>>>> grok the mm code.
>>> I hate vm feature to abuse shrink_slab(). because of, it was not
>>> designed generic callback.
>>> it was designed for shrinking filesystem metadata. Therefore, vm
>>> keeping a balance between
>>> page scanning and slab scanning. then, a lot of shrink_slab misuse may
>>> lead to break balancing
>>> logic. i.e. drop icache/dcache too many and makes perfomance impact.
>>>
>>> As far as a code impact is small, I'm prefer to connect w/ vm reclaim
>>> code directly.
>> I can see your concern about mis-using the shrinker code. Also your
>> other email's point about the problem of having LRU range purging
>> behavior on a NUMA system makes some sense too.  Unfortunately I'm not
>> yet familiar enough with the reclaim core to sort out how to best track
>> and connect the volatile range purging in the vm's reclaim core yet.
>>
>> So for now, I've moved the code back to using the shrinker (along with
>> fixing a few bugs along the way).
>> Thus, currently we manage the ranges as so:
>>      [per fs volatile range lru head] ->  [volatile range] ->  [volatile
>> range] ->  [volatile range]
>> With the per-fs shrinker zaping the volatile ranges from the lru.
>>
>> I *think* ideally, the pages in a volatile range should be similar to
>> non-dirty file-backed pages.  There is a cost to restore them, but
>> freeing them is very cheap.  The trick is that volatile ranges
>> introduces a new relationship between pages. Since the neighboring
>> virtual pages in a volatile range are in effect tied together, purging
>> one effectively ruins the value of keeping the others, regardless of
>> which zone they are physically.
>>
>> So maybe the right appraoch give up the per-fs volatile range lru, and
>> try a varient of what DaveC and DaveH have suggested: Letting the page
>> based lru reclamation handle the selection on a physical page basis, but
>> then zapping the entirety of the neighboring range if any one page is
>> reclaimed.  In order to try to preserve the range based LRU behavior,
>> activate all the pages in the range together when the range is marked
>
> You mean deactivation for fast reclaiming, not activation when memory pressure happen?
Yes. Sorry for mixing up terms here. The point is moving all the pages together to the inactive list to preserve relative LRU behavior for purging ranges.



>> volatile.  Since we assume ranges are un-touched when volatile, that
>> should preserve LRU purging behavior on single node systems and on
>> multi-node systems it will approximate fairly closely.
>>
>> My main concern with this approach is marking and unmarking volatile
>> ranges needs to be fast, so I'm worried about the additional overhead of
>> activating each of the containing pages on mark_volatile.
>
> Yes. it could be a problem if range is very large and populated already.
> Why can't we make new hooks?
>
> Just concept for showing my intention..
>
> +int shrink_volatile_pages(struct zone *zone)
> +{
> +       int ret = 0;
> +       if (zone_page_state(zone, NR_ZONE_VOLATILE))
> +               ret = shmem_purge_one_volatile_range();
> +       return ret;
> +}
> +
>   static void shrink_zone(struct zone *zone, struct scan_control *sc)
>   {
>          struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>                  .priority = sc->priority,
>          };
>          struct mem_cgroup *memcg;
> +       int ret;
> +
> +       /*
> +        * Before we dive into trouble maker, let's look at easy-
> +        * reclaimable pages and avoid costly-reclaim if possible.
> +        */
> +       do {
> +               ret = shrink_volatile_pages();
> +               if (ret)
> +                       zone_watermark_ok(zone, sc->order, xxx);
> +                               return;
> +       } while(ret)

Hmm. I'm confused.
This doesn't seem that different from the shrinker approach.
How does this resolve the numa-unawareness issue that Kosaki-san brought up?


>> The other question I have with this approach is if we're on a system
>> that doesn't have swap, it *seems* (not totally sure I understand it
>> yet) the tmpfs file pages will be skipped over when we call
>> shrink_lruvec.  So it seems we may need to add a new lru_list enum and
>> nr[] entry (maybe LRU_VOLATILE?).   So then it may be that when we mark
>> a range as volatile, instead of just activating it, we move it to the
>> volatile lru, and then when we shrink from that list, we call back to
>> the filesystem to trigger the entire range purging.
> Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope we can avoid that if possible.

Indeed. This is a major concern. I'm currently prototyping it out so I 
have a concrete sense of the performance cost.

thanks
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  2012-06-12 19:35                               ` John Stultz
@ 2012-06-13  0:10                                 ` Minchan Kim
  2012-06-13  1:21                                   ` John Stultz
  0 siblings, 1 reply; 6+ messages in thread
From: Minchan Kim @ 2012-06-13  0:10 UTC (permalink / raw)
  To: John Stultz
  Cc: KOSAKI Motohiro, Dave Hansen, Dmitry Adamushko, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
	linux-mm@kvack.org

On 06/13/2012 04:35 AM, John Stultz wrote:

> On 06/12/2012 12:16 AM, Minchan Kim wrote:
>> Please, Cced linux-mm.
>>
>> On 06/09/2012 12:45 PM, John Stultz wrote:
>>
>>> On 06/07/2012 09:50 PM, KOSAKI Motohiro wrote:
>>>> (6/7/12 11:03 PM), John Stultz wrote:
>>>>
>>>>> So I'm falling back to using a shrinker for now, but I think Dmitry's
>>>>> point is an interesting one, and am interested in finding a better
>>>>> place to trigger purging volatile ranges from the mm code. If anyone
>>>>> has any
>>>>> suggestions, let me know, otherwise I'll go back to trying to better
>>>>> grok the mm code.
>>>> I hate vm feature to abuse shrink_slab(). because of, it was not
>>>> designed generic callback.
>>>> it was designed for shrinking filesystem metadata. Therefore, vm
>>>> keeping a balance between
>>>> page scanning and slab scanning. then, a lot of shrink_slab misuse may
>>>> lead to break balancing
>>>> logic. i.e. drop icache/dcache too many and makes perfomance impact.
>>>>
>>>> As far as a code impact is small, I'm prefer to connect w/ vm reclaim
>>>> code directly.
>>> I can see your concern about mis-using the shrinker code. Also your
>>> other email's point about the problem of having LRU range purging
>>> behavior on a NUMA system makes some sense too.  Unfortunately I'm not
>>> yet familiar enough with the reclaim core to sort out how to best track
>>> and connect the volatile range purging in the vm's reclaim core yet.
>>>
>>> So for now, I've moved the code back to using the shrinker (along with
>>> fixing a few bugs along the way).
>>> Thus, currently we manage the ranges as so:
>>>      [per fs volatile range lru head] ->  [volatile range] ->  [volatile
>>> range] ->  [volatile range]
>>> With the per-fs shrinker zaping the volatile ranges from the lru.
>>>
>>> I *think* ideally, the pages in a volatile range should be similar to
>>> non-dirty file-backed pages.  There is a cost to restore them, but
>>> freeing them is very cheap.  The trick is that volatile ranges
>>> introduces a new relationship between pages. Since the neighboring
>>> virtual pages in a volatile range are in effect tied together, purging
>>> one effectively ruins the value of keeping the others, regardless of
>>> which zone they are physically.
>>>
>>> So maybe the right appraoch give up the per-fs volatile range lru, and
>>> try a varient of what DaveC and DaveH have suggested: Letting the page
>>> based lru reclamation handle the selection on a physical page basis, but
>>> then zapping the entirety of the neighboring range if any one page is
>>> reclaimed.  In order to try to preserve the range based LRU behavior,
>>> activate all the pages in the range together when the range is marked
>>
>> You mean deactivation for fast reclaiming, not activation when memory
>> pressure happen?
> Yes. Sorry for mixing up terms here. The point is moving all the pages
> together to the inactive list to preserve relative LRU behavior for
> purging ranges.


No problem :)

> 
> 
> 
>>> volatile.  Since we assume ranges are un-touched when volatile, that
>>> should preserve LRU purging behavior on single node systems and on
>>> multi-node systems it will approximate fairly closely.
>>>
>>> My main concern with this approach is marking and unmarking volatile
>>> ranges needs to be fast, so I'm worried about the additional overhead of
>>> activating each of the containing pages on mark_volatile.
>>
>> Yes. it could be a problem if range is very large and populated already.
>> Why can't we make new hooks?
>>
>> Just concept for showing my intention..
>>
>> +int shrink_volatile_pages(struct zone *zone)
>> +{
>> +       int ret = 0;
>> +       if (zone_page_state(zone, NR_ZONE_VOLATILE))
>> +               ret = shmem_purge_one_volatile_range();
>> +       return ret;
>> +}
>> +
>>   static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>   {
>>          struct mem_cgroup *root = sc->target_mem_cgroup;
>> @@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone,
>> struct scan_control *sc)
>>                  .priority = sc->priority,
>>          };
>>          struct mem_cgroup *memcg;
>> +       int ret;
>> +
>> +       /*
>> +        * Before we dive into trouble maker, let's look at easy-
>> +        * reclaimable pages and avoid costly-reclaim if possible.
>> +        */
>> +       do {
>> +               ret = shrink_volatile_pages();
>> +               if (ret)
>> +                       zone_watermark_ok(zone, sc->order, xxx);
>> +                               return;
>> +       } while(ret)
> 
> Hmm. I'm confused.
> This doesn't seem that different from the shrinker approach.


Shrinker is called after shrink_list so it means normal pages can be reclaimed
before we reclaim volatile pages. We shouldn't do that.
 

> How does this resolve the numa-unawareness issue that Kosaki-san brought
> up?


Basically, I think your shrink function should be more smart.

when fallocate is called, we can get mem_policy from shmem_inode_info and pass it to
volatile_range so that volatile_range can keep the information of NUMA.

When shmem_purge_one_volatile_range is called, it receives zone information.
So shmem_purge_one_volatile_range should find a range matched with NUMA policy and
passed zone.

Assumption:
  A range may include same node/zone pages if possible.

I am not familiar with NUMA handling code so KOSAKI/Rik can point out if I am wrong.

> 
> 
>>> The other question I have with this approach is if we're on a system
>>> that doesn't have swap, it *seems* (not totally sure I understand it
>>> yet) the tmpfs file pages will be skipped over when we call
>>> shrink_lruvec.  So it seems we may need to add a new lru_list enum and
>>> nr[] entry (maybe LRU_VOLATILE?).   So then it may be that when we mark
>>> a range as volatile, instead of just activating it, we move it to the
>>> volatile lru, and then when we shrink from that list, we call back to
>>> the filesystem to trigger the entire range purging.
>> Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope
>> we can avoid that if possible.
> 
> Indeed. This is a major concern. I'm currently prototyping it out so I
> have a concrete sense of the performance cost.


If performance loss isn't big, that would be a approach!

> 
> thanks
> -john
> 



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  2012-06-13  0:10                                 ` Minchan Kim
@ 2012-06-13  1:21                                   ` John Stultz
  2012-06-13  4:42                                     ` Minchan Kim
  0 siblings, 1 reply; 6+ messages in thread
From: John Stultz @ 2012-06-13  1:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KOSAKI Motohiro, Dave Hansen, Dmitry Adamushko, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
	linux-mm@kvack.org

On 06/12/2012 05:10 PM, Minchan Kim wrote:
> On 06/13/2012 04:35 AM, John Stultz wrote:
>
>> On 06/12/2012 12:16 AM, Minchan Kim wrote:
>>> Please, Cced linux-mm.
>>>
>>> On 06/09/2012 12:45 PM, John Stultz wrote:
>>>
>>>
>>>> volatile.  Since we assume ranges are un-touched when volatile, that
>>>> should preserve LRU purging behavior on single node systems and on
>>>> multi-node systems it will approximate fairly closely.
>>>>
>>>> My main concern with this approach is marking and unmarking volatile
>>>> ranges needs to be fast, so I'm worried about the additional overhead of
>>>> activating each of the containing pages on mark_volatile.
>>> Yes. it could be a problem if range is very large and populated already.
>>> Why can't we make new hooks?
>>>
>>> Just concept for showing my intention..
>>>
>>> +int shrink_volatile_pages(struct zone *zone)
>>> +{
>>> +       int ret = 0;
>>> +       if (zone_page_state(zone, NR_ZONE_VOLATILE))
>>> +               ret = shmem_purge_one_volatile_range();
>>> +       return ret;
>>> +}
>>> +
>>>    static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>>    {
>>>           struct mem_cgroup *root = sc->target_mem_cgroup;
>>> @@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone,
>>> struct scan_control *sc)
>>>                   .priority = sc->priority,
>>>           };
>>>           struct mem_cgroup *memcg;
>>> +       int ret;
>>> +
>>> +       /*
>>> +        * Before we dive into trouble maker, let's look at easy-
>>> +        * reclaimable pages and avoid costly-reclaim if possible.
>>> +        */
>>> +       do {
>>> +               ret = shrink_volatile_pages();
>>> +               if (ret)
>>> +                       zone_watermark_ok(zone, sc->order, xxx);
>>> +                               return;
>>> +       } while(ret)
>> Hmm. I'm confused.
>> This doesn't seem that different from the shrinker approach.
>
> Shrinker is called after shrink_list so it means normal pages can be reclaimed
> before we reclaim volatile pages. We shouldn't do that.


Ah. Ok. Maybe that's a reasonable compromise between the shrinker 
approach and the more complex approach I just posted to lkml?
(Forgive me for forgetting to CC you and linux-mm with my latest post!)

>> How does this resolve the numa-unawareness issue that Kosaki-san brought
>> up?
> Basically, I think your shrink function should be more smart.
>
> when fallocate is called, we can get mem_policy from shmem_inode_info and pass it to
> volatile_range so that volatile_range can keep the information of NUMA.
Hrm.. That sounds reasonable. I'll look into the mem_policy bits and try 
to learn more.

> When shmem_purge_one_volatile_range is called, it receives zone information.
> So shmem_purge_one_volatile_range should find a range matched with NUMA policy and
> passed zone.
>
> Assumption:
>    A range may include same node/zone pages if possible.
>
> I am not familiar with NUMA handling code so KOSAKI/Rik can point out if I am wrong.
Right, the range may cross nodes/zones but maybe that's not a huge deal? 
The only bit I'd worry about is the lru scanning being non-constant as 
we searched for a range that matched the node we want to free from. I 
guess we could have per-node/zone lrus.


>>>> The other question I have with this approach is if we're on a system
>>>> that doesn't have swap, it *seems* (not totally sure I understand it
>>>> yet) the tmpfs file pages will be skipped over when we call
>>>> shrink_lruvec.  So it seems we may need to add a new lru_list enum and
>>>> nr[] entry (maybe LRU_VOLATILE?).   So then it may be that when we mark
>>>> a range as volatile, instead of just activating it, we move it to the
>>>> volatile lru, and then when we shrink from that list, we call back to
>>>> the filesystem to trigger the entire range purging.
>>> Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope
>>> we can avoid that if possible.
>> Indeed. This is a major concern. I'm currently prototyping it out so I
>> have a concrete sense of the performance cost.
> If performance loss isn't big, that would be a approach!
I've not had a chance yet to measure it, as I wanted to get my very 
rough patches out for discussion first. But if folks don't nack it 
outright I'll be providing some data there.  The hard part is that range 
creation would have a linear cost with the number of pages in the range, 
which at some point will be a pain.

Thanks again for your input!
-john

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers
  2012-06-13  1:21                                   ` John Stultz
@ 2012-06-13  4:42                                     ` Minchan Kim
  0 siblings, 0 replies; 6+ messages in thread
From: Minchan Kim @ 2012-06-13  4:42 UTC (permalink / raw)
  To: John Stultz
  Cc: KOSAKI Motohiro, Dave Hansen, Dmitry Adamushko, LKML,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Rik van Riel, Dave Chinner, Neil Brown,
	Andrea Righi, Aneesh Kumar K.V, Taras Glek, Mike Hommey, Jan Kara,
	linux-mm@kvack.org

On 06/13/2012 10:21 AM, John Stultz wrote:

> On 06/12/2012 05:10 PM, Minchan Kim wrote:
>> On 06/13/2012 04:35 AM, John Stultz wrote:
>>
>>> On 06/12/2012 12:16 AM, Minchan Kim wrote:
>>>> Please, Cced linux-mm.
>>>>
>>>> On 06/09/2012 12:45 PM, John Stultz wrote:
>>>>
>>>>
>>>>> volatile.  Since we assume ranges are un-touched when volatile, that
>>>>> should preserve LRU purging behavior on single node systems and on
>>>>> multi-node systems it will approximate fairly closely.
>>>>>
>>>>> My main concern with this approach is marking and unmarking volatile
>>>>> ranges needs to be fast, so I'm worried about the additional
>>>>> overhead of
>>>>> activating each of the containing pages on mark_volatile.
>>>> Yes. it could be a problem if range is very large and populated
>>>> already.
>>>> Why can't we make new hooks?
>>>>
>>>> Just concept for showing my intention..
>>>>
>>>> +int shrink_volatile_pages(struct zone *zone)
>>>> +{
>>>> +       int ret = 0;
>>>> +       if (zone_page_state(zone, NR_ZONE_VOLATILE))
>>>> +               ret = shmem_purge_one_volatile_range();
>>>> +       return ret;
>>>> +}
>>>> +
>>>>    static void shrink_zone(struct zone *zone, struct scan_control *sc)
>>>>    {
>>>>           struct mem_cgroup *root = sc->target_mem_cgroup;
>>>> @@ -1827,6 +1835,18 @@ static void shrink_zone(struct zone *zone,
>>>> struct scan_control *sc)
>>>>                   .priority = sc->priority,
>>>>           };
>>>>           struct mem_cgroup *memcg;
>>>> +       int ret;
>>>> +
>>>> +       /*
>>>> +        * Before we dive into trouble maker, let's look at easy-
>>>> +        * reclaimable pages and avoid costly-reclaim if possible.
>>>> +        */
>>>> +       do {
>>>> +               ret = shrink_volatile_pages();
>>>> +               if (ret)
>>>> +                       zone_watermark_ok(zone, sc->order, xxx);
>>>> +                               return;
>>>> +       } while(ret)
>>> Hmm. I'm confused.
>>> This doesn't seem that different from the shrinker approach.
>>
>> Shrinker is called after shrink_list so it means normal pages can be
>> reclaimed
>> before we reclaim volatile pages. We shouldn't do that.
> 
> 
> Ah. Ok. Maybe that's a reasonable compromise between the shrinker
> approach and the more complex approach I just posted to lkml?
> (Forgive me for forgetting to CC you and linux-mm with my latest post!)


NP.

> 
>>> How does this resolve the numa-unawareness issue that Kosaki-san brought
>>> up?
>> Basically, I think your shrink function should be more smart.
>>
>> when fallocate is called, we can get mem_policy from shmem_inode_info
>> and pass it to
>> volatile_range so that volatile_range can keep the information of NUMA.
> Hrm.. That sounds reasonable. I'll look into the mem_policy bits and try
> to learn more.
> 
>> When shmem_purge_one_volatile_range is called, it receives zone
>> information.
>> So shmem_purge_one_volatile_range should find a range matched with
>> NUMA policy and
>> passed zone.
>>
>> Assumption:
>>    A range may include same node/zone pages if possible.
>>
>> I am not familiar with NUMA handling code so KOSAKI/Rik can point out
>> if I am wrong.
> Right, the range may cross nodes/zones but maybe that's not a huge deal?
> The only bit I'd worry about is the lru scanning being non-constant as
> we searched for a range that matched the node we want to free from. I
> guess we could have per-node/zone lrus.


Good.

> 
> 
>>>>> The other question I have with this approach is if we're on a system
>>>>> that doesn't have swap, it *seems* (not totally sure I understand it
>>>>> yet) the tmpfs file pages will be skipped over when we call
>>>>> shrink_lruvec.  So it seems we may need to add a new lru_list enum and
>>>>> nr[] entry (maybe LRU_VOLATILE?).   So then it may be that when we
>>>>> mark
>>>>> a range as volatile, instead of just activating it, we move it to the
>>>>> volatile lru, and then when we shrink from that list, we call back to
>>>>> the filesystem to trigger the entire range purging.
>>>> Adding new LRU idea might make very slow fallocate(VOLATILE) so I hope
>>>> we can avoid that if possible.
>>> Indeed. This is a major concern. I'm currently prototyping it out so I
>>> have a concrete sense of the performance cost.
>> If performance loss isn't big, that would be a approach!
> I've not had a chance yet to measure it, as I wanted to get my very
> rough patches out for discussion first. But if folks don't nack it
> outright I'll be providing some data there.  The hard part is that range
> creation would have a linear cost with the number of pages in the range,
> which at some point will be a pain.


That's right. So IMHO, my suggestion could be a solution.
I looked through your new patchset[5/6]. I know your intention but code still have problems.
But I didn't commented out. Before the detail review, I would like to hear opinions from others
and am curious about that whether you decide turning the approach or not.
It can save our precious time. :) 

> 
> Thanks again for your input!
> -john


Thanks for your effort!

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-06-13  4:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1338575387-26972-1-git-send-email-john.stultz@linaro.org>
     [not found] ` <1338575387-26972-4-git-send-email-john.stultz@linaro.org>
     [not found]   ` <4FC9235F.5000402@gmail.com>
     [not found]     ` <4FC92E30.4000906@linaro.org>
     [not found]       ` <4FC9360B.4020401@gmail.com>
     [not found]         ` <4FC937AD.8040201@linaro.org>
     [not found]           ` <4FC9438B.1000403@gmail.com>
     [not found]             ` <4FC94F61.20305@linaro.org>
     [not found]               ` <4FCFB4F6.6070308@gmail.com>
     [not found]                 ` <4FCFEE36.3010902@linaro.org>
     [not found]                   ` <CAO6Zf6D++8hOz19BmUwQ8iwbQknQRNsF4npP4r-830j04vbj=g@mail.gmail.com>
     [not found]                     ` <4FD13C30.2030401@linux.vnet.ibm.com>
     [not found]                       ` <4FD16B6E.8000307@linaro.org>
     [not found]                         ` <4FD1848B.7040102@gmail.com>
     [not found]                           ` <4FD2C6C5.1070900@linaro.org>
2012-06-12  7:16                             ` [PATCH 3/3] [RFC] tmpfs: Add FALLOC_FL_MARK_VOLATILE/UNMARK_VOLATILE handlers Minchan Kim
2012-06-12 16:03                               ` KOSAKI Motohiro
2012-06-12 19:35                               ` John Stultz
2012-06-13  0:10                                 ` Minchan Kim
2012-06-13  1:21                                   ` John Stultz
2012-06-13  4:42                                     ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).