* (no subject)
@ 2010-06-16 16:33 Jan Kara
2010-06-16 16:33 ` [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged Jan Kara
` (4 more replies)
0 siblings, 5 replies; 67+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin
Hello,
here is the fourth version of the writeback livelock avoidance patches
for data integrity writes. To quickly summarize the idea: we tag dirty
pages at the beginning of write_cache_pages with a new TOWRITE tag and
then write only tagged pages to avoid parallel writers to livelock us.
See changelogs of the patches for more details.
I have tested the patches with fsx and a test program I wrote which
checks that if we crash after fsync, the data is indeed on disk.
If there are no more concerns, can these patches get merged?
Honza
Changes since last version:
- tagging function was changed to stop after given amount of pages to
avoid keeping tree_lock and irqs disabled for too long
- changed names and updated comments as Andrew suggested
- measured memory impact and reported it in the changelog
Things suggested but not changed (I want to avoid going in circles ;):
- use tagging also for WB_SYNC_NONE writeback - there's problem with an
interaction with wbc->nr_to_write. If we tag all dirty pages, we can
spend too much time tagging when we write only a few pages in the end
because of nr_to_write. If we tag only say nr_to_write pages, we may
not have enough pages tagged because some pages are written out by
someone else and so we would have to restart and tagging would become
essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
writeback if we can get rid of nr_to_write. But that's a story for
a different patch set.
- implement function for clearing several tags (TOWRITE, DIRTY) at once
- IMHO not worth it because we would save only conversion of page index
to radix tree offsets. The rest would have to be separate anyways. And
the interface would be incosistent as well...
- use __lookup_tag to implement radix_tree_range_tag_if_tagged - doesn't
quite work because __lookup_tag returns only leaf nodes so we'd have to
implement tree traversal anyways to tag also internal nodes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
2010-06-16 16:33 Jan Kara
@ 2010-06-16 16:33 ` Jan Kara
2010-06-18 22:18 ` Andrew Morton
2010-06-16 16:33 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
` (3 subsequent siblings)
4 siblings, 1 reply; 67+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin, Jan Kara
Implement function for setting one tag if another tag is set
for each item in given range.
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/radix-tree.h | 4 ++
lib/radix-tree.c | 92 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+), 0 deletions(-)
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 55ca73c..a4b00e9 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -192,6 +192,10 @@ unsigned int
radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
unsigned long first_index, unsigned int max_items,
unsigned int tag);
+unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
+ unsigned long *first_indexp, unsigned long last_index,
+ unsigned long nr_to_tag,
+ unsigned int fromtag, unsigned int totag);
int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
static inline void radix_tree_preload_end(void)
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 05da38b..549ce9c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -609,6 +609,98 @@ int radix_tree_tag_get(struct radix_tree_root *root,
EXPORT_SYMBOL(radix_tree_tag_get);
/**
+ * radix_tree_range_tag_if_tagged - for each item in given range set given
+ * tag if item has another tag set
+ * @root: radix tree root
+ * @first_indexp: pointer to a starting index of a range to scan
+ * @last_index: last index of a range to scan
+ * @nr_to_tag: maximum number items to tag
+ * @iftag: tag index to test
+ * @settag: tag index to set if tested tag is set
+ *
+ * This function scans range of radix tree from first_index to last_index
+ * (inclusive). For each item in the range if iftag is set, the function sets
+ * also settag. The function stops either after tagging nr_to_tag items or
+ * after reaching last_index.
+ *
+ * The function returns number of leaves where the tag was set and sets
+ * *first_indexp to the first unscanned index.
+ */
+unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root,
+ unsigned long *first_indexp, unsigned long last_index,
+ unsigned long nr_to_tag,
+ unsigned int iftag, unsigned int settag)
+{
+ unsigned int height = root->height, shift;
+ unsigned long tagged = 0, index = *first_indexp;
+ struct radix_tree_node *open_slots[height], *slot;
+
+ last_index = min(last_index, radix_tree_maxindex(height));
+ if (index > last_index)
+ return 0;
+ if (!root_tag_get(root, iftag)) {
+ *first_indexp = last_index + 1;
+ return 0;
+ }
+ if (height == 0) {
+ *first_indexp = last_index + 1;
+ root_tag_set(root, settag);
+ return 1;
+ }
+
+ shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+ slot = radix_tree_indirect_to_ptr(root->rnode);
+
+ for (;;) {
+ int offset;
+
+ offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+ if (!slot->slots[offset])
+ goto next;
+ if (!tag_get(slot, iftag, offset))
+ goto next;
+ tag_set(slot, settag, offset);
+ if (height == 1) {
+ tagged++;
+ goto next;
+ }
+ /* Go down one level */
+ height--;
+ shift -= RADIX_TREE_MAP_SHIFT;
+ open_slots[height] = slot;
+ slot = slot->slots[offset];
+ continue;
+next:
+ /* Go to next item at level determined by 'shift' */
+ index = ((index >> shift) + 1) << shift;
+ if (index > last_index)
+ break;
+ if (tagged > nr_to_tag)
+ break;
+ while (((index >> shift) & RADIX_TREE_MAP_MASK) == 0) {
+ /*
+ * We've fully scanned this node. Go up. Because
+ * last_index is guaranteed to be in the tree, what
+ * we do below cannot wander astray.
+ */
+ slot = open_slots[height];
+ height++;
+ shift += RADIX_TREE_MAP_SHIFT;
+ }
+ }
+ /*
+ * The iftag must have been set somewhere because otherwise
+ * we would return immediated at the beginning of the function
+ */
+ root_tag_set(root, settag);
+ *first_indexp = index;
+
+ return tagged;
+}
+EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);
+
+
+/**
* radix_tree_next_hole - find the next hole (not-present entry)
* @root: tree root
* @index: index key
--
1.6.4.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
2010-06-16 16:33 ` [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged Jan Kara
@ 2010-06-18 22:18 ` Andrew Morton
2010-06-21 12:09 ` Nick Piggin
0 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2010-06-18 22:18 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-fsdevel, linux-mm, npiggin
On Wed, 16 Jun 2010 18:33:50 +0200
Jan Kara <jack@suse.cz> wrote:
> Implement function for setting one tag if another tag is set
> for each item in given range.
>
These two patches look OK to me.
fwiw I have a userspace test harness for radix-tree.c:
http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz. Nick used it for a
while and updated it somewhat, but it's probably rather bitrotted and
surely needs to be taught how to test the post-2006 additions.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
2010-06-18 22:18 ` Andrew Morton
@ 2010-06-21 12:09 ` Nick Piggin
2010-06-21 22:43 ` Jan Kara
2010-06-23 13:42 ` Jan Kara
0 siblings, 2 replies; 67+ messages in thread
From: Nick Piggin @ 2010-06-21 12:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, linux-mm
On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> On Wed, 16 Jun 2010 18:33:50 +0200
> Jan Kara <jack@suse.cz> wrote:
>
> > Implement function for setting one tag if another tag is set
> > for each item in given range.
> >
>
> These two patches look OK to me.
>
> fwiw I have a userspace test harness for radix-tree.c:
> http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz. Nick used it for a
> while and updated it somewhat, but it's probably rather bitrotted and
> surely needs to be taught how to test the post-2006 additions.
>
Main thing I did was add RCU support (pretty dumb RCU but it found
a couple of bugs), and add some more tests. I'll try to find it...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
2010-06-21 12:09 ` Nick Piggin
@ 2010-06-21 22:43 ` Jan Kara
2010-06-23 13:42 ` Jan Kara
1 sibling, 0 replies; 67+ messages in thread
From: Jan Kara @ 2010-06-21 22:43 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Jan Kara, linux-fsdevel, linux-mm
On Mon 21-06-10 22:09:34, Nick Piggin wrote:
> On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> > On Wed, 16 Jun 2010 18:33:50 +0200
> > Jan Kara <jack@suse.cz> wrote:
> >
> > > Implement function for setting one tag if another tag is set
> > > for each item in given range.
> > >
> >
> > These two patches look OK to me.
> >
> > fwiw I have a userspace test harness for radix-tree.c:
> > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz. Nick used it for a
> > while and updated it somewhat, but it's probably rather bitrotted and
> > surely needs to be taught how to test the post-2006 additions.
> >
>
> Main thing I did was add RCU support (pretty dumb RCU but it found
> a couple of bugs), and add some more tests. I'll try to find it...
Please send them my way if you can find them. I'll gladly run those tests
(and extend them to check also my new function).
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged
2010-06-21 12:09 ` Nick Piggin
2010-06-21 22:43 ` Jan Kara
@ 2010-06-23 13:42 ` Jan Kara
1 sibling, 0 replies; 67+ messages in thread
From: Jan Kara @ 2010-06-23 13:42 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, Jan Kara, linux-fsdevel, linux-mm
On Mon 21-06-10 22:09:34, Nick Piggin wrote:
> On Fri, Jun 18, 2010 at 03:18:24PM -0700, Andrew Morton wrote:
> > On Wed, 16 Jun 2010 18:33:50 +0200
> > Jan Kara <jack@suse.cz> wrote:
> >
> > > Implement function for setting one tag if another tag is set
> > > for each item in given range.
> > >
> >
> > These two patches look OK to me.
> >
> > fwiw I have a userspace test harness for radix-tree.c:
> > http://userweb.kernel.org/~akpm/stuff/rtth.tar.gz. Nick used it for a
> > while and updated it somewhat, but it's probably rather bitrotted and
> > surely needs to be taught how to test the post-2006 additions.
> >
>
> Main thing I did was add RCU support (pretty dumb RCU but it found
> a couple of bugs), and add some more tests. I'll try to find it...
Nick, any luck with finding updated tests?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
2010-06-16 16:33 Jan Kara
2010-06-16 16:33 ` [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged Jan Kara
@ 2010-06-16 16:33 ` Jan Kara
2010-06-18 22:21 ` Andrew Morton
2010-06-16 22:15 ` your mail Dave Chinner
` (2 subsequent siblings)
4 siblings, 1 reply; 67+ messages in thread
From: Jan Kara @ 2010-06-16 16:33 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin, Jan Kara
We try to avoid livelocks of writeback when some steadily creates
dirty pages in a mapping we are writing out. For memory-cleaning
writeback, using nr_to_write works reasonably well but we cannot
really use it for data integrity writeback. This patch tries to
solve the problem.
The idea is simple: Tag all pages that should be written back
with a special tag (TOWRITE) in the radix tree. This can be done
rather quickly and thus livelocks should not happen in practice.
Then we start doing the hard work of locking pages and sending
them to disk only for those pages that have TOWRITE tag set.
Note: Adding new radix tree tag grows radix tree node from 288 to
296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
However, the number of slab/slub items per page remains the same
(13 and 7 respectively).
Signed-off-by: Jan Kara <jack@suse.cz>
---
include/linux/fs.h | 1 +
include/linux/radix-tree.h | 2 +-
mm/page-writeback.c | 69 +++++++++++++++++++++++++++++++++-----------
3 files changed, 54 insertions(+), 18 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 471e1ff..664674e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -685,6 +685,7 @@ struct block_device {
*/
#define PAGECACHE_TAG_DIRTY 0
#define PAGECACHE_TAG_WRITEBACK 1
+#define PAGECACHE_TAG_TOWRITE 2
int mapping_tagged(struct address_space *mapping, int tag);
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index a4b00e9..634b8e6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -55,7 +55,7 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
/*** radix-tree API starts here ***/
-#define RADIX_TREE_MAX_TAGS 2
+#define RADIX_TREE_MAX_TAGS 3
/* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
struct radix_tree_root {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bbd396a..1cb043e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -807,6 +807,40 @@ void __init page_writeback_init(void)
}
/**
+ * tag_pages_for_writeback - tag pages to be written by write_cache_pages
+ * @mapping: address space structure to write
+ * @start: starting page index
+ * @end: ending page index (inclusive)
+ *
+ * This function scans the page range from @start to @end (inclusive) and tags
+ * all pages that have DIRTY tag set with a special TOWRITE tag. The idea is
+ * that write_cache_pages (or whoever calls this function) will then use
+ * TOWRITE tag to identify pages eligible for writeback. This mechanism is
+ * used to avoid livelocking of writeback by a process steadily creating new
+ * dirty pages in the file (thus it is important for this function to be quick
+ * so that it can tag pages faster than a dirtying process can create them).
+ */
+/*
+ * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce tree_lock latency.
+ */
+#define WRITEBACK_TAG_BATCH 4096
+void tag_pages_for_writeback(struct address_space *mapping,
+ pgoff_t start, pgoff_t end)
+{
+ unsigned long tagged;
+
+ do {
+ spin_lock_irq(&mapping->tree_lock);
+ tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
+ &start, end, WRITEBACK_TAG_BATCH,
+ PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
+ spin_unlock_irq(&mapping->tree_lock);
+ cond_resched();
+ } while (tagged >= WRITEBACK_TAG_BATCH);
+}
+EXPORT_SYMBOL(tag_pages_for_writeback);
+
+/**
* write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
* @mapping: address space structure to write
* @wbc: subtract the number of written pages from *@wbc->nr_to_write
@@ -820,6 +854,13 @@ void __init page_writeback_init(void)
* the call was made get new I/O started against them. If wbc->sync_mode is
* WB_SYNC_ALL then we were called for data integrity and we must wait for
* existing IO to complete.
+ *
+ * To avoid livelocks (when other process dirties new pages), we first tag
+ * pages which should be written back with TOWRITE tag and only then start
+ * writing them. For data-integrity sync we have to be careful so that we do
+ * not miss some pages (e.g., because some other process has cleared TOWRITE
+ * tag we set). The rule we follow is that TOWRITE tag can be cleared only
+ * by the process clearing the DIRTY tag (and submitting the page for IO).
*/
int write_cache_pages(struct address_space *mapping,
struct writeback_control *wbc, writepage_t writepage,
@@ -835,6 +876,7 @@ int write_cache_pages(struct address_space *mapping,
pgoff_t done_index;
int cycled;
int range_whole = 0;
+ int tag;
pagevec_init(&pvec, 0);
if (wbc->range_cyclic) {
@@ -851,29 +893,19 @@ int write_cache_pages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
cycled = 1; /* ignore range_cyclic tests */
-
- /*
- * If this is a data integrity sync, cap the writeback to the
- * current end of file. Any extension to the file that occurs
- * after this is a new write and we don't need to write those
- * pages out to fulfil our data integrity requirements. If we
- * try to write them out, we can get stuck in this scan until
- * the concurrent writer stops adding dirty pages and extending
- * EOF.
- */
- if (wbc->sync_mode == WB_SYNC_ALL &&
- wbc->range_end == LLONG_MAX) {
- end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
- }
}
-
+ if (wbc->sync_mode == WB_SYNC_ALL)
+ tag = PAGECACHE_TAG_TOWRITE;
+ else
+ tag = PAGECACHE_TAG_DIRTY;
retry:
+ if (wbc->sync_mode == WB_SYNC_ALL)
+ tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && (index <= end)) {
int i;
- nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
- PAGECACHE_TAG_DIRTY,
+ nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
if (nr_pages == 0)
break;
@@ -1329,6 +1361,9 @@ int test_set_page_writeback(struct page *page)
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
+ radix_tree_tag_clear(&mapping->page_tree,
+ page_index(page),
+ PAGECACHE_TAG_TOWRITE);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestSetPageWriteback(page);
--
1.6.4.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
2010-06-16 16:33 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
@ 2010-06-18 22:21 ` Andrew Morton
2010-06-21 12:42 ` Jan Kara
0 siblings, 1 reply; 67+ messages in thread
From: Andrew Morton @ 2010-06-18 22:21 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-fsdevel, linux-mm, npiggin
On Wed, 16 Jun 2010 18:33:51 +0200
Jan Kara <jack@suse.cz> wrote:
> We try to avoid livelocks of writeback when some steadily creates
> dirty pages in a mapping we are writing out. For memory-cleaning
> writeback, using nr_to_write works reasonably well but we cannot
> really use it for data integrity writeback. This patch tries to
> solve the problem.
>
> The idea is simple: Tag all pages that should be written back
> with a special tag (TOWRITE) in the radix tree. This can be done
> rather quickly and thus livelocks should not happen in practice.
> Then we start doing the hard work of locking pages and sending
> them to disk only for those pages that have TOWRITE tag set.
>
> Note: Adding new radix tree tag grows radix tree node from 288 to
> 296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
> However, the number of slab/slub items per page remains the same
> (13 and 7 respectively).
>
>
> ...
>
> +void tag_pages_for_writeback(struct address_space *mapping,
> + pgoff_t start, pgoff_t end)
> +{
> + unsigned long tagged;
> +
> + do {
> + spin_lock_irq(&mapping->tree_lock);
> + tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
> + &start, end, WRITEBACK_TAG_BATCH,
> + PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> + spin_unlock_irq(&mapping->tree_lock);
> + cond_resched();
> + } while (tagged >= WRITEBACK_TAG_BATCH);
> +}
grumble. (tagged > WRITEBACK_TAG_BATCH) would be a bug, wouldn't it?
So the ">=" is hiding a bug.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging
2010-06-18 22:21 ` Andrew Morton
@ 2010-06-21 12:42 ` Jan Kara
0 siblings, 0 replies; 67+ messages in thread
From: Jan Kara @ 2010-06-21 12:42 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jan Kara, linux-fsdevel, linux-mm, npiggin
On Fri 18-06-10 15:21:28, Andrew Morton wrote:
> On Wed, 16 Jun 2010 18:33:51 +0200
> Jan Kara <jack@suse.cz> wrote:
>
> > We try to avoid livelocks of writeback when some steadily creates
> > dirty pages in a mapping we are writing out. For memory-cleaning
> > writeback, using nr_to_write works reasonably well but we cannot
> > really use it for data integrity writeback. This patch tries to
> > solve the problem.
> >
> > The idea is simple: Tag all pages that should be written back
> > with a special tag (TOWRITE) in the radix tree. This can be done
> > rather quickly and thus livelocks should not happen in practice.
> > Then we start doing the hard work of locking pages and sending
> > them to disk only for those pages that have TOWRITE tag set.
> >
> > Note: Adding new radix tree tag grows radix tree node from 288 to
> > 296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
> > However, the number of slab/slub items per page remains the same
> > (13 and 7 respectively).
> >
> >
> > ...
> >
> > +void tag_pages_for_writeback(struct address_space *mapping,
> > + pgoff_t start, pgoff_t end)
> > +{
> > + unsigned long tagged;
> > +
> > + do {
> > + spin_lock_irq(&mapping->tree_lock);
> > + tagged = radix_tree_range_tag_if_tagged(&mapping->page_tree,
> > + &start, end, WRITEBACK_TAG_BATCH,
> > + PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_TOWRITE);
> > + spin_unlock_irq(&mapping->tree_lock);
> > + cond_resched();
> > + } while (tagged >= WRITEBACK_TAG_BATCH);
> > +}
>
> grumble. (tagged > WRITEBACK_TAG_BATCH) would be a bug, wouldn't it?
> So the ">=" is hiding a bug.
Good point. I'll add WARN_ON_ONCE when tagged is > WRITEBACK_TAG_BATCH.
That will make the bug vissible while still continuing the writeback
(because it's a situation in which we can still happily continue). Should
I send you a new version of the patch or will you just fold that one line
in?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2010-06-16 16:33 Jan Kara
2010-06-16 16:33 ` [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged Jan Kara
2010-06-16 16:33 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
@ 2010-06-16 22:15 ` Dave Chinner
2010-06-17 7:43 ` [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes Jan Kara
2010-06-17 9:11 ` Jan Kara
2010-06-22 2:59 ` your mail Wu Fengguang
4 siblings, 1 reply; 67+ messages in thread
From: Dave Chinner @ 2010-06-16 22:15 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin
On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> Hello,
>
> here is the fourth version of the writeback livelock avoidance patches
> for data integrity writes. To quickly summarize the idea: we tag dirty
> pages at the beginning of write_cache_pages with a new TOWRITE tag and
> then write only tagged pages to avoid parallel writers to livelock us.
> See changelogs of the patches for more details.
> I have tested the patches with fsx and a test program I wrote which
> checks that if we crash after fsync, the data is indeed on disk.
> If there are no more concerns, can these patches get merged?
Has it been run through xfstests? I'd suggest doing that at least
with XFS as there are several significant sync sanity tests for XFS
in the suite...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
2010-06-16 22:15 ` your mail Dave Chinner
@ 2010-06-17 7:43 ` Jan Kara
2010-06-18 6:11 ` Dave Chinner
0 siblings, 1 reply; 67+ messages in thread
From: Jan Kara @ 2010-06-17 7:43 UTC (permalink / raw)
To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton, npiggin
On Thu 17-06-10 08:15:41, Dave Chinner wrote:
> On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> > Hello,
> >
> > here is the fourth version of the writeback livelock avoidance patches
> > for data integrity writes. To quickly summarize the idea: we tag dirty
> > pages at the beginning of write_cache_pages with a new TOWRITE tag and
> > then write only tagged pages to avoid parallel writers to livelock us.
> > See changelogs of the patches for more details.
> > I have tested the patches with fsx and a test program I wrote which
> > checks that if we crash after fsync, the data is indeed on disk.
> > If there are no more concerns, can these patches get merged?
>
> Has it been run through xfstests? I'd suggest doing that at least
> with XFS as there are several significant sync sanity tests for XFS
> in the suite...
I've run it through XFSQA with ext3 & ext4 before submitting. I'm running
a test with xfs now.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
2010-06-17 7:43 ` [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes Jan Kara
@ 2010-06-18 6:11 ` Dave Chinner
2010-06-18 7:01 ` Nick Piggin
0 siblings, 1 reply; 67+ messages in thread
From: Dave Chinner @ 2010-06-18 6:11 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin
On Thu, Jun 17, 2010 at 09:43:50AM +0200, Jan Kara wrote:
> On Thu 17-06-10 08:15:41, Dave Chinner wrote:
> > On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> > > Hello,
> > >
> > > here is the fourth version of the writeback livelock avoidance patches
> > > for data integrity writes. To quickly summarize the idea: we tag dirty
> > > pages at the beginning of write_cache_pages with a new TOWRITE tag and
> > > then write only tagged pages to avoid parallel writers to livelock us.
> > > See changelogs of the patches for more details.
> > > I have tested the patches with fsx and a test program I wrote which
> > > checks that if we crash after fsync, the data is indeed on disk.
> > > If there are no more concerns, can these patches get merged?
> >
> > Has it been run through xfstests? I'd suggest doing that at least
> > with XFS as there are several significant sync sanity tests for XFS
> > in the suite...
> I've run it through XFSQA with ext3 & ext4 before submitting. I'm running
> a test with xfs now.
Cool. if there are no problems then I'm happy with this ;)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
2010-06-18 6:11 ` Dave Chinner
@ 2010-06-18 7:01 ` Nick Piggin
0 siblings, 0 replies; 67+ messages in thread
From: Nick Piggin @ 2010-06-18 7:01 UTC (permalink / raw)
To: Dave Chinner; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton
On Fri, Jun 18, 2010 at 04:11:11PM +1000, Dave Chinner wrote:
> On Thu, Jun 17, 2010 at 09:43:50AM +0200, Jan Kara wrote:
> > On Thu 17-06-10 08:15:41, Dave Chinner wrote:
> > > On Wed, Jun 16, 2010 at 06:33:49PM +0200, Jan Kara wrote:
> > > > Hello,
> > > >
> > > > here is the fourth version of the writeback livelock avoidance patches
> > > > for data integrity writes. To quickly summarize the idea: we tag dirty
> > > > pages at the beginning of write_cache_pages with a new TOWRITE tag and
> > > > then write only tagged pages to avoid parallel writers to livelock us.
> > > > See changelogs of the patches for more details.
> > > > I have tested the patches with fsx and a test program I wrote which
> > > > checks that if we crash after fsync, the data is indeed on disk.
> > > > If there are no more concerns, can these patches get merged?
> > >
> > > Has it been run through xfstests? I'd suggest doing that at least
> > > with XFS as there are several significant sync sanity tests for XFS
> > > in the suite...
> > I've run it through XFSQA with ext3 & ext4 before submitting. I'm running
> > a test with xfs now.
>
> Cool. if there are no problems then I'm happy with this ;)
Agreed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes
2010-06-16 16:33 Jan Kara
` (2 preceding siblings ...)
2010-06-16 22:15 ` your mail Dave Chinner
@ 2010-06-17 9:11 ` Jan Kara
2010-06-22 2:59 ` your mail Wu Fengguang
4 siblings, 0 replies; 67+ messages in thread
From: Jan Kara @ 2010-06-17 9:11 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-mm, Andrew Morton, npiggin
Sorry for replying to my email but I forgot to set a subject while doing
git send-email. So at least set it now.
Honza
On Wed 16-06-10 18:33:49, Jan Kara wrote:
> Hello,
>
> here is the fourth version of the writeback livelock avoidance patches
> for data integrity writes. To quickly summarize the idea: we tag dirty
> pages at the beginning of write_cache_pages with a new TOWRITE tag and
> then write only tagged pages to avoid parallel writers to livelock us.
> See changelogs of the patches for more details.
> I have tested the patches with fsx and a test program I wrote which
> checks that if we crash after fsync, the data is indeed on disk.
> If there are no more concerns, can these patches get merged?
>
> Honza
>
> Changes since last version:
> - tagging function was changed to stop after given amount of pages to
> avoid keeping tree_lock and irqs disabled for too long
> - changed names and updated comments as Andrew suggested
> - measured memory impact and reported it in the changelog
>
> Things suggested but not changed (I want to avoid going in circles ;):
> - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> spend too much time tagging when we write only a few pages in the end
> because of nr_to_write. If we tag only say nr_to_write pages, we may
> not have enough pages tagged because some pages are written out by
> someone else and so we would have to restart and tagging would become
> essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> writeback if we can get rid of nr_to_write. But that's a story for
> a different patch set.
> - implement function for clearing several tags (TOWRITE, DIRTY) at once
> - IMHO not worth it because we would save only conversion of page index
> to radix tree offsets. The rest would have to be separate anyways. And
> the interface would be incosistent as well...
> - use __lookup_tag to implement radix_tree_range_tag_if_tagged - doesn't
> quite work because __lookup_tag returns only leaf nodes so we'd have to
> implement tree traversal anyways to tag also internal nodes.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2010-06-16 16:33 Jan Kara
` (3 preceding siblings ...)
2010-06-17 9:11 ` Jan Kara
@ 2010-06-22 2:59 ` Wu Fengguang
2010-06-22 13:54 ` Jan Kara
4 siblings, 1 reply; 67+ messages in thread
From: Wu Fengguang @ 2010-06-22 2:59 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Andrew Morton, npiggin
> - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> spend too much time tagging when we write only a few pages in the end
> because of nr_to_write. If we tag only say nr_to_write pages, we may
> not have enough pages tagged because some pages are written out by
> someone else and so we would have to restart and tagging would become
This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
writeback triggered by sync(). write_cache_pages() already ignored
nr_to_write for WB_SYNC_ALL.
> essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> writeback if we can get rid of nr_to_write. But that's a story for
> a different patch set.
Besides introducing overheads, it will be a policy change in which the
system loses control to somehow "throttle" writeback of huge files.
So it may be safer to enlarge nr_to_write instead of canceling it totally.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2010-06-22 2:59 ` your mail Wu Fengguang
@ 2010-06-22 13:54 ` Jan Kara
2010-06-22 14:12 ` Wu Fengguang
0 siblings, 1 reply; 67+ messages in thread
From: Jan Kara @ 2010-06-22 13:54 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton, npiggin
On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> > interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> > spend too much time tagging when we write only a few pages in the end
> > because of nr_to_write. If we tag only say nr_to_write pages, we may
> > not have enough pages tagged because some pages are written out by
> > someone else and so we would have to restart and tagging would become
>
> This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> writeback triggered by sync(). write_cache_pages() already ignored
> nr_to_write for WB_SYNC_ALL.
We could do that but frankly, I'm not very fond of adding more special
cases to writeback code than strictly necessary...
> > essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> > writeback if we can get rid of nr_to_write. But that's a story for
> > a different patch set.
>
> Besides introducing overheads, it will be a policy change in which the
> system loses control to somehow "throttle" writeback of huge files.
Yes, but if we guarantee we cannot livelock on a single file, do we care?
Memory management does not care because it's getting rid of dirty pages
which is what it wants. User might care but actually writing out files in
the order they were dirtied (i.e., the order user written them) is quite
natural so it's not a "surprising" behavior. And I don't think we can
assume that data in those small files are more valuable than data in the
large file and thus should be written earlier...
With the overhead you are right that tagging is more expensive than
checking nr_to_write limit...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2010-06-22 13:54 ` Jan Kara
@ 2010-06-22 14:12 ` Wu Fengguang
0 siblings, 0 replies; 67+ messages in thread
From: Wu Fengguang @ 2010-06-22 14:12 UTC (permalink / raw)
To: Jan Kara
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton,
npiggin@suse.de
On Tue, Jun 22, 2010 at 09:54:58PM +0800, Jan Kara wrote:
> On Tue 22-06-10 10:59:41, Wu Fengguang wrote:
> > > - use tagging also for WB_SYNC_NONE writeback - there's problem with an
> > > interaction with wbc->nr_to_write. If we tag all dirty pages, we can
> > > spend too much time tagging when we write only a few pages in the end
> > > because of nr_to_write. If we tag only say nr_to_write pages, we may
> > > not have enough pages tagged because some pages are written out by
> > > someone else and so we would have to restart and tagging would become
> >
> > This could be addressed by ignoring nr_to_write for the WB_SYNC_NONE
> > writeback triggered by sync(). write_cache_pages() already ignored
> > nr_to_write for WB_SYNC_ALL.
> We could do that but frankly, I'm not very fond of adding more special
> cases to writeback code than strictly necessary...
So do me. However for this case we only need to broaden the special case test:
if (nr_to_write > 0) {
nr_to_write--;
if (nr_to_write == 0 &&
- wbc->sync_mode == WB_SYNC_NONE) {
+ !wbc->for_sync) {
> > > essentially useless. So my option is - switch to tagging for WB_SYNC_NONE
> > > writeback if we can get rid of nr_to_write. But that's a story for
> > > a different patch set.
> >
> > Besides introducing overheads, it will be a policy change in which the
> > system loses control to somehow "throttle" writeback of huge files.
> Yes, but if we guarantee we cannot livelock on a single file, do we care?
> Memory management does not care because it's getting rid of dirty pages
> which is what it wants. User might care but actually writing out files in
> the order they were dirtied (i.e., the order user written them) is quite
> natural so it's not a "surprising" behavior. And I don't think we can
> assume that data in those small files are more valuable than data in the
> large file and thus should be written earlier...
It could be a surprising behavior when, there is a 4GB dirty file and
100 small dirty files. The user may expect the 100 small dirty files
be synced to disk after 30s. However without nr_to_write, they could
be delayed by the 4GB file for 40 more seconds. Now if something goes
wrong in between and the small dirty files happen to include some
.c/.tex/.txt/... files.
Small files are more likely your precious documents that are typed in
word-by-word and perhaps the most important data you want to protect.
Naturally you'll want them enjoy more priority than large files.
> With the overhead you are right that tagging is more expensive than
> checking nr_to_write limit...
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v7 0/7] mseal system mappings
@ 2025-02-24 22:52 jeffxu
2025-02-25 15:18 ` Lorenzo Stoakes
0 siblings, 1 reply; 67+ messages in thread
From: jeffxu @ 2025-02-24 22:52 UTC (permalink / raw)
To: akpm, keescook, jannh, torvalds, vbabka, lorenzo.stoakes,
Liam.Howlett, adhemerval.zanella, oleg, avagin, benjamin
Cc: linux-kernel, linux-hardening, linux-mm, jorgelo, sroettger, hch,
ojeda, thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca,
willy, anna-maria, mark.rutland, linus.walleij, Jason, deller,
rdunlap, davem, peterx, f.fainelli, gerg, dave.hansen, mingo,
ardb, mhocko, 42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
aleksandr.mikhalitsyn, mike.rapoport, Jeff Xu
From: Jeff Xu <jeffxu@chromium.org>
This is V7 version, addressing comments from V6, without code logic
change.
--------------------------------------------------
History:
V7:
- Remove cover letter from the first patch (Liam R. Howlett)
- Change macro name to VM_SEALED_SYSMAP (Liam R. Howlett)
- logging and fclose() in selftest (Liam R. Howlett)
V6:
https://lore.kernel.org/all/20250224174513.3600914-1-jeffxu@google.com/
V5
https://lore.kernel.org/all/20250212032155.1276806-1-jeffxu@google.com/
V4:
https://lore.kernel.org/all/20241125202021.3684919-1-jeffxu@google.com/
V3:
https://lore.kernel.org/all/20241113191602.3541870-1-jeffxu@google.com/
V2:
https://lore.kernel.org/all/20241014215022.68530-1-jeffxu@google.com/
V1:
https://lore.kernel.org/all/20241004163155.3493183-1-jeffxu@google.com/
--------------------------------------------------
As discussed during mseal() upstream process [1], mseal() protects
the VMAs of a given virtual memory range against modifications, such
as the read/write (RW) and no-execute (NX) bits. For complete
descriptions of memory sealing, please see mseal.rst [2].
The mseal() is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped.
The system mappings are readonly only, memory sealing can protect
them from ever changing to writable or unmmap/remapped as different
attributes.
System mappings such as vdso, vvar, and sigpage (arm), vectors (arm)
are created by the kernel during program initialization, and could
be sealed after creation.
Unlike the aforementioned mappings, the uprobe mapping is not
established during program startup. However, its lifetime is the same
as the process's lifetime [3]. It could be sealed from creation.
The vsyscall on x86-64 uses a special address (0xffffffffff600000),
which is outside the mm managed range. This means mprotect, munmap, and
mremap won't work on the vsyscall. Since sealing doesn't enhance
the vsyscall's security, it is skipped in this patch. If we ever seal
the vsyscall, it is probably only for decorative purpose, i.e. showing
the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored.
It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the system mappings during restore operations. UML(User Mode Linux)
and gVisor, rr are also known to change the vdso/vvar mappings.
Consequently, this feature cannot be universally enabled across all
systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.
To support mseal of system mappings, architectures must define
CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings
calls to pass mseal flag. Additionally, architectures must confirm they
do not unmap/remap system mappings during the process lifetime.
In this version, we've improved the handling of system mapping sealing from
previous versions, instead of modifying the _install_special_mapping
function itself, which would affect all architectures, we now call
_install_special_mapping with a sealing flag only within the specific
architecture that requires it. This targeted approach offers two key
advantages: 1) It limits the code change's impact to the necessary
architectures, and 2) It aligns with the software architecture by keeping
the core memory management within the mm layer, while delegating the
decision of sealing system mappings to the individual architecture, which
is particularly relevant since 32-bit architectures never require sealing.
Prior to this patch series, we explored sealing special mappings from
userspace using glibc's dynamic linker. This approach revealed several
issues:
- The PT_LOAD header may report an incorrect length for vdso, (smaller
than its actual size). The dynamic linker, which relies on PT_LOAD
information to determine mapping size, would then split and partially
seal the vdso mapping. Since each architecture has its own vdso/vvar
code, fixing this in the kernel would require going through each
archiecture. Our initial goal was to enable sealing readonly mappings,
e.g. .text, across all architectures, sealing vdso from kernel since
creation appears to be simpler than sealing vdso at glibc.
- The [vvar] mapping header only contains address information, not length
information. Similar issues might exist for other special mappings.
- Mappings like uprobe are not covered by the dynamic linker,
and there is no effective solution for them.
This feature's security enhancements will benefit ChromeOS, Android,
and other high security systems.
Testing:
This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
- Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
- Passing various automation tests (e.g. pre-checkin) on ChromeOS and
Android to ensure the sealing doesn't affect the functionality of
Chromebook and Android phone.
I also tested the feature on Ubuntu on x86-64:
- With config disabled, vdso/vvar is not sealed,
- with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
normal operations such as browsing the web, open/edit doc are OK.
In addition, Benjamin Berg tested this on UML.
Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1]
Link: Documentation/userspace-api/mseal.rst [2]
Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3]
Jeff Xu (7):
mseal, system mappings: kernel config and header change
selftests: x86: test_mremap_vdso: skip if vdso is msealed
mseal, system mappings: enable x86-64
mseal, system mappings: enable arm64
mseal, system mappings: enable uml architecture
mseal, system mappings: uprobe mapping
mseal, system mappings: update mseal.rst
Documentation/userspace-api/mseal.rst | 7 +++
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/vdso.c | 22 +++++++---
arch/um/Kconfig | 1 +
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/vma.c | 16 ++++---
arch/x86/um/vdso/vma.c | 6 ++-
include/linux/mm.h | 10 +++++
init/Kconfig | 18 ++++++++
kernel/events/uprobes.c | 5 ++-
security/Kconfig | 18 ++++++++
.../testing/selftests/x86/test_mremap_vdso.c | 43 +++++++++++++++++++
12 files changed, 132 insertions(+), 16 deletions(-)
--
2.48.1.658.g4767266eb4-goog
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v7 0/7] mseal system mappings
2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
@ 2025-02-25 15:18 ` Lorenzo Stoakes
2025-02-26 0:12 ` Jeff Xu
0 siblings, 1 reply; 67+ messages in thread
From: Lorenzo Stoakes @ 2025-02-25 15:18 UTC (permalink / raw)
To: jeffxu
Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca, willy,
anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
aleksandr.mikhalitsyn, mike.rapoport
Jeff - looking further in this series, I asked for a couple things for this
series which you've not provided:
1. Some assurance based on code that the kernel-side code doesn't rely on
VDSO/VVAR etc. mapping. I gave up waiting for this and went and checked
myself, it looks fine for arm64, x86-64. But it might have been nice had
you done it :) Apologies if you had and I just missed it.
2. Tests - could you please add some tests to assert that mremap() fails
for VDSO for instance? You've edited an existing test for VDSO in x86 to
skip the test should this be enabled, but this is not the same as a self
test. And obviously doesn't cover arm64.
This should be relatively strightforward right? You already have code
for finding out whether VDSO is msealed, so just use that to see if you
skip, then attempt mremap(), mmap() over etc. + assert it fails.
Ideally these tests would cover all the cases you've changed.
Please do try to ensure you address requests from maintainers to save on
iterations, while I get the desire to shoot out new versions (I've been
guilty of this in the past), it makes life so much easier and will reduce
version count if you try to get everything done in a one go.
Having said the above, we're really not far off this being a viable
series. You just need to address comments here (+ in v6...) + provide some
tests.
Thanks!
On Mon, Feb 24, 2025 at 10:52:39PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
>
> This is V7 version, addressing comments from V6, without code logic
> change.
>
> --------------------------------------------------
>
> History:
> V7:
> - Remove cover letter from the first patch (Liam R. Howlett)
> - Change macro name to VM_SEALED_SYSMAP (Liam R. Howlett)
> - logging and fclose() in selftest (Liam R. Howlett)
>
> V6:
> https://lore.kernel.org/all/20250224174513.3600914-1-jeffxu@google.com/
>
Nitty, but it's really useful to actually include the history for all of
these.
> V5
> https://lore.kernel.org/all/20250212032155.1276806-1-jeffxu@google.com/
>
> V4:
> https://lore.kernel.org/all/20241125202021.3684919-1-jeffxu@google.com/
>
> V3:
> https://lore.kernel.org/all/20241113191602.3541870-1-jeffxu@google.com/
>
> V2:
> https://lore.kernel.org/all/20241014215022.68530-1-jeffxu@google.com/
>
> V1:
> https://lore.kernel.org/all/20241004163155.3493183-1-jeffxu@google.com/
>
> --------------------------------------------------
> As discussed during mseal() upstream process [1], mseal() protects
> the VMAs of a given virtual memory range against modifications, such
> as the read/write (RW) and no-execute (NX) bits. For complete
> descriptions of memory sealing, please see mseal.rst [2].
>
> The mseal() is useful to mitigate memory corruption issues where a
> corrupted pointer is passed to a memory management system. For
> example, such an attacker primitive can break control-flow integrity
> guarantees since read-only memory that is supposed to be trusted can
> become writable or .text pages can get remapped.
>
> The system mappings are readonly only, memory sealing can protect
> them from ever changing to writable or unmmap/remapped as different
> attributes.
>
> System mappings such as vdso, vvar, and sigpage (arm), vectors (arm)
> are created by the kernel during program initialization, and could
> be sealed after creation.
>
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [3]. It could be sealed from creation.
>
> The vsyscall on x86-64 uses a special address (0xffffffffff600000),
> which is outside the mm managed range. This means mprotect, munmap, and
> mremap won't work on the vsyscall. Since sealing doesn't enhance
> the vsyscall's security, it is skipped in this patch. If we ever seal
> the vsyscall, it is probably only for decorative purpose, i.e. showing
> the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored.
>
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the system mappings during restore operations. UML(User Mode Linux)
> and gVisor, rr are also known to change the vdso/vvar mappings.
> Consequently, this feature cannot be universally enabled across all
> systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.
>
> To support mseal of system mappings, architectures must define
> CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings
> calls to pass mseal flag. Additionally, architectures must confirm they
> do not unmap/remap system mappings during the process lifetime.
>
> In this version, we've improved the handling of system mapping sealing from
> previous versions, instead of modifying the _install_special_mapping
> function itself, which would affect all architectures, we now call
> _install_special_mapping with a sealing flag only within the specific
> architecture that requires it. This targeted approach offers two key
> advantages: 1) It limits the code change's impact to the necessary
> architectures, and 2) It aligns with the software architecture by keeping
> the core memory management within the mm layer, while delegating the
> decision of sealing system mappings to the individual architecture, which
> is particularly relevant since 32-bit architectures never require sealing.
>
> Prior to this patch series, we explored sealing special mappings from
> userspace using glibc's dynamic linker. This approach revealed several
> issues:
> - The PT_LOAD header may report an incorrect length for vdso, (smaller
> than its actual size). The dynamic linker, which relies on PT_LOAD
> information to determine mapping size, would then split and partially
> seal the vdso mapping. Since each architecture has its own vdso/vvar
> code, fixing this in the kernel would require going through each
> archiecture. Our initial goal was to enable sealing readonly mappings,
> e.g. .text, across all architectures, sealing vdso from kernel since
> creation appears to be simpler than sealing vdso at glibc.
> - The [vvar] mapping header only contains address information, not length
> information. Similar issues might exist for other special mappings.
> - Mappings like uprobe are not covered by the dynamic linker,
> and there is no effective solution for them.
>
> This feature's security enhancements will benefit ChromeOS, Android,
> and other high security systems.
>
> Testing:
> This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
> - Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
> i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
> - Passing various automation tests (e.g. pre-checkin) on ChromeOS and
> Android to ensure the sealing doesn't affect the functionality of
> Chromebook and Android phone.
>
> I also tested the feature on Ubuntu on x86-64:
> - With config disabled, vdso/vvar is not sealed,
> - with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
> normal operations such as browsing the web, open/edit doc are OK.
>
> In addition, Benjamin Berg tested this on UML.
>
> Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1]
> Link: Documentation/userspace-api/mseal.rst [2]
> Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3]
>
>
>
>
> Jeff Xu (7):
> mseal, system mappings: kernel config and header change
> selftests: x86: test_mremap_vdso: skip if vdso is msealed
> mseal, system mappings: enable x86-64
> mseal, system mappings: enable arm64
> mseal, system mappings: enable uml architecture
> mseal, system mappings: uprobe mapping
> mseal, system mappings: update mseal.rst
>
> Documentation/userspace-api/mseal.rst | 7 +++
> arch/arm64/Kconfig | 1 +
> arch/arm64/kernel/vdso.c | 22 +++++++---
> arch/um/Kconfig | 1 +
> arch/x86/Kconfig | 1 +
> arch/x86/entry/vdso/vma.c | 16 ++++---
> arch/x86/um/vdso/vma.c | 6 ++-
> include/linux/mm.h | 10 +++++
> init/Kconfig | 18 ++++++++
> kernel/events/uprobes.c | 5 ++-
> security/Kconfig | 18 ++++++++
> .../testing/selftests/x86/test_mremap_vdso.c | 43 +++++++++++++++++++
> 12 files changed, 132 insertions(+), 16 deletions(-)
>
> --
> 2.48.1.658.g4767266eb4-goog
>
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
2025-02-25 15:18 ` Lorenzo Stoakes
@ 2025-02-26 0:12 ` Jeff Xu
2025-02-26 5:42 ` your mail Lorenzo Stoakes
0 siblings, 1 reply; 67+ messages in thread
From: Jeff Xu @ 2025-02-26 0:12 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca, willy,
anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
aleksandr.mikhalitsyn, mike.rapoport
On Tue, Feb 25, 2025 at 7:18 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Jeff - looking further in this series, I asked for a couple things for this
> series which you've not provided:
>
> 1. Some assurance based on code that the kernel-side code doesn't rely on
> VDSO/VVAR etc. mapping. I gave up waiting for this and went and checked
> myself, it looks fine for arm64, x86-64. But it might have been nice had
> you done it :) Apologies if you had and I just missed it.
>
Thanks for checking this.
Do ppc in kernel code unmap/remap vdso ?
I am aware that userspace can remap/unmap special mappings, but I
don't know if the kernel will remap/unmap a special mapping. Could you
please point out the code ?
> 2. Tests - could you please add some tests to assert that mremap() fails
> for VDSO for instance? You've edited an existing test for VDSO in x86 to
> skip the test should this be enabled, but this is not the same as a self
> test. And obviously doesn't cover arm64.
>
> This should be relatively strightforward right? You already have code
> for finding out whether VDSO is msealed, so just use that to see if you
> skip, then attempt mremap(), mmap() over etc. + assert it fails.
>
> Ideally these tests would cover all the cases you've changed.
>
It is not as easy.
The config is disabled by default. And I don't know a way to detect
KCONFIG from selftest itself. Without this, I can't reasonably
determine the test result.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2025-02-26 0:12 ` Jeff Xu
@ 2025-02-26 5:42 ` Lorenzo Stoakes
2025-02-28 0:55 ` Jeff Xu
0 siblings, 1 reply; 67+ messages in thread
From: Lorenzo Stoakes @ 2025-02-26 5:42 UTC (permalink / raw)
To: Jeff Xu
Cc: akpm, keescook, jannh, torvalds, vbabka, Liam.Howlett,
adhemerval.zanella, oleg, avagin, benjamin, linux-kernel,
linux-hardening, linux-mm, jorgelo, sroettger, hch, ojeda,
thomas.weissschuh, adobriyan, johannes, pedro.falcato, hca, willy,
anna-maria, mark.rutland, linus.walleij, Jason, deller, rdunlap,
davem, peterx, f.fainelli, gerg, dave.hansen, mingo, ardb, mhocko,
42.hyeyoo, peterz, ardb, enh, rientjes, groeck, mpe,
aleksandr.mikhalitsyn, mike.rapoport
On Tue, Feb 25, 2025 at 04:12:40PM -0800, Jeff Xu wrote:
> On Tue, Feb 25, 2025 at 7:18 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Jeff - looking further in this series, I asked for a couple things for this
> > series which you've not provided:
> >
> > 1. Some assurance based on code that the kernel-side code doesn't rely on
> > VDSO/VVAR etc. mapping. I gave up waiting for this and went and checked
> > myself, it looks fine for arm64, x86-64. But it might have been nice had
> > you done it :) Apologies if you had and I just missed it.
> >
> Thanks for checking this.
> Do ppc in kernel code unmap/remap vdso ?
>
> I am aware that userspace can remap/unmap special mappings, but I
> don't know if the kernel will remap/unmap a special mapping. Could you
> please point out the code ?
Again as discussed in other thread, let's leave this until as/when you try
to attack PPC. I am not 100% this is the case, I may be mistaken sure, but
gathered it _might_ be.
>
>
> > 2. Tests - could you please add some tests to assert that mremap() fails
> > for VDSO for instance? You've edited an existing test for VDSO in x86 to
> > skip the test should this be enabled, but this is not the same as a self
> > test. And obviously doesn't cover arm64.
> >
> > This should be relatively strightforward right? You already have code
> > for finding out whether VDSO is msealed, so just use that to see if you
> > skip, then attempt mremap(), mmap() over etc. + assert it fails.
> >
> > Ideally these tests would cover all the cases you've changed.
> >
> It is not as easy.
>
> The config is disabled by default. And I don't know a way to detect
> KCONFIG from selftest itself. Without this, I can't reasonably
> determine the test result.
Please in future let's try to get this kind of response at the point of the
request being made :) makes life much easier.
So I think it is easy actually. As I say here (perhaps you missed it?) you
literally already have code you added to the x86 test to prevent test
failure.
So you essentially look for [vdso] or whatever, then you look up to see if
it is sealed, ensure an mremap() fails.
Of course this doesn't check to see if it _should_ be enabled or not. I'm
being nice by not _insisting_ we export a way for you to determine whether
this config option is enabled or not for the sake of a test (since I don't
want to hold up this series). But that'd be nice! Maybe in a
/sys/kernel/security endpoint...
...but I think we can potentially add this later on so we don't hold things
up here/add yet more to the series. I suspect you and Kees alike would
prioritise getting _this series_ in at this point :)
You could, if you wanted to, check to see if /proc/config.gz exists and
zgrep it (only there if CONFIG_IKCONFIG_PROC is set) and assert based on
that, but you know kinda hacky.
But anyway, FWIW I think it'd be useful to assert that mremap() or munmap()
fails here for system mappings for the sake of it. I guess this is, in
effect, only checking mseal-ing works right? But it at least asserts the
existence of the thing, and that it behaves.
Later on, when you add some way of observing that it's enabled or not, you
can extend the test to check this.
Thanks!
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2025-02-26 5:42 ` your mail Lorenzo Stoakes
@ 2025-02-28 0:55 ` Jeff Xu
2025-02-28 9:35 ` Lorenzo Stoakes
0 siblings, 1 reply; 67+ messages in thread
From: Jeff Xu @ 2025-02-28 0:55 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Kees Cook, Jann Horn, Linus Torvalds,
Vlastimil Babka, Liam R. Howlett, Adhemerval Zanella Netto,
Oleg Nesterov, Andrei Vagin, Benjamin Berg, LKML,
linux-hardening@vger.kernel.org, linux-mm, Jorge Lucangeli Obes,
Stephen Röttger, hch@lst.de, ojeda@kernel.org,
Thomas Weißschuh, adobriyan@gmail.com,
johannes@sipsolutions.net, Pedro Falcato, hca@linux.ibm.com,
Matthew Wilcox, anna-maria@linutronix.de, mark.rutland@arm.com,
linus.walleij@linaro.org, Jason@zx2c4.com, Helge Deller,
Randy Dunlap, davem@davemloft.net, Peter Xu, f.fainelli@gmail.com,
gerg@kernel.org, dave.hansen@linux.intel.com, Ingo Molnar,
ardb@kernel.org, mhocko@suse.com, 42.hyeyoo@gmail.com,
peterz@infradead.org, ardb@google.com, Elliott Hughes,
rientjes@google.com, Guenter Roeck, Michael Ellerman,
Alexander Mikhalitsyn, Mike Rapoport
Hi Lorenzo,
On Tue, Feb 25, 2025, 9:43 PM Lorenzo Stoakes
>
> > > 2. Tests - could you please add some tests to assert that mremap() fails
> > > for VDSO for instance? You've edited an existing test for VDSO in x86 to
> > > skip the test should this be enabled, but this is not the same as a self
> > > test. And obviously doesn't cover arm64.
> > >
> > > This should be relatively strightforward right? You already have code
> > > for finding out whether VDSO is msealed, so just use that to see if you
> > > skip, then attempt mremap(), mmap() over etc. + assert it fails.
> > >
> > > Ideally these tests would cover all the cases you've changed.
> > >
> > It is not as easy.
> >
> > The config is disabled by default. And I don't know a way to detect
> > KCONFIG from selftest itself. Without this, I can't reasonably
> > determine the test result.
>
> Please in future let's try to get this kind of response at the point of the
> request being made :) makes life much easier.
>
There might be miscommunication ?
This version is the first time you ask about adding a self test.
IIRC, you had comments about providing the details of what tests I did, in V4.
As a follow-up, I added a test-info section in the cover letter of V5
Though I have thought about adding a selftest since the beginning,
there are two problems:
1> This config is by default disabled,
2> No good pattern to check KCONFIG from selftest.
>
> So I think it is easy actually. As I say here (perhaps you missed it?) you
> literally already have code you added to the x86 test to prevent test
> failure.
>
> So you essentially look for [vdso] or whatever, then you look up to see if
> it is sealed, ensure an mremap() fails.
>
This suggestion doesn't test the core change of this series, which is
to enable mseal for vdso.
When the vdso is marked with "sl", mremap() will fail, that's part of
the mseal() business logic and belongs in mseal_test. The mseal_test
already covers the mseal, and this series doesn't change mseal.
As for the "sl", as I responded in the "refactor mseal_test" [1] , it
will be part of the verifying step:
[1] https://lore.kernel.org/all/CABi2SkUv_1gsuGh86AON-xRLAggCvDqJMHxT17mGy-XutSTAwg@mail.gmail.com/
> Of course this doesn't check to see if it _should_ be enabled or not. I'm
> being nice by not _insisting_ we export a way for you to determine whether
> this config option is enabled or not for the sake of a test (since I don't
> want to hold up this series).
>
> But that'd be nice! Maybe in a
> /sys/kernel/security endpoint...
>
>
> ...but I think we can potentially add this later on so we don't hold things
> up here/add yet more to the series. I suspect you and Kees alike would
> prioritise getting _this series_ in at this point :)
>
> You could, if you wanted to, check to see if /proc/config.gz exists and
> zgrep it (only there if CONFIG_IKCONFIG_PROC is set) and assert based on
> that, but you know kinda hacky.
Ya, it is hacky. None of the existing selftest uses this pattern, and
I'm not sure /proc/config.gz is enabled in the default kernel config.
One option is to have ChromeOS or Android to maintain an out-of-tree
test, since the configuration will be enabled there.
Option two is to create a new path:
tools/testing/selftests/sealsysmap. Then, add
CONFIG_SEAL_SYSTEM_MAPPING=y to the config file and add a selftest to
this path. This seems to be the preferred way by selftest, but we need
a new dir for that.
Option three is to add a self-test when we have a process-level opt-in
solution. This would allow the test to deterministically know whether
the vdso should be sealed or not.
>
> But anyway, FWIW I think it'd be useful to assert that mremap() or munmap()
> fails here for system mappings for the sake of it. I guess this is, in
> effect, only checking mseal-ing works right? But it at least asserts the
> existence of the thing, and that it behaves.
>
> Later on, when you add some way of observing that it's enabled or not, you
> can extend the test to check this.
I think it is best that we don't add a test that doesn't actually
check the code change. Do you think one of the above three options
works ? maybe the second option (with a new path) ?
In any case, I think the risk is low, and the code changes are quite
simple, and fully tested.
Thanks.
-Jeff.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2025-02-28 0:55 ` Jeff Xu
@ 2025-02-28 9:35 ` Lorenzo Stoakes
2025-02-28 17:24 ` Jeff Xu
0 siblings, 1 reply; 67+ messages in thread
From: Lorenzo Stoakes @ 2025-02-28 9:35 UTC (permalink / raw)
To: Jeff Xu
Cc: Andrew Morton, Kees Cook, Jann Horn, Linus Torvalds,
Vlastimil Babka, Liam R. Howlett, Adhemerval Zanella Netto,
Oleg Nesterov, Andrei Vagin, Benjamin Berg, LKML,
linux-hardening@vger.kernel.org, linux-mm, Jorge Lucangeli Obes,
Stephen Röttger, hch@lst.de, ojeda@kernel.org,
Thomas Weißschuh, adobriyan@gmail.com,
johannes@sipsolutions.net, Pedro Falcato, hca@linux.ibm.com,
Matthew Wilcox, anna-maria@linutronix.de, mark.rutland@arm.com,
linus.walleij@linaro.org, Jason@zx2c4.com, Helge Deller,
Randy Dunlap, davem@davemloft.net, Peter Xu, f.fainelli@gmail.com,
gerg@kernel.org, dave.hansen@linux.intel.com, Ingo Molnar,
ardb@kernel.org, mhocko@suse.com, 42.hyeyoo@gmail.com,
peterz@infradead.org, ardb@google.com, Elliott Hughes,
rientjes@google.com, Guenter Roeck, Michael Ellerman,
Alexander Mikhalitsyn, Mike Rapoport
On Thu, Feb 27, 2025 at 04:55:06PM -0800, Jeff Xu wrote:
> Hi Lorenzo,
>
> On Tue, Feb 25, 2025, 9:43 PM Lorenzo Stoakes
> >
> > > > 2. Tests - could you please add some tests to assert that mremap() fails
> > > > for VDSO for instance? You've edited an existing test for VDSO in x86 to
> > > > skip the test should this be enabled, but this is not the same as a self
> > > > test. And obviously doesn't cover arm64.
> > > >
> > > > This should be relatively strightforward right? You already have code
> > > > for finding out whether VDSO is msealed, so just use that to see if you
> > > > skip, then attempt mremap(), mmap() over etc. + assert it fails.
> > > >
> > > > Ideally these tests would cover all the cases you've changed.
> > > >
> > > It is not as easy.
> > >
> > > The config is disabled by default. And I don't know a way to detect
> > > KCONFIG from selftest itself. Without this, I can't reasonably
> > > determine the test result.
> >
> > Please in future let's try to get this kind of response at the point of the
> > request being made :) makes life much easier.
> >
> There might be miscommunication ?
> This version is the first time you ask about adding a self test.
Yeah I thought I'd been clear but this might _very well_ have been me not
having been, so apologies if so.
I mean 'make sure it's tested' is an overloaded term right? So fact you've
tested on android, chromeos, etc. implies 'tested', but also self
tests/kunit/whatever.
>
> IIRC, you had comments about providing the details of what tests I did, in V4.
> As a follow-up, I added a test-info section in the cover letter of V5
Thanks. Appreciate that! And this really does point towards a
miscommunication on my part, will try to be super explicit in future.
>
> Though I have thought about adding a selftest since the beginning,
> there are two problems:
> 1> This config is by default disabled,
> 2> No good pattern to check KCONFIG from selftest.
Yeah agreed, it's a TOTAL pain.
I wish we had a better way of doing this. Maybe a self-volunteering
exercise (*goes to write on writeboard :P*)
>
> >
> > So I think it is easy actually. As I say here (perhaps you missed it?) you
> > literally already have code you added to the x86 test to prevent test
> > failure.
> >
> > So you essentially look for [vdso] or whatever, then you look up to see if
> > it is sealed, ensure an mremap() fails.
> >
> This suggestion doesn't test the core change of this series, which is
> to enable mseal for vdso.
Right, and thinking about it, what does this test? Just that mseal works
right?
It's sort of implicit that, if a VMA is sealed, the seal should work (or
rather, tested in mseal tests themselves, rather than mseal system
settings).
>
> When the vdso is marked with "sl", mremap() will fail, that's part of
> the mseal() business logic and belongs in mseal_test. The mseal_test
> already covers the mseal, and this series doesn't change mseal.
>
> As for the "sl", as I responded in the "refactor mseal_test" [1] , it
> will be part of the verifying step:
>
> [1] https://lore.kernel.org/all/CABi2SkUv_1gsuGh86AON-xRLAggCvDqJMHxT17mGy-XutSTAwg@mail.gmail.com/
OK cool thanks. I will look at this when I can, I'm just snowed under
pre-LSF and have been sick so backlog, backlog as discussed off-list. But
it's not been forgotten (whiteboard etc. etc.)
>
> > Of course this doesn't check to see if it _should_ be enabled or not. I'm
> > being nice by not _insisting_ we export a way for you to determine whether
> > this config option is enabled or not for the sake of a test (since I don't
> > want to hold up this series).
> >
> > But that'd be nice! Maybe in a
> > /sys/kernel/security endpoint...
> >
> >
> > ...but I think we can potentially add this later on so we don't hold things
> > up here/add yet more to the series. I suspect you and Kees alike would
> > prioritise getting _this series_ in at this point :)
> >
> > You could, if you wanted to, check to see if /proc/config.gz exists and
> > zgrep it (only there if CONFIG_IKCONFIG_PROC is set) and assert based on
> > that, but you know kinda hacky.
>
> Ya, it is hacky. None of the existing selftest uses this pattern, and
> I'm not sure /proc/config.gz is enabled in the default kernel config.
Yeah and I'm not sure I even like my hacky suggestion here, I mean let's be
honest, it sucks.
>
> One option is to have ChromeOS or Android to maintain an out-of-tree
> test, since the configuration will be enabled there.
Nah haha, though of course you can do what you want. Really want something
upstream.
>
> Option two is to create a new path:
> tools/testing/selftests/sealsysmap. Then, add
> CONFIG_SEAL_SYSTEM_MAPPING=y to the config file and add a selftest to
> this path. This seems to be the preferred way by selftest, but we need
> a new dir for that.
OK I like option 2 let's do this.
But let's call it mseal_system_mappings (yes I"m being nitty again :) I
really want to try to _explicitly_ say 'mseal' because we have other forms
of sealing.
Not your fault, but we overload terms like _crazy_ in mm and need to be
cautious as not to confuse vs. e.g. memfd seals.
>
> Option three is to add a self-test when we have a process-level opt-in
> solution. This would allow the test to deterministically know whether
> the vdso should be sealed or not.
Yeah one for future.
>
> >
> > But anyway, FWIW I think it'd be useful to assert that mremap() or munmap()
> > fails here for system mappings for the sake of it. I guess this is, in
> > effect, only checking mseal-ing works right? But it at least asserts the
> > existence of the thing, and that it behaves.
> >
> > Later on, when you add some way of observing that it's enabled or not, you
> > can extend the test to check this.
>
> I think it is best that we don't add a test that doesn't actually
> check the code change. Do you think one of the above three options
> works ? maybe the second option (with a new path) ?
Yeah I actually agree on reflection. And yes agreed option 2 is great,
thanks!
>
> In any case, I think the risk is low, and the code changes are quite
> simple, and fully tested.
Yeah indeed, but I'd really like _something_ if possible. If option 2 is
relatively quick let's get that sorted out!
>
> Thanks.
> -Jeff.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2025-02-28 9:35 ` Lorenzo Stoakes
@ 2025-02-28 17:24 ` Jeff Xu
2025-02-28 17:30 ` Lorenzo Stoakes
0 siblings, 1 reply; 67+ messages in thread
From: Jeff Xu @ 2025-02-28 17:24 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Kees Cook, Jann Horn, Linus Torvalds,
Vlastimil Babka, Liam R. Howlett, Adhemerval Zanella Netto,
Oleg Nesterov, Andrei Vagin, Benjamin Berg, LKML,
linux-hardening@vger.kernel.org, linux-mm, Jorge Lucangeli Obes,
Stephen Röttger, hch@lst.de, ojeda@kernel.org,
Thomas Weißschuh, adobriyan@gmail.com,
johannes@sipsolutions.net, Pedro Falcato, hca@linux.ibm.com,
Matthew Wilcox, anna-maria@linutronix.de, mark.rutland@arm.com,
linus.walleij@linaro.org, Jason@zx2c4.com, Helge Deller,
Randy Dunlap, davem@davemloft.net, Peter Xu, f.fainelli@gmail.com,
gerg@kernel.org, dave.hansen@linux.intel.com, Ingo Molnar,
ardb@kernel.org, mhocko@suse.com, 42.hyeyoo@gmail.com,
peterz@infradead.org, ardb@google.com, Elliott Hughes,
rientjes@google.com, Guenter Roeck, Michael Ellerman,
Alexander Mikhalitsyn, Mike Rapoport
On Fri, Feb 28, 2025 at 1:35 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Feb 27, 2025 at 04:55:06PM -0800, Jeff Xu wrote:
> > Hi Lorenzo,
> >
> > On Tue, Feb 25, 2025, 9:43 PM Lorenzo Stoakes
> > >
> > > > > 2. Tests - could you please add some tests to assert that mremap() fails
> > > > > for VDSO for instance? You've edited an existing test for VDSO in x86 to
> > > > > skip the test should this be enabled, but this is not the same as a self
> > > > > test. And obviously doesn't cover arm64.
> > > > >
> > > > > This should be relatively strightforward right? You already have code
> > > > > for finding out whether VDSO is msealed, so just use that to see if you
> > > > > skip, then attempt mremap(), mmap() over etc. + assert it fails.
> > > > >
> > > > > Ideally these tests would cover all the cases you've changed.
> > > > >
> > > > It is not as easy.
> > > >
> > > > The config is disabled by default. And I don't know a way to detect
> > > > KCONFIG from selftest itself. Without this, I can't reasonably
> > > > determine the test result.
> > >
> > > Please in future let's try to get this kind of response at the point of the
> > > request being made :) makes life much easier.
> > >
> > There might be miscommunication ?
> > This version is the first time you ask about adding a self test.
>
> Yeah I thought I'd been clear but this might _very well_ have been me not
> having been, so apologies if so.
>
> I mean 'make sure it's tested' is an overloaded term right? So fact you've
> tested on android, chromeos, etc. implies 'tested', but also self
> tests/kunit/whatever.
>
> >
> > IIRC, you had comments about providing the details of what tests I did, in V4.
> > As a follow-up, I added a test-info section in the cover letter of V5
>
> Thanks. Appreciate that! And this really does point towards a
> miscommunication on my part, will try to be super explicit in future.
>
> >
> > Though I have thought about adding a selftest since the beginning,
> > there are two problems:
> > 1> This config is by default disabled,
> > 2> No good pattern to check KCONFIG from selftest.
>
> Yeah agreed, it's a TOTAL pain.
>
> I wish we had a better way of doing this. Maybe a self-volunteering
> exercise (*goes to write on writeboard :P*)
>
> >
> > >
> > > So I think it is easy actually. As I say here (perhaps you missed it?) you
> > > literally already have code you added to the x86 test to prevent test
> > > failure.
> > >
> > > So you essentially look for [vdso] or whatever, then you look up to see if
> > > it is sealed, ensure an mremap() fails.
> > >
> > This suggestion doesn't test the core change of this series, which is
> > to enable mseal for vdso.
>
> Right, and thinking about it, what does this test? Just that mseal works
> right?
>
> It's sort of implicit that, if a VMA is sealed, the seal should work (or
> rather, tested in mseal tests themselves, rather than mseal system
> settings).
>
> >
> > When the vdso is marked with "sl", mremap() will fail, that's part of
> > the mseal() business logic and belongs in mseal_test. The mseal_test
> > already covers the mseal, and this series doesn't change mseal.
> >
> > As for the "sl", as I responded in the "refactor mseal_test" [1] , it
> > will be part of the verifying step:
> >
> > [1] https://lore.kernel.org/all/CABi2SkUv_1gsuGh86AON-xRLAggCvDqJMHxT17mGy-XutSTAwg@mail.gmail.com/
>
> OK cool thanks. I will look at this when I can, I'm just snowed under
> pre-LSF and have been sick so backlog, backlog as discussed off-list. But
> it's not been forgotten (whiteboard etc. etc.)
>
Ya, no worry about that review, please take time to recover, I will wait.
And appreciate your time and take priority for reviewing this series.
> >
> > > Of course this doesn't check to see if it _should_ be enabled or not. I'm
> > > being nice by not _insisting_ we export a way for you to determine whether
> > > this config option is enabled or not for the sake of a test (since I don't
> > > want to hold up this series).
> > >
> > > But that'd be nice! Maybe in a
> > > /sys/kernel/security endpoint...
> > >
> > >
> > > ...but I think we can potentially add this later on so we don't hold things
> > > up here/add yet more to the series. I suspect you and Kees alike would
> > > prioritise getting _this series_ in at this point :)
> > >
> > > You could, if you wanted to, check to see if /proc/config.gz exists and
> > > zgrep it (only there if CONFIG_IKCONFIG_PROC is set) and assert based on
> > > that, but you know kinda hacky.
> >
> > Ya, it is hacky. None of the existing selftest uses this pattern, and
> > I'm not sure /proc/config.gz is enabled in the default kernel config.
>
> Yeah and I'm not sure I even like my hacky suggestion here, I mean let's be
> honest, it sucks.
>
> >
> > One option is to have ChromeOS or Android to maintain an out-of-tree
> > test, since the configuration will be enabled there.
>
> Nah haha, though of course you can do what you want. Really want something
> upstream.
>
> >
> > Option two is to create a new path:
> > tools/testing/selftests/sealsysmap. Then, add
> > CONFIG_SEAL_SYSTEM_MAPPING=y to the config file and add a selftest to
> > this path. This seems to be the preferred way by selftest, but we need
> > a new dir for that.
>
> OK I like option 2 let's do this.
>
> But let's call it mseal_system_mappings (yes I"m being nitty again :) I
> really want to try to _explicitly_ say 'mseal' because we have other forms
> of sealing.
>
Sure.
If long path names aren't a problem, I will use mseal_system_mappings,
otherwise mseal_sysmap.
> Not your fault, but we overload terms like _crazy_ in mm and need to be
> cautious as not to confuse vs. e.g. memfd seals.
>
>
> >
> > Option three is to add a self-test when we have a process-level opt-in
> > solution. This would allow the test to deterministically know whether
> > the vdso should be sealed or not.
>
> Yeah one for future.
>
> >
> > >
> > > But anyway, FWIW I think it'd be useful to assert that mremap() or munmap()
> > > fails here for system mappings for the sake of it. I guess this is, in
> > > effect, only checking mseal-ing works right? But it at least asserts the
> > > existence of the thing, and that it behaves.
> > >
> > > Later on, when you add some way of observing that it's enabled or not, you
> > > can extend the test to check this.
> >
> > I think it is best that we don't add a test that doesn't actually
> > check the code change. Do you think one of the above three options
> > works ? maybe the second option (with a new path) ?
>
> Yeah I actually agree on reflection. And yes agreed option 2 is great,
> thanks!
>
> >
> > In any case, I think the risk is low, and the code changes are quite
> > simple, and fully tested.
>
> Yeah indeed, but I'd really like _something_ if possible. If option 2 is
> relatively quick let's get that sorted out!
>
Great ! I will work on option 2.
Thanks
-Jeff
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2025-02-28 17:24 ` Jeff Xu
@ 2025-02-28 17:30 ` Lorenzo Stoakes
0 siblings, 0 replies; 67+ messages in thread
From: Lorenzo Stoakes @ 2025-02-28 17:30 UTC (permalink / raw)
To: Jeff Xu
Cc: Andrew Morton, Kees Cook, Jann Horn, Linus Torvalds,
Vlastimil Babka, Liam R. Howlett, Adhemerval Zanella Netto,
Oleg Nesterov, Andrei Vagin, Benjamin Berg, LKML,
linux-hardening@vger.kernel.org, linux-mm, Jorge Lucangeli Obes,
Stephen Röttger, hch@lst.de, ojeda@kernel.org,
Thomas Weißschuh, adobriyan@gmail.com,
johannes@sipsolutions.net, Pedro Falcato, hca@linux.ibm.com,
Matthew Wilcox, anna-maria@linutronix.de, mark.rutland@arm.com,
linus.walleij@linaro.org, Jason@zx2c4.com, Helge Deller,
Randy Dunlap, davem@davemloft.net, Peter Xu, f.fainelli@gmail.com,
gerg@kernel.org, dave.hansen@linux.intel.com, Ingo Molnar,
ardb@kernel.org, mhocko@suse.com, 42.hyeyoo@gmail.com,
peterz@infradead.org, ardb@google.com, Elliott Hughes,
rientjes@google.com, Guenter Roeck, Michael Ellerman,
Alexander Mikhalitsyn, Mike Rapoport
On Fri, Feb 28, 2025 at 09:24:11AM -0800, Jeff Xu wrote:
[snip]
> > >
> > > When the vdso is marked with "sl", mremap() will fail, that's part of
> > > the mseal() business logic and belongs in mseal_test. The mseal_test
> > > already covers the mseal, and this series doesn't change mseal.
> > >
> > > As for the "sl", as I responded in the "refactor mseal_test" [1] , it
> > > will be part of the verifying step:
> > >
> > > [1] https://lore.kernel.org/all/CABi2SkUv_1gsuGh86AON-xRLAggCvDqJMHxT17mGy-XutSTAwg@mail.gmail.com/
> >
> > OK cool thanks. I will look at this when I can, I'm just snowed under
> > pre-LSF and have been sick so backlog, backlog as discussed off-list. But
> > it's not been forgotten (whiteboard etc. etc.)
> >
> Ya, no worry about that review, please take time to recover, I will wait.
> And appreciate your time and take priority for reviewing this series.
Thanks :)
[snip]
> > >
> > > Option two is to create a new path:
> > > tools/testing/selftests/sealsysmap. Then, add
> > > CONFIG_SEAL_SYSTEM_MAPPING=y to the config file and add a selftest to
> > > this path. This seems to be the preferred way by selftest, but we need
> > > a new dir for that.
> >
> > OK I like option 2 let's do this.
> >
> > But let's call it mseal_system_mappings (yes I"m being nitty again :) I
> > really want to try to _explicitly_ say 'mseal' because we have other forms
> > of sealing.
> >
> Sure.
Thanks!
>
> If long path names aren't a problem, I will use mseal_system_mappings,
> otherwise mseal_sysmap.
I think long form is fine.
[snip]
> > >
> > > In any case, I think the risk is low, and the code changes are quite
> > > simple, and fully tested.
> >
> > Yeah indeed, but I'd really like _something_ if possible. If option 2 is
> > relatively quick let's get that sorted out!
> >
> Great ! I will work on option 2.
Perfect cheers!
>
> Thanks
> -Jeff
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH] maple_tree: Fix a few documentation issues,
@ 2023-05-10 19:01 Thomas Gleixner
2023-05-15 19:27 ` your mail Liam R. Howlett
0 siblings, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2023-05-10 19:01 UTC (permalink / raw)
To: LKML; +Cc: Liam R. Howlett, Matthew Wilcox, linux-mm, Shanker Donthineni
The documentation of mt_next() claims that it starts the search at the
provided index. That's incorrect as it starts the search after the provided
index.
The documentation of mt_find() is slightly confusing. "Handles locking" is
not really helpful as it does not explain how the "locking" works. Also the
documentation of index talks about a range, while in reality the index
is updated on a succesful search to the index of the found entry plus one.
Fix similar issues for mt_find_after() and mt_prev().
Remove the completely confusing and pointless "Note: Will not return the
zero entry." comment from mt_for_each() and document @__index correctly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
include/linux/maple_tree.h | 4 +---
lib/maple_tree.c | 23 ++++++++++++++++++-----
2 files changed, 19 insertions(+), 8 deletions(-)
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -659,10 +659,8 @@ void *mt_next(struct maple_tree *mt, uns
* mt_for_each - Iterate over each entry starting at index until max.
* @__tree: The Maple Tree
* @__entry: The current entry
- * @__index: The index to update to track the location in the tree
+ * @__index: The index to start the search from. Subsequently used as iterator.
* @__max: The maximum limit for @index
- *
- * Note: Will not return the zero entry.
*/
#define mt_for_each(__tree, __entry, __index, __max) \
for (__entry = mt_find(__tree, &(__index), __max); \
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5947,7 +5947,10 @@ EXPORT_SYMBOL_GPL(mas_next);
* @index: The start index
* @max: The maximum index to check
*
- * Return: The entry at @index or higher, or %NULL if nothing is found.
+ * Takes RCU read lock internally to protect the search, which does not
+ * protect the returned pointer after dropping RCU read lock.
+ *
+ * Return: The entry higher than @index or %NULL if nothing is found.
*/
void *mt_next(struct maple_tree *mt, unsigned long index, unsigned long max)
{
@@ -6012,7 +6015,10 @@ EXPORT_SYMBOL_GPL(mas_prev);
* @index: The start index
* @min: The minimum index to check
*
- * Return: The entry at @index or lower, or %NULL if nothing is found.
+ * Takes RCU read lock internally to protect the search, which does not
+ * protect the returned pointer after dropping RCU read lock.
+ *
+ * Return: The entry before @index or %NULL if nothing is found.
*/
void *mt_prev(struct maple_tree *mt, unsigned long index, unsigned long min)
{
@@ -6487,9 +6493,14 @@ EXPORT_SYMBOL(mtree_destroy);
* mt_find() - Search from the start up until an entry is found.
* @mt: The maple tree
* @index: Pointer which contains the start location of the search
- * @max: The maximum value to check
+ * @max: The maximum value of the search range
+ *
+ * Takes RCU read lock internally to protect the search, which does not
+ * protect the returned pointer after dropping RCU read lock.
*
- * Handles locking. @index will be incremented to one beyond the range.
+ * In case that an entry is found @index contains the index of the found
+ * entry plus one, so it can be used as iterator index to find the next
+ * entry.
*
* Return: The entry at or after the @index or %NULL
*/
@@ -6548,7 +6559,9 @@ EXPORT_SYMBOL(mt_find);
* @index: Pointer which contains the start location of the search
* @max: The maximum value to check
*
- * Handles locking, detects wrapping on index == 0
+ * Same as mt_find() except that it checks @index for 0 before
+ * searching. If @index == 0, the search is aborted. This covers a wrap
+ * around of @index to 0 in an iterator loop.
*
* Return: The entry at or after the @index or %NULL
*/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2023-05-10 19:01 [PATCH] maple_tree: Fix a few documentation issues, Thomas Gleixner
@ 2023-05-15 19:27 ` Liam R. Howlett
2023-05-15 21:16 ` Thomas Gleixner
2023-05-16 22:47 ` Thomas Gleixner
0 siblings, 2 replies; 67+ messages in thread
From: Liam R. Howlett @ 2023-05-15 19:27 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: LKML, Matthew Wilcox, linux-mm, Shanker Donthineni
* Thomas Gleixner <tglx@linutronix.de> [230510 15:01]:
> The documentation of mt_next() claims that it starts the search at the
> provided index. That's incorrect as it starts the search after the provided
> index.
>
> The documentation of mt_find() is slightly confusing. "Handles locking" is
> not really helpful as it does not explain how the "locking" works.
More locking notes can be found in Documentation/core-api/maple_tree.rst
which lists mt_find() under the "Takes RCU read lock" list. I'm okay
with duplicating the comment of taking the RCU read lock in here.
>Also the
> documentation of index talks about a range, while in reality the index
> is updated on a succesful search to the index of the found entry plus one.
This is a range based tree, so the index is incremented beyond the last
entry which would return the entry. That is, if you search for 5 and
there is an entry at 4-100, the index would be 101 after the search -
or, one beyond the range. If you have single entries at a specific
index, then index would be equal to last and it would be one beyond the
index you found - but only because index == last in this case.
>
> Fix similar issues for mt_find_after() and mt_prev().
>
> Remove the completely confusing and pointless "Note: Will not return the
> zero entry." comment from mt_for_each() and document @__index correctly.
The zero entry concept is an advanced API concept which allows you to
store something that cannot be seen by the mt_* family of users, so it
will not be returned and, instead, it will return NULL. Think of it as
a reservation for an entry that isn't fully initialized. Perhaps it
should read "Will not return the XA_ZERO_ENTRY" ?
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> include/linux/maple_tree.h | 4 +---
> lib/maple_tree.c | 23 ++++++++++++++++++-----
> 2 files changed, 19 insertions(+), 8 deletions(-)
>
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -659,10 +659,8 @@ void *mt_next(struct maple_tree *mt, uns
> * mt_for_each - Iterate over each entry starting at index until max.
> * @__tree: The Maple Tree
> * @__entry: The current entry
> - * @__index: The index to update to track the location in the tree
> + * @__index: The index to start the search from. Subsequently used as iterator.
> * @__max: The maximum limit for @index
> - *
> - * Note: Will not return the zero entry.
This function "will not return the zero entry", meaning it will return
NULL if xa_is_zero(entry).
> */
> #define mt_for_each(__tree, __entry, __index, __max) \
> for (__entry = mt_find(__tree, &(__index), __max); \
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5947,7 +5947,10 @@ EXPORT_SYMBOL_GPL(mas_next);
> * @index: The start index
> * @max: The maximum index to check
> *
> - * Return: The entry at @index or higher, or %NULL if nothing is found.
> + * Takes RCU read lock internally to protect the search, which does not
> + * protect the returned pointer after dropping RCU read lock.
> + *
> + * Return: The entry higher than @index or %NULL if nothing is found.
> */
> void *mt_next(struct maple_tree *mt, unsigned long index, unsigned long max)
> {
> @@ -6012,7 +6015,10 @@ EXPORT_SYMBOL_GPL(mas_prev);
> * @index: The start index
> * @min: The minimum index to check
> *
> - * Return: The entry at @index or lower, or %NULL if nothing is found.
> + * Takes RCU read lock internally to protect the search, which does not
> + * protect the returned pointer after dropping RCU read lock.
> + *
> + * Return: The entry before @index or %NULL if nothing is found.
> */
> void *mt_prev(struct maple_tree *mt, unsigned long index, unsigned long min)
> {
> @@ -6487,9 +6493,14 @@ EXPORT_SYMBOL(mtree_destroy);
> * mt_find() - Search from the start up until an entry is found.
> * @mt: The maple tree
> * @index: Pointer which contains the start location of the search
> - * @max: The maximum value to check
> + * @max: The maximum value of the search range
> + *
> + * Takes RCU read lock internally to protect the search, which does not
> + * protect the returned pointer after dropping RCU read lock.
> *
> - * Handles locking. @index will be incremented to one beyond the range.
> + * In case that an entry is found @index contains the index of the found
> + * entry plus one, so it can be used as iterator index to find the next
> + * entry.
What about:
"In case that an entry is found @index contains the last index of the
found entry plus one"
> *
> * Return: The entry at or after the @index or %NULL
> */
> @@ -6548,7 +6559,9 @@ EXPORT_SYMBOL(mt_find);
> * @index: Pointer which contains the start location of the search
> * @max: The maximum value to check
> *
> - * Handles locking, detects wrapping on index == 0
> + * Same as mt_find() except that it checks @index for 0 before
> + * searching. If @index == 0, the search is aborted. This covers a wrap
> + * around of @index to 0 in an iterator loop.
> *
> * Return: The entry at or after the @index or %NULL
> */
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2023-05-15 19:27 ` your mail Liam R. Howlett
@ 2023-05-15 21:16 ` Thomas Gleixner
2023-05-16 22:47 ` Thomas Gleixner
1 sibling, 0 replies; 67+ messages in thread
From: Thomas Gleixner @ 2023-05-15 21:16 UTC (permalink / raw)
To: Liam R. Howlett; +Cc: LKML, Matthew Wilcox, linux-mm, Shanker Donthineni
Liam!
On Mon, May 15 2023 at 15:27, Liam R. Howlett wrote:
> * Thomas Gleixner <tglx@linutronix.de> [230510 15:01]:
>>Also the
>> documentation of index talks about a range, while in reality the index
>> is updated on a succesful search to the index of the found entry plus one.
>
> This is a range based tree, so the index is incremented beyond the last
> entry which would return the entry. That is, if you search for 5 and
> there is an entry at 4-100, the index would be 101 after the search -
> or, one beyond the range. If you have single entries at a specific
> index, then index would be equal to last and it would be one beyond the
> index you found - but only because index == last in this case.
Thanks for the explanation
>>
>> Fix similar issues for mt_find_after() and mt_prev().
>>
>> Remove the completely confusing and pointless "Note: Will not return the
>> zero entry." comment from mt_for_each() and document @__index correctly.
>
> The zero entry concept is an advanced API concept which allows you to
> store something that cannot be seen by the mt_* family of users, so it
> will not be returned and, instead, it will return NULL. Think of it as
> a reservation for an entry that isn't fully initialized. Perhaps it
> should read "Will not return the XA_ZERO_ENTRY" ?
That makes actually sense.
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -659,10 +659,8 @@ void *mt_next(struct maple_tree *mt, uns
>> * mt_for_each - Iterate over each entry starting at index until max.
>> * @__tree: The Maple Tree
>> * @__entry: The current entry
>> - * @__index: The index to update to track the location in the tree
>> + * @__index: The index to start the search from. Subsequently used as iterator.
>> * @__max: The maximum limit for @index
>> - *
>> - * Note: Will not return the zero entry.
>
> This function "will not return the zero entry", meaning it will return
> NULL if xa_is_zero(entry).
Ack.
>> + * Takes RCU read lock internally to protect the search, which does not
>> + * protect the returned pointer after dropping RCU read lock.
>> *
>> - * Handles locking. @index will be incremented to one beyond the range.
>> + * In case that an entry is found @index contains the index of the found
>> + * entry plus one, so it can be used as iterator index to find the next
>> + * entry.
>
> What about:
> "In case that an entry is found @index contains the last index of the
> found entry plus one"
Something like that, yes.
Let me try again.
Thanks,
tglx
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2023-05-15 19:27 ` your mail Liam R. Howlett
2023-05-15 21:16 ` Thomas Gleixner
@ 2023-05-16 22:47 ` Thomas Gleixner
2023-05-23 13:46 ` Liam R. Howlett
1 sibling, 1 reply; 67+ messages in thread
From: Thomas Gleixner @ 2023-05-16 22:47 UTC (permalink / raw)
To: Liam R. Howlett; +Cc: LKML, Matthew Wilcox, linux-mm, Shanker Donthineni
On Mon, May 15 2023 at 15:27, Liam R. Howlett wrote:
> * Thomas Gleixner <tglx@linutronix.de> [230510 15:01]:
>> The documentation of mt_next() claims that it starts the search at the
>> provided index. That's incorrect as it starts the search after the provided
>> index.
>>
>> The documentation of mt_find() is slightly confusing. "Handles locking" is
>> not really helpful as it does not explain how the "locking" works.
>
> More locking notes can be found in Documentation/core-api/maple_tree.rst
> which lists mt_find() under the "Takes RCU read lock" list. I'm okay
> with duplicating the comment of taking the RCU read lock in here.
Without a reference to the actual locking documentation such comments
are not super helpful.
>> Fix similar issues for mt_find_after() and mt_prev().
>>
>> Remove the completely confusing and pointless "Note: Will not return the
>> zero entry." comment from mt_for_each() and document @__index correctly.
>
> The zero entry concept is an advanced API concept which allows you to
> store something that cannot be seen by the mt_* family of users, so it
> will not be returned and, instead, it will return NULL. Think of it as
> a reservation for an entry that isn't fully initialized. Perhaps it
> should read "Will not return the XA_ZERO_ENTRY" ?
>>
>> - *
>> - * Note: Will not return the zero entry.
>
> This function "will not return the zero entry", meaning it will return
> NULL if xa_is_zero(entry).
If I understand correctly, this translates to:
This iterator skips entries, which have been reserved for future use
but have not yet been fully initialized.
Right?
>> @@ -6487,9 +6493,14 @@ EXPORT_SYMBOL(mtree_destroy);
>> * mt_find() - Search from the start up until an entry is found.
>> * @mt: The maple tree
>> * @index: Pointer which contains the start location of the search
>> - * @max: The maximum value to check
>> + * @max: The maximum value of the search range
>> + *
>> + * Takes RCU read lock internally to protect the search, which does not
>> + * protect the returned pointer after dropping RCU read lock.
>> *
>> - * Handles locking. @index will be incremented to one beyond the range.
>> + * In case that an entry is found @index contains the index of the found
>> + * entry plus one, so it can be used as iterator index to find the next
>> + * entry.
>
> What about:
> "In case that an entry is found @index contains the last index of the
> found entry plus one"
Still confusing to the casual reader like me :)
"In case that an entry is found @index is updated to point to the next
possible entry independent whether the found entry is occupying a
single index or a range if indices."
Hmm?
Thanks,
tglx
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2023-05-16 22:47 ` Thomas Gleixner
@ 2023-05-23 13:46 ` Liam R. Howlett
0 siblings, 0 replies; 67+ messages in thread
From: Liam R. Howlett @ 2023-05-23 13:46 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: LKML, Matthew Wilcox, linux-mm, Shanker Donthineni
* Thomas Gleixner <tglx@linutronix.de> [230516 18:48]:
> On Mon, May 15 2023 at 15:27, Liam R. Howlett wrote:
> > * Thomas Gleixner <tglx@linutronix.de> [230510 15:01]:
> >> The documentation of mt_next() claims that it starts the search at the
> >> provided index. That's incorrect as it starts the search after the provided
> >> index.
> >>
> >> The documentation of mt_find() is slightly confusing. "Handles locking" is
> >> not really helpful as it does not explain how the "locking" works.
> >
> > More locking notes can be found in Documentation/core-api/maple_tree.rst
> > which lists mt_find() under the "Takes RCU read lock" list. I'm okay
> > with duplicating the comment of taking the RCU read lock in here.
>
> Without a reference to the actual locking documentation such comments
> are not super helpful.
Noted. A reference to the larger document should probably be added.
Thanks.
>
> >> Fix similar issues for mt_find_after() and mt_prev().
> >>
> >> Remove the completely confusing and pointless "Note: Will not return the
> >> zero entry." comment from mt_for_each() and document @__index correctly.
> >
> > The zero entry concept is an advanced API concept which allows you to
> > store something that cannot be seen by the mt_* family of users, so it
> > will not be returned and, instead, it will return NULL. Think of it as
> > a reservation for an entry that isn't fully initialized. Perhaps it
> > should read "Will not return the XA_ZERO_ENTRY" ?
> >>
> >> - *
> >> - * Note: Will not return the zero entry.
> >
> > This function "will not return the zero entry", meaning it will return
> > NULL if xa_is_zero(entry).
>
> If I understand correctly, this translates to:
>
> This iterator skips entries, which have been reserved for future use
> but have not yet been fully initialized.
>
> Right?
Well, that's one use of using the XA_ZERO_ENTRY, but it's really up to
the user to decide why they are adding something that returns NULL in a
specific range for the not-advanced API. It might be worth adding the
XA_ZERO_ENTRY in here, since that's the only special value right now?
>
> >> @@ -6487,9 +6493,14 @@ EXPORT_SYMBOL(mtree_destroy);
> >> * mt_find() - Search from the start up until an entry is found.
> >> * @mt: The maple tree
> >> * @index: Pointer which contains the start location of the search
> >> - * @max: The maximum value to check
> >> + * @max: The maximum value of the search range
> >> + *
> >> + * Takes RCU read lock internally to protect the search, which does not
> >> + * protect the returned pointer after dropping RCU read lock.
> >> *
> >> - * Handles locking. @index will be incremented to one beyond the range.
> >> + * In case that an entry is found @index contains the index of the found
> >> + * entry plus one, so it can be used as iterator index to find the next
> >> + * entry.
> >
> > What about:
> > "In case that an entry is found @index contains the last index of the
> > found entry plus one"
>
> Still confusing to the casual reader like me :)
>
> "In case that an entry is found @index is updated to point to the next
> possible entry independent whether the found entry is occupying a
> single index or a range if indices."
>
> Hmm?
That makes sense to me.
Thanks,
Liam
^ permalink raw reply [flat|nested] 67+ messages in thread
[parent not found: <20190225201635.4648-1-hannes@cmpxchg.org>]
* Re: your mail
[not found] <20190225201635.4648-1-hannes@cmpxchg.org>
@ 2019-02-26 23:49 ` Roman Gushchin
0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2019-02-26 23:49 UTC (permalink / raw)
To: up@kvack.org, the@kvack.org, LRU@kvack.org, counts@kvack.org,
tracking@kvack.org
Cc: Andrew Morton, Tejun Heo, linux-mm@kvack.org,
cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
Kernel Team
On Mon, Feb 25, 2019 at 03:16:29PM -0500, Johannes Weiner wrote:
> [resending, rebased on top of latest mmots]
>
> The memcg LRU stats usage is currently a bit messy. Memcg has private
> per-zone counters because reclaim needs zone granularity sometimes,
> but we also have plenty of users that need to awkwardly sum them up to
> node or memcg granularity. Meanwhile the canonical per-memcg vmstats
> do not track the LRU counts (NR_INACTIVE_ANON etc.) as you'd expect.
>
> This series enables LRU count tracking in the per-memcg vmstats array
> such that lruvec_page_state() and memcg_page_state() work on the enum
> node_stat_item items for the LRU counters. Then it converts all the
> callers that don't specifically need per-zone numbers over to that.
The updated version looks very good* to me!
Please, feel free to use:
Reviewed-by: Roman Gushchin <guro@fb.com>
Looking through the patchset, I have a feeling that we're sometimes
gathering too much data. Perhaps we don't need the whole set
of counters to be per-cpu on both memcg- and memcg-per-node levels.
Merging them can save quite a lot of space. Anyway, it's a separate
topic.
* except "to" and "subject" of the cover letter
Thanks!
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH -v2 0/9] mm: make movable onlining suck less
@ 2017-04-10 11:03 Michal Hocko
2017-04-15 12:17 ` Michal Hocko
0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2017-04-10 11:03 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Andrea Arcangeli,
Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML, Dan Williams,
Heiko Carstens, Lai Jiangshan, Martin Schwidefsky, Michal Hocko,
Tobias Regnery
Hi,
The last version of this series has been posted here [1]. It has seen
some more serious testing (thanks to Reza Arbab) and fixes for the found
issues. I have also decided to drop patch 1 [2] because it turned out to
be more complicated than I initially thought [3]. Few more patches were
added to deal with expectation on zone/node initialization.
I have rebased on top of the current mmotm-2017-04-07-15-53. It
conflicts with HMM because it touches memory hotplug as
well. We have discussed [4] with JA(C)rA'me and he agreed to
rebase on top of this rework [5] so I have reverted his series
before applyig mine. I will help him to resolve the resulting
conflicts. You can find the whole series including the HMM revers in
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git branch
attempts/rewrite-mem_hotplug
Motivation:
Movable onlining is a real hack with many downsides - mainly
reintroduction of lowmem/highmem issues we used to have on 32b systems -
but it is the only way to make the memory hotremove more reliable which
is something that people are asking for.
The current semantic of memory movable onlinening is really cumbersome,
however. The main reason for this is that the udev driven approach is
basically unusable because udev races with the memory probing while only
the last memory block or the one adjacent to the existing zone_movable
are allowed to be onlined movable. In short the criterion for the
successful online_movable changes under udev's feet. A reliable udev
approach would require a 2 phase approach where the first successful
movable online would have to check all the previous blocks and online
them in descending order. This is hard to be considered sane.
This patchset aims at making the onlining semantic more usable. First of
all it allows to online memory movable as long as it doesn't clash with
the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE
cannot overlap. Currently I preserve the original ordering semantic so
the zone always precedes the movable zone but I have plans to remove this
restriction in future because it is not really necessary.
First 3 patches are cleanups which should be ready to be merged right
away (unless I have missed something subtle of course).
Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.
Patch 5 deals with implicit assumptions of register_one_node on pgdat
initialization.
Patch 6 is the core of the change. In order to make it easier to review
I have tried it to be as minimalistic as possible and the large code
removal is moved to patch 9.
Patch 7 is a trivial follow up cleanup. Patch 8 fixes sparse warnings
and finally patch 9 removes the unused code.
I have tested the patches in kvm:
# qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...
and then probed the additional memory by
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
Then I have used this simple script to probe the memory block by hand
# cat probe_memblock.sh
#!/bin/sh
BLOCK_NR=$1
# echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe
# for i in $(seq 10); do sh probe_memblock.sh $i; done
# grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Normal Movable
/sys/devices/system/memory/memory35/valid_zones:Normal Movable
/sys/devices/system/memory/memory36/valid_zones:Normal Movable
/sys/devices/system/memory/memory37/valid_zones:Normal Movable
/sys/devices/system/memory/memory38/valid_zones:Normal Movable
/sys/devices/system/memory/memory39/valid_zones:Normal Movable
The main difference to the original implementation is that all new
memblocks can be both online_kernel and online_movable initially
because there is no clash obviously. For the comparison the original
implementation would have
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal
/sys/devices/system/memory/memory35/valid_zones:Normal
/sys/devices/system/memory/memory36/valid_zones:Normal
/sys/devices/system/memory/memory37/valid_zones:Normal
/sys/devices/system/memory/memory38/valid_zones:Normal
/sys/devices/system/memory/memory39/valid_zones:Normal Movable
Now
# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable
/sys/devices/system/memory/memory36/valid_zones:Movable
/sys/devices/system/memory/memory37/valid_zones:Movable
/sys/devices/system/memory/memory38/valid_zones:Movable
/sys/devices/system/memory/memory39/valid_zones:Movable
Block 33 can still be online both kernel and movable while all
the remaining can be only movable.
/proc/zonelist says
Node 0, zone Normal
pages free 0
min 0
low 0
high 0
spanned 0
present 0
--
Node 0, zone Movable
pages free 32753
min 85
low 117
high 149
spanned 32768
present 32768
A new memblock at a lower address will result in a new memblock (32)
which will still allow both Normal and Movable.
# sh probe_memblock.sh 0
# grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
/sys/devices/system/memory/memory32/valid_zones:Normal Movable
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable
and online_kernel will convert it to the zone normal properly
while 33 can be still onlined both ways.
# echo online_kernel > /sys/devices/system/memory/memory32/state
# grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable
/sys/devices/system/memory/memory35/valid_zones:Movable
/proc/zoneinfo will now tell
Node 0, zone Normal
pages free 65441
min 165
low 230
high 295
spanned 65536
present 65536
--
Node 0, zone Movable
pages free 32740
min 82
low 114
high 146
spanned 32768
present 32768
so both zones have one memblock spanned and present.
Onlining 39 should associate this block to the movable zone
# echo online > /sys/devices/system/memory/memory39/state
/proc/zoneinfo will now tell
Node 0, zone Normal
pages free 32765
min 80
low 112
high 144
spanned 32768
present 32768
--
Node 0, zone Movable
pages free 65501
min 160
low 225
high 290
spanned 196608
present 65536
so we will have a movable zone which spans 6 memblocks, 2 present and 4
representing a hole.
Offlining both movable blocks will lead to the zone with no present
pages which is the expected behavior I believe.
# echo offline > /sys/devices/system/memory/memory39/state
# echo offline > /sys/devices/system/memory/memory34/state
# grep -A6 "Movable\|Normal" /proc/zoneinfo
Node 0, zone Normal
pages free 32735
min 90
low 122
high 154
spanned 32768
present 32768
--
Node 0, zone Movable
pages free 0
min 0
low 0
high 0
spanned 196608
present 0
Any thoughts, complains, suggestions?
As a bonus we will get a nice cleanup in the memory hotplug codebase
arch/ia64/mm/init.c | 11 +-
arch/powerpc/mm/mem.c | 12 +-
arch/s390/mm/init.c | 32 +--
arch/sh/mm/init.c | 10 +-
arch/x86/mm/init_32.c | 7 +-
arch/x86/mm/init_64.c | 11 +-
drivers/base/memory.c | 74 ++++---
drivers/base/node.c | 58 ++----
include/linux/memory_hotplug.h | 19 +-
include/linux/mmzone.h | 16 +-
include/linux/node.h | 35 +++-
kernel/memremap.c | 6 +-
mm/memory_hotplug.c | 451 ++++++++++++++---------------------------
mm/page_alloc.c | 8 +-
mm/sparse.c | 3 +-
15 files changed, 284 insertions(+), 469 deletions(-)
Shortlog says:
Michal Hocko (9):
mm: remove return value from init_currently_empty_zone
mm, memory_hotplug: use node instead of zone in can_online_high_movable
mm: drop page_initialized check from get_nid_for_pfn
mm, memory_hotplug: get rid of is_zone_device_section
mm, memory_hotplug: split up register_one_node
mm, memory_hotplug: do not associate hotadded memory to zones until online
mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory
mm, memory_hotplug: fix the section mismatch warning
mm, memory_hotplug: remove unused cruft after memory hotplug rework
[1] http://lkml.kernel.org/r/20170330115454.32154-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20170331073954.GF27098@dhcp22.suse.cz
[3] http://lkml.kernel.org/r/20170405081400.GE6035@dhcp22.suse.cz
[4] http://lkml.kernel.org/r/20170407121349.GB16392@dhcp22.suse.cz
[5] http://lkml.kernel.org/r/20170407182752.GA17852@redhat.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
2017-04-10 11:03 [PATCH -v2 0/9] mm: make movable onlining suck less Michal Hocko
@ 2017-04-15 12:17 ` Michal Hocko
2017-04-17 5:47 ` your mail Joonsoo Kim
0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2017-04-15 12:17 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Mel Gorman, Vlastimil Babka, Andrea Arcangeli,
Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
Hi,
here I 3 more preparatory patches which I meant to send on Thursday but
forgot... After more thinking about pfn walkers I have realized that
the current code doesn't check offline holes in zones. From a quick
review that doesn't seem to be a problem currently. Pfn walkers can race
with memory offlining and with the original hotplug impementation those
offline pages can change the zone but I wasn't able to find any serious
problem other than small confusion. The new hotplug code, will not have
any valid zone, though so those code paths should check PageReserved
to rule offline holes. I hope I have addressed all of them in these 3
patches. I would appreciate if Vlastimil and Jonsoo double check after
me.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-15 12:17 ` Michal Hocko
@ 2017-04-17 5:47 ` Joonsoo Kim
2017-04-17 8:15 ` Michal Hocko
0 siblings, 1 reply; 67+ messages in thread
From: Joonsoo Kim @ 2017-04-17 5:47 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> Hi,
> here I 3 more preparatory patches which I meant to send on Thursday but
> forgot... After more thinking about pfn walkers I have realized that
> the current code doesn't check offline holes in zones. From a quick
> review that doesn't seem to be a problem currently. Pfn walkers can race
> with memory offlining and with the original hotplug impementation those
> offline pages can change the zone but I wasn't able to find any serious
> problem other than small confusion. The new hotplug code, will not have
> any valid zone, though so those code paths should check PageReserved
> to rule offline holes. I hope I have addressed all of them in these 3
> patches. I would appreciate if Vlastimil and Jonsoo double check after
> me.
Hello, Michal.
s/Jonsoo/Joonsoo. :)
I'm not sure that it's a good idea to add PageResereved() check in pfn
walkers. First, this makes struct page validity check as two steps,
pfn_valid() and then PageResereved(). If we should not use struct page
in this case, it's better to pfn_valid() returns false rather than
adding a separate check. Anyway, we need to fix more places (all pfn
walker?) if we want to check validity by two steps.
The other problem I found is that your change will makes some
contiguous zones to be considered as non-contiguous. Memory allocated
by memblock API is also marked as PageResereved. If we consider this as
a hole, we will set such a zone as non-contiguous.
And, I guess that it's not enough to check PageResereved() in
pageblock_pfn_to_page() in order to skip these pages in compaction. If
holes are in the middle of the pageblock, pageblock_pfn_to_page()
cannot catch it and compaction will use struct page for this hole.
Therefore, I think that making pfn_valid() return false for not
onlined memory is a better solution for this problem. I don't know the
implementation detail for hotplug and I don't see your recent change
but we may defer memmap initialization until the zone is determined.
It will make pfn_valid() return false for un-initialized range.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-17 5:47 ` your mail Joonsoo Kim
@ 2017-04-17 8:15 ` Michal Hocko
2017-04-20 1:27 ` Joonsoo Kim
0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2017-04-17 8:15 UTC (permalink / raw)
To: Joonsoo Kim
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Mon 17-04-17 14:47:20, Joonsoo Kim wrote:
> On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> > Hi,
> > here I 3 more preparatory patches which I meant to send on Thursday but
> > forgot... After more thinking about pfn walkers I have realized that
> > the current code doesn't check offline holes in zones. From a quick
> > review that doesn't seem to be a problem currently. Pfn walkers can race
> > with memory offlining and with the original hotplug impementation those
> > offline pages can change the zone but I wasn't able to find any serious
> > problem other than small confusion. The new hotplug code, will not have
> > any valid zone, though so those code paths should check PageReserved
> > to rule offline holes. I hope I have addressed all of them in these 3
> > patches. I would appreciate if Vlastimil and Jonsoo double check after
> > me.
>
> Hello, Michal.
>
> s/Jonsoo/Joonsoo. :)
ups, sorry about that.
> I'm not sure that it's a good idea to add PageResereved() check in pfn
> walkers. First, this makes struct page validity check as two steps,
> pfn_valid() and then PageResereved().
Yes, those are two separate checkes because semantically they are
different. Not all pfn walkers do care about the online status.
> If we should not use struct page
> in this case, it's better to pfn_valid() returns false rather than
> adding a separate check. Anyway, we need to fix more places (all pfn
> walker?) if we want to check validity by two steps.
Which pfn walkers you have in mind?
> The other problem I found is that your change will makes some
> contiguous zones to be considered as non-contiguous. Memory allocated
> by memblock API is also marked as PageResereved. If we consider this as
> a hole, we will set such a zone as non-contiguous.
Why would that be a problem? We shouldn't touch those pages anyway?
> And, I guess that it's not enough to check PageResereved() in
> pageblock_pfn_to_page() in order to skip these pages in compaction. If
> holes are in the middle of the pageblock, pageblock_pfn_to_page()
> cannot catch it and compaction will use struct page for this hole.
Yes pageblock_pfn_to_page cannot catch it and it wouldn't with the
current implementation anyway. So the implementation won't be any worse
than with the current code. On the other hand offline holes will always
fill the whole pageblock (assuming those are not spanning multiple
memblocks).
> Therefore, I think that making pfn_valid() return false for not
> onlined memory is a better solution for this problem. I don't know the
> implementation detail for hotplug and I don't see your recent change
> but we may defer memmap initialization until the zone is determined.
> It will make pfn_valid() return false for un-initialized range.
I am not really sure. pfn_valid is used in many context and its only
purpose is to tell whether pfn_to_page will return a valid struct page
AFAIU.
I agree that having more checks is more error prone and we can add a
helper pfn_to_valid_page or something similar but I believe we can do
that on top of the current hotplug rework. This would require a non
trivial amount of changes and I believe that a lacking check for the
offline holes is not critical - we would (ab)use the lowest zone which
is similar to (ab)using ZONE_NORMAL/MOVABLE with the original code.
Thanks!
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-17 8:15 ` Michal Hocko
@ 2017-04-20 1:27 ` Joonsoo Kim
2017-04-20 7:28 ` Michal Hocko
0 siblings, 1 reply; 67+ messages in thread
From: Joonsoo Kim @ 2017-04-20 1:27 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> On Mon 17-04-17 14:47:20, Joonsoo Kim wrote:
> > On Sat, Apr 15, 2017 at 02:17:31PM +0200, Michal Hocko wrote:
> > > Hi,
> > > here I 3 more preparatory patches which I meant to send on Thursday but
> > > forgot... After more thinking about pfn walkers I have realized that
> > > the current code doesn't check offline holes in zones. From a quick
> > > review that doesn't seem to be a problem currently. Pfn walkers can race
> > > with memory offlining and with the original hotplug impementation those
> > > offline pages can change the zone but I wasn't able to find any serious
> > > problem other than small confusion. The new hotplug code, will not have
> > > any valid zone, though so those code paths should check PageReserved
> > > to rule offline holes. I hope I have addressed all of them in these 3
> > > patches. I would appreciate if Vlastimil and Jonsoo double check after
> > > me.
> >
> > Hello, Michal.
> >
> > s/Jonsoo/Joonsoo. :)
>
> ups, sorry about that.
>
> > I'm not sure that it's a good idea to add PageResereved() check in pfn
> > walkers. First, this makes struct page validity check as two steps,
> > pfn_valid() and then PageResereved().
>
> Yes, those are two separate checkes because semantically they are
> different. Not all pfn walkers do care about the online status.
If offlined page has no valid information, reading information
about offlined pages are just wrong. So, all pfn walkers that reads
information about the page should do care about it.
I guess that many callers for pfn_valid() is in this category.
>
> > If we should not use struct page
> > in this case, it's better to pfn_valid() returns false rather than
> > adding a separate check. Anyway, we need to fix more places (all pfn
> > walker?) if we want to check validity by two steps.
>
> Which pfn walkers you have in mind?
For example, kpagecount_read() in fs/proc/page.c. I searched it by
using pfn_valid().
> > The other problem I found is that your change will makes some
> > contiguous zones to be considered as non-contiguous. Memory allocated
> > by memblock API is also marked as PageResereved. If we consider this as
> > a hole, we will set such a zone as non-contiguous.
>
> Why would that be a problem? We shouldn't touch those pages anyway?
Skipping those pages in compaction are valid so no problem in this
case.
The problem I mentioned above is that adding PageReserved() check in
__pageblock_pfn_to_page() invalidates optimization by
set_zone_contiguous(). In compaction, we need to get a valid struct
page and it requires a lot of work. There is performance problem
report due to this so set_zone_contiguous() optimization is added. It
checks if the zone is contiguous or not in boot time. If zone is
determined as contiguous, we can easily get a valid struct page in
runtime without expensive checks.
Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
woule make that zone->contiguous usually returns false since memory
used by memblock API is marked as PageReserved() and your patch regard
it as a hole. It invalidates set_zone_contiguous() optimization and I
worry about it.
>
> > And, I guess that it's not enough to check PageResereved() in
> > pageblock_pfn_to_page() in order to skip these pages in compaction. If
> > holes are in the middle of the pageblock, pageblock_pfn_to_page()
> > cannot catch it and compaction will use struct page for this hole.
>
> Yes pageblock_pfn_to_page cannot catch it and it wouldn't with the
> current implementation anyway. So the implementation won't be any worse
> than with the current code. On the other hand offline holes will always
> fill the whole pageblock (assuming those are not spanning multiple
> memblocks).
>
> > Therefore, I think that making pfn_valid() return false for not
> > onlined memory is a better solution for this problem. I don't know the
> > implementation detail for hotplug and I don't see your recent change
> > but we may defer memmap initialization until the zone is determined.
> > It will make pfn_valid() return false for un-initialized range.
>
> I am not really sure. pfn_valid is used in many context and its only
> purpose is to tell whether pfn_to_page will return a valid struct page
> AFAIU.
>
> I agree that having more checks is more error prone and we can add a
> helper pfn_to_valid_page or something similar but I believe we can do
> that on top of the current hotplug rework. This would require a non
> trivial amount of changes and I believe that a lacking check for the
> offline holes is not critical - we would (ab)use the lowest zone which
> is similar to (ab)using ZONE_NORMAL/MOVABLE with the original code.
I'm not objecting your hotplug rework. In fact, I don't know the
relationship between this work and hotplug rework. I'm agreeing
with checking offline holes but I don't like the design and
implementation about it.
Let me clarify my desire(?) for this issue.
1. If pfn_valid() returns true, struct page has valid information, at
least, in flags (zone id, node id, flags, etc...). So, we can use them
without checking PageResereved().
2. pfn_valid() for offlined holes returns false. This can be easily
(?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
guess that there is no reason that pfn_valid() returns true for
offlined holes. If there is, please let me know.
3. We don't need to check PageReserved() in most of pfn walkers in
order to check offline holes.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-20 1:27 ` Joonsoo Kim
@ 2017-04-20 7:28 ` Michal Hocko
2017-04-20 8:49 ` Michal Hocko
2017-04-21 4:38 ` Joonsoo Kim
0 siblings, 2 replies; 67+ messages in thread
From: Michal Hocko @ 2017-04-20 7:28 UTC (permalink / raw)
To: Joonsoo Kim
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
[...]
> > Which pfn walkers you have in mind?
>
> For example, kpagecount_read() in fs/proc/page.c. I searched it by
> using pfn_valid().
Yeah, I've checked that one and in fact this is a good example of the
case where you do not really care about holes. It just checks the page
count which is a valid information under any circumstances.
> > > The other problem I found is that your change will makes some
> > > contiguous zones to be considered as non-contiguous. Memory allocated
> > > by memblock API is also marked as PageResereved. If we consider this as
> > > a hole, we will set such a zone as non-contiguous.
> >
> > Why would that be a problem? We shouldn't touch those pages anyway?
>
> Skipping those pages in compaction are valid so no problem in this
> case.
>
> The problem I mentioned above is that adding PageReserved() check in
> __pageblock_pfn_to_page() invalidates optimization by
> set_zone_contiguous(). In compaction, we need to get a valid struct
> page and it requires a lot of work. There is performance problem
> report due to this so set_zone_contiguous() optimization is added. It
> checks if the zone is contiguous or not in boot time. If zone is
> determined as contiguous, we can easily get a valid struct page in
> runtime without expensive checks.
OK, I see. I've had some vague understading and the clarification helps.
> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> woule make that zone->contiguous usually returns false since memory
> used by memblock API is marked as PageReserved() and your patch regard
> it as a hole. It invalidates set_zone_contiguous() optimization and I
> worry about it.
OK, fair enough. I did't consider memblock allocations. I will rethink
this patch but there are essentially 3 options
- use a different criterion for the offline holes dection. I
have just realized we might do it by storing the online
information into the mem sections
- drop this patch
- move the PageReferenced check down the chain into
isolate_freepages_block resp. isolate_migratepages_block
I would prefer 3 over 2 over 1. I definitely want to make this more
robust so 1 is preferable long term but I do not want this to be a
roadblock to the rest of the rework. Does that sound acceptable to you?
[..]
> Let me clarify my desire(?) for this issue.
>
> 1. If pfn_valid() returns true, struct page has valid information, at
> least, in flags (zone id, node id, flags, etc...). So, we can use them
> without checking PageResereved().
This is no longer true after my rework. Pages are associated with the
zone during _onlining_ rather than when they are physically hotpluged.
Basically only the nid is set properly. Strictly speaking this is the
case also without my rework because the zone might change during online
phase so you cannot assume it is correct even now. It just happens that
it more or less works just fine.
> 2. pfn_valid() for offlined holes returns false. This can be easily
> (?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
> guess that there is no reason that pfn_valid() returns true for
> offlined holes. If there is, please let me know.
There is some code which really expects that pfn_valid returns true iff
there is a struct page and it doesn't care about the online status.
E.g. hotplug code itself so no, we cannot change pfn_valid. What we can
do though is to add pfn_to_online_page which would do the proper check.
I have already sent [1]. As noted above we can (ab)use the remaining bit
in SECTION_MAP_MASK to detect offline pages more robustly.
> 3. We don't need to check PageReserved() in most of pfn walkers in
> order to check offline holes.
We still have to distinguish those who care about offline pages from
those who do not care about it.
Thanks!
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-20 7:28 ` Michal Hocko
@ 2017-04-20 8:49 ` Michal Hocko
2017-04-20 11:56 ` Vlastimil Babka
2017-04-21 4:38 ` Joonsoo Kim
1 sibling, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2017-04-20 8:49 UTC (permalink / raw)
To: Joonsoo Kim
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Thu 20-04-17 09:28:20, Michal Hocko wrote:
> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
[...]
> > Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> > woule make that zone->contiguous usually returns false since memory
> > used by memblock API is marked as PageReserved() and your patch regard
> > it as a hole. It invalidates set_zone_contiguous() optimization and I
> > worry about it.
>
> OK, fair enough. I did't consider memblock allocations. I will rethink
> this patch but there are essentially 3 options
> - use a different criterion for the offline holes dection. I
> have just realized we might do it by storing the online
> information into the mem sections
> - drop this patch
> - move the PageReferenced check down the chain into
> isolate_freepages_block resp. isolate_migratepages_block
>
> I would prefer 3 over 2 over 1. I definitely want to make this more
> robust so 1 is preferable long term but I do not want this to be a
> roadblock to the rest of the rework. Does that sound acceptable to you?
So I've played with all three options just to see how the outcome would
look like and it turned out that going with 1 will be easiest in the
end. What do you think about the following? It should be free of any
false positives. I have only compile tested it yet.
---
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-20 8:49 ` Michal Hocko
@ 2017-04-20 11:56 ` Vlastimil Babka
2017-04-20 12:13 ` Michal Hocko
0 siblings, 1 reply; 67+ messages in thread
From: Vlastimil Babka @ 2017-04-20 11:56 UTC (permalink / raw)
To: Michal Hocko, Joonsoo Kim
Cc: linux-mm, Andrew Morton, Mel Gorman, Andrea Arcangeli,
Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes, Daniel Kiper,
Igor Mammedov, Vitaly Kuznetsov, LKML
On 04/20/2017 10:49 AM, Michal Hocko wrote:
> On Thu 20-04-17 09:28:20, Michal Hocko wrote:
>> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> [...]
>>> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
>>> woule make that zone->contiguous usually returns false since memory
>>> used by memblock API is marked as PageReserved() and your patch regard
>>> it as a hole. It invalidates set_zone_contiguous() optimization and I
>>> worry about it.
>>
>> OK, fair enough. I did't consider memblock allocations. I will rethink
>> this patch but there are essentially 3 options
>> - use a different criterion for the offline holes dection. I
>> have just realized we might do it by storing the online
>> information into the mem sections
>> - drop this patch
>> - move the PageReferenced check down the chain into
>> isolate_freepages_block resp. isolate_migratepages_block
>>
>> I would prefer 3 over 2 over 1. I definitely want to make this more
>> robust so 1 is preferable long term but I do not want this to be a
>> roadblock to the rest of the rework. Does that sound acceptable to you?
>
> So I've played with all three options just to see how the outcome would
> look like and it turned out that going with 1 will be easiest in the
> end. What do you think about the following? It should be free of any
> false positives. I have only compile tested it yet.
That looks fine, can't say immediately if fully correct. I think you'll
need to bump SECTION_NID_SHIFT as well and make sure things still fit?
Otherwise looks like nobody needed a new section bit since 2005, so we
should be fine.
> ---
> From 747794c13c0e82b55b793a31cdbe1a84ee1c6920 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 13 Apr 2017 10:28:45 +0200
> Subject: [PATCH] mm: consider zone which is not fully populated to have holes
>
> __pageblock_pfn_to_page has two users currently, set_zone_contiguous
> which checks whether the given zone contains holes and
> pageblock_pfn_to_page which then carefully returns a first valid
> page from the given pfn range for the given zone. This doesn't handle
> zones which are not fully populated though. Memory pageblocks can be
> offlined or might not have been onlined yet. In such a case the zone
> should be considered to have holes otherwise pfn walkers can touch
> and play with offline pages.
>
> Current callers of pageblock_pfn_to_page in compaction seem to work
> properly right now because they only isolate PageBuddy
> (isolate_freepages_block) or PageLRU resp. __PageMovable
> (isolate_migratepages_block) which will be always false for these pages.
> It would be safer to skip these pages altogether, though.
>
> In order to do this patch adds a new memory section state
> (SECTION_IS_ONLINE) which is set in memory_present (during boot
> time) or in online_pages_range during the memory hotplug. Similarly
> offline_mem_sections clears the bit and it is called when the memory
> range is offlined.
>
> pfn_to_online_page helper is then added which check the mem section and
> only returns a page if it is onlined already.
>
> Use the new helper in __pageblock_pfn_to_page and skip the whole page
> block in such a case.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> include/linux/memory_hotplug.h | 21 ++++++++++++++++++++
> include/linux/mmzone.h | 20 ++++++++++++++++++-
> mm/memory_hotplug.c | 3 +++
> mm/page_alloc.c | 5 ++++-
> mm/sparse.c | 45 +++++++++++++++++++++++++++++++++++++++++-
> 5 files changed, 91 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 3c8cf86201c3..fc1c873504eb 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -14,6 +14,19 @@ struct memory_block;
> struct resource;
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +#define pfn_to_online_page(pfn) \
> +({ \
> + struct page *___page = NULL; \
> + \
> + if (online_section_nr(pfn_to_section_nr(pfn))) \
> + ___page = pfn_to_page(pfn); \
> + ___page; \
> +})
>
> /*
> * Types for free bootmem stored in page->lru.next. These have to be in
> @@ -203,6 +216,14 @@ extern void set_zone_contiguous(struct zone *zone);
> extern void clear_zone_contiguous(struct zone *zone);
>
> #else /* ! CONFIG_MEMORY_HOTPLUG */
> +#define pfn_to_online_page(pfn) \
> +({ \
> + struct page *___page = NULL; \
> + if (pfn_valid(pfn)) \
> + ___page = pfn_to_page(pfn); \
> + ___page; \
> + })
> +
> /*
> * Stub functions for when hotplug is off
> */
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 0fc121bbf4ff..cad16ac080f5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1143,7 +1143,8 @@ extern unsigned long usemap_size(void);
> */
> #define SECTION_MARKED_PRESENT (1UL<<0)
> #define SECTION_HAS_MEM_MAP (1UL<<1)
> -#define SECTION_MAP_LAST_BIT (1UL<<2)
> +#define SECTION_IS_ONLINE (1UL<<2)
> +#define SECTION_MAP_LAST_BIT (1UL<<3)
> #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
> #define SECTION_NID_SHIFT 2
>
> @@ -1174,6 +1175,23 @@ static inline int valid_section_nr(unsigned long nr)
> return valid_section(__nr_to_section(nr));
> }
>
> +static inline int online_section(struct mem_section *section)
> +{
> + return (section && (section->section_mem_map & SECTION_IS_ONLINE));
> +}
> +
> +static inline int online_section_nr(unsigned long nr)
> +{
> + return online_section(__nr_to_section(nr));
> +}
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
> +#endif
> +#endif
> +
> static inline struct mem_section *__pfn_to_section(unsigned long pfn)
> {
> return __nr_to_section(pfn_to_section_nr(pfn));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index caa58338d121..98f565c279bf 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -929,6 +929,9 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
> unsigned long i;
> unsigned long onlined_pages = *(unsigned long *)arg;
> struct page *page;
> +
> + online_mem_sections(start_pfn, start_pfn + nr_pages);
> +
> if (PageReserved(pfn_to_page(start_pfn)))
> for (i = 0; i < nr_pages; i++) {
> page = pfn_to_page(start_pfn + i);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5d72d29a6ece..fa752de84eef 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1353,7 +1353,9 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
> if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
> return NULL;
>
> - start_page = pfn_to_page(start_pfn);
> + start_page = pfn_to_online_page(start_pfn);
> + if (!start_page)
> + return NULL;
>
> if (page_zone(start_page) != zone)
> return NULL;
> @@ -7686,6 +7688,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
> break;
> if (pfn == end_pfn)
> return;
> + offline_mem_sections(pfn, end_pfn);
> zone = page_zone(pfn_to_page(pfn));
> spin_lock_irqsave(&zone->lock, flags);
> pfn = start_pfn;
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 6903c8fc3085..79017f90d8fc 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -185,7 +185,8 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
> ms = __nr_to_section(section);
> if (!ms->section_mem_map)
> ms->section_mem_map = sparse_encode_early_nid(nid) |
> - SECTION_MARKED_PRESENT;
> + SECTION_MARKED_PRESENT |
> + SECTION_IS_ONLINE;
> }
> }
>
> @@ -590,6 +591,48 @@ void __init sparse_init(void)
> }
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> +
> +/* Mark all memory sections within the pfn range as online */
> +void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned long pfn;
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> + unsigned long section_nr = pfn_to_section_nr(start_pfn);
> + struct mem_section *ms;
> +
> + /* onlining code should never touch invalid ranges */
> + if (WARN_ON(!valid_section_nr(section_nr)))
> + continue;
> +
> + ms = __nr_to_section(section_nr);
> + ms->section_mem_map |= SECTION_IS_ONLINE;
> + }
> +}
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +/* Mark all memory sections within the pfn range as online */
> +void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned long pfn;
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> + unsigned long section_nr = pfn_to_section_nr(start_pfn);
> + struct mem_section *ms;
> +
> + /*
> + * TODO this needs some double checking. Offlining code makes
> + * sure to check pfn_valid but those checks might be just bogus
> + */
> + if (WARN_ON(!valid_section_nr(section_nr)))
> + continue;
> +
> + ms = __nr_to_section(section_nr);
> + ms->section_mem_map &= ~SECTION_IS_ONLINE;
> + }
> +}
> +#endif
> +
> #ifdef CONFIG_SPARSEMEM_VMEMMAP
> static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid)
> {
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-20 11:56 ` Vlastimil Babka
@ 2017-04-20 12:13 ` Michal Hocko
0 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2017-04-20 12:13 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Joonsoo Kim, linux-mm, Andrew Morton, Mel Gorman,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Thu 20-04-17 13:56:34, Vlastimil Babka wrote:
> On 04/20/2017 10:49 AM, Michal Hocko wrote:
> > On Thu 20-04-17 09:28:20, Michal Hocko wrote:
> >> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > [...]
> >>> Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> >>> woule make that zone->contiguous usually returns false since memory
> >>> used by memblock API is marked as PageReserved() and your patch regard
> >>> it as a hole. It invalidates set_zone_contiguous() optimization and I
> >>> worry about it.
> >>
> >> OK, fair enough. I did't consider memblock allocations. I will rethink
> >> this patch but there are essentially 3 options
> >> - use a different criterion for the offline holes dection. I
> >> have just realized we might do it by storing the online
> >> information into the mem sections
> >> - drop this patch
> >> - move the PageReferenced check down the chain into
> >> isolate_freepages_block resp. isolate_migratepages_block
> >>
> >> I would prefer 3 over 2 over 1. I definitely want to make this more
> >> robust so 1 is preferable long term but I do not want this to be a
> >> roadblock to the rest of the rework. Does that sound acceptable to you?
> >
> > So I've played with all three options just to see how the outcome would
> > look like and it turned out that going with 1 will be easiest in the
> > end. What do you think about the following? It should be free of any
> > false positives. I have only compile tested it yet.
>
> That looks fine, can't say immediately if fully correct. I think you'll
> need to bump SECTION_NID_SHIFT as well and make sure things still fit?
> Otherwise looks like nobody needed a new section bit since 2005, so we
> should be fine.
You are absolutely right. Thanks for spotting this! I have folded this
in
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 611ff869fa4d..c412e6a3a1e9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,7 +1166,7 @@ extern unsigned long usemap_size(void);
#define SECTION_IS_ONLINE (1UL<<2)
#define SECTION_MAP_LAST_BIT (1UL<<3)
#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT 2
+#define SECTION_NID_SHIFT 3
static inline struct page *__section_mem_map_addr(struct mem_section *section)
{
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-20 7:28 ` Michal Hocko
2017-04-20 8:49 ` Michal Hocko
@ 2017-04-21 4:38 ` Joonsoo Kim
2017-04-21 7:16 ` Michal Hocko
1 sibling, 1 reply; 67+ messages in thread
From: Joonsoo Kim @ 2017-04-21 4:38 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> [...]
> > > Which pfn walkers you have in mind?
> >
> > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > using pfn_valid().
>
> Yeah, I've checked that one and in fact this is a good example of the
> case where you do not really care about holes. It just checks the page
> count which is a valid information under any circumstances.
I don't think so. First, it checks the page *map* count. Is it still valid
even if PageReserved() is set? What I'd like to ask in this example is
that what information is valid if PageReserved() is set. Is there any
design document on this? I think that we need to define/document it first.
And, I hope that all the information in flags field is valid in all
cases if pfn_valid() return true. By the design.
This makes all the exsiting pfn walkers happy since we don't need an
additional check for PageReserved().
>
> > > > The other problem I found is that your change will makes some
> > > > contiguous zones to be considered as non-contiguous. Memory allocated
> > > > by memblock API is also marked as PageResereved. If we consider this as
> > > > a hole, we will set such a zone as non-contiguous.
> > >
> > > Why would that be a problem? We shouldn't touch those pages anyway?
> >
> > Skipping those pages in compaction are valid so no problem in this
> > case.
> >
> > The problem I mentioned above is that adding PageReserved() check in
> > __pageblock_pfn_to_page() invalidates optimization by
> > set_zone_contiguous(). In compaction, we need to get a valid struct
> > page and it requires a lot of work. There is performance problem
> > report due to this so set_zone_contiguous() optimization is added. It
> > checks if the zone is contiguous or not in boot time. If zone is
> > determined as contiguous, we can easily get a valid struct page in
> > runtime without expensive checks.
>
> OK, I see. I've had some vague understading and the clarification helps.
>
> > Your patch try to add PageReserved() to __pageblock_pfn_to_page(). It
> > woule make that zone->contiguous usually returns false since memory
> > used by memblock API is marked as PageReserved() and your patch regard
> > it as a hole. It invalidates set_zone_contiguous() optimization and I
> > worry about it.
>
> OK, fair enough. I did't consider memblock allocations. I will rethink
> this patch but there are essentially 3 options
> - use a different criterion for the offline holes dection. I
> have just realized we might do it by storing the online
> information into the mem sections
> - drop this patch
> - move the PageReferenced check down the chain into
> isolate_freepages_block resp. isolate_migratepages_block
>
> I would prefer 3 over 2 over 1. I definitely want to make this more
> robust so 1 is preferable long term but I do not want this to be a
> roadblock to the rest of the rework. Does that sound acceptable to you?
I like #1 among of above options and I already see your patch for #1.
It's much better than your first attempt but I'm still not happy due
to the semantic of pfn_valid().
> [..]
> > Let me clarify my desire(?) for this issue.
> >
> > 1. If pfn_valid() returns true, struct page has valid information, at
> > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > without checking PageResereved().
>
> This is no longer true after my rework. Pages are associated with the
> zone during _onlining_ rather than when they are physically hotpluged.
If your rework make information valid during _onlining_, my
suggestion is making pfn_valid() return false until onlining.
Caller of pfn_valid() expects that they can get valid information from
the struct page. There is no reason to access the struct page if they
can't get valid information from it. So, passing pfn_valid() should
guarantee that, at least, some kind of information is valid.
If pfn_valid() doesn't guarantee it, most of the pfn walker should
check PageResereved() to make sure that validity of information from
the struct page.
> Basically only the nid is set properly. Strictly speaking this is the
> case also without my rework because the zone might change during online
> phase so you cannot assume it is correct even now. It just happens that
> it more or less works just fine.
>
> > 2. pfn_valid() for offlined holes returns false. This can be easily
> > (?) implemented by manipulating SECTION_MAP_MASK in hotplug code. I
> > guess that there is no reason that pfn_valid() returns true for
> > offlined holes. If there is, please let me know.
>
> There is some code which really expects that pfn_valid returns true iff
> there is a struct page and it doesn't care about the online status.
> E.g. hotplug code itself so no, we cannot change pfn_valid. What we can
> do though is to add pfn_to_online_page which would do the proper check.
> I have already sent [1]. As noted above we can (ab)use the remaining bit
> in SECTION_MAP_MASK to detect offline pages more robustly.
Some pfn_valid() caller in hotplug code look wrong. They want to check
section's validity rather than pfn's validity. Others want to access
the struct page so they fit for my assumption (?) for pfn_valid().
Therefore, we can change that pfn_valid() return false until online.
> > 3. We don't need to check PageReserved() in most of pfn walkers in
> > order to check offline holes.
>
> We still have to distinguish those who care about offline pages from
> those who do not care about it.
Hotplug code can distinguish those by another way by using new section
mask as you did in a new patch. If someone excluding hotplug code do
care about offline pages, it would be just for optimization rather
than correteness. I think that it's okay.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-21 4:38 ` Joonsoo Kim
@ 2017-04-21 7:16 ` Michal Hocko
2017-04-24 1:44 ` Joonsoo Kim
0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2017-04-21 7:16 UTC (permalink / raw)
To: Joonsoo Kim
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > [...]
> > > > Which pfn walkers you have in mind?
> > >
> > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > using pfn_valid().
> >
> > Yeah, I've checked that one and in fact this is a good example of the
> > case where you do not really care about holes. It just checks the page
> > count which is a valid information under any circumstances.
>
> I don't think so. First, it checks the page *map* count. Is it still valid
> even if PageReserved() is set?
I do not know about any user which would manipulate page map count for
referenced pages. The core MM code doesn't.
> What I'd like to ask in this example is
> that what information is valid if PageReserved() is set. Is there any
> design document on this? I think that we need to define/document it first.
NO, it is not AFAIK.
[...]
> > OK, fair enough. I did't consider memblock allocations. I will rethink
> > this patch but there are essentially 3 options
> > - use a different criterion for the offline holes dection. I
> > have just realized we might do it by storing the online
> > information into the mem sections
> > - drop this patch
> > - move the PageReferenced check down the chain into
> > isolate_freepages_block resp. isolate_migratepages_block
> >
> > I would prefer 3 over 2 over 1. I definitely want to make this more
> > robust so 1 is preferable long term but I do not want this to be a
> > roadblock to the rest of the rework. Does that sound acceptable to you?
>
> I like #1 among of above options and I already see your patch for #1.
> It's much better than your first attempt but I'm still not happy due
> to the semantic of pfn_valid().
You are trying to change a semantic of something that has a well defined
meaning. I disagree that we should change it. It might sound like a
simpler thing to do because pfn walkers will have to be checked but what
you are proposing is conflating two different things together.
> > [..]
> > > Let me clarify my desire(?) for this issue.
> > >
> > > 1. If pfn_valid() returns true, struct page has valid information, at
> > > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > > without checking PageResereved().
> >
> > This is no longer true after my rework. Pages are associated with the
> > zone during _onlining_ rather than when they are physically hotpluged.
>
> If your rework make information valid during _onlining_, my
> suggestion is making pfn_valid() return false until onlining.
>
> Caller of pfn_valid() expects that they can get valid information from
> the struct page. There is no reason to access the struct page if they
> can't get valid information from it. So, passing pfn_valid() should
> guarantee that, at least, some kind of information is valid.
>
> If pfn_valid() doesn't guarantee it, most of the pfn walker should
> check PageResereved() to make sure that validity of information from
> the struct page.
This is true only for those walkers which really depend on the full
initialization. This is not the case for all of them. I do not see any
reason to introduce another _pfn_valid to just check whether there is a
struct page...
So please do not conflate those two different concepts together. I
believe that the most prominent pfn walkers should be covered now and
others can be evaluated later.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-21 7:16 ` Michal Hocko
@ 2017-04-24 1:44 ` Joonsoo Kim
2017-04-24 7:53 ` Michal Hocko
0 siblings, 1 reply; 67+ messages in thread
From: Joonsoo Kim @ 2017-04-24 1:44 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Which pfn walkers you have in mind?
> > > >
> > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > using pfn_valid().
> > >
> > > Yeah, I've checked that one and in fact this is a good example of the
> > > case where you do not really care about holes. It just checks the page
> > > count which is a valid information under any circumstances.
> >
> > I don't think so. First, it checks the page *map* count. Is it still valid
> > even if PageReserved() is set?
>
> I do not know about any user which would manipulate page map count for
> referenced pages. The core MM code doesn't.
That's weird that we can get *map* count without PageReserved() check,
but we cannot get zone information.
Zone information is more static information than map count.
It should be defined/documented in this time that what information in
the struct page is valid even if PageReserved() is set. And then, we
need to fix all the things based on this design decision.
>
> > What I'd like to ask in this example is
> > that what information is valid if PageReserved() is set. Is there any
> > design document on this? I think that we need to define/document it first.
>
> NO, it is not AFAIK.
>
> [...]
> > > OK, fair enough. I did't consider memblock allocations. I will rethink
> > > this patch but there are essentially 3 options
> > > - use a different criterion for the offline holes dection. I
> > > have just realized we might do it by storing the online
> > > information into the mem sections
> > > - drop this patch
> > > - move the PageReferenced check down the chain into
> > > isolate_freepages_block resp. isolate_migratepages_block
> > >
> > > I would prefer 3 over 2 over 1. I definitely want to make this more
> > > robust so 1 is preferable long term but I do not want this to be a
> > > roadblock to the rest of the rework. Does that sound acceptable to you?
> >
> > I like #1 among of above options and I already see your patch for #1.
> > It's much better than your first attempt but I'm still not happy due
> > to the semantic of pfn_valid().
>
> You are trying to change a semantic of something that has a well defined
> meaning. I disagree that we should change it. It might sound like a
> simpler thing to do because pfn walkers will have to be checked but what
> you are proposing is conflating two different things together.
I don't think that *I* try to change the semantic of pfn_valid().
It would be original semantic of pfn_valid().
"If pfn_valid() returns true, we can get proper struct page and the
zone information,"
That situation is now being changed by your patch *hotplug rework*.
"Even if pfn_valid() returns true, we cannot get the zone information
without PageReserved() check, since *zone is determined during
onlining* and pfn_valid() return true after adding the memory."
>
> > > [..]
> > > > Let me clarify my desire(?) for this issue.
> > > >
> > > > 1. If pfn_valid() returns true, struct page has valid information, at
> > > > least, in flags (zone id, node id, flags, etc...). So, we can use them
> > > > without checking PageResereved().
> > >
> > > This is no longer true after my rework. Pages are associated with the
> > > zone during _onlining_ rather than when they are physically hotpluged.
> >
> > If your rework make information valid during _onlining_, my
> > suggestion is making pfn_valid() return false until onlining.
> >
> > Caller of pfn_valid() expects that they can get valid information from
> > the struct page. There is no reason to access the struct page if they
> > can't get valid information from it. So, passing pfn_valid() should
> > guarantee that, at least, some kind of information is valid.
> >
> > If pfn_valid() doesn't guarantee it, most of the pfn walker should
> > check PageResereved() to make sure that validity of information from
> > the struct page.
>
> This is true only for those walkers which really depend on the full
> initialization. This is not the case for all of them. I do not see any
> reason to introduce another _pfn_valid to just check whether there is a
> struct page...
It's really confusing concept that only some information is valid for
*not* fully initialized struct page. Even, there is no document that
what information is valid for this half-initialized struct page.
Better design would be that we regard that every information is
invalid for half-initialized struct page. In this case, it's natural
to make pfn_valid() returns false for this half-initialized struct page.
>
> So please do not conflate those two different concepts together. I
> believe that the most prominent pfn walkers should be covered now and
> others can be evaluated later.
Even if original pfn_valid()'s semantic is not the one that I mentioned,
I think that suggested semantic from me is better.
Only hotplug code need to be changed and others doesn't need to be changed.
There is no overhead for others. What's the problem about this approach?
And, I'm not sure that you covered the most prominent pfn walkers.
Please see pagetypeinfo_showblockcount_print() in mm/vmstat.c.
As you admitted, additional check approach is really error-prone and
this example shows that.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-24 1:44 ` Joonsoo Kim
@ 2017-04-24 7:53 ` Michal Hocko
2017-04-25 2:50 ` Joonsoo Kim
0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2017-04-24 7:53 UTC (permalink / raw)
To: Joonsoo Kim
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > Which pfn walkers you have in mind?
> > > > >
> > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > using pfn_valid().
> > > >
> > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > case where you do not really care about holes. It just checks the page
> > > > count which is a valid information under any circumstances.
> > >
> > > I don't think so. First, it checks the page *map* count. Is it still valid
> > > even if PageReserved() is set?
> >
> > I do not know about any user which would manipulate page map count for
> > referenced pages. The core MM code doesn't.
>
> That's weird that we can get *map* count without PageReserved() check,
> but we cannot get zone information.
> Zone information is more static information than map count.
As I've already pointed out the rework of the hotplug code is mainly
about postponing the zone initialization from the physical hot add to
the logical onlining. The zone is really not clear until that moment.
> It should be defined/documented in this time that what information in
> the struct page is valid even if PageReserved() is set. And then, we
> need to fix all the things based on this design decision.
Where would you suggest documenting this? We do have
Documentation/memory-hotplug.txt but it is not really specific about
struct page.
[...]
> > You are trying to change a semantic of something that has a well defined
> > meaning. I disagree that we should change it. It might sound like a
> > simpler thing to do because pfn walkers will have to be checked but what
> > you are proposing is conflating two different things together.
>
> I don't think that *I* try to change the semantic of pfn_valid().
> It would be original semantic of pfn_valid().
>
> "If pfn_valid() returns true, we can get proper struct page and the
> zone information,"
I do not see any guarantee about the zone information anywhere. In fact
this is not true with the original implementation as I've tried to
explain already. We do have new pages associated with a zone but that
association might change during the online phase. So you cannot really
rely on that information until the page is online. There is no real
change in that regards after my rework.
[...]
> > So please do not conflate those two different concepts together. I
> > believe that the most prominent pfn walkers should be covered now and
> > others can be evaluated later.
>
> Even if original pfn_valid()'s semantic is not the one that I mentioned,
> I think that suggested semantic from me is better.
> Only hotplug code need to be changed and others doesn't need to be changed.
> There is no overhead for others. What's the problem about this approach?
That this would require to check _every_ single pfn_valid user in the
kernel. That is beyond my time capacity and not really necessary because
the current code already suffers from the same/similar class of
problems.
> And, I'm not sure that you covered the most prominent pfn walkers.
> Please see pagetypeinfo_showblockcount_print() in mm/vmstat.c.
I probably haven't (and will send a patch to fix this one - thanks for
pointing to it) but the point is they those are broken already and they
can be fixed in follow up patches. If you change pfn_valid you might
break an existing code in an unexpected ways.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-24 7:53 ` Michal Hocko
@ 2017-04-25 2:50 ` Joonsoo Kim
2017-04-26 9:19 ` Michal Hocko
0 siblings, 1 reply; 67+ messages in thread
From: Joonsoo Kim @ 2017-04-25 2:50 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Mon, Apr 24, 2017 at 09:53:12AM +0200, Michal Hocko wrote:
> On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> > On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > > [...]
> > > > > > > Which pfn walkers you have in mind?
> > > > > >
> > > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > > using pfn_valid().
> > > > >
> > > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > > case where you do not really care about holes. It just checks the page
> > > > > count which is a valid information under any circumstances.
> > > >
> > > > I don't think so. First, it checks the page *map* count. Is it still valid
> > > > even if PageReserved() is set?
> > >
> > > I do not know about any user which would manipulate page map count for
> > > referenced pages. The core MM code doesn't.
> >
> > That's weird that we can get *map* count without PageReserved() check,
> > but we cannot get zone information.
> > Zone information is more static information than map count.
>
> As I've already pointed out the rework of the hotplug code is mainly
> about postponing the zone initialization from the physical hot add to
> the logical onlining. The zone is really not clear until that moment.
>
> > It should be defined/documented in this time that what information in
> > the struct page is valid even if PageReserved() is set. And then, we
> > need to fix all the things based on this design decision.
>
> Where would you suggest documenting this? We do have
> Documentation/memory-hotplug.txt but it is not really specific about
> struct page.
pfn_valid() in include/linux/mmzone.h looks proper place.
>
> [...]
>
> > > You are trying to change a semantic of something that has a well defined
> > > meaning. I disagree that we should change it. It might sound like a
> > > simpler thing to do because pfn walkers will have to be checked but what
> > > you are proposing is conflating two different things together.
> >
> > I don't think that *I* try to change the semantic of pfn_valid().
> > It would be original semantic of pfn_valid().
> >
> > "If pfn_valid() returns true, we can get proper struct page and the
> > zone information,"
>
> I do not see any guarantee about the zone information anywhere. In fact
> this is not true with the original implementation as I've tried to
> explain already. We do have new pages associated with a zone but that
> association might change during the online phase. So you cannot really
> rely on that information until the page is online. There is no real
> change in that regards after my rework.
I know that what you did doesn't change thing much. What I try to say
is that previous implementation related to pfn_valid() in hotplug is
wrong. Please do not assume that hotplug implementation is correct and
other pfn_valid() users are incorrect. There is no design document so
I'm not sure which one is correct but assumption that pfn_valid() user
can access whole the struct page information makes much sense to me.
So, I hope that please fix hotplug implementation rather than
modifying each pfn_valid() users.
>
> [...]
> > > So please do not conflate those two different concepts together. I
> > > believe that the most prominent pfn walkers should be covered now and
> > > others can be evaluated later.
> >
> > Even if original pfn_valid()'s semantic is not the one that I mentioned,
> > I think that suggested semantic from me is better.
> > Only hotplug code need to be changed and others doesn't need to be changed.
> > There is no overhead for others. What's the problem about this approach?
>
> That this would require to check _every_ single pfn_valid user in the
> kernel. That is beyond my time capacity and not really necessary because
> the current code already suffers from the same/similar class of
> problems.
I think that all the pfn_valid() user doesn't consider hole case.
Unlike your expectation, if your way is taken, it requires to check
_every_ pfn_valid() users.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-25 2:50 ` Joonsoo Kim
@ 2017-04-26 9:19 ` Michal Hocko
2017-04-27 2:08 ` Joonsoo Kim
0 siblings, 1 reply; 67+ messages in thread
From: Michal Hocko @ 2017-04-26 9:19 UTC (permalink / raw)
To: Joonsoo Kim
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Tue 25-04-17 11:50:45, Joonsoo Kim wrote:
> On Mon, Apr 24, 2017 at 09:53:12AM +0200, Michal Hocko wrote:
> > On Mon 24-04-17 10:44:43, Joonsoo Kim wrote:
> > > On Fri, Apr 21, 2017 at 09:16:16AM +0200, Michal Hocko wrote:
> > > > On Fri 21-04-17 13:38:28, Joonsoo Kim wrote:
> > > > > On Thu, Apr 20, 2017 at 09:28:20AM +0200, Michal Hocko wrote:
> > > > > > On Thu 20-04-17 10:27:55, Joonsoo Kim wrote:
> > > > > > > On Mon, Apr 17, 2017 at 10:15:15AM +0200, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > > Which pfn walkers you have in mind?
> > > > > > >
> > > > > > > For example, kpagecount_read() in fs/proc/page.c. I searched it by
> > > > > > > using pfn_valid().
> > > > > >
> > > > > > Yeah, I've checked that one and in fact this is a good example of the
> > > > > > case where you do not really care about holes. It just checks the page
> > > > > > count which is a valid information under any circumstances.
> > > > >
> > > > > I don't think so. First, it checks the page *map* count. Is it still valid
> > > > > even if PageReserved() is set?
> > > >
> > > > I do not know about any user which would manipulate page map count for
> > > > referenced pages. The core MM code doesn't.
> > >
> > > That's weird that we can get *map* count without PageReserved() check,
> > > but we cannot get zone information.
> > > Zone information is more static information than map count.
> >
> > As I've already pointed out the rework of the hotplug code is mainly
> > about postponing the zone initialization from the physical hot add to
> > the logical onlining. The zone is really not clear until that moment.
> >
> > > It should be defined/documented in this time that what information in
> > > the struct page is valid even if PageReserved() is set. And then, we
> > > need to fix all the things based on this design decision.
> >
> > Where would you suggest documenting this? We do have
> > Documentation/memory-hotplug.txt but it is not really specific about
> > struct page.
>
> pfn_valid() in include/linux/mmzone.h looks proper place.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c412e6a3a1e9..443258fcac93 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1288,10 +1288,14 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
#ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
/*
* pfn_valid() is meant to be able to tell if a given PFN has valid memmap
- * associated with it or not. In FLATMEM, it is expected that holes always
- * have valid memmap as long as there is valid PFNs either side of the hole.
- * In SPARSEMEM, it is assumed that a valid section has a memmap for the
- * entire section.
+ * associated with it or not. This means that a struct page exists for this
+ * pfn. The caller cannot assume the page is fully initialized though.
+ * pfn_to_online_page() should be used to make sure the struct page is fully
+ * initialized.
+ *
+ * In FLATMEM, it is expected that holes always have valid memmap as long as
+ * there is valid PFNs either side of the hole. In SPARSEMEM, it is assumed
+ * that a valid section has a memmap for the entire section.
*
* However, an ARM, and maybe other embedded architectures in the future
* free memmap backing holes to save memory on the assumption the memmap is
> > [...]
> >
> > > > You are trying to change a semantic of something that has a well defined
> > > > meaning. I disagree that we should change it. It might sound like a
> > > > simpler thing to do because pfn walkers will have to be checked but what
> > > > you are proposing is conflating two different things together.
> > >
> > > I don't think that *I* try to change the semantic of pfn_valid().
> > > It would be original semantic of pfn_valid().
> > >
> > > "If pfn_valid() returns true, we can get proper struct page and the
> > > zone information,"
> >
> > I do not see any guarantee about the zone information anywhere. In fact
> > this is not true with the original implementation as I've tried to
> > explain already. We do have new pages associated with a zone but that
> > association might change during the online phase. So you cannot really
> > rely on that information until the page is online. There is no real
> > change in that regards after my rework.
>
> I know that what you did doesn't change thing much. What I try to say
> is that previous implementation related to pfn_valid() in hotplug is
> wrong. Please do not assume that hotplug implementation is correct and
> other pfn_valid() users are incorrect. There is no design document so
> I'm not sure which one is correct but assumption that pfn_valid() user
> can access whole the struct page information makes much sense to me.
Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
still need pfn_valid to work for those pfns. Really, pfn_valid has a
different meaning than you would like it to have. Who knows how many
others like that are lurking there. I feel much more comfortable to go
and hunt already broken code and fix it rathert than break something
unexpectedly.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-26 9:19 ` Michal Hocko
@ 2017-04-27 2:08 ` Joonsoo Kim
2017-04-27 15:10 ` Michal Hocko
0 siblings, 1 reply; 67+ messages in thread
From: Joonsoo Kim @ 2017-04-27 2:08 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Wed, Apr 26, 2017 at 11:19:06AM +0200, Michal Hocko wrote:
> > > [...]
> > >
> > > > > You are trying to change a semantic of something that has a well defined
> > > > > meaning. I disagree that we should change it. It might sound like a
> > > > > simpler thing to do because pfn walkers will have to be checked but what
> > > > > you are proposing is conflating two different things together.
> > > >
> > > > I don't think that *I* try to change the semantic of pfn_valid().
> > > > It would be original semantic of pfn_valid().
> > > >
> > > > "If pfn_valid() returns true, we can get proper struct page and the
> > > > zone information,"
> > >
> > > I do not see any guarantee about the zone information anywhere. In fact
> > > this is not true with the original implementation as I've tried to
> > > explain already. We do have new pages associated with a zone but that
> > > association might change during the online phase. So you cannot really
> > > rely on that information until the page is online. There is no real
> > > change in that regards after my rework.
> >
> > I know that what you did doesn't change thing much. What I try to say
> > is that previous implementation related to pfn_valid() in hotplug is
> > wrong. Please do not assume that hotplug implementation is correct and
> > other pfn_valid() users are incorrect. There is no design document so
> > I'm not sure which one is correct but assumption that pfn_valid() user
> > can access whole the struct page information makes much sense to me.
>
> Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
> still need pfn_valid to work for those pfns. Really, pfn_valid has a
It's really contrary example to your insist. They requires not only
struct page but also other information, especially, the zone index.
They checks zone idx to know whether this page is for ZONE_DEVICE or not.
So, pfn_valid() for ZONE_DEVICE pages assume that struct page has all
the valid information. It's perfectly matched with my suggestion.
Online isn't important issue here. What the important point is the condition
that pfn_valid() return true. pfn_valid() for ZONE_DEVICE returns true after
arch_add_memory() since all the struct page information is fixed there.
If zone of hotplugged memory cannot be fixed at this moment, you can
defef it until all the information is fixed (onlining). That
seems to be better semantic of pfn_valid() to me.
> different meaning than you would like it to have. Who knows how many
> others like that are lurking there. I feel much more comfortable to go
> and hunt already broken code and fix it rathert than break something
> unexpectedly.
I think that I did my best to explain my reasoning. It seems that we
cannot agree with each other so it's better for some others to express
their opinion to this problem. I will stop this discussion from now
on.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2017-04-27 2:08 ` Joonsoo Kim
@ 2017-04-27 15:10 ` Michal Hocko
0 siblings, 0 replies; 67+ messages in thread
From: Michal Hocko @ 2017-04-27 15:10 UTC (permalink / raw)
To: Joonsoo Kim
Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
qiuxishi, Kani Toshimitsu, slaoub, Andi Kleen, David Rientjes,
Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
On Thu 27-04-17 11:08:38, Joonsoo Kim wrote:
> On Wed, Apr 26, 2017 at 11:19:06AM +0200, Michal Hocko wrote:
> > > > [...]
> > > >
> > > > > > You are trying to change a semantic of something that has a well defined
> > > > > > meaning. I disagree that we should change it. It might sound like a
> > > > > > simpler thing to do because pfn walkers will have to be checked but what
> > > > > > you are proposing is conflating two different things together.
> > > > >
> > > > > I don't think that *I* try to change the semantic of pfn_valid().
> > > > > It would be original semantic of pfn_valid().
> > > > >
> > > > > "If pfn_valid() returns true, we can get proper struct page and the
> > > > > zone information,"
> > > >
> > > > I do not see any guarantee about the zone information anywhere. In fact
> > > > this is not true with the original implementation as I've tried to
> > > > explain already. We do have new pages associated with a zone but that
> > > > association might change during the online phase. So you cannot really
> > > > rely on that information until the page is online. There is no real
> > > > change in that regards after my rework.
> > >
> > > I know that what you did doesn't change thing much. What I try to say
> > > is that previous implementation related to pfn_valid() in hotplug is
> > > wrong. Please do not assume that hotplug implementation is correct and
> > > other pfn_valid() users are incorrect. There is no design document so
> > > I'm not sure which one is correct but assumption that pfn_valid() user
> > > can access whole the struct page information makes much sense to me.
> >
> > Not really. E.g. ZONE_DEVICE pages are never online AFAIK. I believe we
> > still need pfn_valid to work for those pfns. Really, pfn_valid has a
>
> It's really contrary example to your insist. They requires not only
> struct page but also other information, especially, the zone index.
> They checks zone idx to know whether this page is for ZONE_DEVICE or not.
Yes and they guarantee this association is true. Without memory onlining
though. This memory is never online for anybody who is asking.
[...]
> I think that I did my best to explain my reasoning. It seems that we
> cannot agree with each other so it's better for some others to express
> their opinion to this problem. I will stop this discussion from now
> on.
I _do_ appreciate your feedback and if the general consensus is to
modify pfn_valid I can go that direction but my gut feeling tells me
that conflating "existing struct page" test and "fully online and
initialized" one is a wrong thing to do.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
@ 2012-10-04 16:50 Andrea Arcangeli
2012-10-04 18:17 ` your mail Christoph Lameter
0 siblings, 1 reply; 67+ messages in thread
From: Andrea Arcangeli @ 2012-10-04 16:50 UTC (permalink / raw)
To: Christoph Lameter
Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha,
Mike Galbraith, Paul E. McKenney
Subject: Re: [PATCH 29/33] autonuma: page_autonuma
Reply-To:
In-Reply-To: <0000013a2c223da2-632aa43e-21f8-4abd-a0ba-2e1b49881e3a-000000@email.amazonses.com>
Hi Christoph,
On Thu, Oct 04, 2012 at 02:16:14PM +0000, Christoph Lameter wrote:
> On Thu, 4 Oct 2012, Andrea Arcangeli wrote:
>
> > Move the autonuma_last_nid from the "struct page" to a separate
> > page_autonuma data structure allocated in the memsection (with
> > sparsemem) or in the pgdat (with flatmem).
>
> Note that there is a available word in struct page before the autonuma
> patches on x86_64 with CONFIG_HAVE_ALIGNED_STRUCT_PAGE.
>
> In fact the page_autonuma fills up the structure to nicely fit in one 64
> byte cacheline.
Good point indeed.
So we could drop page_autonuma by creating a CONFIG_SLUB=y dependency
(AUTONUMA wouldn't be available in the kernel config if SLAB=y, and it
also wouldn't be available on 32bit archs but the latter isn't a
problem).
I think it's a reasonable alternative to page_autonuma. Certainly it
looks more appealing than taking over 16 precious bits from
page->flags. There are still pros and cons. I'm neutral on it so more
comments would be welcome ;).
Andrea
PS. randomly moved some in Cc over to Bcc as I overflowed the max
header allowed on linux-kernel oops!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2012-10-04 16:50 Andrea Arcangeli
@ 2012-10-04 18:17 ` Christoph Lameter
0 siblings, 0 replies; 67+ messages in thread
From: Christoph Lameter @ 2012-10-04 18:17 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: linux-kernel, linux-mm, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Ingo Molnar, Mel Gorman, Hugh Dickins,
Rik van Riel, Johannes Weiner, Hillf Danton, Andrew Jones,
Dan Smith, Thomas Gleixner, Paul Turner, Suresh Siddha,
Mike Galbraith, Paul E. McKenney
On Thu, 4 Oct 2012, Andrea Arcangeli wrote:
> So we could drop page_autonuma by creating a CONFIG_SLUB=y dependency
> (AUTONUMA wouldn't be available in the kernel config if SLAB=y, and it
> also wouldn't be available on 32bit archs but the latter isn't a
> problem).
Nope it should depend on page struct alignment. Other kernel subsystems
may be depeding on page struct alignment in the future (and some other
arches may already have that requirement)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
[parent not found: <1131.86.55.168.2.1170690089.squirrel@mail.thinknet.ro>]
* Re: your mail
[not found] <1131.86.55.168.2.1170690089.squirrel@mail.thinknet.ro>
@ 2007-02-05 12:36 ` Joerg Roedel
0 siblings, 0 replies; 67+ messages in thread
From: Joerg Roedel @ 2007-02-05 12:36 UTC (permalink / raw)
To: logic; +Cc: linux-kernel, linux-mm
On Mon, Feb 05, 2007 at 05:41:29PM +0200, logic@thinknet.ro wrote:
> Good morning,
>
> I am experiencing a bug i think. I am running a 2.6.19.2 kernel on a 3Ghz
> Intel with HT activated, 1 gb ram, and noname motherboard. Here is the
> output of the hang:
Hmm, this seems to be the same issue as in [1] and [2]. A page that is
assumed to belong to the slab but is not longer marked as a slab page.
Could this be a bug in the memory management?
Joerg
[1] http://lkml.org/lkml/2007/2/4/77
[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=406477
--
Joerg Roedel
Operating System Research Center
AMD Saxony LLC & Co. KG
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
@ 2003-01-24 5:54 Anoop J.
2003-01-24 6:28 ` David Lang
0 siblings, 1 reply; 67+ messages in thread
From: Anoop J. @ 2003-01-24 5:54 UTC (permalink / raw)
To: linux-mm, linux-kernel
How is this different from a fully associative cache .Would be better if u
could deal it based on the address bits used
Thanks
David Lang wrote:
>The idea of page coloring is based on the fact that common implementations
>of caching can't put any page in memory in any line in the cache (such an
>implementation is possible, but is more expensive to do so is not commonly
>done)
>
>With this implementation it means that if your program happens to use
>memory that cannot be mapped to half of the cache lines then effectivly
>the CPU cache is half it's rated size for your program. the next time your
>program runs it may get a more favorable memory allocation and be able to
>use all of the cache and therefor run faster.
>
>Page coloring is an attampt to take this into account when allocating
>memory to programs so that every program gets to use all of the cache.
>
>David Lang
>
>
> On Fri, 24 Jan 2003, Anoop J. wrote:
>
>>Date: Fri, 24 Jan 2003 10:38:03 +0530 (IST)
>>From: Anoop J. <cs99001@nitc.ac.in>
>>To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
>>
>>
>>How does page coloring work. Iwant its mechanism not the implementation.
>>I went through some pages of W.L.Lynch's paper on cache and VM. Still not
>>able to grasp it .
>>
>>
>>Thanks in advance
>>
>>
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at http://www.tux.org/lkml/
>>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 5:54 Anoop J.
@ 2003-01-24 6:28 ` David Lang
2003-01-24 8:51 ` Anoop J.
0 siblings, 1 reply; 67+ messages in thread
From: David Lang @ 2003-01-24 6:28 UTC (permalink / raw)
To: Anoop J.; +Cc: linux-mm, linux-kernel
implementing a fully associative cache eliminates the need for page
coloring, but it has to be implemented in hardware. if you don't have
fully associative caches in your hardware page coloring helps avoid the
worst case memory allocations.
from what I have seen on the attempts to implement it the problem is that
the calculations needed to do page colored allocations end up costing
enough that they end up with a net loss compared to the old method.
David Lang
On Fri, 24 Jan 2003, Anoop J.
wrote:
> Date: Fri, 24 Jan 2003 11:24:24 +0530 (IST)
> From: Anoop J. <cs99001@nitc.ac.in>
> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
>
>
> How is this different from a fully associative cache .Would be better if u
> could deal it based on the address bits used
>
> Thanks
>
> David Lang wrote:
>
> >The idea of page coloring is based on the fact that common implementations
> >of caching can't put any page in memory in any line in the cache (such an
> >implementation is possible, but is more expensive to do so is not commonly
> >done)
> >
> >With this implementation it means that if your program happens to use
> >memory that cannot be mapped to half of the cache lines then effectivly
> >the CPU cache is half it's rated size for your program. the next time your
> >program runs it may get a more favorable memory allocation and be able to
> >use all of the cache and therefor run faster.
> >
> >Page coloring is an attampt to take this into account when allocating
> >memory to programs so that every program gets to use all of the cache.
> >
> >David Lang
> >
> >
> > On Fri, 24 Jan 2003, Anoop J. wrote:
> >
> >>Date: Fri, 24 Jan 2003 10:38:03 +0530 (IST)
> >>From: Anoop J. <cs99001@nitc.ac.in>
> >>To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
> >>
> >>
> >>How does page coloring work. Iwant its mechanism not the implementation.
> >>I went through some pages of W.L.Lynch's paper on cache and VM. Still not
> >>able to grasp it .
> >>
> >>
> >>Thanks in advance
> >>
> >>
> >>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at http://www.tux.org/lkml/
> >>
> >
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 6:28 ` David Lang
@ 2003-01-24 8:51 ` Anoop J.
2003-01-24 8:48 ` David Lang
0 siblings, 1 reply; 67+ messages in thread
From: Anoop J. @ 2003-01-24 8:51 UTC (permalink / raw)
To: david.lang; +Cc: linux-mm, linux-kernel
I read that the data coherency problem due to virtual indexing is avoided
through page coloring and it has also got the speed of physical indexing
can u just elaborate on how this is possible?
Thanks
> implementing a fully associative cache eliminates the need for page
> coloring, but it has to be implemented in hardware. if you don't have
> fully associative caches in your hardware page coloring helps avoid the
> worst case memory allocations.
>
> from what I have seen on the attempts to implement it the problem is
> that the calculations needed to do page colored allocations end up
> costing enough that they end up with a net loss compared to the old
> method.
>
> David Lang
>
>
> On Fri, 24 Jan 2003, Anoop J.
> wrote:
>
>> Date: Fri, 24 Jan 2003 11:24:24 +0530 (IST)
>> From: Anoop J. <cs99001@nitc.ac.in>
>> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
>>
>>
>> How is this different from a fully associative cache .Would be better
>> if u could deal it based on the address bits used
>>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 8:51 ` Anoop J.
@ 2003-01-24 8:48 ` David Lang
2003-01-24 9:49 ` Anoop J.
0 siblings, 1 reply; 67+ messages in thread
From: David Lang @ 2003-01-24 8:48 UTC (permalink / raw)
To: Anoop J.; +Cc: linux-mm, linux-kernel
I think this is a case of the same tuerm being used for two different
purposes. I don't know the use you are refering to.
David Lang
On Fri, 24 Jan 2003, Anoop J. wrote:
> I read that the data coherency problem due to virtual indexing is avoided
> through page coloring and it has also got the speed of physical indexing
> can u just elaborate on how this is possible?
>
>
> Thanks
>
>
>
>
> > implementing a fully associative cache eliminates the need for page
> > coloring, but it has to be implemented in hardware. if you don't have
> > fully associative caches in your hardware page coloring helps avoid the
> > worst case memory allocations.
> >
> > from what I have seen on the attempts to implement it the problem is
> > that the calculations needed to do page colored allocations end up
> > costing enough that they end up with a net loss compared to the old
> > method.
> >
> > David Lang
> >
> >
> > On Fri, 24 Jan 2003, Anoop J.
> > wrote:
> >
> >> Date: Fri, 24 Jan 2003 11:24:24 +0530 (IST)
> >> From: Anoop J. <cs99001@nitc.ac.in>
> >> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
> >>
> >>
> >> How is this different from a fully associative cache .Would be better
> >> if u could deal it based on the address bits used
> >>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 8:48 ` David Lang
@ 2003-01-24 9:49 ` Anoop J.
2003-01-24 19:14 ` David Lang
0 siblings, 1 reply; 67+ messages in thread
From: Anoop J. @ 2003-01-24 9:49 UTC (permalink / raw)
To: david.lang; +Cc: linux-mm, linux-kernel
ok i shall put it in another way
since virtual indexing is a representation of the virtual memory,
it is possible for more multiple virtual addresses to represent the same
physical address.So the problem of aliasing occurs in the cache.Does page
coloring guarantee a unique mapping of physical address.If so how is the
maping from virtual to physical address
Thanks
> I think this is a case of the same tuerm being used for two different
> purposes. I don't know the use you are refering to.
>
> David Lang
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 9:49 ` Anoop J.
@ 2003-01-24 19:14 ` David Lang
2003-01-24 19:40 ` Maciej W. Rozycki
0 siblings, 1 reply; 67+ messages in thread
From: David Lang @ 2003-01-24 19:14 UTC (permalink / raw)
To: Anoop J.; +Cc: linux-mm, linux-kernel
the cache never sees the virtual addresses, it operated excclusivly on the
physical addresses so the problem of aliasing never comes up.
virtual to physical addres mapping is all resolved before anything hits
the cache.
David Lang
On Fri, 24 Jan 2003, Anoop J. wrote:
> Date: Fri, 24 Jan 2003 15:19:16 +0530 (IST)
> From: Anoop J. <cs99001@nitc.ac.in>
> To: david.lang@digitalinsight.com
> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
> Subject: Re: your mail
>
> ok i shall put it in another way
> since virtual indexing is a representation of the virtual memory,
> it is possible for more multiple virtual addresses to represent the same
> physical address.So the problem of aliasing occurs in the cache.Does page
> coloring guarantee a unique mapping of physical address.If so how is the
> maping from virtual to physical address
>
>
>
> Thanks
>
>
>
> > I think this is a case of the same tuerm being used for two different
> > purposes. I don't know the use you are refering to.
> >
> > David Lang
> >
> >
> >
>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 19:14 ` David Lang
@ 2003-01-24 19:40 ` Maciej W. Rozycki
0 siblings, 0 replies; 67+ messages in thread
From: Maciej W. Rozycki @ 2003-01-24 19:40 UTC (permalink / raw)
To: David Lang; +Cc: Anoop J., linux-mm, linux-kernel
On Fri, 24 Jan 2003, David Lang wrote:
> the cache never sees the virtual addresses, it operated excclusivly on the
> physical addresses so the problem of aliasing never comes up.
It depends on the implementation.
> virtual to physical addres mapping is all resolved before anything hits
> the cache.
It depends on the processor.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* (unknown),
@ 2003-01-24 5:08 Anoop J.
2003-01-24 5:11 ` your mail David Lang
0 siblings, 1 reply; 67+ messages in thread
From: Anoop J. @ 2003-01-24 5:08 UTC (permalink / raw)
To: linux-kernel, linux-mm
How does page coloring work. Iwant its mechanism not the implementation.
I went through some pages of W.L.Lynch's paper on cache and VM. Still not
able to grasp it .
Thanks in advance
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 5:08 (unknown), Anoop J.
@ 2003-01-24 5:11 ` David Lang
2003-01-24 6:06 ` John Alvord
0 siblings, 1 reply; 67+ messages in thread
From: David Lang @ 2003-01-24 5:11 UTC (permalink / raw)
To: Anoop J.; +Cc: linux-kernel, linux-mm
The idea of page coloring is based on the fact that common implementations
of caching can't put any page in memory in any line in the cache (such an
implementation is possible, but is more expensive to do so is not commonly
done)
With this implementation it means that if your program happens to use
memory that cannot be mapped to half of the cache lines then effectivly
the CPU cache is half it's rated size for your program. the next time your
program runs it may get a more favorable memory allocation and be able to
use all of the cache and therefor run faster.
Page coloring is an attampt to take this into account when allocating
memory to programs so that every program gets to use all of the cache.
David Lang
On Fri, 24 Jan 2003, Anoop J. wrote:
> Date: Fri, 24 Jan 2003 10:38:03 +0530 (IST)
> From: Anoop J. <cs99001@nitc.ac.in>
> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
>
>
> How does page coloring work. Iwant its mechanism not the implementation.
> I went through some pages of W.L.Lynch's paper on cache and VM. Still not
> able to grasp it .
>
>
> Thanks in advance
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 5:11 ` your mail David Lang
@ 2003-01-24 6:06 ` John Alvord
2003-01-25 2:29 ` Jason Papadopoulos
0 siblings, 1 reply; 67+ messages in thread
From: John Alvord @ 2003-01-24 6:06 UTC (permalink / raw)
To: David Lang; +Cc: Anoop J., linux-kernel, linux-mm
The big challenge in Linux is that several serious attempts to add
page coloring have foundered on the shoals of "no benefit found". It
may be that the typical hardware Linux runs on just doesn't experience
the problem very much.
john
On Thu, 23 Jan 2003 21:11:10 -0800 (PST), David Lang
<david.lang@digitalinsight.com> wrote:
>The idea of page coloring is based on the fact that common implementations
>of caching can't put any page in memory in any line in the cache (such an
>implementation is possible, but is more expensive to do so is not commonly
>done)
>
>With this implementation it means that if your program happens to use
>memory that cannot be mapped to half of the cache lines then effectivly
>the CPU cache is half it's rated size for your program. the next time your
>program runs it may get a more favorable memory allocation and be able to
>use all of the cache and therefor run faster.
>
>Page coloring is an attampt to take this into account when allocating
>memory to programs so that every program gets to use all of the cache.
>
>David Lang
>
>
> On Fri, 24 Jan 2003, Anoop J. wrote:
>
>> Date: Fri, 24 Jan 2003 10:38:03 +0530 (IST)
>> From: Anoop J. <cs99001@nitc.ac.in>
>> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
>>
>>
>> How does page coloring work. Iwant its mechanism not the implementation.
>> I went through some pages of W.L.Lynch's paper on cache and VM. Still not
>> able to grasp it .
>>
>>
>> Thanks in advance
>>
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-24 6:06 ` John Alvord
@ 2003-01-25 2:29 ` Jason Papadopoulos
2003-01-25 2:26 ` Larry McVoy
0 siblings, 1 reply; 67+ messages in thread
From: Jason Papadopoulos @ 2003-01-25 2:29 UTC (permalink / raw)
To: linux-kernel, linux-mm
At 10:06 PM 1/23/03 -0800, John Alvord wrote:
>The big challenge in Linux is that several serious attempts to add
>page coloring have foundered on the shoals of "no benefit found". It
>may be that the typical hardware Linux runs on just doesn't experience
>the problem very much.
Another strike against page coloring is that it gives tremendous benefits
when caches are large and not very associative, but if both of these are
not present the benefits are much smaller. In the case of latter-day PCs,
neither of these is the case: the caches are very small and at least 8-way
set associative.
For the record, I finally got to try my own page coloring patch on a 1GHz
Athlon Thunderbird system with 256kB L2 cache. With the present patch, my
own number crunching benchmarks and a kernel compile don't show any benefit
at all, and lmbench is completely unchanged except for the mmap latency,
which is slightly worse. Hardly a compelling case for PCs!
Oh well. At least now I'll be able to port to 2.5 :)
jasonp
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-25 2:29 ` Jason Papadopoulos
@ 2003-01-25 2:26 ` Larry McVoy
2003-01-25 17:47 ` Eric W. Biederman
0 siblings, 1 reply; 67+ messages in thread
From: Larry McVoy @ 2003-01-25 2:26 UTC (permalink / raw)
To: Jason Papadopoulos; +Cc: linux-kernel, linux-mm
> For the record, I finally got to try my own page coloring patch on a 1GHz
> Athlon Thunderbird system with 256kB L2 cache. With the present patch, my
> own number crunching benchmarks and a kernel compile don't show any benefit
> at all, and lmbench is completely unchanged except for the mmap latency,
> which is slightly worse. Hardly a compelling case for PCs!
If it works correctly then the variability in lat_ctx should go away.
Try this
for p in 2 4 8 12 16 24 32 64
do for size in 0 2 4 8 16
do for i in 1 2 3 4 5 6 7 8 9 0
do lat_ctx -s$size $p
done
done
done
on both the with and without kernel. The page coloring should make the
numbers rock steady, without it, they will bounce a lot.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-25 2:26 ` Larry McVoy
@ 2003-01-25 17:47 ` Eric W. Biederman
2003-01-25 23:10 ` Larry McVoy
0 siblings, 1 reply; 67+ messages in thread
From: Eric W. Biederman @ 2003-01-25 17:47 UTC (permalink / raw)
To: Larry McVoy; +Cc: Jason Papadopoulos, linux-kernel, linux-mm
Larry McVoy <lm@bitmover.com> writes:
> > For the record, I finally got to try my own page coloring patch on a 1GHz
> > Athlon Thunderbird system with 256kB L2 cache. With the present patch, my
> > own number crunching benchmarks and a kernel compile don't show any benefit
> > at all, and lmbench is completely unchanged except for the mmap latency,
> > which is slightly worse. Hardly a compelling case for PCs!
>
> If it works correctly then the variability in lat_ctx should go away.
> Try this
>
> for p in 2 4 8 12 16 24 32 64
> do for size in 0 2 4 8 16
> do for i in 1 2 3 4 5 6 7 8 9 0
> do lat_ctx -s$size $p
> done
> done
> done
>
> on both the with and without kernel. The page coloring should make the
> numbers rock steady, without it, they will bounce a lot.
On the same kind of vein I have seen some tremendous variability in the
stream benchmark. Under linux I have gotten it to very as much
as a 100MB/sec by running updatedb, between runs. In one case
it ran faster with updatedb running in the background.
But at the same time streams tends to be very steady if you have a quiet
machine and run it several times in a row repeatedly because it gets
allocated essentially the same memory every run.
So I do no the variables of cache contention do have effect on some
real programs. I have not yet tracked it down to see if cache coloring
could be a benefit. I suspect the buddy allocator actually comes
quite close most of the time, and tricks like allocating multiple pages
at once could improve that even more with very little effort, while reducing
page fault miss times.
I am wondering if there is any point in biasing page addresses in between
processes so that processes are less likely to have a cache conflict.
i.e. process 1 address 0 %16K == 0, process 2 address 0 %16K == 4K
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-25 17:47 ` Eric W. Biederman
@ 2003-01-25 23:10 ` Larry McVoy
2003-01-26 8:12 ` David S. Miller
0 siblings, 1 reply; 67+ messages in thread
From: Larry McVoy @ 2003-01-25 23:10 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: Larry McVoy, Jason Papadopoulos, linux-kernel, linux-mm
> I am wondering if there is any point in biasing page addresses in between
> processes so that processes are less likely to have a cache conflict.
> i.e. process 1 address 0 %16K == 0, process 2 address 0 %16K == 4K
All good page coloring implementation do exactly that. The starting
index into the page buckets is based on process id.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2003-01-25 23:10 ` Larry McVoy
@ 2003-01-26 8:12 ` David S. Miller
0 siblings, 0 replies; 67+ messages in thread
From: David S. Miller @ 2003-01-26 8:12 UTC (permalink / raw)
To: Larry McVoy; +Cc: Eric W. Biederman, Jason Papadopoulos, linux-kernel, linux-mm
On Sat, 2003-01-25 at 15:10, Larry McVoy wrote:
> All good page coloring implementation do exactly that. The starting
> index into the page buckets is based on process id.
I think everyone interested in learning more about this
topic should go read the following papers, they were very
helpful when I was fiddling around in this area.
These papers, in turn, reference several others which are
good reads as well.
1) W. L. Lynch, B. K. Bray, and M. J. Flynn. "The effect of page
allocation on caches". In Micro-25 Conference Proceedings, pages
222-225, December 1992.
2) W. Lynch and M. Flynn. "Cache improvements through colored page
allocation". ACM Transactions on Computer Systems, 1993. Submitted
for review, 1992.
3) William L. Lynch. "The Interaction of Virtual Memory and Cache
Memory". PhD thesis, Stanford University, October
1993. CSL-TR-93-587.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
@ 2002-04-21 14:54 raciel
2002-04-21 19:12 ` your mail William Lee Irwin III
0 siblings, 1 reply; 67+ messages in thread
From: raciel @ 2002-04-21 14:54 UTC (permalink / raw)
To: linux-mm
Hello all :)
I have been trying to understand the rmap patch from Rik Van Riel, but
i dont undertand what the rmap patch do. Somebody can explain me or know
a good site where i can get documentation?
Regards Raciel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2002-04-21 14:54 raciel
@ 2002-04-21 19:12 ` William Lee Irwin III
0 siblings, 0 replies; 67+ messages in thread
From: William Lee Irwin III @ 2002-04-21 19:12 UTC (permalink / raw)
To: raciel; +Cc: linux-mm
On Sun, Apr 21, 2002 at 02:54:00PM +0000, raciel wrote:
> Hello all :)
> I have been trying to understand the rmap patch from Rik Van Riel, but
> i dont undertand what the rmap patch do. Somebody can explain me or know
> a good site where i can get documentation?
> Regards Raciel
The largest piece of functionality is additional accounting that enables
the kernel to find all users of a given page. This is known as physical
scanning, and other OS's call similar functionality "pmap", "ptov" (for
physical to virtual), and "HAT" (Hardware Address Translation), though
the interfaces are just as different as the names.
There are two very important translation mechanisms:
(1) physical page to page table entry
(2) 3rd-level pagetable to address space (mm_struct)
(1) is accomplished by using a singly linked list of page table entry
addresses attached to a per-physical-page structure (struct page).
(2) is accomplished by reusing one of the fields of struct page for the
3rd-level pagetable to hold the pointer to the mm_struct.
To clarify how a physical page is handled, the page replacement
algorithm might decide that a given physical page is targeted for
eviction. It then calls try_to_unmap(), which traverses the list of
3rd-level pagetable entries, and then rounds their addresses to
obtain the physical page occupied by the 3rd-level pagetable, then
it finds the struct page for that physical page, and then it tries
to obtain locks on the address space's pagetable lock and when it
does, it just removes the thing from the pagetable, otherwise it
returns an error code saying "try again later" (SWAP_AGAIN).
All this requires is
(1) putting an entry onto the per-page list when entering a page
into pagetables
(2) pulling things off the per-page list when removing a page
from pagetables
(3) marking a 3rd-level pagetable's struct page with the mm_struct
Cheers,
Bill
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
@ 2002-01-02 14:20 mehul radheshyam choube
2002-01-03 16:40 ` your mail Rik van Riel
0 siblings, 1 reply; 67+ messages in thread
From: mehul radheshyam choube @ 2002-01-02 14:20 UTC (permalink / raw)
To: linux-mm
hi friends,
i would like to do some developement work for linux-mm
please guide me .
mehul.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
@ 2001-08-04 11:10 Mahmoud Taghizadeh
2001-08-04 13:18 ` your mail Francois Romieu
0 siblings, 1 reply; 67+ messages in thread
From: Mahmoud Taghizadeh @ 2001-08-04 11:10 UTC (permalink / raw)
To: linux-mm
I am sorry for my stupid question!
is there any linux distribution without memory managment unit?
I mean, the paging is disabled and memory managemnt is done only
by segmentation.
I am thankful in advance.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
@ 2001-06-08 1:36 jnn
2001-06-08 13:16 ` your mail Ralf Baechle
0 siblings, 1 reply; 67+ messages in thread
From: jnn @ 2001-06-08 1:36 UTC (permalink / raw)
To: linux-mm
[-- Attachment #1: Type: text/plain, Size: 111 bytes --]
hi,all
somebody PLS tell me where can I find some documention about the mechanism of the cach flushing.
[-- Attachment #2: Type: text/html, Size: 462 bytes --]
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
@ 2000-09-04 12:01 Sahil
2000-09-04 15:35 ` your mail Rik van Riel
0 siblings, 1 reply; 67+ messages in thread
From: Sahil @ 2000-09-04 12:01 UTC (permalink / raw)
To: linux-mm
dear friends,
I am a newbie to kernel programming.
can anyone suggest some good readings for the same?
Shahil
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
@ 2000-03-28 8:19 pnilesh
2000-03-28 13:26 ` Stephen C. Tweedie
0 siblings, 1 reply; 67+ messages in thread
From: pnilesh @ 2000-03-28 8:19 UTC (permalink / raw)
To: Kanoj Sarcar; +Cc: linux-mm
No, if both processes have faulted in the page into their ptes, it will
be 2. The page count is normally the number of references from user
ptes, plus any long/short term holds kernel code establishes on the
page.
I was confused as Maurice Bach increases region reference count when any
region say text is shared among more than one processes, and not the page
reference count.
One more thing if the process ocurrs a page fault on text page it calls
file_no_page()
>From what you said in this case it should increment the page count but in
this function no where I could see the page count getting incremented.
>
> Q When a page of a file is in page hash queue, does this page have
page
> table entry in any process ?
Possibly, if the file is mmaped into some other process.
> Q Can this be discarded right away , if the need arises?
>
At the minimum, you need to write modified contents back to disk, if
the file page has not already been discarded.
The David Rusling book says when reducing page cache and buffer cache the
page table entries are not modified and the pages can be dropped directly.
Kanoj
> Nilesh Patel
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
2000-03-28 8:19 pnilesh
@ 2000-03-28 13:26 ` Stephen C. Tweedie
0 siblings, 0 replies; 67+ messages in thread
From: Stephen C. Tweedie @ 2000-03-28 13:26 UTC (permalink / raw)
To: pnilesh; +Cc: Kanoj Sarcar, linux-mm, Stephen Tweedie
Hi,
On Tue, Mar 28, 2000 at 01:49:04PM +0530, pnilesh@in.ibm.com wrote:
>
> No, if both processes have faulted in the page into their ptes, it will
> be 2.
3. The page cache counts as a reference.
> One more thing if the process ocurrs a page fault on text page it calls
> file_no_page()
> From what you said in this case it should increment the page count but in
> this function no where I could see the page count getting incremented.
It is done implicitly when filemap_nopage() looks for the page in the
page cache: __find_page() increments the reference count of any page
it finds before returning.
> The David Rusling book says when reducing page cache and buffer cache the
> page table entries are not modified and the pages can be dropped directly.
Yes, but it checks the page reference count to make sure it is legal to
do so first.
--Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 67+ messages in thread
* (no subject)
@ 1998-02-25 22:15 Rik van Riel
1998-02-25 22:48 ` your mail Linus Torvalds
0 siblings, 1 reply; 67+ messages in thread
From: Rik van Riel @ 1998-02-25 22:15 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Stephen C. Tweedie, linux-mm
Hi,
I've just come up with a very simple idea to limit
thrashing, and I'm asking you if you want it implemented
(there's some cost involved :-( ).
We could simply prohibit the VM subsystem from swapping
out pages which have been allocated less than one second
ago, this way the movement of pages becomes 'slower', and
thrashing might get somewhat less.
The cost involved is that we have to add a new entry
to the page_struct :-( and do some (relatively cheap)
bookkeeping on every page. Also, this might limit the
rate of allocation some programs do, giving rise to
all sorts of new and exiting problems.
Rik.
+-----------------------------+------------------------------+
| For Linux mm-patches, go to | "I'm busy managing memory.." |
| my homepage (via LinuxHQ). | H.H.vanRiel@fys.ruu.nl |
| ...submissions welcome... | http://www.fys.ruu.nl/~riel/ |
+-----------------------------+------------------------------+
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
1998-02-25 22:15 Rik van Riel
@ 1998-02-25 22:48 ` Linus Torvalds
1998-02-25 23:26 ` Rik van Riel
0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 1998-02-25 22:48 UTC (permalink / raw)
To: Rik van Riel; +Cc: Stephen C. Tweedie, linux-mm
On Wed, 25 Feb 1998, Rik van Riel wrote:
>
> I've just come up with a very simple idea to limit
> thrashing, and I'm asking you if you want it implemented
> (there's some cost involved :-( ).
>
> We could simply prohibit the VM subsystem from swapping
> out pages which have been allocated less than one second
> ago, this way the movement of pages becomes 'slower', and
> thrashing might get somewhat less.
I'm _really_ nervous about "rate-based" limiting. Personally I think that
it only makes sense for things like networking, and even there it had
better be done by hardware.
The reason I dislike rate-based things is that it is _so_ hard to give any
kind of guarantees at all on the sanity of the thing. You can tweak the
rates, but they don't have any logic to them, and most importantly they
are very hard to make self-tweaking.
I tend to prefer a "balancing" approach to these problems: the important
part about balancing is that while it may not have some specific
well-defined behaviour that you can point your finger to ("will not page
out the same page that it paged in within 5 seconds"), the basic approach
is to write something that doesn't have any hard rules but that TENDS
towards some certain goal.
That way you get algorithms that you can be fairly confident work well in
the normal cases (because you test those normal cases), and because there
are no hard rules you also don't get strange "edges" when something
unexpected happens: performance may well degrade badly, but it degrades
_softly_ rather than in quantisized jumps.
Linus
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: your mail
1998-02-25 22:48 ` your mail Linus Torvalds
@ 1998-02-25 23:26 ` Rik van Riel
0 siblings, 0 replies; 67+ messages in thread
From: Rik van Riel @ 1998-02-25 23:26 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Stephen C. Tweedie, linux-mm
On Wed, 25 Feb 1998, Linus Torvalds wrote:
> > We could simply prohibit the VM subsystem from swapping
> > out pages which have been allocated less than one second
> > ago, this way the movement of pages becomes 'slower', and
> > thrashing might get somewhat less.
>
> The reason I dislike rate-based things is that it is _so_ hard to give any
> kind of guarantees at all on the sanity of the thing. You can tweak the
> rates, but they don't have any logic to them, and most importantly they
> are very hard to make self-tweaking.
Yup, I don't like 'em either, but I proposed it anyways
in case anyone of you might like it :-)
Personally, I like balancing code too, if possible with
some balancing on the balancing code as well...
OK, we go with balancing code.
Rik.
+-----------------------------+------------------------------+
| For Linux mm-patches, go to | "I'm busy managing memory.." |
| my homepage (via LinuxHQ). | H.H.vanRiel@fys.ruu.nl |
| ...submissions welcome... | http://www.fys.ruu.nl/~riel/ |
+-----------------------------+------------------------------+
^ permalink raw reply [flat|nested] 67+ messages in thread
end of thread, other threads:[~2025-02-28 17:30 UTC | newest]
Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-16 16:33 Jan Kara
2010-06-16 16:33 ` [PATCH 1/2] radix-tree: Implement function radix_tree_range_tag_if_tagged Jan Kara
2010-06-18 22:18 ` Andrew Morton
2010-06-21 12:09 ` Nick Piggin
2010-06-21 22:43 ` Jan Kara
2010-06-23 13:42 ` Jan Kara
2010-06-16 16:33 ` [PATCH 2/2] mm: Implement writeback livelock avoidance using page tagging Jan Kara
2010-06-18 22:21 ` Andrew Morton
2010-06-21 12:42 ` Jan Kara
2010-06-16 22:15 ` your mail Dave Chinner
2010-06-17 7:43 ` [PATCH 0/2 v4] Writeback livelock avoidance for data integrity writes Jan Kara
2010-06-18 6:11 ` Dave Chinner
2010-06-18 7:01 ` Nick Piggin
2010-06-17 9:11 ` Jan Kara
2010-06-22 2:59 ` your mail Wu Fengguang
2010-06-22 13:54 ` Jan Kara
2010-06-22 14:12 ` Wu Fengguang
-- strict thread matches above, loose matches on Subject: below --
2025-02-24 22:52 [PATCH v7 0/7] mseal system mappings jeffxu
2025-02-25 15:18 ` Lorenzo Stoakes
2025-02-26 0:12 ` Jeff Xu
2025-02-26 5:42 ` your mail Lorenzo Stoakes
2025-02-28 0:55 ` Jeff Xu
2025-02-28 9:35 ` Lorenzo Stoakes
2025-02-28 17:24 ` Jeff Xu
2025-02-28 17:30 ` Lorenzo Stoakes
2023-05-10 19:01 [PATCH] maple_tree: Fix a few documentation issues, Thomas Gleixner
2023-05-15 19:27 ` your mail Liam R. Howlett
2023-05-15 21:16 ` Thomas Gleixner
2023-05-16 22:47 ` Thomas Gleixner
2023-05-23 13:46 ` Liam R. Howlett
[not found] <20190225201635.4648-1-hannes@cmpxchg.org>
2019-02-26 23:49 ` Roman Gushchin
2017-04-10 11:03 [PATCH -v2 0/9] mm: make movable onlining suck less Michal Hocko
2017-04-15 12:17 ` Michal Hocko
2017-04-17 5:47 ` your mail Joonsoo Kim
2017-04-17 8:15 ` Michal Hocko
2017-04-20 1:27 ` Joonsoo Kim
2017-04-20 7:28 ` Michal Hocko
2017-04-20 8:49 ` Michal Hocko
2017-04-20 11:56 ` Vlastimil Babka
2017-04-20 12:13 ` Michal Hocko
2017-04-21 4:38 ` Joonsoo Kim
2017-04-21 7:16 ` Michal Hocko
2017-04-24 1:44 ` Joonsoo Kim
2017-04-24 7:53 ` Michal Hocko
2017-04-25 2:50 ` Joonsoo Kim
2017-04-26 9:19 ` Michal Hocko
2017-04-27 2:08 ` Joonsoo Kim
2017-04-27 15:10 ` Michal Hocko
2012-10-04 16:50 Andrea Arcangeli
2012-10-04 18:17 ` your mail Christoph Lameter
[not found] <1131.86.55.168.2.1170690089.squirrel@mail.thinknet.ro>
2007-02-05 12:36 ` Joerg Roedel
2003-01-24 5:54 Anoop J.
2003-01-24 6:28 ` David Lang
2003-01-24 8:51 ` Anoop J.
2003-01-24 8:48 ` David Lang
2003-01-24 9:49 ` Anoop J.
2003-01-24 19:14 ` David Lang
2003-01-24 19:40 ` Maciej W. Rozycki
2003-01-24 5:08 (unknown), Anoop J.
2003-01-24 5:11 ` your mail David Lang
2003-01-24 6:06 ` John Alvord
2003-01-25 2:29 ` Jason Papadopoulos
2003-01-25 2:26 ` Larry McVoy
2003-01-25 17:47 ` Eric W. Biederman
2003-01-25 23:10 ` Larry McVoy
2003-01-26 8:12 ` David S. Miller
2002-04-21 14:54 raciel
2002-04-21 19:12 ` your mail William Lee Irwin III
2002-01-02 14:20 mehul radheshyam choube
2002-01-03 16:40 ` your mail Rik van Riel
2001-08-04 11:10 Mahmoud Taghizadeh
2001-08-04 13:18 ` your mail Francois Romieu
2001-06-08 1:36 jnn
2001-06-08 13:16 ` your mail Ralf Baechle
2000-09-04 12:01 Sahil
2000-09-04 15:35 ` your mail Rik van Riel
2000-03-28 8:19 pnilesh
2000-03-28 13:26 ` Stephen C. Tweedie
1998-02-25 22:15 Rik van Riel
1998-02-25 22:48 ` your mail Linus Torvalds
1998-02-25 23:26 ` Rik van Riel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).