* [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL @ 2017-10-10 2:52 Wei Yang 2017-10-10 20:53 ` Andrew Morton 0 siblings, 1 reply; 6+ messages in thread From: Wei Yang @ 2017-10-10 2:52 UTC (permalink / raw) To: mawilcox, akpm; +Cc: linux-kernel, Wei Yang When parent is NULL, get_slot_offset() returns almost the address of slot. This is an invalid value for offset. One possible scenario happens on deleting #0 index, when it is the only one in tree. Current behavior doesn't harm the system, because the offset will not be used when parent is NULL in the following procedure or parent is checked before get_slot_offset() called. While it is still not safe to return an invalid offset. This patch returns 0 when parent is NULL in get_slot_offset(). Signed-off-by: Wei Yang <richard.weiyang@gmail.com> --- lib/radix-tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index 898e87998417..f006f6928eda 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -119,7 +119,7 @@ bool is_sibling_entry(const struct radix_tree_node *parent, void *node) static inline unsigned long get_slot_offset(const struct radix_tree_node *parent, void __rcu **slot) { - return slot - parent->slots; + return parent ? (slot - parent->slots):0; } static unsigned int radix_tree_descend(const struct radix_tree_node *parent, -- 2.11.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL 2017-10-10 2:52 [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL Wei Yang @ 2017-10-10 20:53 ` Andrew Morton 2017-10-11 2:33 ` Wei Yang 0 siblings, 1 reply; 6+ messages in thread From: Andrew Morton @ 2017-10-10 20:53 UTC (permalink / raw) To: Wei Yang; +Cc: mawilcox, linux-kernel On Tue, 10 Oct 2017 10:52:01 +0800 Wei Yang <richard.weiyang@gmail.com> wrote: > When parent is NULL, get_slot_offset() returns almost the address of slot. > This is an invalid value for offset. > > One possible scenario happens on deleting #0 index, when it is the only one > in tree. > > Current behavior doesn't harm the system, because the offset will not be > used when parent is NULL in the following procedure or parent is checked > before get_slot_offset() called. While it is still not safe to return an > invalid offset. > > This patch returns 0 when parent is NULL in get_slot_offset(). > I'm confused. If parent=NULL, get_slot_offset() will crash the kernel. So why "Current behavior doesn't harm the system"? > --- a/lib/radix-tree.c > +++ b/lib/radix-tree.c > @@ -119,7 +119,7 @@ bool is_sibling_entry(const struct radix_tree_node *parent, void *node) > static inline unsigned long > get_slot_offset(const struct radix_tree_node *parent, void __rcu **slot) > { > - return slot - parent->slots; > + return parent ? (slot - parent->slots):0; > } > > static unsigned int radix_tree_descend(const struct radix_tree_node *parent, ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL 2017-10-10 20:53 ` Andrew Morton @ 2017-10-11 2:33 ` Wei Yang 2017-10-11 23:39 ` Andrew Morton 0 siblings, 1 reply; 6+ messages in thread From: Wei Yang @ 2017-10-11 2:33 UTC (permalink / raw) To: Andrew Morton; +Cc: Wei Yang, mawilcox, linux-kernel On Tue, Oct 10, 2017 at 01:53:11PM -0700, Andrew Morton wrote: >On Tue, 10 Oct 2017 10:52:01 +0800 Wei Yang <richard.weiyang@gmail.com> wrote: > >> When parent is NULL, get_slot_offset() returns almost the address of slot. >> This is an invalid value for offset. >> >> One possible scenario happens on deleting #0 index, when it is the only one >> in tree. >> >> Current behavior doesn't harm the system, because the offset will not be >> used when parent is NULL in the following procedure or parent is checked >> before get_slot_offset() called. While it is still not safe to return an >> invalid offset. >> >> This patch returns 0 when parent is NULL in get_slot_offset(). >> > >I'm confused. If parent=NULL, get_slot_offset() will crash the kernel. >So why "Current behavior doesn't harm the system"? > Andrew, I am confused at first glace too, while this may be a feature of the compiler. For example, the definition of offsetof(), which leverage the trick of the compiler. #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER) BTW, I have another question on the implementation of radix tree. #Question: why 6 is one of the choice of RADIX_TREE_MAP_SHIFT Currently, the shift of the tree is defined as below. #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) The index in the tree is with type unsigned long, which means the size is 32bits or 64bits. 32 % 4 = 0, 64 % 4 = 0 32 % 6 = 2, 64 % 6 = 4 This means when the tree shift is 6 and represents the max index, there would be some unused slots at the top level. So why we chose 6 instead of a 2^N? The original commit is 139e561660fe11e0fc35e142a800df3dd7d03e9d, which I don't see some reason for this choice. If possible, I suggest to change this value to a power of 2. Or maybe I missed some critical reason behind this. Thanks for your time and review. >> --- a/lib/radix-tree.c >> +++ b/lib/radix-tree.c >> @@ -119,7 +119,7 @@ bool is_sibling_entry(const struct radix_tree_node *parent, void *node) >> static inline unsigned long >> get_slot_offset(const struct radix_tree_node *parent, void __rcu **slot) >> { >> - return slot - parent->slots; >> + return parent ? (slot - parent->slots):0; >> } >> >> static unsigned int radix_tree_descend(const struct radix_tree_node *parent, -- Wei Yang Help you, Help me ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL 2017-10-11 2:33 ` Wei Yang @ 2017-10-11 23:39 ` Andrew Morton 2017-10-12 2:20 ` Wei Yang 0 siblings, 1 reply; 6+ messages in thread From: Andrew Morton @ 2017-10-11 23:39 UTC (permalink / raw) To: Wei Yang; +Cc: mawilcox, linux-kernel On Wed, 11 Oct 2017 10:33:39 +0800 Wei Yang <richard.weiyang@gmail.com> wrote: > On Tue, Oct 10, 2017 at 01:53:11PM -0700, Andrew Morton wrote: > >On Tue, 10 Oct 2017 10:52:01 +0800 Wei Yang <richard.weiyang@gmail.com> wrote: > > > >> When parent is NULL, get_slot_offset() returns almost the address of slot. > >> This is an invalid value for offset. > >> > >> One possible scenario happens on deleting #0 index, when it is the only one > >> in tree. > >> > >> Current behavior doesn't harm the system, because the offset will not be > >> used when parent is NULL in the following procedure or parent is checked > >> before get_slot_offset() called. While it is still not safe to return an > >> invalid offset. > >> > >> This patch returns 0 when parent is NULL in get_slot_offset(). > >> > > > >I'm confused. If parent=NULL, get_slot_offset() will crash the kernel. > >So why "Current behavior doesn't harm the system"? > > > > Andrew, > > I am confused at first glace too, while this may be a feature of the compiler. > > For example, the definition of offsetof(), which leverage the trick of the > compiler. > > #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER) Well dereferencing that address will still oops the kernel. Has this bug been seen in practice? If so, under what situation and on which architecture? > BTW, I have another question on the implementation of radix tree. > > #Question: why 6 is one of the choice of RADIX_TREE_MAP_SHIFT > > Currently, the shift of the tree is defined as below. > > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) > > The index in the tree is with type unsigned long, which means the size is > 32bits or 64bits. > > 32 % 4 = 0, 64 % 4 = 0 > > 32 % 6 = 2, 64 % 6 = 4 > > This means when the tree shift is 6 and represents the max index, there would > be some unused slots at the top level. > > So why we chose 6 instead of a 2^N? > > The original commit is 139e561660fe11e0fc35e142a800df3dd7d03e9d, which I don't > see some reason for this choice. > > If possible, I suggest to change this value to a power of 2. Or maybe I missed > some critical reason behind this. > I'm not sure I really see the problem. 1<<6 is still a power of 2. radix_tree_split_preload() is doing mod and div on shift distances which hurts my brain a bit. But CONFIG_BASE_SMALL=0 is what most people will be using so presumably the current code tests out OK. Although simple memory wastage would be hard to detect. Matthew has been here recently and perhaps can consider? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL 2017-10-11 23:39 ` Andrew Morton @ 2017-10-12 2:20 ` Wei Yang 2017-10-13 15:40 ` Matthew Wilcox 0 siblings, 1 reply; 6+ messages in thread From: Wei Yang @ 2017-10-12 2:20 UTC (permalink / raw) To: Andrew Morton; +Cc: Wei Yang, mawilcox, linux-kernel On Wed, Oct 11, 2017 at 04:39:40PM -0700, Andrew Morton wrote: >On Wed, 11 Oct 2017 10:33:39 +0800 Wei Yang <richard.weiyang@gmail.com> wrote: > >> On Tue, Oct 10, 2017 at 01:53:11PM -0700, Andrew Morton wrote: >> >On Tue, 10 Oct 2017 10:52:01 +0800 Wei Yang <richard.weiyang@gmail.com> wrote: >> > >> >> When parent is NULL, get_slot_offset() returns almost the address of slot. >> >> This is an invalid value for offset. >> >> >> >> One possible scenario happens on deleting #0 index, when it is the only one >> >> in tree. >> >> >> >> Current behavior doesn't harm the system, because the offset will not be >> >> used when parent is NULL in the following procedure or parent is checked >> >> before get_slot_offset() called. While it is still not safe to return an >> >> invalid offset. >> >> >> >> This patch returns 0 when parent is NULL in get_slot_offset(). >> >> >> > >> >I'm confused. If parent=NULL, get_slot_offset() will crash the kernel. >> >So why "Current behavior doesn't harm the system"? >> > >> >> Andrew, >> >> I am confused at first glace too, while this may be a feature of the compiler. >> >> For example, the definition of offsetof(), which leverage the trick of the >> compiler. >> >> #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER) > >Well dereferencing that address will still oops the kernel. > I don't get your point why this would oops the kernel. It tries to get the offset of a field in structure. A test program in user space #include <stdio.h> struct A { int b; char n[8]; }; int main() { printf("%d\n", ((struct A*)0)->n); return 0; } This works fine on my machine. Do I missed your point? >Has this bug been seen in practice? If so, under what situation and on >which architecture? The NULL pointer of "parent" is a real case. I see this by adding a log in __radix_tree_delete(), which prints the value of "node" and show "node" is a NULL pointer. Here is the diff: @@ -1994,6 +1994,9 @@ static bool __radix_tree_delete(struct radix_tree_root *root, unsigned offset = get_slot_offset(node, slot); int tag; + if (!node) + printk(KERN_ERR"%s: node is %p\n", __func__, node); + if (is_idr(root))) I can't specify the exact case which leads to this, while it seems to happen a lot from the logs printed. > >> BTW, I have another question on the implementation of radix tree. >> >> #Question: why 6 is one of the choice of RADIX_TREE_MAP_SHIFT >> >> Currently, the shift of the tree is defined as below. >> >> #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) >> >> The index in the tree is with type unsigned long, which means the size is >> 32bits or 64bits. >> >> 32 % 4 = 0, 64 % 4 = 0 >> >> 32 % 6 = 2, 64 % 6 = 4 >> >> This means when the tree shift is 6 and represents the max index, there would >> be some unused slots at the top level. >> >> So why we chose 6 instead of a 2^N? >> >> The original commit is 139e561660fe11e0fc35e142a800df3dd7d03e9d, which I don't >> see some reason for this choice. >> >> If possible, I suggest to change this value to a power of 2. Or maybe I missed >> some critical reason behind this. >> > >I'm not sure I really see the problem. 1<<6 is still a power of 2. >radix_tree_split_preload() is doing mod and div on shift distances >which hurts my brain a bit. But CONFIG_BASE_SMALL=0 is what most >people will be using so presumably the current code tests out OK. >Although simple memory wastage would be hard to detect. Matthew has >been here recently and perhaps can consider? Hmm, I may not state it clearly. What I mean is to set RADIX_TREE_MAP_SHIFT a power of 2, instead of the result of (1 << RADIX_TREE_MAP_SHIFT). Let me explain a simplified example. Assume the max index is 16 bits. Case 1, shift is 4 Each 4 bit forms a group and exactly 4 groups. |0123|4567|0123|4567| Case 2, shift is 6 Each 6 bit forms a group and left 3 groups with 2 empty slots at the beginning. | 0123|456701|234567| Radix tree is a little bit like the page table, which is divided into several levels and each level is referenced by a "slot" of bits. When the shift is a power of 2, the digit fits in exact "slots". This means there is no waste. When the shift is not, we will have several unused digits at the most significant part. This means several slot will never be used. This means on theory, in a tree with shift 6 | 0123|456701|234567| We can store an index with value ffffff ffffff ffffff While actually the largest value a 16 bits index could represents __ffff ffffff ffffff Since radix tree will not allocate memory when there is no corresponding index, this will just leave several unused slot at the top level. While I still prefer to fully utilize the whole tree instead of leave some never touch slot. And this can be achieved by set RADIX_TREE_MAP_SHIFT to a power of 2. So if there is no specific reason to use 6, I would like to form a patch to change it to 8. -- Wei Yang Help you, Help me ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL 2017-10-12 2:20 ` Wei Yang @ 2017-10-13 15:40 ` Matthew Wilcox 0 siblings, 0 replies; 6+ messages in thread From: Matthew Wilcox @ 2017-10-13 15:40 UTC (permalink / raw) To: Wei Yang, Andrew Morton; +Cc: linux-kernel@vger.kernel.org From: Wei Yang [mailto:richard.weiyang@gmail.com] > On Wed, Oct 11, 2017 at 04:39:40PM -0700, Andrew Morton wrote: > >On Wed, 11 Oct 2017 10:33:39 +0800 Wei Yang > >> BTW, I have another question on the implementation of radix tree. > >> > >> #Question: why 6 is one of the choice of RADIX_TREE_MAP_SHIFT > >> > >> Currently, the shift of the tree is defined as below. > >> > >> #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) > >> > >> The index in the tree is with type unsigned long, which means the size is > >> 32bits or 64bits. > >> > >> 32 % 4 = 0, 64 % 4 = 0 > >> > >> 32 % 6 = 2, 64 % 6 = 4 > >> > >> This means when the tree shift is 6 and represents the max index, there > would > >> be some unused slots at the top level. > >> > >> So why we chose 6 instead of a 2^N? > >> > >> The original commit is 139e561660fe11e0fc35e142a800df3dd7d03e9d, > which I don't > >> see some reason for this choice. > >> > >> If possible, I suggest to change this value to a power of 2. Or maybe I missed > >> some critical reason behind this. > >> > > > >I'm not sure I really see the problem. 1<<6 is still a power of 2. > >radix_tree_split_preload() is doing mod and div on shift distances > >which hurts my brain a bit. But CONFIG_BASE_SMALL=0 is what most > >people will be using so presumably the current code tests out OK. > >Although simple memory wastage would be hard to detect. Matthew has > >been here recently and perhaps can consider? > > Hmm, I may not state it clearly. > > What I mean is to set RADIX_TREE_MAP_SHIFT a power of 2, instead of the > result > of (1 << RADIX_TREE_MAP_SHIFT). > > Let me explain a simplified example. Assume the max index is 16 bits. > > Case 1, shift is 4 > > Each 4 bit forms a group and exactly 4 groups. > > |0123|4567|0123|4567| > > Case 2, shift is 6 > > Each 6 bit forms a group and left 3 groups with 2 empty slots at the > beginning. > > | 0123|456701|234567| > > Radix tree is a little bit like the page table, which is divided into several > levels and each level is referenced by a "slot" of bits. > > When the shift is a power of 2, the digit fits in exact "slots". This means > there is no waste. > > When the shift is not, we will have several unused digits at the most > significant part. This means several slot will never be used. > > This means on theory, in a tree with shift 6 > > | 0123|456701|234567| > > We can store an index with value > > ffffff ffffff ffffff > > While actually the largest value a 16 bits index could represents > > __ffff ffffff ffffff > > Since radix tree will not allocate memory when there is no corresponding > index, this will just leave several unused slot at the top level. Yes, if we have a tree with 2^64 pointers in it, we'll have 48 unused pointers in the top level leaf node. That's not a huge percentage of waste. > While I still prefer to fully utilize the whole tree instead of leave some > never touch slot. And this can be achieved by set RADIX_TREE_MAP_SHIFT to a > power of 2. > > So if there is no specific reason to use 6, I would like to form a patch to > change it to 8. The cost of that is in the slab allocator. With 64bit pointers and a 64-entry node, we use 576 bytes per node, fitting 7 nodes per page. With 256 entries per node, the node balloons to just over 2kB, and we get 3 nodes per pair of pages. This is massively less efficient than the inefficiency you’re complaining about. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-10-13 15:40 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-10-10 2:52 [PATCH] radix-tree: get_slot_offset() returns invalid offset when parent is NULL Wei Yang 2017-10-10 20:53 ` Andrew Morton 2017-10-11 2:33 ` Wei Yang 2017-10-11 23:39 ` Andrew Morton 2017-10-12 2:20 ` Wei Yang 2017-10-13 15:40 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox