* semantics of rhashtable and sysvipc @ 2018-05-23 17:25 Davidlohr Bueso 2018-05-23 17:35 ` Davidlohr Bueso 2018-05-23 18:35 ` Linus Torvalds 0 siblings, 2 replies; 10+ messages in thread From: Davidlohr Bueso @ 2018-05-23 17:25 UTC (permalink / raw) To: akpm, torvalds; +Cc: manfred, guillaume.knispel, linux-api, linux-kernel Hi, In sysvipc we have an ids->tables_initialized regarding the rhashtable, introduced in 0cfb6aee70b (ipc: optimize semget/shmget/msgget for lots of keys). It's there, specifically, to prevent nil pointer dereferences, from using an uninitialized api. Considering how rhashtable_init() can fail (probably due to ENOMEM, if anything), this made the overall ipc initialization capable of failure as well. That alone is ugly, but fine, however I've spotted a few issues regarding the semantics of tables_initialized (however unlikely they may be): - There is inconsistency in what we return to userspace: ipc_addid() returns ENOSPC which is certainly _wrong_, while ipc_obtain_object_idr() returns EINVAL. - After we started using rhashtables, ipc_findkey() can return nil upon !tables_initialized, but the caller expects nil for when the ipc structure isn't found, and can therefore call into ipcget() callbacks. I see two possible fixes. The first is to return the proper error code if !tables_initialized, however, I'm not sure how we want to deal with the EINVAL cases when rhashtable_init() fails. Userspace has no reason to know about this. The ENOMEM case I guess makes sense ok. The second alternative would be to add a BUG_ON() if the initialization fails and we get rid of all the tables_initialized hack. I know Linus isn't fond of this, and in the past ipc has gotten rid of BUG_ON usage in ipc because of this. However I mention it because there are other core areas that do this (or call panic() straightaway). Ie in networking: sock_diag_init() and netlink_proto_init(). Thoughts? Thanks, Davidlohr ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-23 17:25 semantics of rhashtable and sysvipc Davidlohr Bueso @ 2018-05-23 17:35 ` Davidlohr Bueso 2018-05-23 18:35 ` Linus Torvalds 1 sibling, 0 replies; 10+ messages in thread From: Davidlohr Bueso @ 2018-05-23 17:35 UTC (permalink / raw) To: akpm, torvalds; +Cc: manfred, guillaume.knispel, linux-api, linux-kernel On Wed, 23 May 2018, Davidlohr Bueso wrote: >I see two possible fixes. I guess a third option would be to make the hashtable static, but I'm not against using rhashtables so I'm not really considering this. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-23 17:25 semantics of rhashtable and sysvipc Davidlohr Bueso 2018-05-23 17:35 ` Davidlohr Bueso @ 2018-05-23 18:35 ` Linus Torvalds 2018-05-23 18:31 ` Davidlohr Bueso 2018-05-24 17:07 ` Davidlohr Bueso 1 sibling, 2 replies; 10+ messages in thread From: Linus Torvalds @ 2018-05-23 18:35 UTC (permalink / raw) To: Davidlohr Bueso, Thomas Graf, Herbert Xu Cc: Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API, Linux Kernel Mailing List On Wed, May 23, 2018 at 10:41 AM Davidlohr Bueso <dave@stgolabs.net> wrote: > The second alternative would be to add a BUG_ON() if the initialization fails > and we get rid of all the tables_initialized hack. I see absolutely no value in an early boot BUG_ON(). Either we know the allocation cannot fail - which is perfectly fine at bootup, and is a common pattern - or it can fail and we need to handle it. In neither case is the BUG_ON() appropriate. So I'm perfectly fine with getting rid of 'tables_initialized'. But no, not with a BUG_ON(). If you cannot guarantee that the allocation works (using __GFP_NOFAIL is ok, for example - but it only works with small allocations), then you need to handle the allocation failure. I refuse to see more of the shit-for-brains kind of "I can't be bothered to handle error cases" BUG_ON() stuff. And I also am not in the least interested in "this cannot possibly happen" BUG_ON() code. One option is to make rhashtable_alloc() shrink the allocation and try again if it fails, and then you *can* do __GFP_NOFAIL eventually. In fact, it can validly be argued that rhashtable_init() is just buggy as-is. The whole *point* olf that function is to size things appropriately, and returning -ENOMEM obviously means that it didn't do its job. Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-23 18:35 ` Linus Torvalds @ 2018-05-23 18:31 ` Davidlohr Bueso 2018-05-23 18:52 ` Linus Torvalds 2018-05-24 17:07 ` Davidlohr Bueso 1 sibling, 1 reply; 10+ messages in thread From: Davidlohr Bueso @ 2018-05-23 18:31 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API, Linux Kernel Mailing List On Wed, 23 May 2018, Linus Torvalds wrote: >So I'm perfectly fine with getting rid of 'tables_initialized'. But no, not >with a BUG_ON(). > >If you cannot guarantee that the allocation works (using __GFP_NOFAIL is >ok, for example - but it only works with small allocations), then you need >to handle the allocation failure. Note that even if the allocation was guaranteed, there are still param validations and rhashtable_init() can return -EINVAL. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-23 18:31 ` Davidlohr Bueso @ 2018-05-23 18:52 ` Linus Torvalds 2018-05-23 18:52 ` Davidlohr Bueso 0 siblings, 1 reply; 10+ messages in thread From: Linus Torvalds @ 2018-05-23 18:52 UTC (permalink / raw) To: Davidlohr Bueso Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API, Linux Kernel Mailing List On Wed, May 23, 2018 at 11:47 AM Davidlohr Bueso <dave@stgolabs.net> wrote: > Note that even if the allocation was guaranteed, there are still param validations > and rhashtable_init() can return -EINVAL. So? It's not going to happen, because you're not going to give garbage parameters. Why would you add a BUG_ON() for something that cannot happen? You might as well sprinkle them randomly in every damn place. And even if somebody screws up the parameters because they are being stupid, then SO WHAT? rhashtable_init() won't initialize the pointers, and we'll get a NULL pointer dereference. And hey, we'll probably get it later during boot, once the system is actually up and running, and that NULL pointer dereference might even get logged in the system logs now because the machine booted successfully, and mnaybe it will even get sent to a distro and debugged. So at what point was there _any_ advantage in doing a BUG_ON() for a crazy case? Really. Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-23 18:52 ` Linus Torvalds @ 2018-05-23 18:52 ` Davidlohr Bueso 0 siblings, 0 replies; 10+ messages in thread From: Davidlohr Bueso @ 2018-05-23 18:52 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API, Linux Kernel Mailing List On Wed, 23 May 2018, Linus Torvalds wrote: >On Wed, May 23, 2018 at 11:47 AM Davidlohr Bueso <dave@stgolabs.net> wrote: > >> Note that even if the allocation was guaranteed, there are still param >validations >> and rhashtable_init() can return -EINVAL. > >So? > >It's not going to happen, because you're not going to give garbage >parameters. Maybe EINVAL could be replaced with WARN_ON(). That would grab the programmer's attention. > >Why would you add a BUG_ON() for something that cannot happen? You might as >well sprinkle them randomly in every damn place. Not suggesting this. Before I started the thread, I was actually thinking of ipc using ENOMEM only for rhashtable_init() filure considering the EINVAL case will never happen. > >And even if somebody screws up the parameters because they are being >stupid, then SO WHAT? rhashtable_init() won't initialize the pointers, and >we'll get a NULL pointer dereference. > >And hey, we'll probably get it later during boot, once the system is >actually up and running, and that NULL pointer dereference might even get >logged in the system logs now because the machine booted successfully, and >mnaybe it will even get sent to a distro and debugged. > >So at what point was there _any_ advantage in doing a BUG_ON() for a crazy >case? For the record, I'm not arguing in favor of BUG_ON(). ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-23 18:35 ` Linus Torvalds 2018-05-23 18:31 ` Davidlohr Bueso @ 2018-05-24 17:07 ` Davidlohr Bueso 2018-05-24 17:53 ` Linus Torvalds 1 sibling, 1 reply; 10+ messages in thread From: Davidlohr Bueso @ 2018-05-24 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API, Linux Kernel Mailing List On Wed, 23 May 2018, Linus Torvalds wrote: >One option is to make rhashtable_alloc() shrink the allocation and try >again if it fails, and then you *can* do __GFP_NOFAIL eventually. The below attempts to implements this, along with converting the EINVAL cases to WARN_ON(). I've refactored bucket_table_alloc() to add a 'retry' param which always ends up doing GFP_KERNEL|__GFP_NOFAIL for both the tbl as well as alloc_bucket_spinlocks(). I've arbitrarily shrunk the size in half upon initial allocation failure. So we default from 64 to 32 buckets, and if the user is hinting at larger nelems, we simply disregard that and retry with 32. Similarly, smaller values are also cut in half. In addition, I think we can also get rid of explicitly passing the gfp flags to bucket_table_alloc() (we only do GFP_KERNEL or GFP_ATOMIC), and leave it up to __bucket_table_alloc() by adding a new bucket_table_alloc_noblock() or something for GFP_ATOMIC. > >In fact, it can validly be argued that rhashtable_init() is just buggy >as-is. The whole *point* olf that function is to size things appropriately, >and returning -ENOMEM obviously means that it didn't do its job. diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 9427b5766134..e86a396aebcf 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -166,16 +166,21 @@ static struct bucket_table *nested_bucket_table_alloc(struct rhashtable *ht, return tbl; } -static struct bucket_table *bucket_table_alloc(struct rhashtable *ht, - size_t nbuckets, - gfp_t gfp) +static struct bucket_table *__bucket_table_alloc(struct rhashtable *ht, + size_t nbuckets, + gfp_t gfp, bool retry) { struct bucket_table *tbl = NULL; size_t size, max_locks; int i; size = sizeof(*tbl) + nbuckets * sizeof(tbl->buckets[0]); - if (gfp != GFP_KERNEL) + if (retry) { + gfp |= __GFP_NOFAIL; + tbl = kzalloc(size, gfp); + } + + else if (gfp != GFP_KERNEL) tbl = kzalloc(size, gfp | __GFP_NOWARN | __GFP_NORETRY); else tbl = kvzalloc(size, gfp); @@ -211,6 +216,20 @@ static struct bucket_table *bucket_table_alloc(struct rhashtable *ht, return tbl; } +static struct bucket_table *bucket_table_alloc(struct rhashtable *ht, + size_t nbuckets, + gfp_t gfp) +{ + return __bucket_table_alloc(ht, nbuckets, gfp, false); +} + +static struct bucket_table *bucket_table_alloc_retry(struct rhashtable *ht, + size_t nbuckets, + gfp_t gfp) +{ + return __bucket_table_alloc(ht, nbuckets, gfp, true); +} + static struct bucket_table *rhashtable_last_table(struct rhashtable *ht, struct bucket_table *tbl) { @@ -1024,12 +1043,11 @@ int rhashtable_init(struct rhashtable *ht, size = HASH_DEFAULT_SIZE; - if ((!params->key_len && !params->obj_hashfn) || - (params->obj_hashfn && !params->obj_cmpfn)) - return -EINVAL; + WARN_ON((!params->key_len && !params->obj_hashfn) || + (params->obj_hashfn && !params->obj_cmpfn)); - if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT)) - return -EINVAL; + WARN_ON(params->nulls_base && + params->nulls_base < (1U << RHT_BASE_SHIFT)); memset(ht, 0, sizeof(*ht)); mutex_init(&ht->mutex); @@ -1068,9 +1086,23 @@ int rhashtable_init(struct rhashtable *ht, } } + /* + * This is api initialization. We need to guarantee the initial + * rhashtable allocation. Upon failure, retry with a smaller size, + * otherwise we exhaust our options with __GFP_NOFAIL. + * + * The size of the table is shrunk to at least half the original + * value. Users that use large nelem_hint values are lowered to 32 + * buckets. + */ tbl = bucket_table_alloc(ht, size, GFP_KERNEL); - if (tbl == NULL) - return -ENOMEM; + if (unlikely(tbl == NULL)) { + size = min(size, HASH_DEFAULT_SIZE) / 2; + + tbl = bucket_table_alloc(ht, size, GFP_KERNEL); + if (tbl == NULL) + tbl = bucket_table_alloc_retry(ht, size, GFP_KERNEL); + } atomic_set(&ht->nelems, 0); ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-24 17:07 ` Davidlohr Bueso @ 2018-05-24 17:53 ` Linus Torvalds 2018-05-24 18:51 ` Davidlohr Bueso 0 siblings, 1 reply; 10+ messages in thread From: Linus Torvalds @ 2018-05-24 17:53 UTC (permalink / raw) To: Davidlohr Bueso Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API, Linux Kernel Mailing List On Thu, May 24, 2018 at 10:23 AM Davidlohr Bueso <dave@stgolabs.net> wrote: > tbl = bucket_table_alloc(ht, size, GFP_KERNEL); > - if (tbl == NULL) > - return -ENOMEM; > + if (unlikely(tbl == NULL)) { > + size = min(size, HASH_DEFAULT_SIZE) / 2; > + > + tbl = bucket_table_alloc(ht, size, GFP_KERNEL); > + if (tbl == NULL) > + tbl = bucket_table_alloc_retry(ht, size, GFP_KERNEL); > + } This doesn't seem to be taking 'param->min_size' into account. I'm not sure that matters, but right now, if you have nelem_hint set and a min_size, the min_size is honored (if you have just min_size it's already ignored because the rhashtable always starts with HASH_DEFAULT_SIZE). So I could imagine that somebody uses it to guarantee something. The docs say that "min_size" is the minimum size for *shrinking* not for initializing, so I guess it's debatable. Also, wouldn't it make sense to make this all be a while loop? Or are you just depending on the knowledge that HASH_DEFAULT_SIZE / 2 is already guaranteed to be so small that there's no point? A comment to that effect would be good, perhaps. Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-24 17:53 ` Linus Torvalds @ 2018-05-24 18:51 ` Davidlohr Bueso 2018-05-24 19:17 ` Linus Torvalds 0 siblings, 1 reply; 10+ messages in thread From: Davidlohr Bueso @ 2018-05-24 18:51 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API, Linux Kernel Mailing List On Thu, 24 May 2018, Linus Torvalds wrote: >This doesn't seem to be taking 'param->min_size' into account. It was in that rounded_hashtable_size() does, however, after more thought I think we can do better by taking it much more into account. > >I'm not sure that matters, but right now, if you have nelem_hint set and a >min_size, the min_size is honored (if you have just min_size it's already >ignored because the rhashtable always starts with HASH_DEFAULT_SIZE). So I >could imagine that somebody uses it to guarantee something. The docs say >that "min_size" is the minimum size for *shrinking* not for initializing, >so I guess it's debatable. > >Also, wouldn't it make sense to make this all be a while loop? Or are you >just depending on the knowledge that HASH_DEFAULT_SIZE / 2 is already >guaranteed to be so small that there's no point? A comment to that effect >would be good, perhaps. Yes, this is why I didn't loop. With the default size of 64 buckets, we allocate 640 + 128 = 768 bytes for the tbl and the lock array, respectively. By halving this, upon retrying, I was relying on it being to "small to fail". However, after how about the resize being based on HASH_MIN_SIZE instead of HASH_DEFAULT_SIZE? That way the initial table would be a _lot_ smaller and aid the allocator that much more; which is why we're here in the first place. Any performance costs of collisions would be completely unimportant in this scenario. Considering that some users set p.min_size to be rather large-ish (up to 1024 buckets afaict), we'd need the following: size = min(ht->p.min_size, HASH_MIN_SIZE); Which takes into account the min_size = max(ht->p.min_size, HASH_MIN_SIZE) which came before, thus p.min_size == 0 is already taken into account. Thanks, Davidlohr ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: semantics of rhashtable and sysvipc 2018-05-24 18:51 ` Davidlohr Bueso @ 2018-05-24 19:17 ` Linus Torvalds 0 siblings, 0 replies; 10+ messages in thread From: Linus Torvalds @ 2018-05-24 19:17 UTC (permalink / raw) To: Davidlohr Bueso Cc: Thomas Graf, Herbert Xu, Andrew Morton, Manfred Spraul, guillaume.knispel, Linux API, Linux Kernel Mailing List On Thu, May 24, 2018 at 12:08 PM Davidlohr Bueso <dave@stgolabs.net> wrote: > However, after how about the resize being based on HASH_MIN_SIZE instead of > HASH_DEFAULT_SIZE? I think that sounds reasonable. We wouldn't expect this to ever happen in practice, and as you say, if it *does* happen, the size of the hash array is the last of our problems. > Considering that some users set p.min_size to be rather large-ish (up to 1024 > buckets afaict), we'd need the following: > size = min(ht->p.min_size, HASH_MIN_SIZE); Bah, let's just go for simplicity, and just make it HASH_MIN_SIZE unconditionally, and just have a single fallback: if the first "normal" allocation fails, do one single unconditional allocation with HASH_MIN_SIZE and GFP_NOFAIL. I think that should work fine. Linus ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-05-24 19:17 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-05-23 17:25 semantics of rhashtable and sysvipc Davidlohr Bueso 2018-05-23 17:35 ` Davidlohr Bueso 2018-05-23 18:35 ` Linus Torvalds 2018-05-23 18:31 ` Davidlohr Bueso 2018-05-23 18:52 ` Linus Torvalds 2018-05-23 18:52 ` Davidlohr Bueso 2018-05-24 17:07 ` Davidlohr Bueso 2018-05-24 17:53 ` Linus Torvalds 2018-05-24 18:51 ` Davidlohr Bueso 2018-05-24 19:17 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).