* dev->destructor
@ 2003-04-30 6:26 David S. Miller
2003-04-30 16:33 ` dev->destructor Stephen Hemminger
2003-05-01 1:10 ` dev->destructor kuznet
0 siblings, 2 replies; 28+ messages in thread
From: David S. Miller @ 2003-04-30 6:26 UTC (permalink / raw)
To: shemminger; +Cc: netdev, kuznet
Stephen, you're right about the dev->destructor problem.
I misread your postings, and I'm very sorry about that.
We were talking about two different things, and admittedly
I had forgotten how some of this stuff works.
Alexey, currently dev->{open,close} are what does get/put
of device module reference.
However, device unregister can explode if dev->destructor is
present. Unlike in dev->destructor==NULL case, we do not
wait for remnant dev->refcnt to go away. Therefore we could
invoke dev->destructor() after module is unloaded.
I guess there are two ways to address this problem:
1) dev_get() gets module reference and dev_put() puts is.
Ugly, as this means dev_get() can fail, but this does
cover all the possible cases.
2) Make unregister_netdev() wait for refcount to reach 1
regardless of whether dev->destructor is NULL or not.
I don't like #1. Do you see some holes in #2?
As Stephen brought up, this also means we should do something
about that NETDEV_UNREGISTER code in dst_dev_event() :-(
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: dev->destructor 2003-04-30 6:26 dev->destructor David S. Miller @ 2003-04-30 16:33 ` Stephen Hemminger 2003-05-01 1:10 ` dev->destructor kuznet 1 sibling, 0 replies; 28+ messages in thread From: Stephen Hemminger @ 2003-04-30 16:33 UTC (permalink / raw) To: David S. Miller; +Cc: netdev, kuznet On Tue, 29 Apr 2003 23:26:31 -0700 (PDT) "David S. Miller" <davem@redhat.com> wrote: > > Stephen, you're right about the dev->destructor problem. > I misread your postings, and I'm very sorry about that. > We were talking about two different things, and admittedly > I had forgotten how some of this stuff works. > > Alexey, currently dev->{open,close} are what does get/put > of device module reference. > > However, device unregister can explode if dev->destructor is > present. Unlike in dev->destructor==NULL case, we do not > wait for remnant dev->refcnt to go away. Therefore we could > invoke dev->destructor() after module is unloaded. > > I guess there are two ways to address this problem: > > 1) dev_get() gets module reference and dev_put() puts is. > Ugly, as this means dev_get() can fail, but this does > cover all the possible cases. > > 2) Make unregister_netdev() wait for refcount to reach 1 > regardless of whether dev->destructor is NULL or not. > > I don't like #1. Do you see some holes in #2? > > As Stephen brought up, this also means we should do something > about that NETDEV_UNREGISTER code in dst_dev_event() :-( There are other (not nice possibilities). A) Require driver have a wait queue per device and wait after unregister. Ugly and requires repeated work. B) Put wait queue and wakeup logic in netdevice/unregister_netdev. Adds more to already overloaded netdevice structure, but could cleanup existing polling stuff. C) Audit unregister notifier callbacks to ensure they all dev_put, all references to device. This works for normal IPv4 except for the dst cache. The following patch causes the dst cache to do this. --- linux-2.5/net/core/dst.c 2003-04-29 11:54:39.000000000 -0700 +++ linux-2.5-bridge/net/core/dst.c 2003-04-29 10:17:15.000000000 -0700 @@ -228,7 +228,7 @@ _race_ _condition_. */ if (event!=NETDEV_DOWN && - dev->destructor == NULL && + (dev->destructor == NULL || dev->owner) && dst->output == dst_blackhole) { dst->dev = &loopback_dev; dev_put(dev); D) Your option #2 only needs to be done if device is a module otherwise, can just let destructor run later. --- linux-2.5/net/core/dev.c 2003-04-29 11:54:39.000000000 -0700 +++ linux-2.5-bridge/net/core/dev.c 2003-04-30 09:28:34.000000000 -0700 @@ -2737,7 +2737,7 @@ free_divert_blk(dev); #endif - if (dev->destructor != NULL) { + if (dev->destructor != NULL && dev->owner == NULL) { #ifdef NET_REFCNT_DEBUG if (atomic_read(&dev->refcnt) != 1) printk(KERN_DEBUG "unregister_netdevice: holding %s " ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-04-30 6:26 dev->destructor David S. Miller 2003-04-30 16:33 ` dev->destructor Stephen Hemminger @ 2003-05-01 1:10 ` kuznet 2003-05-01 7:00 ` dev->destructor David S. Miller 1 sibling, 1 reply; 28+ messages in thread From: kuznet @ 2003-05-01 1:10 UTC (permalink / raw) To: David S. Miller; +Cc: shemminger, netdev Hello! > 1) dev_get() gets module reference and dev_put() puts is. > Ugly, as this means dev_get() can fail, but this does > cover all the possible cases. Seems, you eventually _really_ understood why I histerically moan about bogosity of modules and maybe ready to recongnize that it is not just histerical moaning in some part. :-) > 2) Make unregister_netdev() wait for refcount to reach 1 > regardless of whether dev->destructor is NULL or not. > > I don't like #1. Do you see some holes in #2? It is deadlock. > As Stephen brought up, this also means we should do something > about that NETDEV_UNREGISTER code in dst_dev_event() :-( It is just one and the simplest subcase of general situation. Module must not be unloaded while device is held, that's all. Also, it is not specific to netdevices. The same applies to each object in the stack, which is under control of a module. You may like 1), you may dislike it, but people who designed modules _really_ expect we must use this idiotic module_get() each time when referencing some object. Remember about sockets? Well, my alternatives are: 0) To make module unloading right, rather then preserve creepy scheme. 3) To prohibit unloading such modules and live in peace with code not crippled with midule_get(). I hear, hear your growling, but I feel enough comfortable with this and you may fell better too after comparing to 1), which is exactly the way how all this queer engine was designed. Alexey ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-01 1:10 ` dev->destructor kuznet @ 2003-05-01 7:00 ` David S. Miller 2003-05-01 12:01 ` dev->destructor Rusty Russell 2003-05-02 4:06 ` dev->destructor kuznet 0 siblings, 2 replies; 28+ messages in thread From: David S. Miller @ 2003-05-01 7:00 UTC (permalink / raw) To: kuznet; +Cc: shemminger, netdev, acme, rusty From: kuznet@ms2.inr.ac.ru Date: Thu, 1 May 2003 05:10:33 +0400 (MSD) [ Rusty, just skip down to "Ok ok ok!", it's something we've discussed before. Some of these problems are becomming so widespread that we need to implement a fix, I'll probably be the one to end up doing it... ] > 1) dev_get() gets module reference and dev_put() puts is. > Ugly, as this means dev_get() can fail, but this does > cover all the possible cases. Seems, you eventually _really_ understood why I histerically moan about bogosity of modules and maybe ready to recongnize that it is not just histerical moaning in some part. :-) Yes, I know, and we can talk about this until the cows come home... :-) > 2) Make unregister_netdev() wait for refcount to reach 1 > regardless of whether dev->destructor is NULL or not. > > I don't like #1. Do you see some holes in #2? It is deadlock. What exactly is this deadlock? Let me think... None of destructors kill reference to device object, and I mean none of them. It is why I thought the idea works. Also, holding RTNL semaphore does not block potential holders of device reference. Or does it? > As Stephen brought up, this also means we should do something > about that NETDEV_UNREGISTER code in dst_dev_event() :-( It is just one and the simplest subcase of general situation. Module must not be unloaded while device is held, that's all. Ok ok ok!!! Let us depict how it might work in your idealized module scheme ok? Your idea, as I understand it, is to add callback to module that freezes module then at some time in the future makes indication that module is clean and may be unloaded safely. The logic is that module knows about reference, internal attributes, etc. and thus can check for unloadability better than any simple refcounting system can. The best argument for this are things like ipv6 which are enormously complicated to load/unload safely. If we were to use the module refcounting system to make ipv6 unloading work cleanly, every other line of the ipv6 code would be some module get/put. :-) So the unload sequence is something like: /* Shutdown the module, so that no new references may * be created. */ rc = module->shutdown(module); if (rc) goto out_err; wait_event(&module->unload_waitq, module->unloadable); module->exit(); Then, using netdevice as an example, register_netdevice() would do something like: module_add_instance(dev->owner); which is simply: void module_add_instance(struct module *module) { if (module) atomic_inc(&module->instances); } The net device drivers add: module_shutdown(netdev_module_shutdown); And then netdev_module_shutdown() would go: int netdev_module_shutdown(struct module *module) { struct net_device *d, *d_next; rtnl_lock(); d_next = NULL; for (d = dev_base; d != NULL; d = d_next) { d_next = d->next; if (d->owner == module) { if (unregister_netdevice(d)) BUG(); /* Keep traversing, this module may drive * multiple device instances. */ } } rtnl_unlock(); return 0; } And, at final dev_put(), netdev_finish_unregister() is invoked, and we change it to look something like this: int netdev_finish_unregister(struct net_device *dev) { BUG_TRAP(!dev->ip_ptr); BUG_TRAP(!dev->ip6_ptr); BUG_TRAP(!dev->dn_ptr); if (!dev->deadbeaf) { printk(KERN_ERR "Freeing alive device %p, %s\n", dev, dev->name); return 0; } #ifdef NET_REFCNT_DEBUG printk(KERN_DEBUG "netdev_finish_unregister: %s%s.\n", dev->name, (dev->destructor != NULL)?"":", old style"); #endif if (dev->destructor) dev->destructor(dev); module_dec_instance(dev->owner); return 0; } We implement module_dev_instance as: void module_dec_instance(struct module *module) { if (module && atomic_dec_and_test(&module->instances)) module_is_unloadable(module); } Finally, we implement module_is_unloadable() which is simply: void module_is_unloadable(struct module *module) { if (module) { module->unloadable = 1; wake_up(&module->unload_waitq); } } Next, let us make socket example for "simple protocol". static int __init simple_proto_init(void) { int rc; rc = sock_register(&simple_proto_ops); if (rc) return rc; return 0; } static int __exit simple_proto_shutdown(struct module *module) { int rc; rc = sock_unregister(AF_SIMPLE); if (rc) return rc; return 0; } module_init(simple_proto_init); module_shutdown(simple_proto_shutdown); Now, where to place the code to mark module as unloadable? We could maintain state internally using: static atomic_t nr_simple_sockets; And as sockets are created/destroyed, we inc/dec this thing. When it is decreamented to zero, we can say module_is_unloadable(THIS_MODULE); Question is, where to make this? It cannot be in the module itself of course, that would create race condition similar to one which exists today making this exercise quite pointless :-) Alexey and I had some private conversations about this with Rusty, he agreed with our sentiments mostly, but he was concerned about unload semantics and some unfortunate side effects of two-stage unload. Specifically, consider: int example_module_shutdown(struct module *module) { int rc; rc = unregister_foo(&foo); if (rc) return rc; rc = unregister_bar(&bar); if (rc) { register_foo(&foo); return rc; } return 0; } Note the problem. If between the unregister_foo() and re-register of foo in the failure path, someone asks to open a "foo" it will fail. Those semantics absolutely stink. Rusty went on to describe that this is one of the reasons for the current "enable refcounts" scheme with try_module_get(). A single binary state enables all of the interfaces to start to succeed. ie. opening an object created by the module does not succeed until module->init() finishes successfully and thus try_module_get() starts to return success. I think he was trying to hint to us that some analogue needs to occur for shutdown as well. Ie. to rewrite my shutdown logic above: rc = stop_refcounts(mod); if (rc) goto out_err; if (mod->shutdown) { /* Shutdown the module, so that no new references may * be created. */ rc = mod->shutdown(mod); if (rc) { restart_refcounts(mod); goto out_err; } wait_event(&mod->unload_waitq, mod->unloadable); if (module_refcount(mod) != 0) BUG(); } restart_refcounts(mod); /* ... put here all the existing code in sys_delete_module() * ... dealing O_NONBLOCK etc. etc. */ module->exit(); module_free(mod); I am not even sure this is right. It's quite a tricky area and requires real brains to solve. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-01 7:00 ` dev->destructor David S. Miller @ 2003-05-01 12:01 ` Rusty Russell 2003-05-01 11:09 ` dev->destructor David S. Miller 2003-05-01 17:28 ` dev->destructor Arnaldo Carvalho de Melo 2003-05-02 4:06 ` dev->destructor kuznet 1 sibling, 2 replies; 28+ messages in thread From: Rusty Russell @ 2003-05-01 12:01 UTC (permalink / raw) To: David S. Miller; +Cc: kuznet, shemminger, netdev, acme, Werner Almesberger In message <20030501.000058.39187964.davem@redhat.com> you write: > From: kuznet@ms2.inr.ac.ru > Date: Thu, 1 May 2003 05:10:33 +0400 (MSD) > > [ Rusty, just skip down to "Ok ok ok!", it's something we've > discussed before. Some of these problems are becomming so > widespread that we need to implement a fix, I'll probably be > the one to end up doing it... ] > > > 1) dev_get() gets module reference and dev_put() puts is. > > Ugly, as this means dev_get() can fail, but this does > > cover all the possible cases. > > Seems, you eventually _really_ understood why I histerically moan > about bogosity of modules and maybe ready to recongnize that it is > not just histerical moaning in some part. :-) > > Yes, I know, and we can talk about this until the cows come > home... :-) I agree with Alexey. Modules are poor-man's microkernel: allowing them to be unloaded has always been a horror. But I failed to convince my collegues of this at the 2002 kernel summit, so I did the best with what we had. If I had my way, we would *never* remove modules (even on failed init: we might re-try init later, but never free the memory). But before we redesign module architecture from scratch, let's look at a solution with what we do have (assuming Linus takes my damn __module_get() patch some day, see below). There are 70 calls to dev_hold() in the kernel. The vast majority of them already have a reference, they just want another one: dev_hold() can do __module_get(). There are a few *sources* of devices: dev_get, dev_get_by*. These should check and fail, using "try_dev_hold()" or something. Unfortunately auditing all the __dev_get_by* is quite a task, since it's used very widely (and I think, sometime erroneously). Completely untested patch below other patch. I need more time to digest your proposal in detail, Dave. Expect reply w/in 24 hours. Thanks. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. Name: __module_get Author: Rusty Russell Status: Tested on 2.5.68-bk10 D: Introduces __module_get for places where we know we already hold D: a reference and ignoring the fact that the module is being "rmmod --wait"ed D: is simpler. diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .5001-linux-2.5.67-bk5/fs/filesystems.c .5001-linux-2.5.67-bk5.updated/fs/filesystems.c --- .5001-linux-2.5.67-bk5/fs/filesystems.c 2003-04-14 13:45:44.000000000 +1000 +++ .5001-linux-2.5.67-bk5.updated/fs/filesystems.c 2003-04-14 15:44:36.000000000 +1000 @@ -32,17 +32,7 @@ static rwlock_t file_systems_lock = RW_L /* WARNING: This can be used only if we _already_ own a reference */ void get_filesystem(struct file_system_type *fs) { - if (!try_module_get(fs->owner)) { -#ifdef CONFIG_MODULE_UNLOAD - unsigned int cpu = get_cpu(); - local_inc(&fs->owner->ref[cpu].count); - put_cpu(); -#else - /* Getting filesystem while it's starting up? We're - already supposed to have a reference. */ - BUG(); -#endif - } + __module_get(fs->owner); } void put_filesystem(struct file_system_type *fs) diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .5001-linux-2.5.67-bk5/include/linux/module.h .5001-linux-2.5.67-bk5.updated/include/linux/module.h --- .5001-linux-2.5.67-bk5/include/linux/module.h 2003-04-08 11:15:01.000000000 +1000 +++ .5001-linux-2.5.67-bk5.updated/include/linux/module.h 2003-04-14 15:45:15.000000000 +1000 @@ -255,6 +255,7 @@ struct module *module_text_address(unsig #ifdef CONFIG_MODULE_UNLOAD +unsigned int module_refcount(struct module *mod); void __symbol_put(const char *symbol); #define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x) void symbol_put_addr(void *addr); @@ -265,6 +266,17 @@ void symbol_put_addr(void *addr); #define local_dec(x) atomic_dec(x) #endif +/* Sometimes we know we already have a refcount, and it's easier not + to handle the error case (which only happens with rmmod --wait). */ +static inline void __module_get(struct module *module) +{ + if (module) { + BUG_ON(module_refcount(module) == 0); + local_inc(&module->ref[get_cpu()].count); + put_cpu(); + } +} + static inline int try_module_get(struct module *module) { int ret = 1; @@ -300,6 +317,9 @@ static inline int try_module_get(struct static inline void module_put(struct module *module) { } +static inline void __module_get(struct module *module) +{ +} #define symbol_put(x) do { } while(0) #define symbol_put_addr(p) do { } while(0) @@ -357,6 +377,10 @@ static inline struct module *module_text #define symbol_put(x) do { } while(0) #define symbol_put_addr(x) do { } while(0) +static inline void __module_get(struct module *module) +{ +} + static inline int try_module_get(struct module *module) { return 1; diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .5001-linux-2.5.67-bk5/kernel/module.c .5001-linux-2.5.67-bk5.updated/kernel/module.c --- .5001-linux-2.5.67-bk5/kernel/module.c 2003-04-14 13:45:46.000000000 +1000 +++ .5001-linux-2.5.67-bk5.updated/kernel/module.c 2003-04-14 15:44:36.000000000 +1000 @@ -431,7 +431,7 @@ static inline void restart_refcounts(voi } #endif -static unsigned int module_refcount(struct module *mod) +unsigned int module_refcount(struct module *mod) { unsigned int i, total = 0; @@ -439,6 +439,7 @@ static unsigned int module_refcount(stru total += atomic_read(&mod->ref[i].count); return total; } +EXPORT_SYMBOL(module_refcount); /* This exists whether we can unload or not */ static void free_module(struct module *mod); ================ Name: try_dev_hold Author: Rusty Russell Status: Experimental Depends: Module/module_dup.patch.gz D: Make dev_hold() actually duplicate the module reference count, and D: introduce try_dev_hold() for where refcount may be zero. Some D: places seemed to use dev_hold() to initialize the device reference D: count to 1. D: D: Lots of fixmes caused by quick audit of __dev_get_by* and D: dev_getbyhwaddr. diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/drivers/net/shaper.c working-2.5.68-bk10-netdevice/drivers/net/shaper.c --- linux-2.5.68-bk10/drivers/net/shaper.c 2003-04-08 11:14:26.000000000 +1000 +++ working-2.5.68-bk10-netdevice/drivers/net/shaper.c 2003-05-01 21:21:46.000000000 +1000 @@ -526,6 +526,7 @@ static int shaper_neigh_setup_dev(struct static int shaper_attach(struct net_device *shdev, struct shaper *sh, struct net_device *dev) { + /* FIXME: No reference to dev --RR */ sh->dev = dev; sh->hard_start_xmit=dev->hard_start_xmit; sh->get_stats=dev->get_stats; diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/drivers/net/wan/dlci.c working-2.5.68-bk10-netdevice/drivers/net/wan/dlci.c --- linux-2.5.68-bk10/drivers/net/wan/dlci.c 2003-03-18 05:01:46.000000000 +1100 +++ working-2.5.68-bk10-netdevice/drivers/net/wan/dlci.c 2003-05-01 21:20:31.000000000 +1000 @@ -457,6 +457,7 @@ int dlci_add(struct dlci_add *dlci) *(short *)(master->dev_addr) = dlci->dlci; dlp = (struct dlci_local *) master->priv; + /* FIXME: We have no reference to slave here. --RR */ dlp->slave = slave; flp = slave->priv; @@ -484,6 +485,7 @@ int dlci_del(struct dlci_add *dlci) /* validate slave device */ master = __dev_get_by_name(dlci->devname); + /* FIXME: No lock, no reference held to master. --RR */ if (!master) return(-ENODEV); diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/include/linux/netdevice.h working-2.5.68-bk10-netdevice/include/linux/netdevice.h --- linux-2.5.68-bk10/include/linux/netdevice.h 2003-04-08 11:15:01.000000000 +1000 +++ working-2.5.68-bk10-netdevice/include/linux/netdevice.h 2003-05-01 20:03:27.000000000 +1000 @@ -29,6 +29,7 @@ #include <linux/if_ether.h> #include <linux/if_packet.h> #include <linux/kobject.h> +#include <linux/module.h> #include <asm/atomic.h> #include <asm/cache.h> @@ -634,7 +635,25 @@ static inline void dev_put(struct net_de } #define __dev_put(dev) atomic_dec(&(dev)->refcnt) -#define dev_hold(dev) atomic_inc(&(dev)->refcnt) + +/* If you already have a reference, and are duplicating it. */ +#define dev_hold(dev) \ +do { \ + atomic_inc(&(dev)->refcnt); \ + __module_get((dev)->owner); \ +} while(0) + +/* If you need a new reference, or will be holding it for a long time. + If this returns 0, pretend dev doesn't exist (it's being removed now). */ +#define try_dev_hold(dev) \ +({ \ + int __ret = 1; \ + if (try_module_get((dev)->owner)) \ + atomic_inc(&(dev)->refcnt); \ + else \ + __ret = 0; \ + __ret; \ +}) /* Carrier loss detection, dial on demand. The functions netif_carrier_on * and _off may be called from IRQ context, but it is caller diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/802/tr.c working-2.5.68-bk10-netdevice/net/802/tr.c --- linux-2.5.68-bk10/net/802/tr.c 2003-02-07 19:21:54.000000000 +1100 +++ working-2.5.68-bk10-netdevice/net/802/tr.c 2003-05-01 21:24:46.000000000 +1000 @@ -479,6 +479,7 @@ static int rif_get_info(char *buffer,cha for(i=0;i < RIF_TABLE_SIZE;i++) { for(entry=rif_table[i];entry;entry=entry->next) { + /* FIXME: No lock, and no reference to dev. --RR */ struct net_device *dev = __dev_get_by_index(entry->iface); size=sprintf(buffer+len,"%s %02X:%02X:%02X:%02X:%02X:%02X %7li ", diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/appletalk/ddp.c working-2.5.68-bk10-netdevice/net/appletalk/ddp.c --- linux-2.5.68-bk10/net/appletalk/ddp.c 2003-05-01 09:29:34.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/appletalk/ddp.c 2003-05-01 21:28:03.000000000 +1000 @@ -920,6 +920,7 @@ static int atrtr_ioctl(unsigned int cmd, * space, isn't it? */ if (rt.rt_dev) { + /* FIXME: No lock, and no reference to dev --RR */ dev = __dev_get_by_name(rt.rt_dev); if (!dev) return -ENODEV; @@ -1217,6 +1218,7 @@ static __inline__ int is_ip_over_ddp(str static int handle_ip_over_ddp(struct sk_buff *skb) { + /* FIXME: No lock, and no reference held to dev. --RR */ struct net_device *dev = __dev_get_by_name("ipddp0"); struct net_device_stats *stats; diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/core/dev.c working-2.5.68-bk10-netdevice/net/core/dev.c --- linux-2.5.68-bk10/net/core/dev.c 2003-05-01 09:29:35.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/core/dev.c 2003-05-01 20:19:24.000000000 +1000 @@ -440,8 +440,8 @@ struct net_device *dev_get_by_name(const read_lock(&dev_base_lock); dev = __dev_get_by_name(name); - if (dev) - dev_hold(dev); + if (dev && !try_dev_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -513,8 +513,8 @@ struct net_device *dev_get_by_index(int read_lock(&dev_base_lock); dev = __dev_get_by_index(ifindex); - if (dev) - dev_hold(dev); + if (dev && !try_dev_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -563,8 +563,8 @@ struct net_device * dev_get_by_flags(uns read_lock(&dev_base_lock); dev = __dev_get_by_flags(if_flags, mask); - if (dev) - dev_hold(dev); + if (dev && !try_dev_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -2611,7 +2611,7 @@ int register_netdevice(struct net_device dev_init_scheduler(dev); write_lock_bh(&dev_base_lock); *dp = dev; - dev_hold(dev); + atomic_set(&dev->refcnt, 1); dev->deadbeaf = 0; write_unlock_bh(&dev_base_lock); @@ -2899,7 +2899,7 @@ static int __init net_dev_init(void) #endif dev->xmit_lock_owner = -1; dev->iflink = -1; - dev_hold(dev); + atomic_set(&dev->refcnt, 1); /* * Allocate name. If the init() fails diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/core/pktgen.c working-2.5.68-bk10-netdevice/net/core/pktgen.c --- linux-2.5.68-bk10/net/core/pktgen.c 2003-05-01 09:29:35.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/core/pktgen.c 2003-05-01 21:31:45.000000000 +1000 @@ -226,21 +226,20 @@ static struct net_device *setup_inject(s { struct net_device *odev; - rtnl_lock(); - odev = __dev_get_by_name(info->outdev); + odev = dev_get(info->outdev); if (!odev) { sprintf(info->result, "No such netdevice: \"%s\"", info->outdev); - goto out_unlock; + return NULL; } if (odev->type != ARPHRD_ETHER) { sprintf(info->result, "Not ethernet device: \"%s\"", info->outdev); - goto out_unlock; + goto out_put; } if (!netif_running(odev)) { sprintf(info->result, "Device is down: \"%s\"", info->outdev); - goto out_unlock; + goto out_put; } /* Default to the interface's mac if not explicitly set. */ @@ -281,14 +280,11 @@ static struct net_device *setup_inject(s info->cur_daddr = info->daddr_min; info->cur_udp_dst = info->udp_dst_min; info->cur_udp_src = info->udp_src_min; - - atomic_inc(&odev->refcnt); - rtnl_unlock(); return odev; -out_unlock: - rtnl_unlock(); +out_put: + dev_put(odev); return NULL; } diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/decnet/dn_dev.c working-2.5.68-bk10-netdevice/net/decnet/dn_dev.c --- linux-2.5.68-bk10/net/decnet/dn_dev.c 2003-04-20 18:05:16.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/decnet/dn_dev.c 2003-05-01 21:35:29.000000000 +1000 @@ -312,9 +312,7 @@ struct net_device *dn_dev_get_default(vo read_lock(&dndev_lock); dev = decnet_default_device; if (dev) { - if (dev->dn_ptr) - dev_hold(dev); - else + if (!dev->dn_ptr || !try_dev_hold(dev)) dev = NULL; } read_unlock(&dndev_lock); @@ -584,6 +582,8 @@ int dn_dev_ioctl(unsigned int cmd, void goto done; } + /* FIXME: if cmd == SIOCGIFADDR, don't hold lock, and don't + have reference to dev. --RR */ if ((dn_db = dev->dn_ptr) != NULL) { for (ifap = &dn_db->ifa_list; (ifa=*ifap) != NULL; ifap = &ifa->ifa_next) if (strcmp(ifr->ifr_name, ifa->ifa_label) == 0) @@ -677,6 +677,7 @@ static int dn_dev_rtm_newaddr(struct sk_ if (rta[IFA_LOCAL-1] == NULL) return -EINVAL; + /* FIXME: Don't have lock, and don't hold reference to dev. --RR */ if ((dev = __dev_get_by_index(ifm->ifa_index)) == NULL) return -ENODEV; @@ -1189,9 +1190,10 @@ void dn_dev_up(struct net_device *dev) * configured ethernet card in the system. */ if (maybe_default) { - dev_hold(dev); - if (dn_dev_set_default(dev, 0)) - dev_put(dev); + if (try_dev_hold(dev)) { + if (dn_dev_set_default(dev, 0)) + dev_put(dev); + } } } diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/decnet/dn_fib.c working-2.5.68-bk10-netdevice/net/decnet/dn_fib.c --- linux-2.5.68-bk10/net/decnet/dn_fib.c 2003-04-20 18:05:16.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/decnet/dn_fib.c 2003-05-01 21:43:34.000000000 +1000 @@ -218,7 +218,9 @@ static int dn_fib_check_nh(const struct if (!(dev->flags&IFF_UP)) return -ENETDOWN; nh->nh_dev = dev; - atomic_inc(&dev->refcnt); + /* FIXME: Must hold lock, or use dev_get_by_index. + --RR */ + dev_hold(dev); nh->nh_scope = RT_SCOPE_LINK; return 0; } @@ -262,7 +264,7 @@ out: if (!(dev->flags&IFF_UP)) return -ENETDOWN; nh->nh_dev = dev; - atomic_inc(&nh->nh_dev->refcnt); + dev_hold(&nh->nh_dev); nh->nh_scope = RT_SCOPE_HOST; } diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/decnet/dn_route.c working-2.5.68-bk10-netdevice/net/decnet/dn_route.c --- linux-2.5.68-bk10/net/decnet/dn_route.c 2003-05-01 09:29:35.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/decnet/dn_route.c 2003-05-01 20:58:08.000000000 +1000 @@ -891,7 +891,9 @@ static int dn_route_output_slow(struct d read_unlock(&dev_base_lock); if (dev_out == NULL) goto out; - dev_hold(dev_out); + /* FIXME: Shouldn't this be inside the lock? --RR */ + if (!try_dev_hold(dev_out)) + goto out; source_ok: ; } @@ -960,8 +962,10 @@ source_ok: } else { dev_out = neigh->dev; } - dev_hold(dev_out); - goto select_source; + if (try_dev_hold(dev_out)) + goto select_source; + else + dev_out = NULL; } } } @@ -1035,7 +1039,10 @@ select_source: if (dev_out) dev_put(dev_out); dev_out = DN_FIB_RES_DEV(res); - dev_hold(dev_out); + if (!try_dev_hold(dev_out)) { + dev_out = NULL; + goto e_addr; + } fl.oif = dev_out->ifindex; gateway = DN_FIB_RES_GW(res); @@ -1231,7 +1238,8 @@ static int dn_route_input_slow(struct sk "No output device\n"); goto e_inval; } - dev_hold(out_dev); + if (!try_dev_hold(out_dev)) + goto e_inval; if (res.r) src_map = dn_fib_rules_policy(fl.fld_src, &res, &flags); @@ -1349,8 +1357,12 @@ make_route: rt->u.dst.input = dn_blackhole; } rt->rt_flags = flags; - if (rt->u.dst.dev) - dev_hold(rt->u.dst.dev); + if (rt->u.dst.dev) { + if (!try_dev_hold(rt->u.dst.dev)) { + dst_free(&rt->u.dst); + goto e_inval; + } + } if (dn_rt_set_next_hop(rt, &res)) goto e_neighbour; diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/decnet/dn_rules.c working-2.5.68-bk10-netdevice/net/decnet/dn_rules.c --- linux-2.5.68-bk10/net/decnet/dn_rules.c 2003-04-20 18:05:16.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/decnet/dn_rules.c 2003-05-01 21:37:58.000000000 +1000 @@ -173,6 +173,7 @@ int dn_fib_rtm_newrule(struct sk_buff *s memcpy(new_r->r_ifname, RTA_DATA(rta[RTA_IIF-1]), IFNAMSIZ); new_r->r_ifname[IFNAMSIZ-1] = 0; new_r->r_ifindex = -1; + /* FIXME: Don't hold lock, and don't get reference. --RR */ dev = __dev_get_by_name(new_r->r_ifname); if (dev) new_r->r_ifindex = dev->ifindex; diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/ipv4/fib_frontend.c working-2.5.68-bk10-netdevice/net/ipv4/fib_frontend.c --- linux-2.5.68-bk10/net/ipv4/fib_frontend.c 2003-05-01 20:36:37.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/ipv4/fib_frontend.c 2003-01-02 12:30:47.000000000 +1100 @@ -115,8 +115,8 @@ struct net_device * ip_dev_find(u32 addr if (res.type != RTN_LOCAL) goto out; dev = FIB_RES_DEV(res); - if (dev && !try_dev_get(dev)) - dev = NULL; + if (dev) + atomic_inc(&dev->refcnt); out: fib_res_put(&res); diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/ipv4/fib_semantics.c working-2.5.68-bk10-netdevice/net/ipv4/fib_semantics.c --- linux-2.5.68-bk10/net/ipv4/fib_semantics.c 2003-05-01 09:29:35.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/ipv4/fib_semantics.c 2003-05-01 21:39:20.000000000 +1000 @@ -405,6 +405,8 @@ static int fib_check_nh(const struct rtm return -ENODEV; if (!(dev->flags&IFF_UP)) return -ENETDOWN; + /* FIXME: Must hold lock, or use dev_get_by_index. + --RR */ nh->nh_dev = dev; atomic_inc(&dev->refcnt); nh->nh_scope = RT_SCOPE_LINK; diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/ipv4/ip_gre.c working-2.5.68-bk10-netdevice/net/ipv4/ip_gre.c --- linux-2.5.68-bk10/net/ipv4/ip_gre.c 2003-05-01 09:29:35.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/ipv4/ip_gre.c 2003-05-01 21:50:07.000000000 +1000 @@ -289,7 +289,7 @@ static struct ip_tunnel * ipgre_tunnel_l if (register_netdevice(dev) < 0) goto failed; - dev_hold(dev); + atomic_set(&dev->refcnt, 1); ipgre_tunnel_link(nt); /* Do not decrement MOD_USE_COUNT here. */ return nt; @@ -1205,6 +1205,7 @@ static int ipgre_tunnel_init(struct net_ } if (!tdev && tunnel->parms.link) + /* FIXME: Don't hold lock, don't grab reference. --RR */ tdev = __dev_get_by_index(tunnel->parms.link); if (tdev) { diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/ipv4/ipip.c working-2.5.68-bk10-netdevice/net/ipv4/ipip.c --- linux-2.5.68-bk10/net/ipv4/ipip.c 2003-04-20 18:05:16.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/ipv4/ipip.c 2003-05-01 21:51:47.000000000 +1000 @@ -259,7 +259,7 @@ static struct ip_tunnel * ipip_tunnel_lo if (register_netdevice(dev) < 0) goto failed; - dev_hold(dev); + atomic_set(&dev->refcnt, 1); ipip_tunnel_link(nt); /* Do not decrement MOD_USE_COUNT here. */ return nt; @@ -841,6 +841,7 @@ static int ipip_tunnel_init(struct net_d } if (!tdev && tunnel->parms.link) + /* FIXME: Don't hold lock, don't grab reference. --RR */ tdev = __dev_get_by_index(tunnel->parms.link); if (tdev) { diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/ipv4/ipmr.c working-2.5.68-bk10-netdevice/net/ipv4/ipmr.c --- linux-2.5.68-bk10/net/ipv4/ipmr.c 2003-03-25 12:17:32.000000000 +1100 +++ working-2.5.68-bk10-netdevice/net/ipv4/ipmr.c 2003-05-01 20:39:29.000000000 +1000 @@ -443,7 +443,6 @@ static int vif_add(struct vifctl *vifc, /* And finish update writing critical data */ write_lock_bh(&mrt_lock); - dev_hold(dev); v->dev=dev; #ifdef CONFIG_IP_PIMSM if (v->flags&VIFF_REGISTER) @@ -1441,8 +1440,8 @@ int pim_rcv_v1(struct sk_buff * skb) read_lock(&mrt_lock); if (reg_vif_num >= 0) reg_dev = vif_table[reg_vif_num].dev; - if (reg_dev) - dev_hold(reg_dev); + if (reg_dev && !try_dev_hold(reg_dev)) + reg_dev = NULL; read_unlock(&mrt_lock); if (reg_dev == NULL) { @@ -1508,8 +1507,8 @@ int pim_rcv(struct sk_buff * skb) read_lock(&mrt_lock); if (reg_vif_num >= 0) reg_dev = vif_table[reg_vif_num].dev; - if (reg_dev) - dev_hold(reg_dev); + if (reg_dev && !try_dev_hold(reg_dev)) + reg_dev = NULL; read_unlock(&mrt_lock); if (reg_dev == NULL) { diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/ipv4/route.c working-2.5.68-bk10-netdevice/net/ipv4/route.c --- linux-2.5.68-bk10/net/ipv4/route.c 2003-05-01 09:29:35.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/ipv4/route.c 2003-05-01 20:41:28.000000000 +1000 @@ -1992,7 +1992,8 @@ int ip_route_output_slow(struct rtable * if (dev_out) dev_put(dev_out); dev_out = FIB_RES_DEV(res); - dev_hold(dev_out); + if (!try_dev_hold(dev_out)) + goto e_inval; fl.oif = dev_out->ifindex; make_route: diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/ipv6/sit.c working-2.5.68-bk10-netdevice/net/ipv6/sit.c --- linux-2.5.68-bk10/net/ipv6/sit.c 2003-04-20 18:05:17.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/ipv6/sit.c 2003-05-01 21:55:21.000000000 +1000 @@ -197,7 +197,7 @@ static struct ip_tunnel * ipip6_tunnel_l if (register_netdevice(dev) < 0) goto failed; - dev_hold(dev); + atomic_set(&dev->refcount, 1); ipip6_tunnel_link(nt); /* Do not decrement MOD_USE_COUNT here. */ return nt; @@ -778,6 +778,7 @@ static int ipip6_tunnel_init(struct net_ } if (!tdev && tunnel->parms.link) + /* FIXME: Don't hold lock, don't grab reference. --RR */ tdev = __dev_get_by_index(tunnel->parms.link); if (tdev) { diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/llc/af_llc.c working-2.5.68-bk10-netdevice/net/llc/af_llc.c --- linux-2.5.68-bk10/net/llc/af_llc.c 2003-05-01 09:29:36.000000000 +1000 +++ working-2.5.68-bk10-netdevice/net/llc/af_llc.c 2003-05-01 21:13:49.000000000 +1000 @@ -256,6 +256,7 @@ static int llc_ui_autobind(struct socket rc = -ENETUNREACH; if (!dev) goto out; + /* FIXME: We don't hold a reference to dev --RR */ llc->dev = dev; } /* bind to a specific sap, optional. */ @@ -419,6 +420,7 @@ static int llc_ui_connect(struct socket rtnl_unlock(); if (!dev) goto out; + /* FIXME: We don't hold a reference to dev --RR */ llc->dev = dev; } else dev = llc->dev; @@ -764,6 +766,7 @@ static int llc_ui_sendmsg(struct kiocb * dev = dev_getbyhwaddr(addr->sllc_arphrd, addr->sllc_smac); rtnl_unlock(); rc = -ENETUNREACH; + /* FIXME: We don't hold a reference to dev --RR */ if (!dev) goto release; } else diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/netrom/nr_route.c working-2.5.68-bk10-netdevice/net/netrom/nr_route.c --- linux-2.5.68-bk10/net/netrom/nr_route.c 2003-01-02 12:27:51.000000000 +1100 +++ working-2.5.68-bk10-netdevice/net/netrom/nr_route.c 2003-05-01 20:51:51.000000000 +1000 @@ -567,8 +567,8 @@ struct net_device *nr_dev_get(ax25_addre read_lock(&dev_base_lock); for (dev = dev_base; dev != NULL; dev = dev->next) { if ((dev->flags & IFF_UP) && dev->type == ARPHRD_NETROM && ax25cmp(addr, (ax25_address *)dev->dev_addr) == 0) { - dev_hold(dev); - goto out; + if (try_dev_hold(dev)) + goto out; } } out: diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.68-bk10/net/rose/rose_route.c working-2.5.68-bk10-netdevice/net/rose/rose_route.c --- linux-2.5.68-bk10/net/rose/rose_route.c 2003-03-18 12:21:41.000000000 +1100 +++ working-2.5.68-bk10-netdevice/net/rose/rose_route.c 2003-05-01 20:51:59.000000000 +1000 @@ -629,8 +629,8 @@ struct net_device *rose_dev_get(rose_add read_lock(&dev_base_lock); for (dev = dev_base; dev != NULL; dev = dev->next) { if ((dev->flags & IFF_UP) && dev->type == ARPHRD_ROSE && rosecmp(addr, (rose_address *)dev->dev_addr) == 0) { - dev_hold(dev); - goto out; + if (try_dev_hold(dev)) + goto out; } } out: ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-01 12:01 ` dev->destructor Rusty Russell @ 2003-05-01 11:09 ` David S. Miller 2003-05-01 17:51 ` dev->destructor Arnaldo Carvalho de Melo 2003-05-01 17:28 ` dev->destructor Arnaldo Carvalho de Melo 1 sibling, 1 reply; 28+ messages in thread From: David S. Miller @ 2003-05-01 11:09 UTC (permalink / raw) To: rusty; +Cc: kuznet, shemminger, netdev, acme, wa From: Rusty Russell <rusty@rustcorp.com.au> Date: Thu, 01 May 2003 22:01:19 +1000 There are 70 calls to dev_hold() in the kernel. The vast majority of them already have a reference, they just want another one: dev_hold() can do __module_get(). Rusty, this is precisely the what Alexey and myself want to avoid. On the surface, it looks fine, only 70 dev_get's in the kernel right? But think further... So you propose to add this kind of thing for every ARP entry, every route cache entry, every IPSEC policy, every socket, every struct sock, every networking dynamic object ever created? When we add SKB recycling, will we need to do a module get/put on every SKB alloc/free/clone/copy? I think this way lies insanity :) You may make the decision to eat this kind of overhead inside of netfilter, but Alexey and I do not accept this. I disagreed with Alexey initially, but now I truly see his wisdom. This networking device example is just the tip of the iceberg. We can continue to add bandaids across the kernel, instead of solving the real problem that modules need to manage their own removal. It is at the core of the reason the current module scheme has to be extended to let the module manage unloading. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-01 11:09 ` dev->destructor David S. Miller @ 2003-05-01 17:51 ` Arnaldo Carvalho de Melo 2003-05-01 16:55 ` dev->destructor David S. Miller 0 siblings, 1 reply; 28+ messages in thread From: Arnaldo Carvalho de Melo @ 2003-05-01 17:51 UTC (permalink / raw) To: David S. Miller; +Cc: rusty, kuznet, shemminger, netdev, wa Em Thu, May 01, 2003 at 04:09:35AM -0700, David S. Miller escreveu: > From: Rusty Russell <rusty@rustcorp.com.au> > Date: Thu, 01 May 2003 22:01:19 +1000 > > There are 70 calls to dev_hold() in the kernel. The vast majority of > them already have a reference, they just want another one: dev_hold() > can do __module_get(). > > Rusty, this is precisely the what Alexey and myself want to avoid. On > the surface, it looks fine, only 70 dev_get's in the kernel right? > But think further... > > So you propose to add this kind of thing for every ARP entry, every > route cache entry, every IPSEC policy, every socket, every struct > sock, every networking dynamic object ever created? ALERT: brainstorming and expecting for comments from the people who knows this better. Well, I think that because there are a graph of relationships here we perhaps can be safe by protecting just some of the higher level objects (e.g. struct sock, struct socket, struct net_device) while leaving some other lower level objects managed by those higher level ones, e.g. struct sk_buff managed by struct sock. This came to me while discussing the struct socket and struct sock module infrastructure with Max, specifically when net family modules (e.g. AF_INET) doesn't requires protecting for each and every struct socket created, as the protocol modules (e.g.: udp, raw, tcp) have to somehow register with the net family module and by just using one exported function (register_protocol type functions: register_pppox_proto, bt_sock_register, register_8022_client, register_snap_client, llc_sap_open, etc) makes the net family module/lower level protocol protected. So we need to have a graph of these relationships to see what we have to protect at a higher level, reducing the overhead of otherwise having to call try_module_get/__module_get & module_put on _all_ objects creation/use. - Arnaldo ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-01 17:51 ` dev->destructor Arnaldo Carvalho de Melo @ 2003-05-01 16:55 ` David S. Miller 0 siblings, 0 replies; 28+ messages in thread From: David S. Miller @ 2003-05-01 16:55 UTC (permalink / raw) To: acme; +Cc: rusty, kuznet, shemminger, netdev, wa From: Arnaldo Carvalho de Melo <acme@conectiva.com.br> Date: Thu, 1 May 2003 14:51:11 -0300 Well, I think that because there are a graph of relationships here we perhaps can be safe by protecting just some of the higher level objects (e.g. struct sock, struct socket, struct net_device) while leaving some other lower level objects managed by those higher level ones, e.g. struct sk_buff managed by struct sock. The graphs are unfortunately not completely connected. For example, sk_buff's can be sent not assosciated with any socket. Routing cache entries are not attached to any particular client, similar with ARP/neighbour entires, and sk_buff's in turn hold references to these things. See, long ago we used to not do proper reference counting on struct sock's. We used to rely on graphs of relationships and certain sock states to control destruction of these objects. The networking was riddled with obscure bugs because of this. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-01 12:01 ` dev->destructor Rusty Russell 2003-05-01 11:09 ` dev->destructor David S. Miller @ 2003-05-01 17:28 ` Arnaldo Carvalho de Melo 1 sibling, 0 replies; 28+ messages in thread From: Arnaldo Carvalho de Melo @ 2003-05-01 17:28 UTC (permalink / raw) To: Rusty Russell Cc: David S. Miller, kuznet, shemminger, netdev, Werner Almesberger Em Thu, May 01, 2003 at 10:01:19PM +1000, Rusty Russell escreveu: > But before we redesign module architecture from scratch, let's look at > a solution with what we do have (assuming Linus takes my damn > __module_get() patch some day, see below). Linus took the __module_get patch, I even used it in redesigning the way struct sock and struct socket are handled in response to Max Krasnyansky alternative patches > There are 70 calls to dev_hold() in the kernel. The vast majority of > them already have a reference, they just want another one: dev_hold() > can do __module_get(). yes > There are a few *sources* of devices: dev_get, dev_get_by*. These > should check and fail, using "try_dev_hold()" or something. > Unfortunately auditing all the __dev_get_by* is quite a task, since > it's used very widely (and I think, sometime erroneously). > > Completely untested patch below other patch. > > I need more time to digest your proposal in detail, Dave. Expect > reply w/in 24 hours. I'm digesting it as well 8) - Arnaldo ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-01 7:00 ` dev->destructor David S. Miller 2003-05-01 12:01 ` dev->destructor Rusty Russell @ 2003-05-02 4:06 ` kuznet 2003-05-02 5:25 ` dev->destructor Rusty Russell 1 sibling, 1 reply; 28+ messages in thread From: kuznet @ 2003-05-02 4:06 UTC (permalink / raw) To: David S. Miller; +Cc: shemminger, netdev, acme, rusty Hello! > None of destructors kill reference to device object, and I mean none > of them. It is why I thought the idea works. When you call unregister_something() you hold a reference to this something, and you have no idea how much of the references you hold. This is invisible when unregister_something() is called from a single place sort of cleanup_module(). > Also, holding RTNL semaphore does not block potential holders > of device reference. Or does it? When that branch with waiting inside unregister() exists you can't hold any reference when grabbing this semaphore. That's why dev_ioctl() takes the semaphore first. It is damnly inconvenient, fragile, et cetera and such bugs do exist. That's why unregister_netdev() is logically wrong function: it takes dev as argument, so any sane programmer would assume caller holds a reference. But he can't. So, call of the function is allowed only from contexts where device is presumed to be held, i.e. from cleanup_module() and from no other places. > Your idea, as I understand it, is to add callback to module that > freezes module then at some time in the future makes indication > that module is clean and may be unloaded safely. Yes, the description is mostly right. > Question is, where to make this? It cannot be in the module > itself of course, that would create race condition similar to > one which exists today making this exercise quite pointless :-) Probably, you should look at the most first module implementation. :-) Desite of it was horrible (like almost everything in kernel that days) it was logically correct. rmmod deleted module not depending on refcnt and module body was destroyed later, when refcnt reached zero. See? So, that cleanup_module() is replaced with shutdown() and a destructor is added to allow to cleanup something but memory, if it is necessary. And to handle the situation when we do not want to use module refcnt, a predicate to ask module for readiness to kill is added. I think it can be combined to destructor, so that for such modules destructor can return -EAGAIN. Well, when refcnt is zero you can try to destroy module and it might disagree. > Note the problem. If between the unregister_foo() and re-register of > foo in the failure path, someone asks to open a "foo" it will fail. > Those semantics absolutely stink. It is not a problem at all comparing to real ones. :-) > Rusty went on to describe that this is one of the reasons for the > current "enable refcounts" scheme with try_module_get(). Rusty forgot that crippling xxx_get() is million times more painful. :-) He also forgot that in 99% of cases there is a single registry and this registry must be self-consistent, so all the work is already done and module.c just invades area out of its competence. Anyway, this approach is legal and the simplest one, I understand this. It can be used optionally. Only, frankly speaking, I do not see any applications for this, because when more than one registry exists, module is surely so complicated that tracking references is painful enough to forget about this. Anyway, we always know number of sockets et al., so why to count them in more than one place? netdevices is the simplest example, but it shows the most didctively that all the ocurences of module_** there are illegal. We want to register/unregister them dynamically, we have to do all the job not depending on modules. We have to do our own refcounting. And incorrect design of modules only prevents to make final small step to make this right. Well, the key moment is that while device is registered, its module refcnt is not zero logically, but we can't unload the module in this case, so we have to do funny try_* each lookup. Alexey ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-02 4:06 ` dev->destructor kuznet @ 2003-05-02 5:25 ` Rusty Russell 2003-05-02 20:48 ` dev->destructor David S. Miller 0 siblings, 1 reply; 28+ messages in thread From: Rusty Russell @ 2003-05-02 5:25 UTC (permalink / raw) To: kuznet; +Cc: David S. Miller, shemminger, netdev, acme In message <200305020406.IAA10719@sex.inr.ac.ru> you write: > Hello! Hi Alexey! > It is damnly inconvenient, fragile, et cetera and such bugs do exist. > That's why unregister_netdev() is logically wrong function: it takes > dev as argument, so any sane programmer would assume caller holds > a reference. But he can't. So, call of the function is allowed > only from contexts where device is presumed to be held, i.e. from > cleanup_module() and from no other places. If this is true, I think you can use the module reference count only, and your code will be faster, too. I can prepare the patch for you later tonight, to see how it looks. > netdevices is the simplest example, but it shows the most didctively > that all the ocurences of module_** there are illegal. We want > to register/unregister them dynamically, we have to do all the job not > depending on modules. We have to do our own refcounting. And incorrect > design of modules only prevents to make final small step to make this > right. Well, the key moment is that while device is registered, its > module refcnt is not zero logically, but we can't unload the module > in this case, so we have to do funny try_* each lookup. Alexey, you are using a module but don't want to reference count it. I made module reference counts very cheap so you don't have to worry, but you still are trying to cheat 8) You want to be very tricky and count all ways into the module, instead. Clearly this is mathematically possible, but in practice very tricky. And all solutions I have seen which do this are ugly, and leave us with "remove may not succeed, it may hang forever, and you won't know, and you can't replace the module and need to reboot if it happens". 8( Better, I think, to make CONFIG_MODULE_UNLOAD=n, and make CONFIG_MODULE_FORCE_UNLOAD work even if CONFIG_MODULE_UNLOAD=n. Hope that clarifies? Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-02 5:25 ` dev->destructor Rusty Russell @ 2003-05-02 20:48 ` David S. Miller 2003-05-03 4:07 ` dev->destructor Rusty Russell 0 siblings, 1 reply; 28+ messages in thread From: David S. Miller @ 2003-05-02 20:48 UTC (permalink / raw) To: rusty; +Cc: kuznet, shemminger, netdev, acme From: Rusty Russell <rusty@rustcorp.com.au> Date: Fri, 02 May 2003 15:25:15 +1000 If this is true, I think you can use the module reference count only, and your code will be faster, too. I can prepare the patch for you later tonight, to see how it looks. And where do we get the counter from when dev->owner is NULL (ie. non-modular)? We need the reference counting regardless of whether the device is implemented statically in the kernel or modular. Do you propose to attach dummy struct module to non-modular case? I am curious... Alexey, you are using a module but don't want to reference count it. I made module reference counts very cheap so you don't have to worry, but you still are trying to cheat 8) Understood. I think even stronger part of Alexey's argument is that all of this "if (x->owner)" all over the place takes away some of the gains of compiling things statically into the kernel. Why extra branches all over the place? You want to be very tricky and count all ways into the module, instead. Clearly this is mathematically possible, but in practice very tricky. And all solutions I have seen which do this are ugly, and leave us with "remove may not succeed, it may hang forever, and you won't know, and you can't replace the module and need to reboot if it happens". 8( As long as I can Control-C rmmod when it waits like this, which would be the case, what is the problem? Also, not only is this mathematically possible it is DONE already. Hmmm, there seems to be massive disconnect here between what we understand here and what you appear to. Let me try to describe it in detail. All reasonable protocol code must do exactly this. Any module which does not properly keep track of the objects it is creating has problems bigger than proper module handling. It is not "very tricky", but rather "required". Look at it this way, when module kmalloc's something does it immediately forget about this? This seems to be what you suggest, and it is a dangerous way to think! No, rather, it remembers that it did this, either by setting '1' to refcount of this object, or attaching it to some hash table, list, tree, or other global data structure it maintains. Any time this object is attached somewhere else, reference count is incremented. Anytime it is detached or destroyed, refcount is decremented and final decrement to zero makes final killing of this object. It is ABCs of programming. :-) Apply this to every dynamic object created by a module, and the end result is that it makes the work of counting all internal references. Ergo, module refcounting is superfluous. Look, once external view into module (ie. socket operations, superblock ops, netdev registry) is removed, all that remains to reference object is exactly these objects. It is the only different part about modules vs. non-modules. After threading the networking and adding true refcounting to sockets I will never forget these rules. :-) Better, I think, to make CONFIG_MODULE_UNLOAD=n, and make CONFIG_MODULE_FORCE_UNLOAD work even if CONFIG_MODULE_UNLOAD=n. As much as I'd like to be able to accept that behavior, it's too much breakage. So many people periodically make rmmod attempts to unload unused modules, distributions even make this by default (or at least used to). Let's look at this aspect of behavior: 1) Some people think that -EBUSY return is unexpected. I fall into this category. 2) It is argued that some other people think the "wait until unloadable" behavior is unexpected. But nobody would be surprised if rmmod told them: ==================== Trying to unload %s, waiting for all references to go away. Perhaps you have programs running which still reference this module? (Hit Ctrl-C to interrupt unload to go and remove those references) ==================== nobody would ask what does this mean. :-) In fact, what IF rmmod was able to know it was unloading a filesystem and therefore could walk the mount list to find mounted instances of this filesystem and print that to the user in the rmmod message? Or for network protocols to print the socket list of sockets/routes/devices open to that module and even making 'lsof' to print process name/pid holding open such sockets? I bet even Linus himself would exclaim "wow, that's nice." Compare this to "-EBUSY". :-))))))))) And I want to mention that in some cases you have to "wait". The best example are TCP_TIME_WAIT sockets. Even after users downs all the interfaces, and closes all the sockets, these remnants must remain for their full life of 60 seconds. I really am concerned at both sides, both user observed behavior and kernel side correctness. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-02 20:48 ` dev->destructor David S. Miller @ 2003-05-03 4:07 ` Rusty Russell 2003-05-03 3:46 ` dev->destructor David S. Miller ` (3 more replies) 0 siblings, 4 replies; 28+ messages in thread From: Rusty Russell @ 2003-05-03 4:07 UTC (permalink / raw) To: David S. Miller; +Cc: kuznet, shemminger, netdev, acme In message <20030502.134804.78707298.davem@redhat.com> you write: > From: Rusty Russell <rusty@rustcorp.com.au> > Date: Fri, 02 May 2003 15:25:15 +1000 > > If this is true, I think you can use the module reference count only, > and your code will be faster, too. I can prepare the patch for you > later tonight, to see how it looks. > > And where do we get the counter from when dev->owner is NULL > (ie. non-modular)? We need the reference counting regardless of > whether the device is implemented statically in the kernel or modular. But Alexey said you can only call unregister_netdev from module unload, ie. if not a module, it can't be unloaded, hence no refcount needed. I wrote the above paragraph because I'm not sure if I understood Alexey correctly? > Alexey, you are using a module but don't want to reference count it. > I made module reference counts very cheap so you don't have to worry, > but you still are trying to cheat 8) > > Understood. > > I think even stronger part of Alexey's argument is that all of > this "if (x->owner)" all over the place takes away some of the > gains of compiling things statically into the kernel. Why extra > branches all over the place? Agreed. I have considered removing that, and making THIS_MODULE equal a dummy module struct for the core kernel. I think that would be a win (we've already established that the actual refcount is cheap, possibly cheaper than the branch in practice). BTW, we should look at making local_inc() etc. a first-class citizen: it has uses outside modules. > As long as I can Control-C rmmod when it waits like this, which would > be the case, what is the problem? If you can, and the device is still usable afterwards, it would be nirvana 8) [ Refcounting tutorial skipped: I went through the same pain with early conntrack code, and learned to refcount everything now 8) ] > Look, once external view into module (ie. socket operations, > superblock ops, netdev registry) is removed, all that remains to > reference object is exactly these objects. It is the only different > part about modules vs. non-modules. This argument applies to all objects. If you reference count everything which holds a reference to an object, you can infer the reference count of the object from the sum of reference counts of its referees. In practice, as you pointed out in an earlier mail (I think sockets were your example), doing this proves to be extremely painful. And we're feeling the pain now. The module functions, and all its data, are objects. For convenience, size, and speed, we don't reference count them separately. OK, so why doesn't, say, struct netdevice grab the module refcount on registration (as is logical), and drop it on unregistration? Because the module holds a reference to the struct device, so now you have a classic circular reference count problem: the module reference count will never fall to zero. That's why we grab a reference just before use. You can, of course, do two-stage module cleanup (ie. first stage with refcounts non-zero), but the user wants clean "remove or fail to remove" semantics, so half-way through the cleanup you realize it's still in use, and restore things: now you have the spurious failure problem where it was half-unregistered for a while, and AFAICT you have to rewrite 1400 modules's cleanup routines. I imagined schemes where the kernel would be basically stopped during module remove, so the half-remove and unremove would appear atomic. I shied away from implementing such a monster without deadlock, but it might be possible. Then we would truly have nirvana 8) > Trying to unload %s, waiting for all references to go away. > Perhaps you have programs running which still reference this > module? (Hit Ctrl-C to interrupt unload to go and remove > those references) > ==================== > > nobody would ask what does this mean. :-) Please, implement such a thing. I was unable to, without introducing spurious failure in the components, *and* rewriting every module_cleanup function 8( Hence rmmod does not wait by default, but says "module is busy". > And I want to mention that in some cases you have to "wait". The best > example are TCP_TIME_WAIT sockets. Even after users downs all the > interfaces, and closes all the sockets, these remnants must remain for > their full life of 60 seconds. Yes, certainly. The two-stage unload provided by rmmod --wait ensures that the reference count decreases to zero (it makes try_module_get fail). This was my desire if unloading security modules or some netfilter modules was going to be reasonable. > I really am concerned at both sides, both user observed behavior and > kernel side correctness. There are shades of correctness, too. Not jumping into a module which has gone away is probably the most important. Having finite unload time is also nice. Not having spurious failures because someone tried to unload a module is nice too. Roman Zippel suggested (among other things) that every unregister fail when busy, and that modules the reinitialize. This gives the spurious failure problem, and means rewriting every unregister interface in the kernel, every module cleanup function, and dealing with the case where you're cleaning up a failed initialization and your unregister failed. Adam Richter suggested the module or interface register with a central repository to say "this modules uses this interface", then you can atomically freeze all the module interfaces (say they all share a single lock) and see if it's in use, but you can't really call the module's cleanup routine, so you have to atomically deactivate all these registrations before dropping the lock, and that's the same as try_module_get(). As you know, I love radical change. But I want be make sure we're going to end up somewhere we like at the end of it, which is why I didn't do it. Anyway, putting the module loader in the kernel was enough to sate my appetite for change 8) Thanks! Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-03 4:07 ` dev->destructor Rusty Russell @ 2003-05-03 3:46 ` David S. Miller 2003-05-05 5:18 ` dev->destructor Rusty Russell 2003-05-03 4:00 ` dev->destructor David S. Miller ` (2 subsequent siblings) 3 siblings, 1 reply; 28+ messages in thread From: David S. Miller @ 2003-05-03 3:46 UTC (permalink / raw) To: rusty; +Cc: kuznet, shemminger, netdev, acme From: Rusty Russell <rusty@rustcorp.com.au> Date: Sat, 03 May 2003 14:07:41 +1000 I imagined schemes where the kernel would be basically stopped during module remove, so the half-remove and unremove would appear atomic. I shied away from implementing such a monster without deadlock, but it might be possible. Then we would truly have nirvana 8) I think this is an interesting area for exploration. This isn't a unique requirement BTW Rusty. IP conntrack rehashing in the presence of RCU would want something just like this now wouldn't it? Consider other applications, such as hot plug memory. I'm sure tons of other interesting examples could be imagined. So indeed, the key to nirvana would indeed be here :-) I think it can work Rusty, in short you create 1 freeze thread per cpu. You wake up all the freeze threads on non-local cpus, and they indicate their presence via some bitmask. Once the master cpu sees all non-local-cpu bits set in the bitmask, it begins the unload sequence, after the unload the cpu mask is cleared and this signals the freeze threads to break from their spin loops and schedule(). This means the local master cpu executes the unload sequence. It may sleep in order to yield to, for example, semaphore holders, it may also sleep to yield to kswapd and friends for the sake of memory allocation. I mean... consider all the situations and please try to find some hole in this. We can make all try_to_*() sleep at this time too... this in particular needs more thought. To make these freeze threads globally useful, we allow them to run atomicity commands. The two defined commands are "local_irq_*()" and "local_bh_*()", two bitmasks control this and the freeze threads check the bits in their spin loops. Do you see? Maybe... it is nearly Nirvana! :-))))) Our ability to implement this changes the rest of the conversation, so let us resolve this first. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-03 3:46 ` dev->destructor David S. Miller @ 2003-05-05 5:18 ` Rusty Russell 0 siblings, 0 replies; 28+ messages in thread From: Rusty Russell @ 2003-05-05 5:18 UTC (permalink / raw) To: David S. Miller; +Cc: kuznet, shemminger, netdev, acme In message <20030502.204628.35664814.davem@redhat.com> you write: > I think it can work Rusty, in short you create 1 freeze thread > per cpu. You wake up all the freeze threads on non-local cpus, > and they indicate their presence via some bitmask. This code is already in module.c. I'm glad you like it though 8). But we disable local irqs as well: this is what I call a "bogolock" (the read-side of a bogolock is prempt_disable()/preempt_enable(): you could temporarily disable preemption and force the scheduler to run every preempted thread, and remove this). > This means the local master cpu executes the unload sequence. It may > sleep in order to yield to, for example, semaphore holders, it may > also sleep to yield to kswapd and friends for the sake of memory > allocation. I mean... consider all the situations and please try to > find some hole in this. We can make all try_to_*() sleep at this > time too... this in particular needs more thought. Well, it's a big task. Holding interrupts disabled for unbounded time on CPUs needs to be thought about, but I think can be fixed. try_xxx can be called from interrupt context: you really want to get rid of interrupts, too... During previous discussions, I called this "return to primordial soup": back to like during init. Ideally, only userspace context (no interrupts, timers, bottom halves), and life is easy. > To make these freeze threads globally useful, we allow them to > run atomicity commands. The two defined commands are "local_irq_*()" > and "local_bh_*()", two bitmasks control this and the freeze threads > check the bits in their spin loops. Something like this? /* Tell all freeze threads to disable bottom halves. */ void global_bh_disable(void); void global_bh_enable(void); /* Tell all freeze threads to disable interrupts halves. */ void global_irq_disable(void); void global_irq_enable(void); > Do you see? Maybe... it is nearly Nirvana! :-))))) Yes, but I worry it might be an illusion 8) > Our ability to implement this changes the rest of the conversation, > so let us resolve this first. Yes, but it's a big IF. I think it might be easier to make all unregistrations runnable in interrupt context 8( Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-03 4:07 ` dev->destructor Rusty Russell 2003-05-03 3:46 ` dev->destructor David S. Miller @ 2003-05-03 4:00 ` David S. Miller 2003-05-05 16:08 ` dev->destructor Stephen Hemminger 2003-05-05 20:00 ` dev->destructor Stephen Hemminger 3 siblings, 0 replies; 28+ messages in thread From: David S. Miller @ 2003-05-03 4:00 UTC (permalink / raw) To: rusty; +Cc: kuznet, shemminger, netdev, acme From: Rusty Russell <rusty@rustcorp.com.au> Date: Sat, 03 May 2003 14:07:41 +1000 This argument applies to all objects. If you reference count everything which holds a reference to an object, you can infer the reference count of the object from the sum of reference counts of its referees. In practice, as you pointed out in an earlier mail (I think sockets were your example), doing this proves to be extremely painful. And we're feeling the pain now. Please ignore the example code I wrote in that email. Most of it is inconsistent and frankly garbage. :-) The "->can_unload()" check is actually simpler than we might initially suspect. Something like ipv6 might check: if (atomic_read(&inet6_sock_nr) == 0 && atomic_read(&inet6_dev_nr) == 0 && rt6_cache_empty()) return 1; return 0; Now, here is the important part! When this thing returns "1" the module.c code does this: call_rcu(&mod->rcu_head, mod->cleanup, NULL); This makes sure the guy who killed the last object has indeed left the module code. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-03 4:07 ` dev->destructor Rusty Russell 2003-05-03 3:46 ` dev->destructor David S. Miller 2003-05-03 4:00 ` dev->destructor David S. Miller @ 2003-05-05 16:08 ` Stephen Hemminger 2003-05-06 14:25 ` dev->destructor David S. Miller 2003-05-05 20:00 ` dev->destructor Stephen Hemminger 3 siblings, 1 reply; 28+ messages in thread From: Stephen Hemminger @ 2003-05-05 16:08 UTC (permalink / raw) To: Rusty Russell; +Cc: davem, kuznet, netdev, acme On Sat, 03 May 2003 14:07:41 +1000 Rusty Russell <rusty@rustcorp.com.au> wrote: > In message <20030502.134804.78707298.davem@redhat.com> you write: > > From: Rusty Russell <rusty@rustcorp.com.au> > > Date: Fri, 02 May 2003 15:25:15 +1000 > > > > If this is true, I think you can use the module reference count only, > > and your code will be faster, too. I can prepare the patch for you > > later tonight, to see how it looks. > > > > And where do we get the counter from when dev->owner is NULL > > (ie. non-modular)? We need the reference counting regardless of > > whether the device is implemented statically in the kernel or modular. > > But Alexey said you can only call unregister_netdev from module > unload, ie. if not a module, it can't be unloaded, hence no refcount > needed. I wrote the above paragraph because I'm not sure if I > understood Alexey correctly? There are several flavors of pseudo-network devices like bridging and VLAN that dynamically create/destroy netdev's even when they are not modules. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-05 16:08 ` dev->destructor Stephen Hemminger @ 2003-05-06 14:25 ` David S. Miller 2003-05-07 2:54 ` dev->destructor Rusty Russell 0 siblings, 1 reply; 28+ messages in thread From: David S. Miller @ 2003-05-06 14:25 UTC (permalink / raw) To: shemminger; +Cc: rusty, kuznet, netdev, acme From: Stephen Hemminger <shemminger@osdl.org> Date: Mon, 5 May 2003 09:08:20 -0700 On Sat, 03 May 2003 14:07:41 +1000 Rusty Russell <rusty@rustcorp.com.au> wrote: > But Alexey said you can only call unregister_netdev from module > unload, ie. if not a module, it can't be unloaded, hence no refcount > needed. I wrote the above paragraph because I'm not sure if I > understood Alexey correctly? There are several flavors of pseudo-network devices like bridging and VLAN that dynamically create/destroy netdev's even when they are not modules. I think you'll understand what Alexey/Rusty are saying better if you consider statically compiled kernel code as a module with an implicit non-zero reference count :-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-06 14:25 ` dev->destructor David S. Miller @ 2003-05-07 2:54 ` Rusty Russell 0 siblings, 0 replies; 28+ messages in thread From: Rusty Russell @ 2003-05-07 2:54 UTC (permalink / raw) To: David S. Miller; +Cc: shemminger, rusty, kuznet, netdev, acme In message <20030506.072529.52888036.davem@redhat.com> you write: > From: Stephen Hemminger <shemminger@osdl.org> > Date: Mon, 5 May 2003 09:08:20 -0700 > > On Sat, 03 May 2003 14:07:41 +1000 > Rusty Russell <rusty@rustcorp.com.au> wrote: > > > But Alexey said you can only call unregister_netdev from module > > unload, ie. if not a module, it can't be unloaded, hence no refcount > > needed. I wrote the above paragraph because I'm not sure if I > > understood Alexey correctly? > > There are several flavors of pseudo-network devices like bridging > and VLAN that dynamically create/destroy netdev's even when they > are not modules. > > I think you'll understand what Alexey/Rusty are saying better > if you consider statically compiled kernel code as a module with > an implicit non-zero reference count :-) Yes, but his point is valid. We *do* want to destroy netdev's at random times, not just from module cleanup code. Hotplug, for example. So me saying "just rely on the owner refcnt" was wrong. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-03 4:07 ` dev->destructor Rusty Russell ` (2 preceding siblings ...) 2003-05-05 16:08 ` dev->destructor Stephen Hemminger @ 2003-05-05 20:00 ` Stephen Hemminger 2003-05-06 4:18 ` dev->destructor Rusty Russell 3 siblings, 1 reply; 28+ messages in thread From: Stephen Hemminger @ 2003-05-05 20:00 UTC (permalink / raw) To: Rusty Russell; +Cc: davem, kuznet, netdev, acme [-- Attachment #1: Type: text/plain, Size: 2385 bytes --] As an experiment, tried acquiring module ref count every time network device is ref counted. The result is discovering that there are cases in the Ethernet module init path where there is a call to dev_hold() without a previous explicit ref count. kernel BUG at include/linux/module.h:284! invalid operand: 0000 [#1] CPU: 0 EIP: 0060:[<c028fd02>] Not tainted EFLAGS: 00010246 EIP is at linkwatch_fire_event+0x170/0x1a3 eax: 00000000 ebx: c047fad0 ecx: 00000020 edx: f88c8100 esi: f88c7100 edi: f6fa7000 ebp: f6f15de4 esp: f6f15dc8 ds: 007b es: 007b ss: 0068 Process modprobe (pid: 408, threadinfo=f6f14000 task=f78f46a0) Stack: f88c7100 c03f7008 00000246 f6f14000 f6fa7000 00000000 fffc829b f6f15df8 f88c0ab9 f6fa7000 f6fa71e0 033002a8 f6f15e24 f88c00e0 f6fa71e0 00007148 c03e2e80 f6fa7320 c011eb46 ffffffef f6f15e3e f6fa71e0 fffc829b f6f15e50 Call Trace: [<f88c7100>] +0x0/0x1180 [e100] [<f88c0ab9>] e100_update_link_state+0x97/0xa2 [e100] [<f88c00e0>] e100_find_speed_duplex+0x20/0x26a [e100] [<c011eb46>] sys_sched_yield+0xc0/0xfe [<f88c07de>] e100_auto_neg+0x114/0x11c [e100] [<c01fa4f8>] __delay+0x14/0x18 [<f88c081d>] e100_phy_set_speed_duplex+0x37/0xa4 [e100] [<f88c0997>] e100_phy_init+0x69/0x78 [e100] [<f88ba1dc>] e100_hw_init+0x14/0x11e [e100] [<f88bc432>] e100_rd_pwa_no+0x32/0x40 [e100] [<f88ba058>] e100_init+0xf6/0x126 [e100] [<f88b9273>] e100_found1+0x1a9/0x42e [e100] [<f88c5b25>] e100_driver_version+0x0/0xb [e100] [<f88c5e40>] e100_driver+0x0/0xa0 [e100] [<c01fe884>] pci_device_probe+0x5a/0x68 [<f88c5b60>] e100_id_table+0x0/0x2e0 [e100] [<f88c5e68>] e100_driver+0x28/0xa0 [e100] [<c0241853>] bus_match+0x43/0x6e [<f88c5e68>] e100_driver+0x28/0xa0 [e100] [<f88c5e68>] e100_driver+0x28/0xa0 [e100] [<c0241956>] driver_attach+0x5c/0x60 [<f88c5e68>] e100_driver+0x28/0xa0 [e100] [<f88c5e68>] e100_driver+0x28/0xa0 [e100] [<c0241c2a>] bus_add_driver+0xb2/0xc8 [<f88c5e68>] e100_driver+0x28/0xa0 [e100] [<f88c7100>] +0x0/0x1180 [e100] [<c01fe99a>] pci_register_driver+0x46/0x56 [<f88c5e68>] e100_driver+0x28/0xa0 [e100] [<f880c015>] +0x15/0x3e [e100] [<f88c5e40>] e100_driver+0x0/0xa0 [e100] [<c013b034>] sys_init_module+0x1b0/0x292 all_call+0x7/0xb Code: 0f 0b 1c 01 97 5a 32 c0 e9 d9 fe ff ff c7 04 24 0c 00 00 00 ./ifup: line 91: 408 Segmentation fault modprobe $1 >/dev/null 2>&1 [-- Attachment #2: netdev-module.diff --] [-- Type: application/octet-stream, Size: 3843 bytes --] diff -urNp -X dontdiff linux-2.5/include/asm-i386/asm_offsets.h linux-2.5-dev/include/asm-i386/asm_offsets.h --- linux-2.5/include/asm-i386/asm_offsets.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.5-dev/include/asm-i386/asm_offsets.h 2003-05-05 10:11:50.000000000 -0700 @@ -0,0 +1,22 @@ +#ifndef __ASM_OFFSETS_H__ +#define __ASM_OFFSETS_H__ +/* + * DO NOT MODIFY. + * + * This file was generated by arch/i386/Makefile + * + */ + +#define SIGCONTEXT_eax 44 /* offsetof (struct sigcontext, eax) */ +#define SIGCONTEXT_ebx 32 /* offsetof (struct sigcontext, ebx) */ +#define SIGCONTEXT_ecx 40 /* offsetof (struct sigcontext, ecx) */ +#define SIGCONTEXT_edx 36 /* offsetof (struct sigcontext, edx) */ +#define SIGCONTEXT_esi 20 /* offsetof (struct sigcontext, esi) */ +#define SIGCONTEXT_edi 16 /* offsetof (struct sigcontext, edi) */ +#define SIGCONTEXT_ebp 24 /* offsetof (struct sigcontext, ebp) */ +#define SIGCONTEXT_esp 28 /* offsetof (struct sigcontext, esp) */ +#define SIGCONTEXT_eip 56 /* offsetof (struct sigcontext, eip) */ + +#define RT_SIGFRAME_sigcontext 164 /* offsetof (struct rt_sigframe, uc.uc_mcontext) */ + +#endif diff -urNp -X dontdiff linux-2.5/include/linux/netdevice.h linux-2.5-dev/include/linux/netdevice.h --- linux-2.5/include/linux/netdevice.h 2003-04-14 13:32:21.000000000 -0700 +++ linux-2.5-dev/include/linux/netdevice.h 2003-05-05 10:06:19.000000000 -0700 @@ -29,6 +29,7 @@ #include <linux/if_ether.h> #include <linux/if_packet.h> #include <linux/kobject.h> +#include <linux/module.h> #include <asm/atomic.h> #include <asm/cache.h> @@ -629,12 +630,32 @@ extern int netdev_finish_unregister(stru static inline void dev_put(struct net_device *dev) { + module_put(dev->owner); if (atomic_dec_and_test(&dev->refcnt)) netdev_finish_unregister(dev); } -#define __dev_put(dev) atomic_dec(&(dev)->refcnt) -#define dev_hold(dev) atomic_inc(&(dev)->refcnt) +static inline void __dev_put(struct net_device *dev) +{ + module_put(dev->owner); + atomic_dec(&dev->refcnt); +} + +static inline void dev_hold(struct net_device *dev) +{ + __module_get(dev->owner); + atomic_inc(&dev->refcnt); +} + +static inline int dev_try_hold(struct net_device *dev) +{ + int ret = 0; + if (try_module_get(dev->owner)){ + atomic_inc(&dev->refcnt); + ret = 1; + } + return ret; +} /* Carrier loss detection, dial on demand. The functions netif_carrier_on * and _off may be called from IRQ context, but it is caller diff -urNp -X dontdiff linux-2.5/net/core/dev.c linux-2.5-dev/net/core/dev.c --- linux-2.5/net/core/dev.c 2003-05-05 09:41:03.000000000 -0700 +++ linux-2.5-dev/net/core/dev.c 2003-05-05 10:06:20.000000000 -0700 @@ -1,3 +1,4 @@ +#define NET_REFCNT_DEBUG 1 /* * NET3 Protocol independent device support routines. * @@ -440,8 +441,8 @@ struct net_device *dev_get_by_name(const read_lock(&dev_base_lock); dev = __dev_get_by_name(name); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -513,8 +514,8 @@ struct net_device *dev_get_by_index(int read_lock(&dev_base_lock); dev = __dev_get_by_index(ifindex); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -563,8 +564,8 @@ struct net_device * dev_get_by_flags(uns read_lock(&dev_base_lock); dev = __dev_get_by_flags(if_flags, mask); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -2609,10 +2610,11 @@ int register_netdevice(struct net_device set_bit(__LINK_STATE_PRESENT, &dev->state); dev->next = NULL; + atomic_inc(&dev->refcnt); dev_init_scheduler(dev); write_lock_bh(&dev_base_lock); *dp = dev; - dev_hold(dev); + dev->deadbeaf = 0; write_unlock_bh(&dev_base_lock); ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-05 20:00 ` dev->destructor Stephen Hemminger @ 2003-05-06 4:18 ` Rusty Russell 2003-05-06 14:23 ` dev->destructor David S. Miller 2003-05-06 22:35 ` dev->destructor Stephen Hemminger 0 siblings, 2 replies; 28+ messages in thread From: Rusty Russell @ 2003-05-06 4:18 UTC (permalink / raw) To: Stephen Hemminger; +Cc: davem, kuznet, netdev, acme In message <20030505130050.4b9868bb.shemminger@osdl.org> you write: > As an experiment, tried acquiring module ref count every time network > device is ref counted. Brave man 8) > The result is discovering that there are cases in the Ethernet > module init path where there is a call to dev_hold() without a > previous explicit ref count. Well caught: this is in fact a false alarm. Coming, as we do, out of module_init(), we actually hold an implicit reference. It's logically consistent to make it implicit, and cuts out some code in the unload path. How's this? Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. Name: Hold Initial Reference To Module Author: Rusty Russell Status: Tested on 2.5.69 D: __module_get is theoretically allowed on module inside init, since D: we already hold an implicit reference. Currently this BUG()s: make D: the reference count explicit, which also simplifies delete path. diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.69/kernel/module.c working-2.5.69-module-init-ref/kernel/module.c --- linux-2.5.69/kernel/module.c 2003-05-05 12:37:13.000000000 +1000 +++ working-2.5.69-module-init-ref/kernel/module.c 2003-05-06 12:11:18.000000000 +1000 @@ -214,6 +214,8 @@ static void module_unload_init(struct mo INIT_LIST_HEAD(&mod->modules_which_use_me); for (i = 0; i < NR_CPUS; i++) atomic_set(&mod->ref[i].count, 0); + /* Hold reference count during initialization. */ + atomic_set(&mod->ref[smp_processor_id()].count, 1); /* Backwards compatibility macros put refcount during init. */ mod->waiter = current; } @@ -500,16 +502,6 @@ sys_delete_module(const char __user *nam goto out; } - /* Coming up? Allow force on stuck modules. */ - if (mod->state == MODULE_STATE_COMING) { - forced = try_force(flags); - if (!forced) { - /* This module can't be removed */ - ret = -EBUSY; - goto out; - } - } - /* If it has an init func, it must have an exit func to unload */ if ((mod->init != init_module && mod->exit == cleanup_module) || mod->unsafe) { @@ -1448,6 +1440,7 @@ sys_init_module(void __user *umod, printk(KERN_ERR "%s: module is now stuck!\n", mod->name); else { + module_put(mod); down(&module_mutex); free_module(mod); up(&module_mutex); @@ -1463,6 +1456,9 @@ sys_init_module(void __user *umod, mod->init_size = 0; up(&module_mutex); + /* Drop initial reference. */ + module_put(mod); + return 0; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-06 4:18 ` dev->destructor Rusty Russell @ 2003-05-06 14:23 ` David S. Miller 2003-05-06 22:32 ` [PATCH 2.5.69] IPV4 should use dev_hold Stephen Hemminger 2003-05-07 2:50 ` dev->destructor Rusty Russell 2003-05-06 22:35 ` dev->destructor Stephen Hemminger 1 sibling, 2 replies; 28+ messages in thread From: David S. Miller @ 2003-05-06 14:23 UTC (permalink / raw) To: rusty; +Cc: shemminger, kuznet, netdev, acme From: Rusty Russell <rusty@rustcorp.com.au> Date: Tue, 06 May 2003 14:18:36 +1000 It's logically consistent to make it implicit, and cuts out some code in the unload path. How's this? This looks fine to me. How hard would it be to make this completely consistent in that no module code is ever invoked with modcount == 0? By this I mean keeping the implicit reference after modload succeeds, and then calling ->cleanup() is valid once the count drops to '1'. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2.5.69] IPV4 should use dev_hold 2003-05-06 14:23 ` dev->destructor David S. Miller @ 2003-05-06 22:32 ` Stephen Hemminger 2003-05-07 7:32 ` David S. Miller 2003-05-07 2:50 ` dev->destructor Rusty Russell 1 sibling, 1 reply; 28+ messages in thread From: Stephen Hemminger @ 2003-05-06 22:32 UTC (permalink / raw) To: David S. Miller; +Cc: rusty, kuznet, netdev, acme David, While debugging possible changes to dev refcounting, discovered a couple of places in IPV4 that were directly incrementing rather than using the inline dev_hold. Let's hide the implementation of net device ref counting so future module ref count fixes will be easier. diff -urN -X dontdiff linux-2.5/net/ipv4/fib_frontend.c linux-2.5-dev/net/ipv4/fib_frontend.c --- linux-2.5/net/ipv4/fib_frontend.c 2003-04-14 13:32:26.000000000 -0700 +++ linux-2.5-dev/net/ipv4/fib_frontend.c 2003-05-06 15:16:03.000000000 -0700 @@ -115,9 +115,9 @@ if (res.type != RTN_LOCAL) goto out; dev = FIB_RES_DEV(res); - if (dev) - atomic_inc(&dev->refcnt); + if (dev) + dev_hold(dev); out: fib_res_put(&res); return dev; diff -urN -X dontdiff linux-2.5/net/ipv4/fib_semantics.c linux-2.5-dev/net/ipv4/fib_semantics.c --- linux-2.5/net/ipv4/fib_semantics.c 2003-04-29 09:57:41.000000000 -0700 +++ linux-2.5-dev/net/ipv4/fib_semantics.c 2003-05-06 15:10:40.000000000 -0700 @@ -406,7 +406,7 @@ if (!(dev->flags&IFF_UP)) return -ENETDOWN; nh->nh_dev = dev; - atomic_inc(&dev->refcnt); + dev_hold(dev); nh->nh_scope = RT_SCOPE_LINK; return 0; } @@ -429,7 +429,7 @@ nh->nh_oif = FIB_RES_OIF(res); if ((nh->nh_dev = FIB_RES_DEV(res)) == NULL) goto out; - atomic_inc(&nh->nh_dev->refcnt); + dev_hold(nh->nh_dev); err = -ENETDOWN; if (!(nh->nh_dev->flags & IFF_UP)) goto out; @@ -451,7 +451,7 @@ return -ENETDOWN; } nh->nh_dev = in_dev->dev; - atomic_inc(&nh->nh_dev->refcnt); + dev_hold(nh->nh_dev); nh->nh_scope = RT_SCOPE_HOST; in_dev_put(in_dev); } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2.5.69] IPV4 should use dev_hold 2003-05-06 22:32 ` [PATCH 2.5.69] IPV4 should use dev_hold Stephen Hemminger @ 2003-05-07 7:32 ` David S. Miller 0 siblings, 0 replies; 28+ messages in thread From: David S. Miller @ 2003-05-07 7:32 UTC (permalink / raw) To: shemminger; +Cc: rusty, kuznet, netdev, acme From: Stephen Hemminger <shemminger@osdl.org> Date: Tue, 6 May 2003 15:32:34 -0700 While debugging possible changes to dev refcounting, discovered a couple of places in IPV4 that were directly incrementing rather than using the inline dev_hold. Let's hide the implementation of net device ref counting so future module ref count fixes will be easier. Applied, thanks. If you find any more, just send along another patch :-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-06 14:23 ` dev->destructor David S. Miller 2003-05-06 22:32 ` [PATCH 2.5.69] IPV4 should use dev_hold Stephen Hemminger @ 2003-05-07 2:50 ` Rusty Russell 2003-05-07 3:58 ` dev->destructor David S. Miller 1 sibling, 1 reply; 28+ messages in thread From: Rusty Russell @ 2003-05-07 2:50 UTC (permalink / raw) To: David S. Miller; +Cc: shemminger, kuznet, netdev, acme In message <20030506.072338.39479306.davem@redhat.com> you write: > From: Rusty Russell <rusty@rustcorp.com.au> > Date: Tue, 06 May 2003 14:18:36 +1000 > > It's logically consistent to make it implicit, and cuts out some > code in the unload path. > > How's this? > > This looks fine to me. > > How hard would it be to make this completely consistent in that > no module code is ever invoked with modcount == 0? By this I mean > keeping the implicit reference after modload succeeds, and then > calling ->cleanup() is valid once the count drops to '1'. I had to think hard about this 8). There's nothing *wrong* with leaving the refcount at 1: it's just a matter of adjusting various checks for 0 to 1 (including the BUG() Stephen hit), and subbing 1 in /proc/modules... But that's a sideshow: you'd *still* want an extra refcount around the call to module_init(). Because it's the module state being MODULE_STATE_COMING which stops the module from being unloaded, ie. holds the extra reference count. Once init is finished, module state goes to MODULE_STATE_LIVE, and the reference must be dropped. Now, grabbing a similar reference when deleting a module might make sense, but it's actually kinda pointless when you look at the code. Hope that clarifies, Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-07 2:50 ` dev->destructor Rusty Russell @ 2003-05-07 3:58 ` David S. Miller 0 siblings, 0 replies; 28+ messages in thread From: David S. Miller @ 2003-05-07 3:58 UTC (permalink / raw) To: rusty; +Cc: shemminger, kuznet, netdev, acme From: Rusty Russell <rusty@rustcorp.com.au> Date: Wed, 07 May 2003 12:50:07 +1000 But that's a sideshow: you'd *still* want an extra refcount around the call to module_init(). ... Now, grabbing a similar reference when deleting a module might make sense, but it's actually kinda pointless when you look at the code. Indeed, forget my idea ;-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: dev->destructor 2003-05-06 4:18 ` dev->destructor Rusty Russell 2003-05-06 14:23 ` dev->destructor David S. Miller @ 2003-05-06 22:35 ` Stephen Hemminger 2003-05-06 23:51 ` [RFC] Experiment with dev and module ref counts Stephen Hemminger 1 sibling, 1 reply; 28+ messages in thread From: Stephen Hemminger @ 2003-05-06 22:35 UTC (permalink / raw) To: Rusty Russell; +Cc: davem, kuznet, netdev, acme On Tue, 06 May 2003 14:18:36 +1000 Rusty Russell <rusty@rustcorp.com.au> wrote: > In message <20030505130050.4b9868bb.shemminger@osdl.org> you write: > > As an experiment, tried acquiring module ref count every time network > > device is ref counted. > > Brave man 8) > > > The result is discovering that there are cases in the Ethernet > > module init path where there is a call to dev_hold() without a > > previous explicit ref count. > > Well caught: this is in fact a false alarm. Coming, as we do, out of > module_init(), we actually hold an implicit reference. > > It's logically consistent to make it implicit, and cuts out some code > in the unload path. > > How's this? > Rusty. Thanks, with that change and the following patches, the system does boot and correctly ref counts the modules. Still have problems on unregister and shutdown, but it is a start. diff -urN -X dontdiff linux-2.5/include/linux/netdevice.h linux-2.5-dev/include/linux/netdevice.h --- linux-2.5/include/linux/netdevice.h 2003-04-14 13:32:21.000000000 -0700 +++ linux-2.5-dev/include/linux/netdevice.h 2003-05-06 15:11:25.000000000 -0700 @@ -29,6 +29,7 @@ #include <linux/if_ether.h> #include <linux/if_packet.h> #include <linux/kobject.h> +#include <linux/module.h> #include <asm/atomic.h> #include <asm/cache.h> @@ -629,12 +630,32 @@ static inline void dev_put(struct net_device *dev) { + module_put(dev->owner); if (atomic_dec_and_test(&dev->refcnt)) netdev_finish_unregister(dev); } -#define __dev_put(dev) atomic_dec(&(dev)->refcnt) -#define dev_hold(dev) atomic_inc(&(dev)->refcnt) +static inline void __dev_put(struct net_device *dev) +{ + module_put(dev->owner); + atomic_dec(&dev->refcnt); +} + +static inline void dev_hold(struct net_device *dev) +{ + __module_get(dev->owner); + atomic_inc(&dev->refcnt); +} + +static inline int dev_try_hold(struct net_device *dev) +{ + int ret = 0; + if (try_module_get(dev->owner)){ + atomic_inc(&dev->refcnt); + ret = 1; + } + return ret; +} /* Carrier loss detection, dial on demand. The functions netif_carrier_on * and _off may be called from IRQ context, but it is caller diff -urN -X dontdiff linux-2.5/net/core/dev.c linux-2.5-dev/net/core/dev.c --- linux-2.5/net/core/dev.c 2003-05-05 09:41:03.000000000 -0700 +++ linux-2.5-dev/net/core/dev.c 2003-05-06 15:12:24.000000000 -0700 @@ -1,3 +1,4 @@ +#define NET_REFCNT_DEBUG 1 /* * NET3 Protocol independent device support routines. * @@ -440,8 +441,8 @@ read_lock(&dev_base_lock); dev = __dev_get_by_name(name); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -513,8 +514,8 @@ read_lock(&dev_base_lock); dev = __dev_get_by_index(ifindex); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -563,8 +564,8 @@ read_lock(&dev_base_lock); dev = __dev_get_by_flags(if_flags, mask); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -1312,7 +1313,9 @@ goto drop; enqueue: - dev_hold(skb->dev); + if (!dev_try_hold(skb->dev)) + goto drop; + __skb_queue_tail(&queue->input_pkt_queue, skb); #ifndef OFFLINE_SAMPLE get_sample_stats(this_cpu); @@ -1990,9 +1993,8 @@ ASSERT_RTNL(); if (master) { - if (old) + if (old || !dev_try_hold(master)) return -EBUSY; - dev_hold(master); } br_write_lock_bh(BR_NETPROTO_LOCK); @@ -2609,10 +2611,11 @@ set_bit(__LINK_STATE_PRESENT, &dev->state); dev->next = NULL; + atomic_inc(&dev->refcnt); dev_init_scheduler(dev); write_lock_bh(&dev_base_lock); *dp = dev; - dev_hold(dev); + dev->deadbeaf = 0; write_unlock_bh(&dev_base_lock); @@ -2900,7 +2903,11 @@ #endif dev->xmit_lock_owner = -1; dev->iflink = -1; - dev_hold(dev); + + if (!dev_try_hold(dev)) { + dev->deadbeaf = 1; + dp = &dev->next; + } /* * Allocate name. If the init() fails diff -urN -X dontdiff linux-2.5/net/ipv4/devinet.c linux-2.5-dev/net/ipv4/devinet.c --- linux-2.5/net/ipv4/devinet.c 2003-04-14 13:32:26.000000000 -0700 +++ linux-2.5-dev/net/ipv4/devinet.c 2003-05-06 15:16:03.000000000 -0700 @@ -559,6 +559,9 @@ if ((dev = __dev_get_by_name(ifr.ifr_name)) == NULL) goto done; + if (!dev_try_hold(dev)) + goto done; + if (colon) *colon = ':'; @@ -591,7 +594,7 @@ ret = -EADDRNOTAVAIL; if (!ifa && cmd != SIOCSIFADDR && cmd != SIOCSIFFLAGS) - goto done; + goto put; switch(cmd) { case SIOCGIFADDR: /* Get interface address */ @@ -700,12 +703,15 @@ } break; } +put: + dev_put(dev); done: rtnl_unlock(); dev_probe_unlock(); out: return ret; rarok: + dev_put(dev); rtnl_unlock(); dev_probe_unlock(); ret = copy_to_user(arg, &ifr, sizeof(struct ifreq)) ? -EFAULT : 0; ^ permalink raw reply [flat|nested] 28+ messages in thread
* [RFC] Experiment with dev and module ref counts 2003-05-06 22:35 ` dev->destructor Stephen Hemminger @ 2003-05-06 23:51 ` Stephen Hemminger 0 siblings, 0 replies; 28+ messages in thread From: Stephen Hemminger @ 2003-05-06 23:51 UTC (permalink / raw) To: rusty, davem, kuznet, acme; +Cc: netdev This patch ties network device refcounts and module ref counts together. It works for IPV4 and the dev->destructor problem is fixed, at least for the Ethernet bridge device. Unregister and shutdown work correctly, and modules can be removed when expected. BUT: lots of dev_hold usage would need more auditing. No idea what the performance impact is yet. diff -urNp -X dontdiff linux-2.5/include/linux/netdevice.h linux-2.5-dev/include/linux/netdevice.h --- linux-2.5/include/linux/netdevice.h 2003-04-14 13:32:21.000000000 -0700 +++ linux-2.5-dev/include/linux/netdevice.h 2003-05-06 15:11:25.000000000 -0700 @@ -29,6 +29,7 @@ #include <linux/if_ether.h> #include <linux/if_packet.h> #include <linux/kobject.h> +#include <linux/module.h> #include <asm/atomic.h> #include <asm/cache.h> @@ -629,12 +630,32 @@ extern int netdev_finish_unregister(stru static inline void dev_put(struct net_device *dev) { + module_put(dev->owner); if (atomic_dec_and_test(&dev->refcnt)) netdev_finish_unregister(dev); } -#define __dev_put(dev) atomic_dec(&(dev)->refcnt) -#define dev_hold(dev) atomic_inc(&(dev)->refcnt) +static inline void __dev_put(struct net_device *dev) +{ + module_put(dev->owner); + atomic_dec(&dev->refcnt); +} + +static inline void dev_hold(struct net_device *dev) +{ + __module_get(dev->owner); + atomic_inc(&dev->refcnt); +} + +static inline int dev_try_hold(struct net_device *dev) +{ + int ret = 0; + if (try_module_get(dev->owner)){ + atomic_inc(&dev->refcnt); + ret = 1; + } + return ret; +} /* Carrier loss detection, dial on demand. The functions netif_carrier_on * and _off may be called from IRQ context, but it is caller diff -urNp -X dontdiff linux-2.5/kernel/module.c linux-2.5-dev/kernel/module.c --- linux-2.5/kernel/module.c 2003-04-30 11:19:23.000000000 -0700 +++ linux-2.5-dev/kernel/module.c 2003-05-06 13:13:20.000000000 -0700 @@ -214,6 +214,8 @@ static void module_unload_init(struct mo INIT_LIST_HEAD(&mod->modules_which_use_me); for (i = 0; i < NR_CPUS; i++) atomic_set(&mod->ref[i].count, 0); + /* Hold reference count during initialization. */ + atomic_set(&mod->ref[smp_processor_id()].count, 1); /* Backwards compatibility macros put refcount during init. */ mod->waiter = current; } @@ -500,16 +502,6 @@ sys_delete_module(const char __user *nam goto out; } - /* Coming up? Allow force on stuck modules. */ - if (mod->state == MODULE_STATE_COMING) { - forced = try_force(flags); - if (!forced) { - /* This module can't be removed */ - ret = -EBUSY; - goto out; - } - } - /* If it has an init func, it must have an exit func to unload */ if ((mod->init != init_module && mod->exit == cleanup_module) || mod->unsafe) { @@ -1448,6 +1440,7 @@ sys_init_module(void __user *umod, printk(KERN_ERR "%s: module is now stuck!\n", mod->name); else { + module_put(mod); down(&module_mutex); free_module(mod); up(&module_mutex); @@ -1463,6 +1456,9 @@ sys_init_module(void __user *umod, mod->init_size = 0; up(&module_mutex); + /* Drop initial reference. */ + module_put(mod); + return 0; } diff -urNp -X dontdiff linux-2.5/net/core/dev.c linux-2.5-dev/net/core/dev.c --- linux-2.5/net/core/dev.c 2003-05-05 09:41:03.000000000 -0700 +++ linux-2.5-dev/net/core/dev.c 2003-05-06 16:07:15.000000000 -0700 @@ -1,3 +1,4 @@ +#define NET_REFCNT_DEBUG 1 /* * NET3 Protocol independent device support routines. * @@ -440,8 +441,8 @@ struct net_device *dev_get_by_name(const read_lock(&dev_base_lock); dev = __dev_get_by_name(name); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -513,8 +514,8 @@ struct net_device *dev_get_by_index(int read_lock(&dev_base_lock); dev = __dev_get_by_index(ifindex); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -563,8 +564,8 @@ struct net_device * dev_get_by_flags(uns read_lock(&dev_base_lock); dev = __dev_get_by_flags(if_flags, mask); - if (dev) - dev_hold(dev); + if (dev && !dev_try_hold(dev)) + dev = NULL; read_unlock(&dev_base_lock); return dev; } @@ -1312,7 +1313,9 @@ int netif_rx(struct sk_buff *skb) goto drop; enqueue: - dev_hold(skb->dev); + if (!dev_try_hold(skb->dev)) + goto drop; + __skb_queue_tail(&queue->input_pkt_queue, skb); #ifndef OFFLINE_SAMPLE get_sample_stats(this_cpu); @@ -1990,9 +1993,8 @@ int netdev_set_master(struct net_device ASSERT_RTNL(); if (master) { - if (old) + if (old || !dev_try_hold(master)) return -EBUSY; - dev_hold(master); } br_write_lock_bh(BR_NETPROTO_LOCK); @@ -2609,10 +2611,11 @@ int register_netdevice(struct net_device set_bit(__LINK_STATE_PRESENT, &dev->state); dev->next = NULL; + atomic_inc(&dev->refcnt); dev_init_scheduler(dev); write_lock_bh(&dev_base_lock); *dp = dev; - dev_hold(dev); + dev->deadbeaf = 0; write_unlock_bh(&dev_base_lock); @@ -2809,7 +2812,9 @@ int unregister_netdevice(struct net_devi } out: kobject_unregister(&dev->kobj); - dev_put(dev); + + atomic_dec(&dev->refcnt); + return 0; } @@ -2900,7 +2905,11 @@ static int __init net_dev_init(void) #endif dev->xmit_lock_owner = -1; dev->iflink = -1; - dev_hold(dev); + + if (!dev_try_hold(dev)) { + dev->deadbeaf = 1; + dp = &dev->next; + } /* * Allocate name. If the init() fails diff -urNp -X dontdiff linux-2.5/net/ipv4/devinet.c linux-2.5-dev/net/ipv4/devinet.c --- linux-2.5/net/ipv4/devinet.c 2003-04-14 13:32:26.000000000 -0700 +++ linux-2.5-dev/net/ipv4/devinet.c 2003-05-06 15:16:03.000000000 -0700 @@ -559,6 +559,9 @@ int devinet_ioctl(unsigned int cmd, void if ((dev = __dev_get_by_name(ifr.ifr_name)) == NULL) goto done; + if (!dev_try_hold(dev)) + goto done; + if (colon) *colon = ':'; @@ -591,7 +594,7 @@ int devinet_ioctl(unsigned int cmd, void ret = -EADDRNOTAVAIL; if (!ifa && cmd != SIOCSIFADDR && cmd != SIOCSIFFLAGS) - goto done; + goto put; switch(cmd) { case SIOCGIFADDR: /* Get interface address */ @@ -700,12 +703,15 @@ int devinet_ioctl(unsigned int cmd, void } break; } +put: + dev_put(dev); done: rtnl_unlock(); dev_probe_unlock(); out: return ret; rarok: + dev_put(dev); rtnl_unlock(); dev_probe_unlock(); ret = copy_to_user(arg, &ifr, sizeof(struct ifreq)) ? -EFAULT : 0; diff -urNp -X dontdiff linux-2.5/net/ipv4/fib_frontend.c linux-2.5-dev/net/ipv4/fib_frontend.c --- linux-2.5/net/ipv4/fib_frontend.c 2003-04-14 13:32:26.000000000 -0700 +++ linux-2.5-dev/net/ipv4/fib_frontend.c 2003-05-06 15:16:03.000000000 -0700 @@ -115,9 +115,9 @@ struct net_device * ip_dev_find(u32 addr if (res.type != RTN_LOCAL) goto out; dev = FIB_RES_DEV(res); - if (dev) - atomic_inc(&dev->refcnt); + if (dev) + dev_hold(dev); out: fib_res_put(&res); return dev; diff -urNp -X dontdiff linux-2.5/net/ipv4/fib_semantics.c linux-2.5-dev/net/ipv4/fib_semantics.c --- linux-2.5/net/ipv4/fib_semantics.c 2003-04-29 09:57:41.000000000 -0700 +++ linux-2.5-dev/net/ipv4/fib_semantics.c 2003-05-06 15:10:40.000000000 -0700 @@ -406,7 +406,7 @@ static int fib_check_nh(const struct rtm if (!(dev->flags&IFF_UP)) return -ENETDOWN; nh->nh_dev = dev; - atomic_inc(&dev->refcnt); + dev_hold(dev); nh->nh_scope = RT_SCOPE_LINK; return 0; } @@ -429,7 +429,7 @@ static int fib_check_nh(const struct rtm nh->nh_oif = FIB_RES_OIF(res); if ((nh->nh_dev = FIB_RES_DEV(res)) == NULL) goto out; - atomic_inc(&nh->nh_dev->refcnt); + dev_hold(nh->nh_dev); err = -ENETDOWN; if (!(nh->nh_dev->flags & IFF_UP)) goto out; @@ -451,7 +451,7 @@ out: return -ENETDOWN; } nh->nh_dev = in_dev->dev; - atomic_inc(&nh->nh_dev->refcnt); + dev_hold(nh->nh_dev); nh->nh_scope = RT_SCOPE_HOST; in_dev_put(in_dev); } ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2003-05-07 7:32 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2003-04-30 6:26 dev->destructor David S. Miller 2003-04-30 16:33 ` dev->destructor Stephen Hemminger 2003-05-01 1:10 ` dev->destructor kuznet 2003-05-01 7:00 ` dev->destructor David S. Miller 2003-05-01 12:01 ` dev->destructor Rusty Russell 2003-05-01 11:09 ` dev->destructor David S. Miller 2003-05-01 17:51 ` dev->destructor Arnaldo Carvalho de Melo 2003-05-01 16:55 ` dev->destructor David S. Miller 2003-05-01 17:28 ` dev->destructor Arnaldo Carvalho de Melo 2003-05-02 4:06 ` dev->destructor kuznet 2003-05-02 5:25 ` dev->destructor Rusty Russell 2003-05-02 20:48 ` dev->destructor David S. Miller 2003-05-03 4:07 ` dev->destructor Rusty Russell 2003-05-03 3:46 ` dev->destructor David S. Miller 2003-05-05 5:18 ` dev->destructor Rusty Russell 2003-05-03 4:00 ` dev->destructor David S. Miller 2003-05-05 16:08 ` dev->destructor Stephen Hemminger 2003-05-06 14:25 ` dev->destructor David S. Miller 2003-05-07 2:54 ` dev->destructor Rusty Russell 2003-05-05 20:00 ` dev->destructor Stephen Hemminger 2003-05-06 4:18 ` dev->destructor Rusty Russell 2003-05-06 14:23 ` dev->destructor David S. Miller 2003-05-06 22:32 ` [PATCH 2.5.69] IPV4 should use dev_hold Stephen Hemminger 2003-05-07 7:32 ` David S. Miller 2003-05-07 2:50 ` dev->destructor Rusty Russell 2003-05-07 3:58 ` dev->destructor David S. Miller 2003-05-06 22:35 ` dev->destructor Stephen Hemminger 2003-05-06 23:51 ` [RFC] Experiment with dev and module ref counts Stephen Hemminger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).