netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
To: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: santosh.shilimkar@oracle.com, davem@davemloft.net,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	rds-devel@oss.oracle.com, edumazet@google.com
Subject: Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
Date: Sat, 17 Mar 2018 10:15:07 -0400	[thread overview]
Message-ID: <20180317141507.GC873@oracle.com> (raw)
In-Reply-To: <82d6e954-8043-078c-266c-2f1ac992f864@virtuozzo.com>


I spent a long time staring at both v1 and v2 of your patch.

I understand the overall goal, but I am afraid to say that these
patches are complete hacks.

I was trying to understand why patchv1 blows with a null rtn in 
rds_tcp_init_net, but v2 does not, and the analysis is ugly.  

I'm going to put down the analysis here, and others can
decide if this sort of hack is a healthy solution for a scaling
issue (IMHO  it is not- we should get the real fix for the
scaling instead of using duck-tape-and-chewing-gum solutions)

What is happening in v1 is this:

1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the
   following in rds_tcp_init
       register_pernet_device(&rds_tcp_dev_ops);
       register_pernet_device(&rds_tcp_net_ops);
   Where rds_tcp_dev_ops has 
        .id = &rds_tcp_netid,
        .size = sizeof(struct rds_tcp_net),
   and rds_tcp_net_ops has 0 values for both of these.

2. So now pernet_list has &rds_tcp_net_ops as the first member of the
   pernet_list.

3. Now I do "ip netns create blue". As part of setup_net(), we walk
   the pernet_list and call the ->init of each member (through ops_init()). 
   So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd
   skip the struct rds_tcp_net allocation, so rds_tcp_init_net would 
   find a null return from net_generic() and bomb.

The way I view it (and ymmv) the hack here is to call both
register_pernet_device and register_pernet_subsys: the kernel only
guarantees that calling *one* of register_pernet_* will ensure
that you can safely call net_generic() afterwards.

The v2 patch "works around" this by reordering the registration. 
So this time, init_net will set up the rds_tcp_net_ops as the second 
member, and the first memeber will be the pernet_operations struct 
that has non-zero id and size.

But then the unregistration (necessarily) works in the opposite order
you have to unregister_pernet_device first (so that interfaces are
quiesced) and then unregister_pernet_subsys() so that sockets can
be safely quiesced.

I dont think this type of hack makes the code cleaner, it just
make things much harder to understand, and completely brittle
for subsequent changes.

To solve the scaling problem why not just have a well-defined 
callback to modules when devices are quiesced, instead of 
overloading the pernet_device registration in this obscure way?

  parent reply	other threads:[~2018-03-17 14:15 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-16 12:38 [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL) Kirill Tkhai
2018-03-16 13:00 ` Sowmini Varadhan
2018-03-16 13:17   ` Kirill Tkhai
2018-03-16 13:53     ` Sowmini Varadhan
2018-03-16 14:36       ` Kirill Tkhai
2018-03-16 14:41         ` Kirill Tkhai
2018-03-16 17:29     ` Sowmini Varadhan
2018-03-16 18:14       ` Kirill Tkhai
2018-03-16 18:31         ` Sowmini Varadhan
2018-03-16 18:48           ` Kirill Tkhai
2018-03-16 18:53             ` Sowmini Varadhan
2018-03-17 14:15         ` Sowmini Varadhan [this message]
2018-03-17 21:13           ` Kirill Tkhai
2018-03-17 21:26           ` [rds-devel] " Sowmini Varadhan
2018-03-17 21:55             ` Kirill Tkhai
2018-03-18 20:45               ` Sowmini Varadhan
2018-03-19 10:08                 ` Kirill Tkhai
2018-03-20 11:37                 ` Håkon Bugge
2018-03-20 13:29                   ` Sowmini Varadhan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180317141507.GC873@oracle.com \
    --to=sowmini.varadhan@oracle.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=rds-devel@oss.oracle.com \
    --cc=santosh.shilimkar@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).