* RFC(v2): Audit Kernel Container IDs
From: Richard Guy Briggs @ 2017-10-12 14:14 UTC (permalink / raw)
To: cgroups, Linux Containers, Linux API, Linux Audit, Linux FS Devel,
Linux Kernel, Linux Network Development
Cc: Simo Sorce, Carlos O'Donell, Aristeu Rozanski, David Howells,
Eric W. Biederman, Eric Paris, jlayton, Andy Lutomirski, mszeredi,
Paul Moore, Serge E. Hallyn, Steve Grubb, trondmy, Al Viro
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration. At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.
Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children into same container
- RGB
^ permalink raw reply
* Re: [RFC 1/3] devlink: Add config parameter get/set operations
From: Jiri Pirko @ 2017-10-12 14:15 UTC (permalink / raw)
To: Steve Lin; +Cc: netdev, jiri, davem, michael.chan, linux-pci, linville, gospo
In-Reply-To: <1507815262-33294-2-git-send-email-steven.lin1@broadcom.com>
Thu, Oct 12, 2017 at 03:34:20PM CEST, steven.lin1@broadcom.com wrote:
>+
>+ /* When config doesn't take effect until next reboot (config
>+ * just changed NVM which isn't read until boot, for example),
>+ * this attribute should be set by the driver.
>+ */
>+ DEVLINK_ATTR_RESTART_REQUIRED, /* u8 */
>+
I think it would be nice to return this information as a part of retply
to set message - extack
Also, we need to expose to the user the original value (currently being
used) and the new one (to be used after driver re-instatiation)
^ permalink raw reply
* Re: [PATCH net-next v2 1/1] bridge: return error code when deleting Vlan
From: David Ahern @ 2017-10-12 14:19 UTC (permalink / raw)
To: Roman Mashak, davem; +Cc: stephen, netdev
In-Reply-To: <1507816314-2896-1-git-send-email-mrv@mojatatu.com>
On 10/12/17 7:51 AM, Roman Mashak wrote:
> v2:
> Return err immediately if nbp_vlan_delete() fails (pointed by David Ahern)
>
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ---
> net/bridge/br_netlink.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index f0e8268..1efdd48 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -527,11 +527,13 @@ static int br_vlan_info(struct net_bridge *br, struct net_bridge_port *p,
>
> case RTM_DELLINK:
> if (p) {
> - nbp_vlan_delete(p, vinfo->vid);
> + err = nbp_vlan_delete(p, vinfo->vid);
> + if (err)
> + break;
I'm not sure a break is the right thing to do. Seems like you leave it
in a half configured state.
> if (vinfo->flags & BRIDGE_VLAN_INFO_MASTER)
> - br_vlan_delete(p->br, vinfo->vid);
> + err = br_vlan_delete(p->br, vinfo->vid);
> } else {
> - br_vlan_delete(br, vinfo->vid);
> + err = br_vlan_delete(br, vinfo->vid);
> }
> break;
> }
>
Why do you want to return the error code here? Walking the code paths
seems like ENOENT or err from switchdev_port_obj_del are the 2 error
possibilities.
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Roopa Prabhu @ 2017-10-12 14:35 UTC (permalink / raw)
To: Steve Lin
Cc: netdev@vger.kernel.org, Jiri Pirko, davem@davemloft.net,
michael.chan, linux-pci, John W. Linville, gospo
In-Reply-To: <1507815262-33294-1-git-send-email-steven.lin1@broadcom.com>
On Thu, Oct 12, 2017 at 6:34 AM, Steve Lin <steven.lin1@broadcom.com> wrote:
> Adds a devlink command for getting & setting device configuration
> parameters, and enumerates a bunch of those parameters as devlink
> attributes. Also introduces an attribute that can be set by a
> driver to indicate that the config change doesn't take effect
> until the next restart (as in the case of the bnxt driver changes
> in this patchset, for which all the configuration changes affect NVM
> only, and aren't loaded until the next restart.)
>
> bnxt driver patches make use of these new devlink cmds/attributes.
>
> Steve Lin (3):
> devlink: Add config parameter get/set operations
> bnxt: Move generic devlink code to new file
> bnxt: Add devlink support for config get/set
>
Is the goal here to move all ethtool operations to devlink (I saw some
attrs related to speed etc). ?.
We do need to move ethtool attrs to netlink and devlink is a good
place (and of-course leave the current ethtool api around for backward
compatibility).
^ permalink raw reply
* Re: [RFC 1/3] devlink: Add config parameter get/set operations
From: Steve Lin @ 2017-10-12 14:37 UTC (permalink / raw)
To: Jiri Pirko
Cc: netdev, jiri, David S . Miller, Michael Chan, linux-pci,
John Linville, Andy Gospodarek
In-Reply-To: <20171012140317.GC14672@nanopsycho>
Jiri,
Thanks for your feedback below and in your other response. I will
make some changes to address those issues and resubmit.
Thanks again!
Steve
On Thu, Oct 12, 2017 at 10:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Oct 12, 2017 at 03:34:20PM CEST, steven.lin1@broadcom.com wrote:
>>Add support for config parameter get/set commands. Initially used by
>>bnxt driver, but other drivers can use the same, or new, attributes.
>>The config_get() and config_set() operations operate as expected, but
>>note that the driver implementation of the config_set() operation can
>>indicate whether a restart is necessary for the setting to take
>>effect.
>>
>
> First of all, I like this approach.
>
> I would like to see this patch split into:
> 1) config-options infrastructure introduction
> 2) specific config options introductions - would be best to have it
> per-option. We need to make sure every option is very well described
> and explained usecases. This is needed in order vendors to share
> attributes among drivers.
>
> More nits inlined.
>
>
>>Signed-off-by: Steve Lin <steven.lin1@broadcom.com>
>>Acked-by: Andy Gospodarek <gospo@broadcom.com>
>>---
>> include/net/devlink.h | 4 +
>> include/uapi/linux/devlink.h | 108 ++++++++++++++++++++++
>> net/core/devlink.c | 207 +++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 319 insertions(+)
>>
>>diff --git a/include/net/devlink.h b/include/net/devlink.h
>>index b9654e1..952966c 100644
>>--- a/include/net/devlink.h
>>+++ b/include/net/devlink.h
>>@@ -270,6 +270,10 @@ struct devlink_ops {
>> int (*eswitch_inline_mode_set)(struct devlink *devlink, u8 inline_mode);
>> int (*eswitch_encap_mode_get)(struct devlink *devlink, u8 *p_encap_mode);
>> int (*eswitch_encap_mode_set)(struct devlink *devlink, u8 encap_mode);
>>+ int (*config_get)(struct devlink *devlink, enum devlink_attr attr,
>>+ u32 *value);
>>+ int (*config_set)(struct devlink *devlink, enum devlink_attr attr,
>>+ u32 value, u8 *restart_reqd);
>> };
>>
>> static inline void *devlink_priv(struct devlink *devlink)
>>diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
>>index 0cbca96..e959716 100644
>>--- a/include/uapi/linux/devlink.h
>>+++ b/include/uapi/linux/devlink.h
>>@@ -70,6 +70,9 @@ enum devlink_command {
>> DEVLINK_CMD_DPIPE_HEADERS_GET,
>> DEVLINK_CMD_DPIPE_TABLE_COUNTERS_SET,
>>
>>+ DEVLINK_CMD_CONFIG_GET,
>>+ DEVLINK_CMD_CONFIG_SET,
>>+
>> /* add new commands above here */
>> __DEVLINK_CMD_MAX,
>> DEVLINK_CMD_MAX = __DEVLINK_CMD_MAX - 1
>>@@ -124,6 +127,68 @@ enum devlink_eswitch_encap_mode {
>> DEVLINK_ESWITCH_ENCAP_MODE_BASIC,
>> };
>>
>>+enum devlink_dcbx_mode {
>>+ DEVLINK_DCBX_MODE_DISABLED,
>>+ DEVLINK_DCBX_MODE_IEEE,
>>+ DEVLINK_DCBX_MODE_CEE,
>>+ DEVLINK_DCBX_MODE_IEEE_CEE,
>>+};
>>+
>>+enum devlink_multifunc_mode {
>>+ DEVLINK_MULTIFUNC_MODE_ALLOWED, /* Ext switch activates MF */
>>+ DEVLINK_MULTIFUNC_MODE_FORCE_SINGFUNC,
>>+ DEVLINK_MULTIFUNC_MODE_NPAR10, /* NPAR 1.0 */
>>+ DEVLINK_MULTIFUNC_MODE_NPAR15, /* NPAR 1.5 */
>>+ DEVLINK_MULTIFUNC_MODE_NPAR20, /* NPAR 2.0 */
>>+};
>>+
>>+enum devlink_autoneg_protocol {
>>+ DEVLINK_AUTONEG_PROTOCOL_IEEE8023BY_BAM,
>>+ DEVLINK_AUTONEG_PROTOCOL_IEEE8023BY_CONSORTIUM,
>>+ DEVLINK_AUTONEG_PROTOCOL_IEEE8023BY,
>>+ DEVLINK_AUTONEG_PROTOCOL_BAM, /* Broadcom Autoneg Mode */
>>+ DEVLINK_AUTONEG_PROTOCOL_CONSORTIUM, /* Consortium Autoneg Mode */
>>+};
>>+
>>+enum devlink_pre_os_link_speed {
>>+ DEVLINK_PRE_OS_LINK_SPEED_AUTONEG,
>>+ DEVLINK_PRE_OS_LINK_SPEED_1G,
>>+ DEVLINK_PRE_OS_LINK_SPEED_10G,
>>+ DEVLINK_PRE_OS_LINK_SPEED_25G,
>>+ DEVLINK_PRE_OS_LINK_SPEED_40G,
>>+ DEVLINK_PRE_OS_LINK_SPEED_50G,
>>+ DEVLINK_PRE_OS_LINK_SPEED_100G,
>>+ DEVLINK_PRE_OS_LINK_SPEED_5G = 0xe,
>>+ DEVLINK_PRE_OS_LINK_SPEED_100M = 0xf,
>>+};
>>+
>>+enum devlink_mba_boot_type {
>>+ DEVLINK_MBA_BOOT_TYPE_AUTO_DETECT,
>>+ DEVLINK_MBA_BOOT_TYPE_BBS, /* BIOS Boot Specification */
>>+ DEVLINK_MBA_BOOT_TYPE_INTR18, /* Hook interrupt 0x18 */
>>+ DEVLINK_MBA_BOOT_TYPE_INTR19, /* Hook interrupt 0x19 */
>>+};
>>+
>>+enum devlink_mba_setup_hot_key {
>>+ DEVLINK_MBA_SETUP_HOT_KEY_CTRL_S,
>>+ DEVLINK_MBA_SETUP_HOT_KEY_CTRL_B,
>>+};
>>+
>>+enum devlink_mba_boot_protocol {
>>+ DEVLINK_MBA_BOOT_PROTOCOL_PXE,
>>+ DEVLINK_MBA_BOOT_PROTOCOL_ISCSI,
>>+ DEVLINK_MBA_BOOT_PROTOCOL_NONE = 0x7,
>>+};
>>+
>>+enum devlink_mba_link_speed {
>>+ DEVLINK_MBA_LINK_SPEED_AUTONEG,
>>+ DEVLINK_MBA_LINK_SPEED_1G,
>>+ DEVLINK_MBA_LINK_SPEED_10G,
>>+ DEVLINK_MBA_LINK_SPEED_25G,
>>+ DEVLINK_MBA_LINK_SPEED_40G,
>>+ DEVLINK_MBA_LINK_SPEED_50G,
>>+};
>>+
>> enum devlink_attr {
>> /* don't change the order or add anything between, this is ABI! */
>> DEVLINK_ATTR_UNSPEC,
>>@@ -202,6 +267,49 @@ enum devlink_attr {
>>
>> DEVLINK_ATTR_ESWITCH_ENCAP_MODE, /* u8 */
>>
>>+ /* Configuration Parameters */
>>+ DEVLINK_ATTR_SRIOV_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_NUM_VF_PER_PF, /* u32 */
>>+ DEVLINK_ATTR_MAX_NUM_PF_MSIX_VECT, /* u32 */
>>+ DEVLINK_ATTR_MSIX_VECTORS_PER_VF, /* u32 */
>>+ DEVLINK_ATTR_NPAR_NUM_PARTITIONS_PER_PORT, /* u32 */
>>+ DEVLINK_ATTR_NPAR_BW_IN_PERCENT, /* u8 */
>>+ DEVLINK_ATTR_NPAR_BW_RESERVATION, /* u8 */
>>+ DEVLINK_ATTR_NPAR_BW_RESERVATION_VALID, /* u8 */
>>+ DEVLINK_ATTR_NPAR_BW_LIMIT, /* u8 */
>>+ DEVLINK_ATTR_NPAR_BW_LIMIT_VALID, /* u8 */
>>+ DEVLINK_ATTR_DCBX_MODE, /* u8 */
>>+ DEVLINK_ATTR_RDMA_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_MULTIFUNC_MODE, /* u8 */
>>+ DEVLINK_ATTR_SECURE_NIC_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_IGNORE_ARI_CAPABILITY, /* u8 */
>>+ DEVLINK_ATTR_LLDP_NEAREST_BRIDGE_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_LLDP_NEAREST_NONTPMR_BRIDGE_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_PME_CAPABILITY_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_MAGIC_PACKET_WOL_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_EEE_PWR_SAVE_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_AUTONEG_PROTOCOL, /* u8 */
>>+ DEVLINK_ATTR_MEDIA_AUTO_DETECT, /* u8 */
>>+ DEVLINK_ATTR_PHY_SELECT, /* u8 */
>>+ DEVLINK_ATTR_PRE_OS_LINK_SPEED_D0, /* u8 */
>>+ DEVLINK_ATTR_PRE_OS_LINK_SPEED_D3, /* u8 */
>>+ DEVLINK_ATTR_MBA_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_MBA_BOOT_TYPE, /* u8 */
>>+ DEVLINK_ATTR_MBA_DELAY_TIME, /* u32 */
>>+ DEVLINK_ATTR_MBA_SETUP_HOT_KEY, /* u8 */
>>+ DEVLINK_ATTR_MBA_HIDE_SETUP_PROMPT, /* u8 */
>>+ DEVLINK_ATTR_MBA_BOOT_RETRY_COUNT, /* u32 */
>>+ DEVLINK_ATTR_MBA_VLAN_ENABLED, /* u8 */
>>+ DEVLINK_ATTR_MBA_VLAN_TAG, /* u16 */
>>+ DEVLINK_ATTR_MBA_BOOT_PROTOCOL, /* u8 */
>>+ DEVLINK_ATTR_MBA_LINK_SPEED, /* u8 */
>
> Okay, I think it is about the time we should start thinking about
> putting this new config attributes under nester attribute. What do you
> think?
>
>
>>+
>>+ /* When config doesn't take effect until next reboot (config
>>+ * just changed NVM which isn't read until boot, for example),
>>+ * this attribute should be set by the driver.
>>+ */
>>+ DEVLINK_ATTR_RESTART_REQUIRED, /* u8 */
>>+
>> /* add new attributes above here, update the policy in devlink.c */
>>
>> __DEVLINK_ATTR_MAX,
>>diff --git a/net/core/devlink.c b/net/core/devlink.c
>>index 7d430c1..701c84b 100644
>>--- a/net/core/devlink.c
>>+++ b/net/core/devlink.c
>>@@ -1566,6 +1566,164 @@ static int devlink_nl_cmd_eswitch_set_doit(struct sk_buff *skb,
>> return 0;
>> }
>>
>>+static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1];
>>+
>>+static int devlink_nl_config_fill(struct sk_buff *msg,
>>+ struct devlink *devlink,
>>+ enum devlink_command cmd,
>>+ struct genl_info *info)
>>+{
>>+ const struct devlink_ops *ops = devlink->ops;
>>+ void *hdr;
>>+ int err;
>>+ enum devlink_attr attr = -1;
>
> -1 is not a valid enum value. Better to just have bool to indicate attr
> was found or not.
>
>
>>+ u32 value;
>>+ int i;
>>+ struct nla_policy policy;
>>+ u8 restart_reqd;
>>+
>>+ if (!ops->config_get)
>>+ return -EOPNOTSUPP;
>>+
>>+ hdr = genlmsg_put(msg, info->snd_portid, info->snd_seq,
>>+ &devlink_nl_family, 0, cmd);
>>+ if (!hdr) {
>>+ err = -EMSGSIZE;
>>+ goto nla_msg_failure;
>>+ }
>>+
>>+ err = devlink_nl_put_handle(msg, devlink);
>>+ if (err)
>>+ goto nla_put_failure;
>>+
>>+ for (i = 0; i < DEVLINK_ATTR_MAX; i++)
>>+ if (info->attrs[i])
>>+ attr = i;
>>+
>>+ if (attr == -1) {
>>+ /* Not found - invalid/unknown attribute? */
>>+ err = -EINVAL;
>>+ goto nla_put_failure;
>>+ }
>>+
>>+ policy = devlink_nl_policy[attr];
>>+
>>+ if (cmd == DEVLINK_CMD_CONFIG_GET) {
>>+ err = ops->config_get(devlink, attr, &value);
>>+
>>+ if (err)
>>+ goto nla_put_failure;
>>+
>>+ switch (policy.type) {
>>+ case NLA_U8:
>>+ err = nla_put_u8(msg, attr, value);
>>+ break;
>>+ case NLA_U16:
>>+ err = nla_put_u16(msg, attr, value);
>>+ break;
>>+ case NLA_U32:
>>+ err = nla_put_u32(msg, attr, value);
>>+ break;
>>+ default:
>>+ goto nla_put_failure;
>>+ }
>>+
>>+ if (err)
>>+ goto nla_put_failure;
>>+ } else {
>>+ /* Must be config_set command */
>>+ u8 *val8;
>>+ u16 *val16;
>>+ u32 *val32;
>>+
>>+ switch (policy.type) {
>>+ case NLA_U8:
>>+ val8 = nla_data(info->attrs[attr]);
>>+ value = *val8;
>>+ break;
>>+ case NLA_U16:
>>+ val16 = nla_data(info->attrs[attr]);
>>+ value = *val16;
>>+ break;
>>+ case NLA_U32:
>>+ val32 = nla_data(info->attrs[attr]);
>>+ value = *val32;
>>+ break;
>>+ default:
>>+ goto nla_put_failure;
>>+ }
>>+
>>+ err = ops->config_set(devlink, attr, value, &restart_reqd);
>>+ if (err)
>>+ goto nla_put_failure;
>>+
>>+ if (restart_reqd) {
>>+ err = nla_put_u8(msg, DEVLINK_ATTR_RESTART_REQUIRED,
>>+ restart_reqd);
>>+ if (err)
>>+ goto nla_put_failure;
>>+ }
>>+ }
>>+
>>+ genlmsg_end(msg, hdr);
>>+ return 0;
>>+
>>+nla_put_failure:
>>+ genlmsg_cancel(msg, hdr);
>>+nla_msg_failure:
>>+ return err;
>>+}
>>+
>>+static int devlink_nl_cmd_config_get_doit(struct sk_buff *skb,
>>+ struct genl_info *info)
>>+{
>>+ struct devlink *devlink = info->user_ptr[0];
>>+ struct sk_buff *msg;
>>+ int err;
>>+
>>+ if (!devlink->ops)
>>+ return -EOPNOTSUPP;
>>+
>>+ msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
>>+ if (!msg)
>>+ return -ENOMEM;
>>+
>>+ err = devlink_nl_config_fill(msg, devlink, DEVLINK_CMD_CONFIG_GET,
>>+ info);
>>+
>>+ if (err) {
>>+ nlmsg_free(msg);
>>+ return err;
>>+ }
>>+
>>+ return genlmsg_reply(msg, info);
>>+}
>>+
>>+static int devlink_nl_cmd_config_set_doit(struct sk_buff *skb,
>>+ struct genl_info *info)
>>+{
>>+ struct devlink *devlink = info->user_ptr[0];
>>+ struct sk_buff *msg;
>>+ int err;
>>+
>>+ if (!devlink->ops)
>>+ return -EOPNOTSUPP;
>>+
>>+ msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
>>+ if (!msg)
>>+ return -ENOMEM;
>>+
>>+ err = devlink_nl_config_fill(msg, devlink, DEVLINK_CMD_CONFIG_SET,
>>+ info);
>
> This is odd. You are using "fill" function for "set". It should be used
> for "get". But for set, it really sounds odd. Please split the
> devlink_nl_config_fill function into get and set parts.
>
> Also, get should dump all available options.
>
>
>>+
>>+ if (err) {
>>+ nlmsg_free(msg);
>>+ return err;
>>+ }
>>+
>>+ return genlmsg_reply(msg, info);
>>+}
>>+
>> int devlink_dpipe_match_put(struct sk_buff *skb,
>> struct devlink_dpipe_match *match)
>> {
>>@@ -2291,6 +2449,41 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
>> [DEVLINK_ATTR_ESWITCH_ENCAP_MODE] = { .type = NLA_U8 },
>> [DEVLINK_ATTR_DPIPE_TABLE_NAME] = { .type = NLA_NUL_STRING },
>> [DEVLINK_ATTR_DPIPE_TABLE_COUNTERS_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_SRIOV_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_NUM_VF_PER_PF] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MAX_NUM_PF_MSIX_VECT] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MSIX_VECTORS_PER_VF] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_NPAR_NUM_PARTITIONS_PER_PORT] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_NPAR_BW_IN_PERCENT] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_NPAR_BW_RESERVATION] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_NPAR_BW_RESERVATION_VALID] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_NPAR_BW_LIMIT] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_NPAR_BW_LIMIT_VALID] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_DCBX_MODE] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_RDMA_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_MULTIFUNC_MODE] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_SECURE_NIC_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_IGNORE_ARI_CAPABILITY] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_LLDP_NEAREST_BRIDGE_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_LLDP_NEAREST_NONTPMR_BRIDGE_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_PME_CAPABILITY_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_MAGIC_PACKET_WOL_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_EEE_PWR_SAVE_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_AUTONEG_PROTOCOL] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MEDIA_AUTO_DETECT] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_PHY_SELECT] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_PRE_OS_LINK_SPEED_D0] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_PRE_OS_LINK_SPEED_D3] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MBA_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_MBA_BOOT_TYPE] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MBA_DELAY_TIME] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MBA_SETUP_HOT_KEY] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MBA_HIDE_SETUP_PROMPT] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_MBA_BOOT_RETRY_COUNT] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MBA_VLAN_ENABLED] = { .type = NLA_U8 },
>>+ [DEVLINK_ATTR_MBA_VLAN_TAG] = { .type = NLA_U16 },
>>+ [DEVLINK_ATTR_MBA_BOOT_PROTOCOL] = { .type = NLA_U32 },
>>+ [DEVLINK_ATTR_MBA_LINK_SPEED] = { .type = NLA_U32 },
>> };
>>
>> static const struct genl_ops devlink_nl_ops[] = {
>>@@ -2451,6 +2644,20 @@ static const struct genl_ops devlink_nl_ops[] = {
>> .flags = GENL_ADMIN_PERM,
>> .internal_flags = DEVLINK_NL_FLAG_NEED_DEVLINK,
>> },
>>+ {
>>+ .cmd = DEVLINK_CMD_CONFIG_GET,
>>+ .doit = devlink_nl_cmd_config_get_doit,
>>+ .policy = devlink_nl_policy,
>>+ .flags = GENL_ADMIN_PERM,
>>+ .internal_flags = DEVLINK_NL_FLAG_NEED_DEVLINK,
>>+ },
>>+ {
>>+ .cmd = DEVLINK_CMD_CONFIG_SET,
>>+ .doit = devlink_nl_cmd_config_set_doit,
>>+ .policy = devlink_nl_policy,
>>+ .flags = GENL_ADMIN_PERM,
>>+ .internal_flags = DEVLINK_NL_FLAG_NEED_DEVLINK,
>>+ },
>> };
>>
>> static struct genl_family devlink_nl_family __ro_after_init = {
>>--
>>2.7.4
>>
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Jiri Pirko @ 2017-10-12 14:40 UTC (permalink / raw)
To: Roopa Prabhu
Cc: Steve Lin, netdev@vger.kernel.org, Jiri Pirko,
davem@davemloft.net, michael.chan, linux-pci, John W. Linville,
gospo
In-Reply-To: <CAJieiUjuKtmYDZ5-7UjYkWFJuMGu2gajhziLQKAZLTz6g63f2A@mail.gmail.com>
Thu, Oct 12, 2017 at 04:35:10PM CEST, roopa@cumulusnetworks.com wrote:
>On Thu, Oct 12, 2017 at 6:34 AM, Steve Lin <steven.lin1@broadcom.com> wrote:
>> Adds a devlink command for getting & setting device configuration
>> parameters, and enumerates a bunch of those parameters as devlink
>> attributes. Also introduces an attribute that can be set by a
>> driver to indicate that the config change doesn't take effect
>> until the next restart (as in the case of the bnxt driver changes
>> in this patchset, for which all the configuration changes affect NVM
>> only, and aren't loaded until the next restart.)
>>
>> bnxt driver patches make use of these new devlink cmds/attributes.
>>
>> Steve Lin (3):
>> devlink: Add config parameter get/set operations
>> bnxt: Move generic devlink code to new file
>> bnxt: Add devlink support for config get/set
>>
>
>Is the goal here to move all ethtool operations to devlink (I saw some
>attrs related to speed etc). ?.
>We do need to move ethtool attrs to netlink and devlink is a good
>place (and of-course leave the current ethtool api around for backward
>compatibility).
We need to make sure we are not moving things to devlink which don't
belong there. All options that use "netdev" as a handle should go into
rtnetlink instead.
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Steve Lin @ 2017-10-12 14:45 UTC (permalink / raw)
To: Roopa Prabhu
Cc: netdev@vger.kernel.org, Jiri Pirko, davem@davemloft.net,
Michael Chan, linux-pci, John W. Linville, Andy Gospodarek
In-Reply-To: <CAJieiUjuKtmYDZ5-7UjYkWFJuMGu2gajhziLQKAZLTz6g63f2A@mail.gmail.com>
Hi Roopa,
The attributes added in this patchset are not really the same type as
ethtool - these are more device configuration type attributes. The
speeds you saw, for example, affect the pre-OS [i.e. PXE boot time]
configuration for a port, and aren't run-time speed changes on a given
netdev like ethtool configures. As Jiri mentioned, I will add some
comments to better describe each of the attributes.
So I don't think there's much duplication here with ethtool.
That said, there also shouldn't be anything in the patchset that would
preclude some future migration of ethtool settings to using devlink or
rtnetlink API.
Steve
On Thu, Oct 12, 2017 at 10:35 AM, Roopa Prabhu
<roopa@cumulusnetworks.com> wrote:
> On Thu, Oct 12, 2017 at 6:34 AM, Steve Lin <steven.lin1@broadcom.com> wrote:
>> Adds a devlink command for getting & setting device configuration
>> parameters, and enumerates a bunch of those parameters as devlink
>> attributes. Also introduces an attribute that can be set by a
>> driver to indicate that the config change doesn't take effect
>> until the next restart (as in the case of the bnxt driver changes
>> in this patchset, for which all the configuration changes affect NVM
>> only, and aren't loaded until the next restart.)
>>
>> bnxt driver patches make use of these new devlink cmds/attributes.
>>
>> Steve Lin (3):
>> devlink: Add config parameter get/set operations
>> bnxt: Move generic devlink code to new file
>> bnxt: Add devlink support for config get/set
>>
>
> Is the goal here to move all ethtool operations to devlink (I saw some
> attrs related to speed etc). ?.
> We do need to move ethtool attrs to netlink and devlink is a good
> place (and of-course leave the current ethtool api around for backward
> compatibility).
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Roopa Prabhu @ 2017-10-12 14:46 UTC (permalink / raw)
To: Jiri Pirko
Cc: Steve Lin, netdev@vger.kernel.org, Jiri Pirko,
davem@davemloft.net, michael.chan, linux-pci, John W. Linville,
gospo
In-Reply-To: <20171012144032.GG14672@nanopsycho>
On Thu, Oct 12, 2017 at 7:40 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Oct 12, 2017 at 04:35:10PM CEST, roopa@cumulusnetworks.com wrote:
>>On Thu, Oct 12, 2017 at 6:34 AM, Steve Lin <steven.lin1@broadcom.com> wrote:
>>> Adds a devlink command for getting & setting device configuration
>>> parameters, and enumerates a bunch of those parameters as devlink
>>> attributes. Also introduces an attribute that can be set by a
>>> driver to indicate that the config change doesn't take effect
>>> until the next restart (as in the case of the bnxt driver changes
>>> in this patchset, for which all the configuration changes affect NVM
>>> only, and aren't loaded until the next restart.)
>>>
>>> bnxt driver patches make use of these new devlink cmds/attributes.
>>>
>>> Steve Lin (3):
>>> devlink: Add config parameter get/set operations
>>> bnxt: Move generic devlink code to new file
>>> bnxt: Add devlink support for config get/set
>>>
>>
>>Is the goal here to move all ethtool operations to devlink (I saw some
>>attrs related to speed etc). ?.
>>We do need to move ethtool attrs to netlink and devlink is a good
>>place (and of-course leave the current ethtool api around for backward
>>compatibility).
>
> We need to make sure we are not moving things to devlink which don't
> belong there. All options that use "netdev" as a handle should go into
> rtnetlink instead.
>
Any reason you want to keep that restriction ?.
FWIS, devlink is a driver api just like ethtool is.
and ethtool needs to move to netlink soon...and It would be better to
not put the rtnl_lock burden on ethtool driver operations. Instead of
adding yet another driver api, extending devlink seems like a great
fit to me.
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Roopa Prabhu @ 2017-10-12 14:51 UTC (permalink / raw)
To: Steve Lin
Cc: netdev@vger.kernel.org, Jiri Pirko, davem@davemloft.net,
Michael Chan, linux-pci, John W. Linville, Andy Gospodarek
In-Reply-To: <CA+Jmh7GOo4K8gpyhQ8kceFtpNuASp89WYi0NR8AaoXncTCAksQ@mail.gmail.com>
On Thu, Oct 12, 2017 at 7:45 AM, Steve Lin <steven.lin1@broadcom.com> wrote:
> Hi Roopa,
>
> The attributes added in this patchset are not really the same type as
> ethtool - these are more device configuration type attributes. The
> speeds you saw, for example, affect the pre-OS [i.e. PXE boot time]
> configuration for a port, and aren't run-time speed changes on a given
> netdev like ethtool configures. As Jiri mentioned, I will add some
> comments to better describe each of the attributes.
>
> So I don't think there's much duplication here with ethtool.
>
> That said, there also shouldn't be anything in the patchset that would
> preclude some future migration of ethtool settings to using devlink or
> rtnetlink API.
>
ok, ack, thanks for the clarification. Just trying to find a best
netlink place for future ethtool migration to netlink.
^ permalink raw reply
* Re: [RFC] Support for UNARP (RFC 1868)
From: Eric Dumazet @ 2017-10-12 14:57 UTC (permalink / raw)
To: Girish Moodalbail; +Cc: netdev, davem, kuznet
In-Reply-To: <1507782977-2443-1-git-send-email-girish.moodalbail@oracle.com>
On Wed, 2017-10-11 at 21:36 -0700, Girish Moodalbail wrote:
> Add support for UNARP, as detailed in the IETF RFC 1868 (ARP Extension -
> UNARP). The central idea here is for a node to announce that it is
> leaving the network and that all the nodes on the L2 broadcast domain to
> update their ARP tables accordingly (i.e., mark the neighbor entry state
> to FAILED). Even though the ARP timers on nodes would eventually mark
> such entries as FAILED it will be more robust if those entries gets
> marked FAILED sooner with the help from the host that is going away.
>
> Besides providing a solution for an usecase, as captured in RFC, of an
> IP address moving across a proxy server, this feature is even more
> important for certain use cases in the Cloud. Imagine a tenant who is
> bringing up and down VM instances for some workload of theirs. If these
> instances are part of a small subnet, then the new VM instances may be
> assigned the same IP address (since the subnet pool is small) but with a
> different MAC address. So, if there is a client which has a stale
> mapping of the IP address to the old MAC address, then that client will
> fail to communicate with the new VM instance for some time.
>
> Another usecase that comes to mind is that of the Live VM
> Migration. Imagine a client that is communicating with a VM. Now, let us
> migrate this VM to a destination machine. The IP address to MAC address
> mapping for a VM doesn't change after the Live Migration. However, there
> will be a small amount of time (till the VM sends gratuitous ARP from
> the destination machine) during which packets from a client will be
> forwarded to the source machine. This occurs because:
>
> - the ARP entry in the client is not invalidated yet and it continues
> to use the same MAC address and
>
> - the MAC address table of all of the intermediate switches between the
> client and the source machine are not updated yet for the MAC address
> move.
>
> This issue of forwarding the packets to wrong target could be avoided by
> sending UNARP packets from the source machine. This would invalidate the
> ARP entry on the client and forces it to resolve the IP address again by
> broadcasting an ARP request to the network. The VM on the destination
> machine would then respond back with an ARP response. The ARP response
> back from the VM should also clean up the MAC address table of the
> intermediate switches.
>
> The following changes implements the UNARP receive processing in the
> kernel. Once the changes are in the kernel, arping(8) program can be
> updated to send UNARP packets.
>
> Any Thoughts/Comments?
>
> Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
> ---
Hi Girish
Your description (or patch title) is misleading. You apparently
implement the receive side of the RFC.
And the RFC had Proxy ARP in mind.
What about security implications ? Will TCP flows be terminated, instead
of being smoothly migrated (TCP_REPAIR)
What about IPv6 ? Or maybe more abruptly, do we still need to add
features to IPv4 in 2017, 22 years after this RFC came ? ;)
Thanks.
^ permalink raw reply
* Re: [PATCH] netconsole: make config_item_type const
From: Bhumika Goyal @ 2017-10-12 15:04 UTC (permalink / raw)
To: Julia Lawall, David Miller, akpm, nab, Linux Kernel Mailing List,
netdev
Cc: Bhumika Goyal
In-Reply-To: <1507811352-9962-1-git-send-email-bhumirks@gmail.com>
On Thu, Oct 12, 2017 at 2:29 PM, Bhumika Goyal <bhumirks@gmail.com> wrote:
> This is a followup patch for: https://lkml.org/lkml/2017/10/11/375 and
> https://patchwork.kernel.org/patch/9999649/
>
> Make these structures const as they are either passed to the functions
> having the argument as const or stored as a reference in the "ci_type"
> const field of a config_item structure.
>
> Done using Coccienlle.
>
Actually, this patch is dependent on the patches in the links
https://lkml.org/lkml/2017/10/11/375 and
https://patchwork.kernel.org/patch/9999649/. Therefore, this patch
won't be correct unless the patches in these links gets applied.
> Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
> ---
> drivers/net/netconsole.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
> index 0e27920..be9aa36 100644
> --- a/drivers/net/netconsole.c
> +++ b/drivers/net/netconsole.c
> @@ -616,7 +616,7 @@ static void netconsole_target_release(struct config_item *item)
> .release = netconsole_target_release,
> };
>
> -static struct config_item_type netconsole_target_type = {
> +static const struct config_item_type netconsole_target_type = {
> .ct_attrs = netconsole_target_attrs,
> .ct_item_ops = &netconsole_target_item_ops,
> .ct_owner = THIS_MODULE,
> @@ -682,7 +682,7 @@ static void drop_netconsole_target(struct config_group *group,
> .drop_item = drop_netconsole_target,
> };
>
> -static struct config_item_type netconsole_subsys_type = {
> +static const struct config_item_type netconsole_subsys_type = {
> .ct_group_ops = &netconsole_subsys_group_ops,
> .ct_owner = THIS_MODULE,
> };
> --
> 1.9.1
>
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Jiri Pirko @ 2017-10-12 15:04 UTC (permalink / raw)
To: Roopa Prabhu
Cc: Steve Lin, netdev@vger.kernel.org, Jiri Pirko,
davem@davemloft.net, michael.chan, linux-pci, John W. Linville,
gospo
In-Reply-To: <CAJieiUiFmBr72yYP9VNtm0VBTNjJcA6Lj5nHJNBjDBC4moxE2A@mail.gmail.com>
Thu, Oct 12, 2017 at 04:46:24PM CEST, roopa@cumulusnetworks.com wrote:
>On Thu, Oct 12, 2017 at 7:40 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Thu, Oct 12, 2017 at 04:35:10PM CEST, roopa@cumulusnetworks.com wrote:
>>>On Thu, Oct 12, 2017 at 6:34 AM, Steve Lin <steven.lin1@broadcom.com> wrote:
>>>> Adds a devlink command for getting & setting device configuration
>>>> parameters, and enumerates a bunch of those parameters as devlink
>>>> attributes. Also introduces an attribute that can be set by a
>>>> driver to indicate that the config change doesn't take effect
>>>> until the next restart (as in the case of the bnxt driver changes
>>>> in this patchset, for which all the configuration changes affect NVM
>>>> only, and aren't loaded until the next restart.)
>>>>
>>>> bnxt driver patches make use of these new devlink cmds/attributes.
>>>>
>>>> Steve Lin (3):
>>>> devlink: Add config parameter get/set operations
>>>> bnxt: Move generic devlink code to new file
>>>> bnxt: Add devlink support for config get/set
>>>>
>>>
>>>Is the goal here to move all ethtool operations to devlink (I saw some
>>>attrs related to speed etc). ?.
>>>We do need to move ethtool attrs to netlink and devlink is a good
>>>place (and of-course leave the current ethtool api around for backward
>>>compatibility).
>>
>> We need to make sure we are not moving things to devlink which don't
>> belong there. All options that use "netdev" as a handle should go into
>> rtnetlink instead.
>>
>
>Any reason you want to keep that restriction ?.
>FWIS, devlink is a driver api just like ethtool is.
>and ethtool needs to move to netlink soon...and It would be better to
>not put the rtnl_lock burden on ethtool driver operations. Instead of
>adding yet another driver api, extending devlink seems like a great
>fit to me.
Hmm, the original purpose of devlink was to obtain iface for things that
could not use "netdev" as a handle. I try to stick with it as we already
have iface for things that could use "netdev" as a handle - rtnetlink.
Not sure we want to go this way and add "netdev"-handle things into
devlink. Thoughts?
^ permalink raw reply
* [PATCH net-next 0/2] net: stmmac: Improvements for multi-queuing and for AVB
From: Jose Abreu @ 2017-10-12 15:14 UTC (permalink / raw)
To: netdev
Cc: Jose Abreu, David S. Miller, Joao Pinto, Giuseppe Cavallaro,
Alexandre Torgue
Hi,
Two improvements for stmmac: First one corrects the available fifo size per queue, second one corrects enabling of AVB queues. More info in commit log.
Best regards,
Jose Miguel Abreu
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Jose Abreu (2):
net: stmmac: Use correct values in TQS/RQS fields
net: stmmac: Disable flow ctrl for RX AVB queues and really enable TX
AVB queues
drivers/net/ethernet/stmicro/stmmac/common.h | 5 +--
drivers/net/ethernet/stmicro/stmmac/dwmac4.h | 2 ++
drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 27 ++++++++++------
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 39 +++++++++++++++++++----
4 files changed, 56 insertions(+), 17 deletions(-)
--
1.9.1
^ permalink raw reply
* [PATCH net-next 1/2] net: stmmac: Use correct values in TQS/RQS fields
From: Jose Abreu @ 2017-10-12 15:14 UTC (permalink / raw)
To: netdev
Cc: Jose Abreu, David S. Miller, Joao Pinto, Giuseppe Cavallaro,
Alexandre Torgue
In-Reply-To: <cover.1507820933.git.joabreu@synopsys.com>
Currently we are using all the available fifo size in RQS and
TQS fields. This will not work correctly in multi-queues IP's
because total fifo size must be splitted to the enabled queues.
Correct this by computing the available fifo size per queue and
setting the right value in TQS and RQS fields.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
drivers/net/ethernet/stmicro/stmmac/common.h | 3 ++-
drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 15 +++++++++------
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 22 ++++++++++++++++++++--
3 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index e82b4b7..c26c8a7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -443,7 +443,8 @@ struct stmmac_dma_ops {
int rxfifosz);
void (*dma_rx_mode)(void __iomem *ioaddr, int mode, u32 channel,
int fifosz);
- void (*dma_tx_mode)(void __iomem *ioaddr, int mode, u32 channel);
+ void (*dma_tx_mode)(void __iomem *ioaddr, int mode, u32 channel,
+ int fifosz);
/* To track extra statistic (if supported) */
void (*dma_diagnostic_fr) (void *data, struct stmmac_extra_stats *x,
void __iomem *ioaddr);
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
index e84831e..898849b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -271,9 +271,10 @@ static void dwmac4_dma_rx_chan_op_mode(void __iomem *ioaddr, int mode,
}
static void dwmac4_dma_tx_chan_op_mode(void __iomem *ioaddr, int mode,
- u32 channel)
+ u32 channel, int fifosz)
{
u32 mtl_tx_op = readl(ioaddr + MTL_CHAN_TX_OP_MODE(channel));
+ unsigned int tqs = fifosz / 256 - 1;
if (mode == SF_DMA_MODE) {
pr_debug("GMAC: enable TX store and forward mode\n");
@@ -306,12 +307,14 @@ static void dwmac4_dma_tx_chan_op_mode(void __iomem *ioaddr, int mode,
* For an IP with DWC_EQOS_NUM_TXQ > 1, the fields TXQEN and TQS are R/W
* with reset values: TXQEN off, TQS 256 bytes.
*
- * Write the bits in both cases, since it will have no effect when RO.
- * For DWC_EQOS_NUM_TXQ > 1, the top bits in MTL_OP_MODE_TQS_MASK might
- * be RO, however, writing the whole TQS field will result in a value
- * equal to DWC_EQOS_TXFIFO_SIZE, just like for DWC_EQOS_NUM_TXQ == 1.
+ * TXQEN must be written for multi-channel operation and TQS must
+ * reflect the available fifo size per queue (total fifo size / number
+ * of enabled queues).
*/
- mtl_tx_op |= MTL_OP_MODE_TXQEN | MTL_OP_MODE_TQS_MASK;
+ mtl_tx_op |= MTL_OP_MODE_TXQEN;
+ mtl_tx_op &= ~MTL_OP_MODE_TQS_MASK;
+ mtl_tx_op |= tqs << MTL_OP_MODE_TQS_SHIFT;
+
writel(mtl_tx_op, ioaddr + MTL_CHAN_TX_OP_MODE(channel));
}
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index f41661a..edf245b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1750,12 +1750,19 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
u32 rx_channels_count = priv->plat->rx_queues_to_use;
u32 tx_channels_count = priv->plat->tx_queues_to_use;
int rxfifosz = priv->plat->rx_fifo_size;
+ int txfifosz = priv->plat->tx_fifo_size;
u32 txmode = 0;
u32 rxmode = 0;
u32 chan = 0;
if (rxfifosz == 0)
rxfifosz = priv->dma_cap.rx_fifo_size;
+ if (txfifosz == 0)
+ txfifosz = priv->dma_cap.tx_fifo_size;
+
+ /* Adjust for real per queue fifo size */
+ rxfifosz /= rx_channels_count;
+ txfifosz /= tx_channels_count;
if (priv->plat->force_thresh_dma_mode) {
txmode = tc;
@@ -1783,7 +1790,8 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
rxfifosz);
for (chan = 0; chan < tx_channels_count; chan++)
- priv->hw->dma->dma_tx_mode(priv->ioaddr, txmode, chan);
+ priv->hw->dma->dma_tx_mode(priv->ioaddr, txmode, chan,
+ txfifosz);
} else {
priv->hw->dma->dma_mode(priv->ioaddr, txmode, rxmode,
rxfifosz);
@@ -1946,15 +1954,25 @@ static void stmmac_tx_err(struct stmmac_priv *priv, u32 chan)
static void stmmac_set_dma_operation_mode(struct stmmac_priv *priv, u32 txmode,
u32 rxmode, u32 chan)
{
+ u32 rx_channels_count = priv->plat->rx_queues_to_use;
+ u32 tx_channels_count = priv->plat->tx_queues_to_use;
int rxfifosz = priv->plat->rx_fifo_size;
+ int txfifosz = priv->plat->tx_fifo_size;
if (rxfifosz == 0)
rxfifosz = priv->dma_cap.rx_fifo_size;
+ if (txfifosz == 0)
+ txfifosz = priv->dma_cap.tx_fifo_size;
+
+ /* Adjust for real per queue fifo size */
+ rxfifosz /= rx_channels_count;
+ txfifosz /= tx_channels_count;
if (priv->synopsys_id >= DWMAC_CORE_4_00) {
priv->hw->dma->dma_rx_mode(priv->ioaddr, rxmode, chan,
rxfifosz);
- priv->hw->dma->dma_tx_mode(priv->ioaddr, txmode, chan);
+ priv->hw->dma->dma_tx_mode(priv->ioaddr, txmode, chan,
+ txfifosz);
} else {
priv->hw->dma->dma_mode(priv->ioaddr, txmode, rxmode,
rxfifosz);
--
1.9.1
^ permalink raw reply related
* [PATCH net-next 2/2] net: stmmac: Disable flow ctrl for RX AVB queues and really enable TX AVB queues
From: Jose Abreu @ 2017-10-12 15:14 UTC (permalink / raw)
To: netdev
Cc: Jose Abreu, David S. Miller, Joao Pinto, Giuseppe Cavallaro,
Alexandre Torgue
In-Reply-To: <cover.1507820933.git.joabreu@synopsys.com>
Flow control must be disabled for AVB enabled queues and TX
AVB queues must be enabled by setting BIT(2) of TXQEN.
Correct this by passing the queue mode to DMA callbacks
and by checking in these functions wether we are in AVB
performing the necessary adjustments.
Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
drivers/net/ethernet/stmicro/stmmac/common.h | 4 ++--
drivers/net/ethernet/stmicro/stmmac/dwmac4.h | 2 ++
drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c | 16 +++++++++++-----
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 21 +++++++++++++++------
4 files changed, 30 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index c26c8a7..e1e5ac0 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -442,9 +442,9 @@ struct stmmac_dma_ops {
void (*dma_mode)(void __iomem *ioaddr, int txmode, int rxmode,
int rxfifosz);
void (*dma_rx_mode)(void __iomem *ioaddr, int mode, u32 channel,
- int fifosz);
+ int fifosz, u8 qmode);
void (*dma_tx_mode)(void __iomem *ioaddr, int mode, u32 channel,
- int fifosz);
+ int fifosz, u8 qmode);
/* To track extra statistic (if supported) */
void (*dma_diagnostic_fr) (void *data, struct stmmac_extra_stats *x,
void __iomem *ioaddr);
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
index d74cedf..aeda3ab 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4.h
@@ -225,6 +225,8 @@ enum power_event {
#define MTL_CHAN_RX_DEBUG(x) (MTL_CHANX_BASE_ADDR(x) + 0x38)
#define MTL_OP_MODE_RSF BIT(5)
+#define MTL_OP_MODE_TXQEN_MASK GENMASK(3, 2)
+#define MTL_OP_MODE_TXQEN_AV BIT(2)
#define MTL_OP_MODE_TXQEN BIT(3)
#define MTL_OP_MODE_TSF BIT(1)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
index 898849b..572e96b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_dma.c
@@ -191,7 +191,7 @@ static void dwmac4_rx_watchdog(void __iomem *ioaddr, u32 riwt, u32 number_chan)
}
static void dwmac4_dma_rx_chan_op_mode(void __iomem *ioaddr, int mode,
- u32 channel, int fifosz)
+ u32 channel, int fifosz, u8 qmode)
{
unsigned int rqs = fifosz / 256 - 1;
u32 mtl_rx_op, mtl_rx_int;
@@ -218,8 +218,10 @@ static void dwmac4_dma_rx_chan_op_mode(void __iomem *ioaddr, int mode,
mtl_rx_op &= ~MTL_OP_MODE_RQS_MASK;
mtl_rx_op |= rqs << MTL_OP_MODE_RQS_SHIFT;
- /* enable flow control only if each channel gets 4 KiB or more FIFO */
- if (fifosz >= 4096) {
+ /* Enable flow control only if each channel gets 4 KiB or more FIFO and
+ * only if channel is not an AVB channel.
+ */
+ if ((fifosz >= 4096) && (qmode != MTL_QUEUE_AVB)) {
unsigned int rfd, rfa;
mtl_rx_op |= MTL_OP_MODE_EHFC;
@@ -271,7 +273,7 @@ static void dwmac4_dma_rx_chan_op_mode(void __iomem *ioaddr, int mode,
}
static void dwmac4_dma_tx_chan_op_mode(void __iomem *ioaddr, int mode,
- u32 channel, int fifosz)
+ u32 channel, int fifosz, u8 qmode)
{
u32 mtl_tx_op = readl(ioaddr + MTL_CHAN_TX_OP_MODE(channel));
unsigned int tqs = fifosz / 256 - 1;
@@ -311,7 +313,11 @@ static void dwmac4_dma_tx_chan_op_mode(void __iomem *ioaddr, int mode,
* reflect the available fifo size per queue (total fifo size / number
* of enabled queues).
*/
- mtl_tx_op |= MTL_OP_MODE_TXQEN;
+ mtl_tx_op &= ~MTL_OP_MODE_TXQEN_MASK;
+ if (mtl_tx_op != MTL_QUEUE_AVB)
+ mtl_tx_op |= MTL_OP_MODE_TXQEN;
+ else
+ mtl_tx_op |= MTL_OP_MODE_TXQEN_AV;
mtl_tx_op &= ~MTL_OP_MODE_TQS_MASK;
mtl_tx_op |= tqs << MTL_OP_MODE_TQS_SHIFT;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index edf245b..0e1b0a3 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1754,6 +1754,7 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
u32 txmode = 0;
u32 rxmode = 0;
u32 chan = 0;
+ u8 qmode = 0;
if (rxfifosz == 0)
rxfifosz = priv->dma_cap.rx_fifo_size;
@@ -1785,13 +1786,19 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
/* configure all channels */
if (priv->synopsys_id >= DWMAC_CORE_4_00) {
- for (chan = 0; chan < rx_channels_count; chan++)
+ for (chan = 0; chan < rx_channels_count; chan++) {
+ qmode = priv->plat->rx_queues_cfg[chan].mode_to_use;
+
priv->hw->dma->dma_rx_mode(priv->ioaddr, rxmode, chan,
- rxfifosz);
+ rxfifosz, qmode);
+ }
+
+ for (chan = 0; chan < tx_channels_count; chan++) {
+ qmode = priv->plat->tx_queues_cfg[chan].mode_to_use;
- for (chan = 0; chan < tx_channels_count; chan++)
priv->hw->dma->dma_tx_mode(priv->ioaddr, txmode, chan,
- txfifosz);
+ txfifosz, qmode);
+ }
} else {
priv->hw->dma->dma_mode(priv->ioaddr, txmode, rxmode,
rxfifosz);
@@ -1954,6 +1961,8 @@ static void stmmac_tx_err(struct stmmac_priv *priv, u32 chan)
static void stmmac_set_dma_operation_mode(struct stmmac_priv *priv, u32 txmode,
u32 rxmode, u32 chan)
{
+ u8 rxqmode = priv->plat->rx_queues_cfg[chan].mode_to_use;
+ u8 txqmode = priv->plat->tx_queues_cfg[chan].mode_to_use;
u32 rx_channels_count = priv->plat->rx_queues_to_use;
u32 tx_channels_count = priv->plat->tx_queues_to_use;
int rxfifosz = priv->plat->rx_fifo_size;
@@ -1970,9 +1979,9 @@ static void stmmac_set_dma_operation_mode(struct stmmac_priv *priv, u32 txmode,
if (priv->synopsys_id >= DWMAC_CORE_4_00) {
priv->hw->dma->dma_rx_mode(priv->ioaddr, rxmode, chan,
- rxfifosz);
+ rxfifosz, rxqmode);
priv->hw->dma->dma_tx_mode(priv->ioaddr, txmode, chan,
- txfifosz);
+ txfifosz, txqmode);
} else {
priv->hw->dma->dma_mode(priv->ioaddr, txmode, rxmode,
rxfifosz);
--
1.9.1
^ permalink raw reply related
* Re: [Intel-wired-lan] [next-queue PATCH v6 2/5] mqprio: Implement select_queue class_ops
From: Alexander Duyck @ 2017-10-12 15:21 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: Netdev, intel-wired-lan, rodney.cummings, andre.guedes,
Jiri Pirko, ivan.briano, Richard Cochran, henrik,
Jamal Hadi Salim, levipearson, boon.leong.ong, Cong Wang,
Jesus Sanchez-Palencia
In-Reply-To: <20171012005449.26533-3-vinicius.gomes@intel.com>
On Wed, Oct 11, 2017 at 5:54 PM, Vinicius Costa Gomes
<vinicius.gomes@intel.com> wrote:
> From: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>
> When replacing a child qdisc from mqprio, tc_modify_qdisc() must fetch
> the netdev_queue pointer that the current child qdisc is associated
> with before creating the new qdisc.
>
> Currently, when using mqprio as root qdisc, the kernel will end up
> getting the queue #0 pointer from the mqprio (root qdisc), which leaves
> any new child qdisc with a possibly wrong netdev_queue pointer.
>
> Implementing the Qdisc_class_ops select_queue() on mqprio fixes this
> issue and avoid an inconsistent state when child qdiscs are replaced.
>
> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
> ---
> net/sched/sch_mqprio.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> index 6bcdfe6e7b63..8c042ae323e3 100644
> --- a/net/sched/sch_mqprio.c
> +++ b/net/sched/sch_mqprio.c
> @@ -396,6 +396,12 @@ static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> }
> }
>
> +static struct netdev_queue *mqprio_select_queue(struct Qdisc *sch,
> + struct tcmsg *tcm)
> +{
> + return mqprio_queue_get(sch, TC_H_MIN(tcm->tcm_parent));
> +}
> +
So I was just comparing this against mq_selet_queue, and I was
wondering why we are willing to return NULL here instead of just
returning a pointer to the first Tx queue? I realize there is the fix
in the first patch but it seems like if we are going to go that route
then maybe we should update mq as well so that both of these qdiscs
behave the same way. Either this should work like mq, or mq should
work like this, but we shouldn't have them exposing different
behaviors.
> static const struct Qdisc_class_ops mqprio_class_ops = {
> .graft = mqprio_graft,
> .leaf = mqprio_leaf,
> @@ -403,6 +409,7 @@ static const struct Qdisc_class_ops mqprio_class_ops = {
> .walk = mqprio_walk,
> .dump = mqprio_dump_class,
> .dump_stats = mqprio_dump_class_stats,
> + .select_queue = mqprio_select_queue,
> };
>
> static struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
> --
> 2.14.2
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Roopa Prabhu @ 2017-10-12 15:31 UTC (permalink / raw)
To: Jiri Pirko
Cc: Steve Lin, netdev@vger.kernel.org, Jiri Pirko,
davem@davemloft.net, Michael Chan, linux-pci, John W. Linville,
Andy Gospodarek
In-Reply-To: <20171012150419.GI14672@nanopsycho>
On Thu, Oct 12, 2017 at 8:04 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Oct 12, 2017 at 04:46:24PM CEST, roopa@cumulusnetworks.com wrote:
>>On Thu, Oct 12, 2017 at 7:40 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Thu, Oct 12, 2017 at 04:35:10PM CEST, roopa@cumulusnetworks.com wrote:
>>>>On Thu, Oct 12, 2017 at 6:34 AM, Steve Lin <steven.lin1@broadcom.com> wrote:
>>>>> Adds a devlink command for getting & setting device configuration
>>>>> parameters, and enumerates a bunch of those parameters as devlink
>>>>> attributes. Also introduces an attribute that can be set by a
>>>>> driver to indicate that the config change doesn't take effect
>>>>> until the next restart (as in the case of the bnxt driver changes
>>>>> in this patchset, for which all the configuration changes affect NVM
>>>>> only, and aren't loaded until the next restart.)
>>>>>
>>>>> bnxt driver patches make use of these new devlink cmds/attributes.
>>>>>
>>>>> Steve Lin (3):
>>>>> devlink: Add config parameter get/set operations
>>>>> bnxt: Move generic devlink code to new file
>>>>> bnxt: Add devlink support for config get/set
>>>>>
>>>>
>>>>Is the goal here to move all ethtool operations to devlink (I saw some
>>>>attrs related to speed etc). ?.
>>>>We do need to move ethtool attrs to netlink and devlink is a good
>>>>place (and of-course leave the current ethtool api around for backward
>>>>compatibility).
>>>
>>> We need to make sure we are not moving things to devlink which don't
>>> belong there. All options that use "netdev" as a handle should go into
>>> rtnetlink instead.
>>>
>>
>>Any reason you want to keep that restriction ?.
>>FWIS, devlink is a driver api just like ethtool is.
>>and ethtool needs to move to netlink soon...and It would be better to
>>not put the rtnl_lock burden on ethtool driver operations. Instead of
>>adding yet another driver api, extending devlink seems like a great
>>fit to me.
>
> Hmm, the original purpose of devlink was to obtain iface for things that
> could not use "netdev" as a handle. I try to stick with it as we already
> have iface for things that could use "netdev" as a handle - rtnetlink.
>
> Not sure we want to go this way and add "netdev"-handle things into
> devlink. Thoughts?
>
Only motivation for me is to keep all driver/hw api in a single place.
and its high time ethtool moved to netlink. I would prefer it be out
of rtnetlink if we have a choice.
Moving some of the driver ops to rtnetlink and leaving the rest in
devlink can be a mess for drivers in the long run.
Maybe we can discuss this more at netdev2.2 ?
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Florian Fainelli @ 2017-10-12 15:43 UTC (permalink / raw)
To: Jiri Pirko, Roopa Prabhu
Cc: Steve Lin, netdev@vger.kernel.org, Jiri Pirko,
davem@davemloft.net, michael.chan, linux-pci, John W. Linville,
gospo
In-Reply-To: <20171012150419.GI14672@nanopsycho>
On October 12, 2017 8:04:19 AM PDT, Jiri Pirko <jiri@resnulli.us> wrote:
>Thu, Oct 12, 2017 at 04:46:24PM CEST, roopa@cumulusnetworks.com wrote:
>>On Thu, Oct 12, 2017 at 7:40 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Thu, Oct 12, 2017 at 04:35:10PM CEST, roopa@cumulusnetworks.com
>wrote:
>>>>On Thu, Oct 12, 2017 at 6:34 AM, Steve Lin
><steven.lin1@broadcom.com> wrote:
>>>>> Adds a devlink command for getting & setting device configuration
>>>>> parameters, and enumerates a bunch of those parameters as devlink
>>>>> attributes. Also introduces an attribute that can be set by a
>>>>> driver to indicate that the config change doesn't take effect
>>>>> until the next restart (as in the case of the bnxt driver changes
>>>>> in this patchset, for which all the configuration changes affect
>NVM
>>>>> only, and aren't loaded until the next restart.)
>>>>>
>>>>> bnxt driver patches make use of these new devlink cmds/attributes.
>>>>>
>>>>> Steve Lin (3):
>>>>> devlink: Add config parameter get/set operations
>>>>> bnxt: Move generic devlink code to new file
>>>>> bnxt: Add devlink support for config get/set
>>>>>
>>>>
>>>>Is the goal here to move all ethtool operations to devlink (I saw
>some
>>>>attrs related to speed etc). ?.
>>>>We do need to move ethtool attrs to netlink and devlink is a good
>>>>place (and of-course leave the current ethtool api around for
>backward
>>>>compatibility).
>>>
>>> We need to make sure we are not moving things to devlink which don't
>>> belong there. All options that use "netdev" as a handle should go
>into
>>> rtnetlink instead.
>>>
>>
>>Any reason you want to keep that restriction ?.
>>FWIS, devlink is a driver api just like ethtool is.
>>and ethtool needs to move to netlink soon...and It would be better to
>>not put the rtnl_lock burden on ethtool driver operations. Instead of
>>adding yet another driver api, extending devlink seems like a great
>>fit to me.
>
>Hmm, the original purpose of devlink was to obtain iface for things
>that
>could not use "netdev" as a handle. I try to stick with it as we
>already
>have iface for things that could use "netdev" as a handle - rtnetlink.
>
>Not sure we want to go this way and add "netdev"-handle things into
>devlink. Thoughts?
In the current situation where we have ethtool and devlink operating separately on different objects as entry points to the kernel, I agree with that design.
Once we move ethtool (or however we name its successor) over to netlink there is an opportunity for accessing objects that do and do not have a netdevice representor today (e.g: management ports on switches) with the same interface, and devlink could be used for that.
In terms of compatibility though we should have a pseudo generic layer that can take ethtool ioctl() and transform that into a netlink message and then call back down to drivers with the existing ethtool_ops that are implemented. It is reasonably simple to use coccinelle to update these ethtool_ops with possibly updated signatures to support netlink attributes and whatnot, but forcing drivers to quit doing ethtool_ops entitely and implement new sets of "ethtool over netlink" ops is a non starter IMHO.
--
Florian
^ permalink raw reply
* Re: BUG:af_packet fails to TX TSO frames
From: Anton Ivanov @ 2017-10-12 15:44 UTC (permalink / raw)
To: Willem de Bruijn; +Cc: Network Development, David Miller
In-Reply-To: <f046dbf4-0b46-2d7c-c82c-fe9f50c11f71@cambridgegreys.com>
Found it.
Two bugs canceling each other.
The bind sequence in: psock_txring_vnet.c is wrong.
It does the following addr.sll_protocol = htons(ETH_P_IP);
before calling bind.
If you set addr.sll_protocol to ETH_P_ALL where it should have been in
the first place the test program blows up with -ENOBUFS
I think what is happening is that this value is taken into account when
looking at "what should I use to segment it with" in skb_mac_gso_segment
which is invoked at the end of the verification chain which starts in
packet_direct_xmit in af_packet.c
I have not tried the other test cases like setting it to ETH_P_IP and
giving it IPv6 traffic or the opposite, but my guess is that these will
fail too if they need GSO to be applied.
A.
On 10/12/17 15:12, Anton Ivanov wrote:
>
>
> On 10/12/17 14:39, Willem de Bruijn wrote:
>>> If I produce a real vnet frame out of a live kernel frame using
>>> virtio_net_hdr_from_skb() and try to send it it fails on the check in
>>> af_packet, while succeeding for tap. If I remove the af_packet check
>>> the
>>> frame is accepted by the hardware too.
>>>
>>> If I produce it a synthetic frame + vnet header using the test
>>> program - it
>>> works. Go figure.
>> Besides looking at the raw frame bytes, also compare the setup
>> of virtio_net_header, as well as the tcp checksum field. The stack
>> expects the pseudo header to have already been calculated.
>
> I am feeding it a skb which is coming up in the tx routine of a User
> Mode Linux device which is marked as NETIF_F_HW_CSUM and SG - that
> results in a skb with csum-ed headers, body set up for CSUM_PARTIAL
> and multiple fragments (always at least 1 more frag besides the TCP
> head).
>
> That has everything in order as expected by virtio_net_hdr_from_skb
> and this is what I use to generate the vnet header. It works correctly
> for csum and GRO with af_packet and it works correctly for everything
> using a tap device. It fails only on GSO + af_packet TX.
>
> What I am doing is the same thing virtio_net does - it just takes the
> output of virtio_net_hdr_from_skb and does nothing more. There should
> be no need to do anything more :(
>
> It should just work.
>
> Unless there is a gremlin somewhere in the machinery and that gremlin
> needs some light to be flushed out.
>>
>>> I am going to continue digging into it.
>>>
>>> At the very least I now have a positive test case which uses the same
>>> semantics as my code so I have something to compare to.
>> Glad to hear that the test is helpful. I wrote it because I
>> have run into these exact same issues in the past.
>
> It is. I have changes ready for it so it also supports vector IO, need
> to finish fighting with it.
>
> A.
>
>>
>
--
Anton R. Ivanov
Cambridge Greys Limited, England and Wales company No 10273661
http://www.cambridgegreys.com/
^ permalink raw reply
* Re: RFC(v2): Audit Kernel Container IDs
From: Steve Grubb @ 2017-10-12 15:45 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: cgroups, Linux Containers, Linux API, Linux Audit, Linux FS Devel,
Linux Kernel, Linux Network Development, Simo Sorce,
Carlos O'Donell, Aristeu Rozanski, David Howells,
Eric W. Biederman, Eric Paris, jlayton, Andy Lutomirski, mszeredi,
Paul Moore, Serge E. Hallyn, trondmy, Al Viro
In-Reply-To: <20171012141359.saqdtnodwmbz33b2@madcap2.tricolour.ca>
On Thursday, October 12, 2017 10:14:00 AM EDT Richard Guy Briggs wrote:
> Containers are a userspace concept. The kernel knows nothing of them.
>
> The Linux audit system needs a way to be able to track the container
> provenance of events and actions. Audit needs the kernel's help to do
> this.
>
> Since the concept of a container is entirely a userspace concept, a
> registration from the userspace container orchestration system initiates
> this. This will define a point in time and a set of resources
> associated with a particular container with an audit container ID.
The requirements for common criteria around containers should be very closely
modeled on the requirements for virtualization. It would be the container
manager that is responsible for logging the resource assignment events.
> The registration is a pseudo filesystem (proc, since PID tree already
> exists) write of a u8[16] UUID representing the container ID to a file
> representing a process that will become the first process in a new
> container. This write might place restrictions on mount namespaces
> required to define a container, or at least careful checking of
> namespaces in the kernel to verify permissions of the orchestrator so it
> can't change its own container ID. A bind mount of nsfs may be
> necessary in the container orchestrator's mntNS.
> Note: Use a 128-bit scalar rather than a string to make compares faster
> and simpler.
>
> Require a new CAP_CONTAINER_ADMIN to be able to carry out the
> registration.
Wouldn't CAP_AUDIT_WRITE be sufficient? After all, this is for auditing.
> At that time, record the target container's user-supplied
> container identifier along with the target container's first process
> (which may become the target container's "init" process) process ID
> (referenced from the initial PID namespace), all namespace IDs (in the
> form of a nsfs device number and inode number tuple) in a new auxilliary
> record AUDIT_CONTAINER with a qualifying op=$action field.
This would be in addition to the normal audit fields.
> Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
> container ID present on an auditable action or event.
>
> Forked and cloned processes inherit their parent's container ID,
> referenced in the process' task_struct.
>
> Mimic setns(2) and return an error if the process has already initiated
> threading or forked since this registration should happen before the
> process execution is started by the orchestrator and hence should not
> yet have any threads or children. If this is deemed overly restrictive,
> switch all threads and children to the new containerID.
>
> Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
>
> Log the creation of every namespace, inheriting/adding its spawning
> process' containerID(s), if applicable. Include the spawning and
> spawned namespace IDs (device and inode number tuples).
> [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
> Note: At this point it appears only network namespaces may need to track
> container IDs apart from processes since incoming packets may cause an
> auditable event before being associated with a process.
>
> Log the destruction of every namespace when it is no longer used by any
> process, include the namespace IDs (device and inode number tuples).
> [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
In the virtualization requirements, we only log removal of resources when
something is removed by intention. If the VM shuts down, the manager issues a
VIRT_CONTROL stop event and the user space utilities knows this means all
resources have been unassigned.
> Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
> the parent and child namespace IDs for any changes to a process'
> namespaces. [setns(2)]
> Note: It may be possible to combine AUDIT_NS_* record formats and
> distinguish them with an op=$action field depending on the fields
> required for each message type.
>
> When a container ceases to exist because the last process in that
> container has exited and hence the last namespace has been destroyed and
> its refcount dropping to zero, log the fact.
> (This latter is likely needed for certification accountability.) A
> container object may need a list of processes and/or namespaces.
>
> A namespace cannot directly migrate from one container to another but
> could be assigned to a newly spawned container. A namespace can be
> moved from one container to another indirectly by having that namespace
> used in a second process in another container and then ending all the
> processes in the first container.
I'm thinking that there needs to be a clear delineation between what the
container manager is responsible for and what the kernel needs to do. The
kernel needs the registration system and to associate an identifier with
events inside the container.
But would the container manager be mostly responsible for auditing the events
described here:
https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events
Also, we can already audit exit, unshare, setns, and clone. If the kernel just
sticks the identifier on them, isn't that sufficient?
-Steve
> (v2)
> - switch from u64 to u128 UUID
> - switch from "signal" and "trigger" to "register"
> - restrict registration to single process or force all threads and children
> into same container
>
> - RGB
>
> --
> Richard Guy Briggs <rgb@redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [next-queue PATCH v6 3/5] net/sched: Introduce Credit Based Shaper (CBS) qdisc
From: Jesus Sanchez-Palencia @ 2017-10-12 15:45 UTC (permalink / raw)
To: Henrik Austad, Vinicius Costa Gomes
Cc: netdev, intel-wired-lan, jhs, xiyou.wangcong, jiri, andre.guedes,
ivan.briano, boon.leong.ong, richardcochran, levipearson,
rodney.cummings
In-Reply-To: <20171012075826.GA29396@sisyphus.home.austad.us>
Hi Henrik,
On 10/12/2017 12:58 AM, Henrik Austad wrote:
> On Wed, Oct 11, 2017 at 05:54:47PM -0700, Vinicius Costa Gomes wrote:
>> This queueing discipline implements the shaper algorithm defined by
>> the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.
>>
>> It's primary usage is to apply some bandwidth reservation to user
>> defined traffic classes, which are mapped to different queues via the
>> mqprio qdisc.
>>
>> Only a simple software implementation is added for now.
>>
>> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>> ---
>> include/uapi/linux/pkt_sched.h | 18 +++
>> net/sched/Kconfig | 11 ++
>> net/sched/Makefile | 1 +
>> net/sched/sch_cbs.c | 302 +++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 332 insertions(+)
>> create mode 100644 net/sched/sch_cbs.c
>>
>> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
>> index 099bf5528fed..41e349df4bf4 100644
>> --- a/include/uapi/linux/pkt_sched.h
>> +++ b/include/uapi/linux/pkt_sched.h
>> @@ -871,4 +871,22 @@ struct tc_pie_xstats {
>> __u32 maxq; /* maximum queue size */
>> __u32 ecn_mark; /* packets marked with ecn*/
>> };
>> +
>> +/* CBS */
>> +struct tc_cbs_qopt {
>> + __u8 offload;
>> + __s32 hicredit;
>> + __s32 locredit;
>> + __s32 idleslope;
>> + __s32 sendslope;
>> +};
>> +
>> +enum {
>> + TCA_CBS_UNSPEC,
>> + TCA_CBS_PARMS,
>> + __TCA_CBS_MAX,
>> +};
>> +
>> +#define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
>> +
>> #endif
>> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
>> index e70ed26485a2..c03d86a7775e 100644
>> --- a/net/sched/Kconfig
>> +++ b/net/sched/Kconfig
>> @@ -172,6 +172,17 @@ config NET_SCH_TBF
>> To compile this code as a module, choose M here: the
>> module will be called sch_tbf.
>>
>> +config NET_SCH_CBS
>> + tristate "Credit Based Shaper (CBS)"
>
> Any particular reason why the dependency to MQPRIO was dropped? I'm only
> asking because you added it in v1 of the series and then it disappeared in
> v2 and onwards.
The main reason is that currently there are no qdiscs that depend on any other
specifically. The dependency CBS had on MQPRIO came from how we were calculating
the queue index from the qdisc id inside sch_cbc.c . During the previous review
rounds we agreed on a fix for that and the dependency could be finally removed.
(...)
>> +static int cbs_change(struct Qdisc *sch, struct nlattr *opt)
>> +{
>> + struct cbs_sched_data *q = qdisc_priv(sch);
>> + struct net_device *dev = qdisc_dev(sch);
>> + struct nlattr *tb[TCA_CBS_MAX + 1];
>> + struct ethtool_link_ksettings ecmd;
>> + struct tc_cbs_qopt *qopt;
>> + s64 link_speed;
>> + int err;
>> +
>> + err = nla_parse_nested(tb, TCA_CBS_MAX, opt, cbs_policy, NULL);
>> + if (err < 0)
>> + return err;
>> +
>> + if (!tb[TCA_CBS_PARMS])
>> + return -EINVAL;
>> +
>> + qopt = nla_data(tb[TCA_CBS_PARMS]);
>> +
>> + if (qopt->offload)
>> + return -EOPNOTSUPP;
>> +
>> + if (!__ethtool_get_link_ksettings(dev, &ecmd))
>> + link_speed = ecmd.base.speed;
>> + else
>> + link_speed = SPEED_1000;
>> +
>> + q->port_rate = link_speed * 1000 * BYTES_PER_KBIT;
>> +
>> + q->enqueue = cbs_enqueue_soft;
>> + q->dequeue = cbs_dequeue_soft;
>
> How does one use the cbs_(en|de)queue instead of _soft()?
Judging from your comment on the other patch I believe you've got that
clarified. Please let us know if otherwise.
Thanks,
Jesus
^ permalink raw reply
* Re: BUG:af_packet fails to TX TSO frames
From: Anton Ivanov @ 2017-10-12 15:57 UTC (permalink / raw)
To: Willem de Bruijn; +Cc: Network Development, David Miller, user-mode-linux-devel
In-Reply-To: <e301fbfd-a283-caa1-5915-8be15677ed74@cambridgegreys.com>
Also confirmed using my UML patchset. This is inside the UML guest.
root@blinky:~# iperf -c 192.168.98.1
------------------------------------------------------------
Client connecting to 192.168.98.1, TCP port 5001
TCP window size: 414 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.98.146 port 36744 connected with 192.168.98.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 4.52 GBytes 3.89 Gbits/sec
This is GSO on a raw socket Virtual NIC to host. Normal speed without
GSO on same machine is ~ 500.
So for the time being I can turn off TSO for anything but v4 in the User
Mode Linux patchset (declare the guest NIC as TSOV4 capable only) and
enable it once the host kernel is fixed. Alternatively, there is the
rather ugly approach of using multiple FDs - one for v4, one for v6, one
for...
A.
On 10/12/17 16:44, Anton Ivanov wrote:
> Found it.
>
> Two bugs canceling each other.
> The bind sequence in: psock_txring_vnet.c is wrong.
>
> It does the following addr.sll_protocol = htons(ETH_P_IP);
> before calling bind.
>
> If you set addr.sll_protocol to ETH_P_ALL where it should have been in
> the first place the test program blows up with -ENOBUFS
>
> I think what is happening is that this value is taken into account
> when looking at "what should I use to segment it with" in
> skb_mac_gso_segment which is invoked at the end of the verification
> chain which starts in packet_direct_xmit in af_packet.c
>
> I have not tried the other test cases like setting it to ETH_P_IP and
> giving it IPv6 traffic or the opposite, but my guess is that these
> will fail too if they need GSO to be applied.
>
> A.
>
> On 10/12/17 15:12, Anton Ivanov wrote:
>>
>>
>> On 10/12/17 14:39, Willem de Bruijn wrote:
>>>> If I produce a real vnet frame out of a live kernel frame using
>>>> virtio_net_hdr_from_skb() and try to send it it fails on the check in
>>>> af_packet, while succeeding for tap. If I remove the af_packet
>>>> check the
>>>> frame is accepted by the hardware too.
>>>>
>>>> If I produce it a synthetic frame + vnet header using the test
>>>> program - it
>>>> works. Go figure.
>>> Besides looking at the raw frame bytes, also compare the setup
>>> of virtio_net_header, as well as the tcp checksum field. The stack
>>> expects the pseudo header to have already been calculated.
>>
>> I am feeding it a skb which is coming up in the tx routine of a User
>> Mode Linux device which is marked as NETIF_F_HW_CSUM and SG - that
>> results in a skb with csum-ed headers, body set up for CSUM_PARTIAL
>> and multiple fragments (always at least 1 more frag besides the TCP
>> head).
>>
>> That has everything in order as expected by virtio_net_hdr_from_skb
>> and this is what I use to generate the vnet header. It works
>> correctly for csum and GRO with af_packet and it works correctly for
>> everything using a tap device. It fails only on GSO + af_packet TX.
>>
>> What I am doing is the same thing virtio_net does - it just takes the
>> output of virtio_net_hdr_from_skb and does nothing more. There should
>> be no need to do anything more :(
>>
>> It should just work.
>>
>> Unless there is a gremlin somewhere in the machinery and that gremlin
>> needs some light to be flushed out.
>>>
>>>> I am going to continue digging into it.
>>>>
>>>> At the very least I now have a positive test case which uses the same
>>>> semantics as my code so I have something to compare to.
>>> Glad to hear that the test is helpful. I wrote it because I
>>> have run into these exact same issues in the past.
>>
>> It is. I have changes ready for it so it also supports vector IO,
>> need to finish fighting with it.
>>
>> A.
>>
>>>
>>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel
^ permalink raw reply
* Re: [RFC 0/3] Adding config get/set to devlink
From: Roopa Prabhu @ 2017-10-12 16:05 UTC (permalink / raw)
To: Florian Fainelli
Cc: Jiri Pirko, Steve Lin, netdev@vger.kernel.org, Jiri Pirko,
davem@davemloft.net, Michael Chan, linux-pci, John W. Linville,
Andy Gospodarek
In-Reply-To: <24E5DE7C-A401-48BF-BF80-673ACC38FBBE@gmail.com>
On Thu, Oct 12, 2017 at 8:43 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On October 12, 2017 8:04:19 AM PDT, Jiri Pirko <jiri@resnulli.us> wrote:
>>Thu, Oct 12, 2017 at 04:46:24PM CEST, roopa@cumulusnetworks.com wrote:
>>>On Thu, Oct 12, 2017 at 7:40 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Thu, Oct 12, 2017 at 04:35:10PM CEST, roopa@cumulusnetworks.com
>>wrote:
>>>>>On Thu, Oct 12, 2017 at 6:34 AM, Steve Lin
>><steven.lin1@broadcom.com> wrote:
>>>>>> Adds a devlink command for getting & setting device configuration
>>>>>> parameters, and enumerates a bunch of those parameters as devlink
>>>>>> attributes. Also introduces an attribute that can be set by a
>>>>>> driver to indicate that the config change doesn't take effect
>>>>>> until the next restart (as in the case of the bnxt driver changes
>>>>>> in this patchset, for which all the configuration changes affect
>>NVM
>>>>>> only, and aren't loaded until the next restart.)
>>>>>>
>>>>>> bnxt driver patches make use of these new devlink cmds/attributes.
>>>>>>
>>>>>> Steve Lin (3):
>>>>>> devlink: Add config parameter get/set operations
>>>>>> bnxt: Move generic devlink code to new file
>>>>>> bnxt: Add devlink support for config get/set
>>>>>>
>>>>>
>>>>>Is the goal here to move all ethtool operations to devlink (I saw
>>some
>>>>>attrs related to speed etc). ?.
>>>>>We do need to move ethtool attrs to netlink and devlink is a good
>>>>>place (and of-course leave the current ethtool api around for
>>backward
>>>>>compatibility).
>>>>
>>>> We need to make sure we are not moving things to devlink which don't
>>>> belong there. All options that use "netdev" as a handle should go
>>into
>>>> rtnetlink instead.
>>>>
>>>
>>>Any reason you want to keep that restriction ?.
>>>FWIS, devlink is a driver api just like ethtool is.
>>>and ethtool needs to move to netlink soon...and It would be better to
>>>not put the rtnl_lock burden on ethtool driver operations. Instead of
>>>adding yet another driver api, extending devlink seems like a great
>>>fit to me.
>>
>>Hmm, the original purpose of devlink was to obtain iface for things
>>that
>>could not use "netdev" as a handle. I try to stick with it as we
>>already
>>have iface for things that could use "netdev" as a handle - rtnetlink.
>>
>>Not sure we want to go this way and add "netdev"-handle things into
>>devlink. Thoughts?
>
> In the current situation where we have ethtool and devlink operating separately on different objects as entry points to the kernel, I agree with that design.
>
> Once we move ethtool (or however we name its successor) over to netlink there is an opportunity for accessing objects that do and do not have a netdevice representor today (e.g: management ports on switches) with the same interface, and devlink could be used for that.
>
> In terms of compatibility though we should have a pseudo generic layer that can take ethtool ioctl() and transform that into a netlink message and then call back down to drivers with the existing ethtool_ops that are implemented. It is reasonably simple to use coccinelle to update these ethtool_ops with possibly updated signatures to support netlink attributes and whatnot,
ack, that sounds like a good approach.
> but forcing drivers to quit doing ethtool_ops entitely and implement new sets of "ethtool over netlink" ops is a non starter IMHO.
correct, nobody disagrees with that point.
^ permalink raw reply
* Re: [Intel-wired-lan] [next-queue PATCH v6 2/5] mqprio: Implement select_queue class_ops
From: Jesus Sanchez-Palencia @ 2017-10-12 15:59 UTC (permalink / raw)
To: Alexander Duyck, Vinicius Costa Gomes
Cc: Netdev, intel-wired-lan, rodney.cummings, andre.guedes,
Jiri Pirko, ivan.briano, Richard Cochran, henrik,
Jamal Hadi Salim, levipearson, boon.leong.ong, Cong Wang
In-Reply-To: <CAKgT0Ufo1tG4GpSkGaWuUxqhzEUOT7wcBLuaZx_mZmEL+9SnXw@mail.gmail.com>
Hi Alex,
On 10/12/2017 08:21 AM, Alexander Duyck wrote:
> On Wed, Oct 11, 2017 at 5:54 PM, Vinicius Costa Gomes
> <vinicius.gomes@intel.com> wrote:
>> From: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>>
>> When replacing a child qdisc from mqprio, tc_modify_qdisc() must fetch
>> the netdev_queue pointer that the current child qdisc is associated
>> with before creating the new qdisc.
>>
>> Currently, when using mqprio as root qdisc, the kernel will end up
>> getting the queue #0 pointer from the mqprio (root qdisc), which leaves
>> any new child qdisc with a possibly wrong netdev_queue pointer.
>>
>> Implementing the Qdisc_class_ops select_queue() on mqprio fixes this
>> issue and avoid an inconsistent state when child qdiscs are replaced.
>>
>> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>> ---
>> net/sched/sch_mqprio.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
>> index 6bcdfe6e7b63..8c042ae323e3 100644
>> --- a/net/sched/sch_mqprio.c
>> +++ b/net/sched/sch_mqprio.c
>> @@ -396,6 +396,12 @@ static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
>> }
>> }
>>
>> +static struct netdev_queue *mqprio_select_queue(struct Qdisc *sch,
>> + struct tcmsg *tcm)
>> +{
>> + return mqprio_queue_get(sch, TC_H_MIN(tcm->tcm_parent));
>> +}
>> +
>
> So I was just comparing this against mq_selet_queue, and I was
> wondering why we are willing to return NULL here instead of just
> returning a pointer to the first Tx queue? I realize there is the fix
> in the first patch but it seems like if we are going to go that route
> then maybe we should update mq as well so that both of these qdiscs
> behave the same way. Either this should work like mq, or mq should
> work like this, but we shouldn't have them exposing different
> behaviors.
This was brought up by Cong Wang during the review of our v2. Based on my
understanding, the point I've made is that for mqprio the inner qdiscs are
always 'related' to one of the Tx netdev_queues per design. Returning any other
queue as a fallback seemed like going against that to me.
I'm still inclined to say that we should keep this function as the patch is
proposing, thus either returning the correct netdev_queue for a given handle, or
NULL as a way to flag that something was 'wrong' with it. Returning queue #0 is
misleading in that sense, imho.
As for aligning mq_select_queue() with this approach, if my reasoning behind
mqprio is correct and also applies to mq, I would be happy to send that fix as
part our v7.
What do you think?
Thanks,
Jesus
^ permalink raw reply
* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
From: Stephen Hemminger @ 2017-10-12 16:07 UTC (permalink / raw)
To: Phil Sutter; +Cc: Michal Kubecek, Hangbin Liu, netdev, Hangbin Liu
In-Reply-To: <20171011111007.GA11332@orbyte.nwl.cc>
On Wed, 11 Oct 2017 13:10:07 +0200
Phil Sutter <phil@nwl.cc> wrote:
> On Tue, Oct 10, 2017 at 09:47:43AM -0700, Stephen Hemminger wrote:
> > On Tue, 10 Oct 2017 08:41:17 +0200
> > Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > > On Mon, Oct 09, 2017 at 10:25:25PM +0200, Phil Sutter wrote:
> > > > Hi Stephen,
> > > >
> > > > On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:
> > > > > On Thu, 28 Sep 2017 21:33:46 +0800
> > > > > Hangbin Liu <haliu@redhat.com> wrote:
> > > > >
> > > > > > From: Hangbin Liu <liuhangbin@gmail.com>
> > > > > >
> > > > > > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > > > > > iplink_get()"). After update, we will not need to double the buffer size
> > > > > > every time when VFs number increased.
> > > > > >
> > > > > > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > > > > > length parameter.
> > > > > >
> > > > > > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > > > > > answer to avoid overwrite data in nlh, because it may has more info after
> > > > > > nlh. also this will avoid nlh buffer not enough issue.
> > > > > >
> > > > > > We need to free answer after using.
> > > > > >
> > > > > > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > > > > > Signed-off-by: Phil Sutter <phil@nwl.cc>
> > > > > > ---
> > > > >
> > > > > Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> > > > > Can only those places that need that be targeted?
> > > >
> > > > We could probably do that, by having a buffer on stack in __rtnl_talk()
> > > > which will be used instead of the allocated one if 'answer' is NULL. Or
> > > > maybe even introduce a dedicated API call for the dynamically allocated
> > > > receive buffer. But I really doubt that's feasible: AFAICT, that stack
> > > > buffer still needs to be reasonably sized since the reply might be
> > > > larger than the request (reusing the request buffer would be the most
> > > > simple way to tackle this), also there is support for extack which may
> > > > bloat the response to arbitrary size. Hangbin has shown in his benchmark
> > > > that the overhead of the second syscall is negligible, so why care about
> > > > that and increase code complexity even further?
> > > >
> > > > Not saying it's not possible, but I just doubt it's worth the effort.
> > >
> > > Agreed. Current code is based on the assumption that we can estimate the
> > > maximum reply length in advance and the reason for this series is that
> > > this assumption turned out to be wrong. I'm afraid that if we replace
> > > it by an assumption that we can estimate the maximum reply length for
> > > most requests with only few exceptions, it's only matter of time for us
> > > to be proven wrong again.
> > >
> > > Michal Kubecek
> > >
> >
> > For query responses, yes the response may be large. But for the common cases of
> > add address or add route, the response should just be ack or error.
>
> And with extack, error is comprised of the original request plus an
> arbitrarily sized error message, so we can't just reuse the request
> buffer and are back to "guessing" the right length again.
>
> To get an idea of what we're talking about, I wrote a simple benchmark
> which adds 256 * 254 (= 65024) addresses to an interface, then removes
> them again one by one and measured the time that takes for binaries with
> and without Hangbin's patches:
>
> OP Vanilla Hangbin Delta
> --------------------------------------------------------
> add real 2m16.244s real 2m27.964s +11.72s (108.6%)
> user 0m15.241s user 0m17.295s +2.054s (113.5%)
> sys 1m40.229s sys 1m48.239s +8.01s (108.0%)
>
> remove real 1m44.950s real 1m47.044s +2.094s (102.0%)
> user 0m13.899s user 0m14.723s +0.824s (105.9%)
> sys 1m30.798s sys 1m31.938s +1.140s (101.3%)
>
> So the overhead of the second syscall and dynamic memory allocation is
> less than 10% overall. Given the short time a single call to 'ip'
> typically takes, I don't think the difference is noticeable even in
> highly performance critical applications.
>
> Cheers, Phil
For a better benchmark, I generated 4 Million routes
then did:
# ip ---batch routes.txt
OP Vanilla Hangbin Delta
-----------------------------------------------------
add real 1:25.840 1:33.677 +9.13%
user 10.690 6.078 -56.85%
sys 1:00.920 1:13.109 +20.00%
remove real 2:29.881 2:25.872 -2.67%
user 12.862 7.942 -38.25%
sys 44.127 44.633 +1.15%
So the answer is addition is slower but deletion appears faster?
If I rerun the Vanilla test, get about the same times.
The slowdown won't impact me, but what about large scale users
like Cumulus.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox