Netdev List
 help / color / mirror / Atom feed
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-20 16:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180620170904-mutt-send-email-mst@kernel.org>

On Wed, 20 Jun 2018 17:11:59 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 20, 2018 at 11:53:59AM +0200, Cornelia Huck wrote:
> > On Tue, 19 Jun 2018 23:32:06 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:  
> > > > Sorry about dragging mainframes into this, but this will only work for
> > > > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > > > + virtio-net-ccw example again: The guest cannot find out that the two
> > > > belong together by checking some group ID, it has to either use the MAC
> > > > or some needs-to-be-architectured property.
> > > > 
> > > > Alternatively, we could propose that mechanism as pci-only, which means
> > > > we can rely on mechanisms that won't necessarily work on non-pci
> > > > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > > > through a network card anytime in the near future, due to the nature of
> > > > network cards currently in use on s390.)    
> > > 
> > > That's what it boils down to, yes.  If there's need to have this for
> > > non-pci devices, then we should put it in config space.
> > > Cornelia, what do you think?
> > >   
> > 
> > I think the only really useful config on s390 is the vfio-pci network
> > card coupled with a virtio-net-ccw device: Using an s390 network card
> > via vfio-ccw is out due to the nature of the s390 network cards, and
> > virtio-ccw is the default transport (virtio-pci is not supported on any
> > enterprise distro AFAIK).
> > 
> > For this, having a uuid in the config space could work (vfio-pci
> > devices have a config space by virtue of being pci devices, and
> > virtio-net-ccw devices have a config space by virtue of being virtio
> > devices -- ccw devices usually don't have that concept).  
> 
> OK so this calls for adding such a field generally (it's
> device agnostic right now).
> 
> How would you suggest doing that?

I hope that I'm not thoroughly confused at this point in time, so I'll
summarize my current understanding (also keep in mind that I haven't
looked at Venu's patches yet):

- The Linux guest initiates coupling from the virtio-net driver.
  Matching the other device is done via the MAC, and only pci devices
  are allowed for the failover device. (There does not seem to be any
  restriction on the transport of the virtio-net device.)
- The Linux guest virtio-net driver does not allow changing the MAC if
  standby has been negotiated (implying that the hypervisor needs to
  configure the correct MAC).
- In QEMU, we need to know which two devices (vfio-pci and virtio-net)
  go together, so that the virtio-net device gets the correct MAC. We
  also need the pairing so that we can make the vfio-pci device
  available once the guest has negotiated the standby feature.

We can tack the two devices together in QEMU by introducing new,
optional properties pointing from the virtio-net device to the vfio-pci
device (only offer standby if this is set) and the other way around
(don't make the device visible at the start if this is set). Problems:

- The admin needs to figure out the MAC by themselves and set it
  correctly. If this is incorrect, the vfio-pci device cannot be found
  in the guest. (Not sure how much of a problem this is in practice --
  and QEMU cannot figure out the MAC without poking at the vfio-pci
  device, and we probably want to avoid that.)
- This two-way pointing makes for interesting handing of the command
  line and when both devices are plugged later.

In any case, I'm not sure anymore why we'd want the extra uuid. Is
there any way QEMU (or libvirt) can figure it out without actually
looking at the vfio-pci device?

^ permalink raw reply

* [PATCH 4/5] ceph: use timespec64 for r_mtime
From: Arnd Bergmann @ 2018-06-20 15:50 UTC (permalink / raw)
  To: Yan Zheng, Sage Weil, Ilya Dryomov, Alex Elder
  Cc: Jens Axboe, Jan Kara, Martin K. Petersen, Arnd Bergmann, y2038,
	netdev, linux-kernel, Daniel Jordan, linux-block, ceph-devel,
	Jason Dillaman, David S. Miller
In-Reply-To: <20180620155101.57685-1-arnd@arndb.de>

The request mtime field is used all over ceph, and is currently
represented as a 'timespec' structure in Linux. This changes it to
timespec64 to allow times beyond 2038, modifying all users at the
same time.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/block/rbd.c             |  2 +-
 fs/ceph/addr.c                  | 12 ++++++------
 fs/ceph/file.c                  | 11 +++++------
 include/linux/ceph/osd_client.h |  6 +++---
 net/ceph/osd_client.c           |  8 ++++----
 5 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index fa0729c1e776..356936333cd9 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1452,7 +1452,7 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
 	struct ceph_osd_request *osd_req = obj_request->osd_req;
 
 	osd_req->r_flags = CEPH_OSD_FLAG_WRITE;
-	ktime_get_real_ts(&osd_req->r_mtime);
+	ktime_get_real_ts64(&osd_req->r_mtime);
 	osd_req->r_data_offset = obj_request->ex.oe_off;
 }
 
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 292b3d72d725..d44d51e69e76 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -574,7 +574,7 @@ static u64 get_writepages_data_length(struct inode *inode,
  */
 static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 {
-	struct timespec ts;
+	struct timespec64 ts;
 	struct inode *inode;
 	struct ceph_inode_info *ci;
 	struct ceph_fs_client *fsc;
@@ -625,7 +625,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 		set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
 
 	set_page_writeback(page);
-	ts = timespec64_to_timespec(inode->i_mtime);
+	ts = inode->i_mtime;
 	err = ceph_osdc_writepages(&fsc->client->osdc, ceph_vino(inode),
 				   &ci->i_layout, snapc, page_off, len,
 				   ceph_wbc.truncate_seq,
@@ -1134,7 +1134,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 			pages = NULL;
 		}
 
-		req->r_mtime = timespec64_to_timespec(inode->i_mtime);
+		req->r_mtime = inode->i_mtime;
 		rc = ceph_osdc_start_request(&fsc->client->osdc, req, true);
 		BUG_ON(rc);
 		req = NULL;
@@ -1734,7 +1734,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
 		goto out;
 	}
 
-	req->r_mtime = timespec64_to_timespec(inode->i_mtime);
+	req->r_mtime = inode->i_mtime;
 	err = ceph_osdc_start_request(&fsc->client->osdc, req, false);
 	if (!err)
 		err = ceph_osdc_wait_request(&fsc->client->osdc, req);
@@ -1776,7 +1776,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page)
 			goto out_put;
 	}
 
-	req->r_mtime = timespec64_to_timespec(inode->i_mtime);
+	req->r_mtime = inode->i_mtime;
 	err = ceph_osdc_start_request(&fsc->client->osdc, req, false);
 	if (!err)
 		err = ceph_osdc_wait_request(&fsc->client->osdc, req);
@@ -1937,7 +1937,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci,
 				     0, false, true);
 	err = ceph_osdc_start_request(&fsc->client->osdc, rd_req, false);
 
-	wr_req->r_mtime = timespec64_to_timespec(ci->vfs_inode.i_mtime);
+	wr_req->r_mtime = ci->vfs_inode.i_mtime;
 	err2 = ceph_osdc_start_request(&fsc->client->osdc, wr_req, false);
 
 	if (!err)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index ad0bed99b1d5..1795a8dc9a1e 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -721,7 +721,7 @@ struct ceph_aio_request {
 	struct list_head osd_reqs;
 	unsigned num_reqs;
 	atomic_t pending_reqs;
-	struct timespec mtime;
+	struct timespec64 mtime;
 	struct ceph_cap_flush *prealloc_cf;
 };
 
@@ -923,7 +923,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 	int num_pages = 0;
 	int flags;
 	int ret;
-	struct timespec mtime = timespec64_to_timespec(current_time(inode));
+	struct timespec64 mtime = current_time(inode);
 	size_t count = iov_iter_count(iter);
 	loff_t pos = iocb->ki_pos;
 	bool write = iov_iter_rw(iter) == WRITE;
@@ -1013,7 +1013,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 			truncate_inode_pages_range(inode->i_mapping, pos,
 					(pos+len) | (PAGE_SIZE - 1));
 
-			req->r_mtime = mtime;
+			req->r_mtime = current_time(inode);
 		}
 
 		osd_req_op_extent_osd_data_bvecs(req, 0, bvecs, num_pages, len);
@@ -1131,7 +1131,6 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 	int flags;
 	int ret;
 	bool check_caps = false;
-	struct timespec mtime = timespec64_to_timespec(current_time(inode));
 	size_t count = iov_iter_count(from);
 
 	if (ceph_snap(file_inode(file)) != CEPH_NOSNAP)
@@ -1201,7 +1200,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos,
 		osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0,
 						false, true);
 
-		req->r_mtime = mtime;
+		req->r_mtime = current_time(inode);
 		ret = ceph_osdc_start_request(&fsc->client->osdc, req, false);
 		if (!ret)
 			ret = ceph_osdc_wait_request(&fsc->client->osdc, req);
@@ -1663,7 +1662,7 @@ static int ceph_zero_partial_object(struct inode *inode,
 		goto out;
 	}
 
-	req->r_mtime = timespec64_to_timespec(inode->i_mtime);
+	req->r_mtime = inode->i_mtime;
 	ret = ceph_osdc_start_request(&fsc->client->osdc, req, false);
 	if (!ret) {
 		ret = ceph_osdc_wait_request(&fsc->client->osdc, req);
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 0d6ee04b4c41..2e6611c1e9a0 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -199,7 +199,7 @@ struct ceph_osd_request {
 	/* set by submitter */
 	u64 r_snapid;                         /* for reads, CEPH_NOSNAP o/w */
 	struct ceph_snap_context *r_snapc;    /* for writes */
-	struct timespec r_mtime;              /* ditto */
+	struct timespec64 r_mtime;            /* ditto */
 	u64 r_data_offset;                    /* ditto */
 	bool r_linger;                        /* don't resend on failure */
 
@@ -253,7 +253,7 @@ struct ceph_osd_linger_request {
 	struct ceph_osd_request_target t;
 	u32 map_dne_bound;
 
-	struct timespec mtime;
+	struct timespec64 mtime;
 
 	struct kref kref;
 	struct mutex lock;
@@ -508,7 +508,7 @@ extern int ceph_osdc_writepages(struct ceph_osd_client *osdc,
 				struct ceph_snap_context *sc,
 				u64 off, u64 len,
 				u32 truncate_seq, u64 truncate_size,
-				struct timespec *mtime,
+				struct timespec64 *mtime,
 				struct page **pages, int nr_pages);
 
 /* watch/notify */
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index a00c74f1154e..a87a021ca9d0 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1978,7 +1978,7 @@ static void encode_request_partial(struct ceph_osd_request *req,
 	p += sizeof(struct ceph_blkin_trace_info);
 
 	ceph_encode_32(&p, 0); /* client_inc, always 0 */
-	ceph_encode_timespec(p, &req->r_mtime);
+	ceph_encode_timespec64(p, &req->r_mtime);
 	p += sizeof(struct ceph_timespec);
 
 	encode_oloc(&p, end, &req->r_t.target_oloc);
@@ -4512,7 +4512,7 @@ ceph_osdc_watch(struct ceph_osd_client *osdc,
 	ceph_oid_copy(&lreq->t.base_oid, oid);
 	ceph_oloc_copy(&lreq->t.base_oloc, oloc);
 	lreq->t.flags = CEPH_OSD_FLAG_WRITE;
-	ktime_get_real_ts(&lreq->mtime);
+	ktime_get_real_ts64(&lreq->mtime);
 
 	lreq->reg_req = alloc_linger_request(lreq);
 	if (!lreq->reg_req) {
@@ -4570,7 +4570,7 @@ int ceph_osdc_unwatch(struct ceph_osd_client *osdc,
 	ceph_oid_copy(&req->r_base_oid, &lreq->t.base_oid);
 	ceph_oloc_copy(&req->r_base_oloc, &lreq->t.base_oloc);
 	req->r_flags = CEPH_OSD_FLAG_WRITE;
-	ktime_get_real_ts(&req->r_mtime);
+	ktime_get_real_ts64(&req->r_mtime);
 	osd_req_op_watch_init(req, 0, lreq->linger_id,
 			      CEPH_OSD_WATCH_OP_UNWATCH);
 
@@ -5136,7 +5136,7 @@ int ceph_osdc_writepages(struct ceph_osd_client *osdc, struct ceph_vino vino,
 			 struct ceph_snap_context *snapc,
 			 u64 off, u64 len,
 			 u32 truncate_seq, u64 truncate_size,
-			 struct timespec *mtime,
+			 struct timespec64 *mtime,
 			 struct page **pages, int num_pages)
 {
 	struct ceph_osd_request *req;
-- 
2.9.0

_______________________________________________
Y2038 mailing list
Y2038@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/y2038

^ permalink raw reply related

* [PATCH 1/5] ceph: use timespec64 in for keepalive
From: Arnd Bergmann @ 2018-06-20 15:50 UTC (permalink / raw)
  To: Yan Zheng, Sage Weil, Ilya Dryomov, David S. Miller
  Cc: y2038, ceph-devel, Arnd Bergmann, linux-kernel, netdev

ceph_con_keepalive_expired() is the last user of timespec_add() and some
of the last uses of ktime_get_real_ts().  Replacing this with timespec64
based interfaces  lets us remove that deprecated API.

I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64()
here that take timespec64 structures and convert to/from ceph_timespec,
which is defined to have an unsigned 32-bit tv_sec member. This extends
the range of valid times to year 2106, avoiding the year 2038 overflow.

The ceph file system portion still uses the old functions for inode
timestamps, this will be done separately after the VFS layer is converted.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 include/linux/ceph/decode.h    | 20 +++++++++++++++++++-
 include/linux/ceph/messenger.h |  2 +-
 net/ceph/auth_x.c              | 14 +++++++-------
 net/ceph/auth_x.h              |  2 +-
 net/ceph/cls_lock_client.c     |  4 ++--
 net/ceph/messenger.c           | 20 ++++++++++----------
 6 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
index d143ac8879c6..45efe09fb7b4 100644
--- a/include/linux/ceph/decode.h
+++ b/include/linux/ceph/decode.h
@@ -194,8 +194,26 @@ ceph_decode_skip_n(p, end, sizeof(u8), bad)
 	} while (0)
 
 /*
- * struct ceph_timespec <-> struct timespec
+ * struct ceph_timespec <-> struct timespec64
  */
+static inline void ceph_decode_timespec64(struct timespec64 *ts,
+					const struct ceph_timespec *tv)
+{
+	/*
+	 * this will still overflow in year 2106. We could extend
+	 * the protocol to steal two more bits from tv_nsec to
+	 * add three more 136 year epochs after that the way ext4
+	 * does if necessary.
+	 */
+	ts->tv_sec = (time64_t)le32_to_cpu(tv->tv_sec);
+	ts->tv_nsec = (long)le32_to_cpu(tv->tv_nsec);
+}
+static inline void ceph_encode_timespec64(struct ceph_timespec *tv,
+					const struct timespec64 *ts)
+{
+	tv->tv_sec = cpu_to_le32((u32)ts->tv_sec);
+	tv->tv_nsec = cpu_to_le32((u32)ts->tv_nsec);
+}
 static inline void ceph_decode_timespec(struct timespec *ts,
 					const struct ceph_timespec *tv)
 {
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index c7dfcb8a1fb2..a718b877c597 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -330,7 +330,7 @@ struct ceph_connection {
 	int in_base_pos;     /* bytes read */
 	__le64 in_temp_ack;  /* for reading an ack */
 
-	struct timespec last_keepalive_ack; /* keepalive2 ack stamp */
+	struct timespec64 last_keepalive_ack; /* keepalive2 ack stamp */
 
 	struct delayed_work work;	    /* send|recv work */
 	unsigned long       delay;          /* current delay interval */
diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c
index 2f4a1baf5f52..b05c3a540a5a 100644
--- a/net/ceph/auth_x.c
+++ b/net/ceph/auth_x.c
@@ -149,12 +149,12 @@ static int process_one_ticket(struct ceph_auth_client *ac,
 	void *dp, *dend;
 	int dlen;
 	char is_enc;
-	struct timespec validity;
+	struct timespec64 validity;
 	void *tp, *tpend;
 	void **ptp;
 	struct ceph_crypto_key new_session_key = { 0 };
 	struct ceph_buffer *new_ticket_blob;
-	unsigned long new_expires, new_renew_after;
+	time64_t new_expires, new_renew_after;
 	u64 new_secret_id;
 	int ret;
 
@@ -189,11 +189,11 @@ static int process_one_ticket(struct ceph_auth_client *ac,
 	if (ret)
 		goto out;
 
-	ceph_decode_timespec(&validity, dp);
+	ceph_decode_timespec64(&validity, dp);
 	dp += sizeof(struct ceph_timespec);
-	new_expires = get_seconds() + validity.tv_sec;
+	new_expires = ktime_get_real_seconds() + validity.tv_sec;
 	new_renew_after = new_expires - (validity.tv_sec / 4);
-	dout(" expires=%lu renew_after=%lu\n", new_expires,
+	dout(" expires=%llu renew_after=%llu\n", new_expires,
 	     new_renew_after);
 
 	/* ticket blob for service */
@@ -385,13 +385,13 @@ static bool need_key(struct ceph_x_ticket_handler *th)
 	if (!th->have_key)
 		return true;
 
-	return get_seconds() >= th->renew_after;
+	return ktime_get_real_seconds() >= th->renew_after;
 }
 
 static bool have_key(struct ceph_x_ticket_handler *th)
 {
 	if (th->have_key) {
-		if (get_seconds() >= th->expires)
+		if (ktime_get_real_seconds() >= th->expires)
 			th->have_key = false;
 	}
 
diff --git a/net/ceph/auth_x.h b/net/ceph/auth_x.h
index 454cb54568af..57ba99f7736f 100644
--- a/net/ceph/auth_x.h
+++ b/net/ceph/auth_x.h
@@ -22,7 +22,7 @@ struct ceph_x_ticket_handler {
 	u64 secret_id;
 	struct ceph_buffer *ticket_blob;
 
-	unsigned long renew_after, expires;
+	time64_t renew_after, expires;
 };
 
 #define CEPHX_AU_ENC_BUF_LEN	128  /* big enough for encrypted blob */
diff --git a/net/ceph/cls_lock_client.c b/net/ceph/cls_lock_client.c
index 8d2032b2f225..2105a6eaa66c 100644
--- a/net/ceph/cls_lock_client.c
+++ b/net/ceph/cls_lock_client.c
@@ -32,7 +32,7 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
 	int desc_len = strlen(desc);
 	void *p, *end;
 	struct page *lock_op_page;
-	struct timespec mtime;
+	struct timespec64 mtime;
 	int ret;
 
 	lock_op_buf_size = name_len + sizeof(__le32) +
@@ -63,7 +63,7 @@ int ceph_cls_lock(struct ceph_osd_client *osdc,
 	ceph_encode_string(&p, end, desc, desc_len);
 	/* only support infinite duration */
 	memset(&mtime, 0, sizeof(mtime));
-	ceph_encode_timespec(p, &mtime);
+	ceph_encode_timespec64(p, &mtime);
 	p += sizeof(struct ceph_timespec);
 	ceph_encode_8(&p, flags);
 
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index c6413c360771..3f6336248509 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1417,11 +1417,11 @@ static void prepare_write_keepalive(struct ceph_connection *con)
 	dout("prepare_write_keepalive %p\n", con);
 	con_out_kvec_reset(con);
 	if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) {
-		struct timespec now;
+		struct timespec64 now;
 
-		ktime_get_real_ts(&now);
+		ktime_get_real_ts64(&now);
 		con_out_kvec_add(con, sizeof(tag_keepalive2), &tag_keepalive2);
-		ceph_encode_timespec(&con->out_temp_keepalive2, &now);
+		ceph_encode_timespec64(&con->out_temp_keepalive2, &now);
 		con_out_kvec_add(con, sizeof(con->out_temp_keepalive2),
 				 &con->out_temp_keepalive2);
 	} else {
@@ -2555,7 +2555,7 @@ static int read_keepalive_ack(struct ceph_connection *con)
 	int ret = read_partial(con, size, size, &ceph_ts);
 	if (ret <= 0)
 		return ret;
-	ceph_decode_timespec(&con->last_keepalive_ack, &ceph_ts);
+	ceph_decode_timespec64(&con->last_keepalive_ack, &ceph_ts);
 	prepare_read_tag(con);
 	return 1;
 }
@@ -3223,12 +3223,12 @@ bool ceph_con_keepalive_expired(struct ceph_connection *con,
 {
 	if (interval > 0 &&
 	    (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2)) {
-		struct timespec now;
-		struct timespec ts;
-		ktime_get_real_ts(&now);
-		jiffies_to_timespec(interval, &ts);
-		ts = timespec_add(con->last_keepalive_ack, ts);
-		return timespec_compare(&now, &ts) >= 0;
+		struct timespec64 now;
+		struct timespec64 ts;
+		ktime_get_real_ts64(&now);
+		jiffies_to_timespec64(interval, &ts);
+		ts = timespec64_add(con->last_keepalive_ack, ts);
+		return timespec64_compare(&now, &ts) >= 0;
 	}
 	return false;
 }
-- 
2.9.0

^ permalink raw reply related

* [PATCH net] sctp: fix erroneous inc of snmp SctpFragUsrMsgs
From: Marcelo Ricardo Leitner @ 2018-06-20 15:47 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich

Currently it is incrementing SctpFragUsrMsgs when the user message size
is of the exactly same size as the maximum fragment size, which is wrong.

The fix is to increment it only when user message is bigger than the
maximum fragment size.

Fixes: bfd2e4b8734d ("sctp: refactor sctp_datamsg_from_user")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/chunk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 79daa98208c391c780440144d69bc7be875c3476..bfb9f812e2ef9fa605b08dc1f534781573c3abf8 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -237,7 +237,9 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct sctp_association *asoc,
 	/* Account for a different sized first fragment */
 	if (msg_len >= first_len) {
 		msg->can_delay = 0;
-		SCTP_INC_STATS(sock_net(asoc->base.sk), SCTP_MIB_FRAGUSRMSGS);
+		if (msg_len > first_len)
+			SCTP_INC_STATS(sock_net(asoc->base.sk),
+				       SCTP_MIB_FRAGUSRMSGS);
 	} else {
 		/* Which may be the only one... */
 		first_len = msg_len;
-- 
2.14.4

^ permalink raw reply related

* Re: [Y2038] [PATCH] ceph: use ktime_get_real_seconds()
From: Arnd Bergmann @ 2018-06-20 15:47 UTC (permalink / raw)
  To: Allen Pais
  Cc: Networking, y2038 Mailman List, ceph-devel, David Miller,
	Linux Kernel Mailing List
In-Reply-To: <1529488801-22093-1-git-send-email-allen.lkml@gmail.com>

On Wed, Jun 20, 2018 at 12:00 PM, Allen Pais <allen.lkml@gmail.com> wrote:
> Use ktime_get_real_seconds() as get_seconds() is deprecated.
>
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>

I have done a similar patch and will post it soon along with the rest of the
ceph y2038 series. Please have a look at "ceph: use timespec64 in for
keepalive" and comment if you see something that I missed.

       Arnd

^ permalink raw reply

* Re: Route fallback issue
From: Grant Taylor @ 2018-06-20 15:38 UTC (permalink / raw)
  To: netdev; +Cc: lartc
In-Reply-To: <e19de6ae-536c-7dc5-22e7-62ee7d077c67@spamtrap.tnetconsulting.net>

[-- Attachment #1: Type: text/plain, Size: 1193 bytes --]

On 06/20/2018 09:18 AM, Grant Taylor wrote:
> Where can I find more information on ignore_routes_with_linkdown?  I 
> don't see it listed in $Kernel/Documentation/networking/ip-sysctl.txt. 
> (I do see fib_multipath_use_neigh documented there in.)

I'm specifically interested in if ignore_routes_with_linkdown and / or 
fib_multipath_use_neigh will cause Linux to fall back to an alternate 
(higher metric) route if the link is still up but the neighbor is not 
accessible across it.

           +-------+
           | Linux |
           +---+---+
               |
+-----+   +---+----+   +-----+
| R 1 +---+ Switch +---+ R 2 |
+-----+   +--------+   +-----+

A typical scenario is where Linux is connected to a DSL or Cable modem 
where the physical link stays up even if the neighbor R {1,2} goes 
offline.  It's not possible to rely on the local link (MII) status to 
determine that a neighbor is not reachable.  I.e. R 1 going away like below.

           +-------+
           | Linux |
           +---+---+
               |
+-----+   +---+----+   +-----+
| R 1 X   X Switch +---+ R 2 |
+-----+   +--------+   +-----+



-- 
Grant. . . .
unix || die


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3982 bytes --]

^ permalink raw reply

* Re: Incomplete fix for be9c798 / 7b2ee50 hv_netvsc: common detach logic?
From: Thomas Walker @ 2018-06-20 15:26 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Greg KH, devel, Stephen Hemminger, netdev
In-Reply-To: <20180619161440.2bc69506@xeon-e3>

On Tue, Jun 19, 2018 at 04:14:40PM -0700, Stephen Hemminger wrote:
> On Wed, 20 Jun 2018 08:01:50 +0900
> Greg KH <greg@kroah.com> wrote:
> 
> > On Tue, Jun 19, 2018 at 03:19:41PM -0400, Thomas Walker wrote:
> > > Upon updating some internal kernels from 4.14.18 to 4.14.49, our hyper-v guests failed to bring up network interfaces on boot, logging "A link change request failed with some changes committed already. Interface eth0 may have been left with an inconsistent configuration, please check."  Running 'ifconfig eth0 up' appears to fix the problem temporarily so I went about bisecting which landed on:
> > > 
> > > commit be9c798d0d13ae609a91177323ac816545c39d28
> > > Author: Stephen Hemminger <stephen@networkplumber.org>
> > > Date:   Mon May 14 15:32:18 2018 -0700
> > > 
> > >     hv_netvsc: common detach logic
> > >     
> > >     [ Commit 7b2ee50c0cd513a176a26a71f2989facdd75bfea upstream. ]
> > >     
> > >     Make common function for detaching internals of device
> > >     during changes to MTU and RSS. Make sure no more packets
> > >     are transmitted and all packets have been received before
> > >     doing device teardown.
> > > 
> > >     Change the wait logic to be common and use usleep_range().
> > > 
> > >     Changes transmit enabling logic so that transmit queues are disabled
> > >     during the period when lower device is being changed. And enabled
> > >     only after sub channels are setup. This avoids issue where it could
> > >     be that a packet was being sent while subchannel was not initialized.
> > > 
> > >     Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
> > >     Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > 
> > > 
> > > The removal of which does indeed fix the problem.  That led me to:
> > > 
> > > commit 52acf73b6e9a6962045feb2ba5a8921da2201915
> > > Author: Dexuan Cui <decui@microsoft.com>
> > > Date:   Wed Jun 6 21:32:51 2018 +0000
> > > 
> > >     hv_netvsc: Fix a network regression after ifdown/ifup
> > >     
> > >     Recently people reported the NIC stops working after
> > >     "ifdown eth0; ifup eth0". It turns out in this case the TX queues are not
> > >     enabled, after the refactoring of the common detach logic: when the NIC
> > >     has sub-channels, usually we enable all the TX queues after all
> > >     sub-channels are set up: see rndis_set_subchannel() ->
> > >     netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
> > >     the number of channels doesn't change, we also must make sure the TX queues
> > >     are enabled. The patch fixes the regression.
> > >     
> > >     Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
> > >     Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > >     Cc: Stephen Hemminger <sthemmin@microsoft.com>
> > >     Cc: K. Y. Srinivasan <kys@microsoft.com>
> > >     Cc: Haiyang Zhang <haiyangz@microsoft.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > 
> > > Which sounded very promising, but does not seem to fully fix the issue.  Doing some more digging, I was able to determine that the message coincides with 'ip link set dev eth0 mtu 1300 up' very shortly (>~1 second) after the hv_netvsc driver loads.  If I delay the mtu change until well after the driver loads, everything works fine.  If I unload hv_netvsc and then reload it and apply the mtu change immediately, the failure re-occurs.  So something is racy here, and the above doesn't entirely address it.
> > > 
> > > I'm happy to test out any suggested patches and/or do additional debugging if anyone has any suggestions.
> > > 
> > > (oh, and I did also try 4.18-rc1 and the problem still persists)  
> > 
> > Always cc: the authors of the patch you are having problems with, along
> > with the mailing list for the networking subsystem, so that the right
> > people know to look at this.
> 
> How are you changing the MTU? and starting the device.
> When MTU changes the device has to reconnect with the host. This takes a small amount of time
> and no changes to device state are possible then.
> 
> If MTU change happens and at the same time some other script tries to bring up the connection,
> then that script will get an error.

Simply setting 'mtu 1300' in /etc/network/interfaces (Debian userland) which results in ifup executing:

ip addr add 192.168.1.100/255.255.255.0 broadcast 192.168.1.255 dev eth0 label eth0
ip link set dev eth0 mtu 1300 up                                               
ip route add default via 192.168.1.1 dev eth0                               

Running those commands standalone (outside of ifup) reproduces the problem as well.  Trying lots of different permutations of loading the driver and different commands with different delays, my initial suspicion that this was somehow related to proximity to the driver load appears incorrect.  The only thing that seems to matter is setting the mtu and link state in one go.  If I separate the 'ip link' command into two, 'ip link set dev eth0 up' and 'ip link set dev eth0 mtu 1300', in either order (or if either link is already up or mtu is already 1300), then there is no error.  

(I should note that the above testing is with 4.14.49 with 52acf73)

^ permalink raw reply

* Re: Route fallback issue
From: Grant Taylor @ 2018-06-20 15:18 UTC (permalink / raw)
  To: netdev; +Cc: lartc
In-Reply-To: <7e001d45-f692-7d93-0390-f0e6077bf724@cumulusnetworks.com>

[-- Attachment #1: Type: text/plain, Size: 441 bytes --]

On 06/20/2018 07:48 AM, David Ahern wrote:
> See the ignore_routes_with_linkdown and fib_multipath_use_neigh sysctl 
> settings.

Where can I find more information on ignore_routes_with_linkdown?  I 
don't see it listed in $Kernel/Documentation/networking/ip-sysctl.txt. 
(I do see fib_multipath_use_neigh documented there in.)

> A feature is in the works to have fallback nexthops.

O.o?



-- 
Grant. . . .
unix || die


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3982 bytes --]

^ permalink raw reply

* KASAN: use-after-free Read in __dev_map_entry_free
From: syzbot @ 2018-06-20 15:19 UTC (permalink / raw)
  To: ast, daniel, linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    f0dc7f9c6dd9 Merge git://git.kernel.org/pub/scm/linux/kern..
git tree:       bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=15ad7d10400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=fa9c20c48788d1c1
dashboard link: https://syzkaller.appspot.com/bug?extid=457d3e2ffbcf31aee5c0
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1195225f800000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=171a7ce4400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+457d3e2ffbcf31aee5c0@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in dev_map_flush_old kernel/bpf/devmap.c:365  
[inline]
BUG: KASAN: use-after-free in __dev_map_entry_free+0x2a8/0x300  
kernel/bpf/devmap.c:379
Read of size 8 at addr ffff8801b8da38c8 by task ksoftirqd/1/18

CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.17.0+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
  dev_map_flush_old kernel/bpf/devmap.c:365 [inline]
  __dev_map_entry_free+0x2a8/0x300 kernel/bpf/devmap.c:379
  __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
  rcu_do_batch kernel/rcu/tree.c:2558 [inline]
  invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
  __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
  rcu_process_callbacks+0xe9d/0x1760 kernel/rcu/tree.c:2802
  __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
  run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
  smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Allocated by task 6675:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
  kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
  kmalloc include/linux/slab.h:513 [inline]
  kzalloc include/linux/slab.h:706 [inline]
  dev_map_alloc+0x208/0x7f0 kernel/bpf/devmap.c:102
  find_and_alloc_map kernel/bpf/syscall.c:129 [inline]
  map_create+0x393/0x1010 kernel/bpf/syscall.c:453
  __do_sys_bpf kernel/bpf/syscall.c:2351 [inline]
  __se_sys_bpf kernel/bpf/syscall.c:2328 [inline]
  __x64_sys_bpf+0x303/0x510 kernel/bpf/syscall.c:2328
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 26:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
  __cache_free mm/slab.c:3498 [inline]
  kfree+0xd9/0x260 mm/slab.c:3813
  dev_map_free+0x4fa/0x670 kernel/bpf/devmap.c:191
  bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
  process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
  worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

The buggy address belongs to the object at ffff8801b8da37c0
  which belongs to the cache kmalloc-512 of size 512
The buggy address is located 264 bytes inside of
  512-byte region [ffff8801b8da37c0, ffff8801b8da39c0)
The buggy address belongs to the page:
page:ffffea0006e368c0 count:1 mapcount:0 mapping:ffff8801da800940  
index:0xffff8801b8da3540
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffffea0007217b88 ffffea0006e30cc8 ffff8801da800940
raw: ffff8801b8da3540 ffff8801b8da3040 0000000100000004 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801b8da3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
  ffff8801b8da3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801b8da3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                               ^
  ffff8801b8da3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801b8da3980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH v4 net-next] net:sched: add action inheritdsfield to skbedit
From: Marcelo Ricardo Leitner @ 2018-06-20 14:45 UTC (permalink / raw)
  To: Fu, Qiaobin
  Cc: davem@davemloft.net, netdev@vger.kernel.org, jhs@mojatatu.com,
	Michel Machado, xiyou.wangcong@gmail.com
In-Reply-To: <E5616B71-959E-43FE-9096-BF523948ABB2@bu.edu>

On Tue, Jun 12, 2018 at 03:42:55PM +0000, Fu, Qiaobin wrote:
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->priority. This enables
> later classification of packets based on the DS field.
> 
> v4:
> *Not allow setting flags other than the expected ones.
> 
> *Allow dumping the pure flags.
> 
> Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
> 
> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu>
> Reviewed-by: Michel Machado <michel@digirati.com.br>

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

> ---
> 
> Note that the motivation for this patch is found in the following discussion:
> https://www.spinics.net/lists/netdev/msg501061.html
> ---
> diff --git a/include/uapi/linux/tc_act/tc_skbedit.h b/include/uapi/linux/tc_act/tc_skbedit.h
> index fbcfe27a4e6c..6de6071ebed6 100644
> --- a/include/uapi/linux/tc_act/tc_skbedit.h
> +++ b/include/uapi/linux/tc_act/tc_skbedit.h
> @@ -30,6 +30,7 @@
>  #define SKBEDIT_F_MARK			0x4
>  #define SKBEDIT_F_PTYPE			0x8
>  #define SKBEDIT_F_MASK			0x10
> +#define SKBEDIT_F_INHERITDSFIELD	0x20
>  
>  struct tc_skbedit {
>  	tc_gen;
> @@ -45,6 +46,7 @@ enum {
>  	TCA_SKBEDIT_PAD,
>  	TCA_SKBEDIT_PTYPE,
>  	TCA_SKBEDIT_MASK,
> +	TCA_SKBEDIT_FLAGS,
>  	__TCA_SKBEDIT_MAX
>  };
>  #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1)
> diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
> index 6138d1d71900..9adbcfa3f5fe 100644
> --- a/net/sched/act_skbedit.c
> +++ b/net/sched/act_skbedit.c
> @@ -23,6 +23,9 @@
>  #include <linux/rtnetlink.h>
>  #include <net/netlink.h>
>  #include <net/pkt_sched.h>
> +#include <net/ip.h>
> +#include <net/ipv6.h>
> +#include <net/dsfield.h>
>  
>  #include <linux/tc_act/tc_skbedit.h>
>  #include <net/tc_act/tc_skbedit.h>
> @@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a,
>  
>  	if (d->flags & SKBEDIT_F_PRIORITY)
>  		skb->priority = d->priority;
> +	if (d->flags & SKBEDIT_F_INHERITDSFIELD) {
> +		int wlen = skb_network_offset(skb);
> +
> +		switch (tc_skb_protocol(skb)) {
> +		case htons(ETH_P_IP):
> +			wlen += sizeof(struct iphdr);
> +			if (!pskb_may_pull(skb, wlen))
> +				goto err;
> +			skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
> +			break;
> +
> +		case htons(ETH_P_IPV6):
> +			wlen += sizeof(struct ipv6hdr);
> +			if (!pskb_may_pull(skb, wlen))
> +				goto err;
> +			skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
> +			break;
> +		}
> +	}
>  	if (d->flags & SKBEDIT_F_QUEUE_MAPPING &&
>  	    skb->dev->real_num_tx_queues > d->queue_mapping)
>  		skb_set_queue_mapping(skb, d->queue_mapping);
> @@ -53,6 +75,10 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a,
>  
>  	spin_unlock(&d->tcf_lock);
>  	return d->tcf_action;
> +
> +err:
> +	spin_unlock(&d->tcf_lock);
> +	return TC_ACT_SHOT;
>  }
>  
>  static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
> @@ -62,6 +88,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
>  	[TCA_SKBEDIT_MARK]		= { .len = sizeof(u32) },
>  	[TCA_SKBEDIT_PTYPE]		= { .len = sizeof(u16) },
>  	[TCA_SKBEDIT_MASK]		= { .len = sizeof(u32) },
> +	[TCA_SKBEDIT_FLAGS]		= { .len = sizeof(u64) },
>  };
>  
>  static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
> @@ -73,6 +100,7 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
>  	struct tc_skbedit *parm;
>  	struct tcf_skbedit *d;
>  	u32 flags = 0, *priority = NULL, *mark = NULL, *mask = NULL;
> +	u64 *pure_flags = NULL;
>  	u16 *queue_mapping = NULL, *ptype = NULL;
>  	bool exists = false;
>  	int ret = 0, err;
> @@ -114,6 +142,12 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
>  		mask = nla_data(tb[TCA_SKBEDIT_MASK]);
>  	}
>  
> +	if (tb[TCA_SKBEDIT_FLAGS] != NULL) {
> +		pure_flags = nla_data(tb[TCA_SKBEDIT_FLAGS]);
> +		if (*pure_flags & SKBEDIT_F_INHERITDSFIELD)
> +			flags |= SKBEDIT_F_INHERITDSFIELD;
> +	}
> +
>  	parm = nla_data(tb[TCA_SKBEDIT_PARMS]);
>  
>  	exists = tcf_idr_check(tn, parm->index, a, bind);
> @@ -178,6 +212,7 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct tc_action *a,
>  		.action  = d->tcf_action,
>  	};
>  	struct tcf_t t;
> +	u64 pure_flags = 0;
>  
>  	if (nla_put(skb, TCA_SKBEDIT_PARMS, sizeof(opt), &opt))
>  		goto nla_put_failure;
> @@ -196,6 +231,11 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct tc_action *a,
>  	if ((d->flags & SKBEDIT_F_MASK) &&
>  	    nla_put_u32(skb, TCA_SKBEDIT_MASK, d->mask))
>  		goto nla_put_failure;
> +	if (d->flags & SKBEDIT_F_INHERITDSFIELD)
> +		pure_flags |= SKBEDIT_F_INHERITDSFIELD;
> +	if (pure_flags != 0 &&
> +	    nla_put(skb, TCA_SKBEDIT_FLAGS, sizeof(pure_flags), &pure_flags))
> +		goto nla_put_failure;
>  
>  	tcf_tm_dump(&t, &d->tcf_tm);
>  	if (nla_put_64bit(skb, TCA_SKBEDIT_TM, sizeof(t), &t, TCA_SKBEDIT_PAD))

^ permalink raw reply

* Re: [PATCH] ceph: use ktime_get_real_seconds()
From: Allen @ 2018-06-20 14:42 UTC (permalink / raw)
  To: idryomov; +Cc: y2038, netdev, ceph-devel, David Miller, linux-kernel
In-Reply-To: <CAOi1vP9Aa9xb2RD7ahE=jOj8nvOX936oVyp6rkKHV8tknF_1Kg@mail.gmail.com>

> >
> > Use ktime_get_real_seconds() as get_seconds() is deprecated.
> >
> > Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> > ---
> >  net/ceph/auth_x.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c
> > index 2f4a1ba..99561c2 100644
> > --- a/net/ceph/auth_x.c
> > +++ b/net/ceph/auth_x.c
> > @@ -154,7 +154,7 @@ static int process_one_ticket(struct ceph_auth_client *ac,
> >         void **ptp;
> >         struct ceph_crypto_key new_session_key = { 0 };
> >         struct ceph_buffer *new_ticket_blob;
> > -       unsigned long new_expires, new_renew_after;
> > +       u32 new_expires, new_renew_after;
> >         u64 new_secret_id;
> >         int ret;
> >
> > @@ -191,9 +191,9 @@ static int process_one_ticket(struct ceph_auth_client *ac,
> >
> >         ceph_decode_timespec(&validity, dp);
> >         dp += sizeof(struct ceph_timespec);
> > -       new_expires = get_seconds() + validity.tv_sec;
> > +       new_expires = (u32)ktime_get_real_seconds() + validity.tv_sec;
>
> Why the change to u32 and this cast?  If the type has to change,
> wouldn't time64_t make more sense?  ktime_get_real_seconds() returns
> time64_t, after all.
>

 Right. I'll drop the cast and send out V2.

- Allen
_______________________________________________
Y2038 mailing list
Y2038@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/y2038

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-20 14:34 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Samudrala, Sridhar, Alexander Duyck, virtio-dev, aaron.f.brown,
	Jiri Pirko, Michael S. Tsirkin, Jakub Kicinski, Netdev,
	qemu-devel, virtualization, konrad.wilk, boris.ostrovsky,
	Joao Martins, Venu Busireddy, vijay.balakrishna
In-Reply-To: <CADGSJ20wrF418-WVev6OjaeShZWE488uxRwgaZmsjOHZr8jyOw@mail.gmail.com>

On Tue, 19 Jun 2018 13:09:14 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Jun 19, 2018 at 3:54 AM, Cornelia Huck <cohuck@redhat.com> wrote:
> > On Fri, 15 Jun 2018 10:06:07 -0700
> > Siwei Liu <loseweigh@gmail.com> wrote:
> >  
> >> On Fri, Jun 15, 2018 at 4:48 AM, Cornelia Huck <cohuck@redhat.com> wrote:  
> >> > On Thu, 14 Jun 2018 18:57:11 -0700
> >> > Siwei Liu <loseweigh@gmail.com> wrote:

> >> > I'm a bit confused here. What, exactly, ties the two devices together?  
> >>
> >> The group UUID. Since QEMU VFIO dvice does not have insight of MAC
> >> address (which it doesn't have to), the association between VFIO
> >> passthrough and standby must be specificed for QEMU to understand the
> >> relationship with this model. Note, standby feature is no longer
> >> required to be exposed under this model.  
> >
> > Isn't that a bit limiting, though?
> >
> > With this model, you can probably tie a vfio-pci device and a
> > virtio-net-pci device together. But this will fail if you have
> > different transports: Consider tying together a vfio-pci device and a
> > virtio-net-ccw device on s390, for example. The standby feature bit is
> > on the virtio-net level and should not have any dependency on the
> > transport used.  
> 
> Probably we'd limit the support for grouping to virtio-net-pci device
> and vfio-pci device only. For virtio-net-pci, as you might see with
> Venu's patch, we store the group UUID on the config space of
> virtio-pci, which is only applicable to PCI transport.
> 
> If virtio-net-ccw needs to support the same, I think similar grouping
> interface should be defined on the VirtIO CCW transport. I think the
> current implementation of the Linux failover driver assumes that it's
> SR-IOV VF with same MAC address which the virtio-net-pci needs to pair
> with, and that the PV path is on same PF without needing to update
> network of the port-MAC association change. If we need to extend the
> grouping mechanism to virtio-net-ccw, it has to pass such failover
> mode to virtio driver specifically through some other option I guess.

Hm, I've just spent some time reading the Linux failover code and I did
not really find much pci-related magic in there (other than checking
for a pci device in net_failover_slave_pre_register). We also seem to
look for a matching device by MAC only. What magic am I missing?

Is the look-for-uuid handling supposed to happen in the host only?

> >> > If libvirt already has the knowledge that it should manage the two as a
> >> > couple, why do we need the group id (or something else for other
> >> > architectures)? (Maybe I'm simply missing something because I'm not
> >> > that familiar with pci.)  
> >>
> >> The idea is to have QEMU control the visibility and enumeration order
> >> of the passthrough VFIO for the failover scenario. Hotplug can be one
> >> way to achieve it, and perhaps there's other way around also. The
> >> group ID is not just for QEMU to couple devices, it's also helpful to
> >> guest too as grouping using MAC address is just not safe.  
> >
> > Sorry about dragging mainframes into this, but this will only work for
> > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > + virtio-net-ccw example again: The guest cannot find out that the two
> > belong together by checking some group ID, it has to either use the MAC
> > or some needs-to-be-architectured property.
> >
> > Alternatively, we could propose that mechanism as pci-only, which means
> > we can rely on mechanisms that won't necessarily work on non-pci
> > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > through a network card anytime in the near future, due to the nature of
> > network cards currently in use on s390.)  
> 
> Yes, let's do this just for PCI transport (homogenous) for now.

But why? Using pci for passthrough to make things easier (and because
there's not really a use case), sure. But I really don't want to
restrict this to virtio-pci only.

> >> In the model of (b), I think it essentially turns hotplug to one of
> >> mechanisms for QEMU to control the visibility. The libvirt can still
> >> manage the hotplug of individual devices during live migration or in
> >> normal situation to hot add/remove devices. Though the visibility of
> >> the VFIO is under the controll of QEMU, and it's possible that the hot
> >> add/remove request does not involve actual hot plug activity in guest
> >> at all.  
> >
> > That depends on how you model visibility, I guess. You'll probably want
> > to stop traffic flowing through one or the other of the cards; would
> > link down or similar be enough for the virtio device?  
> 
> I'm not sure if it is a good idea. The guest user will see two devices
> with same MAC but one of them is down. Do you expect user to use it or
> not? And since the guest is going to be migrated, we need to unplug a
> broken VF from guest before migrating, why do we bother plugging in
> this useless VF at the first place?

I was thinking about using hotunplugging only over migration and doing
the link up only after feature negotiation has finished, but that is
probably too complicated. Let's stick to hotplug for simplicity's sake.

^ permalink raw reply

* Re: [PATCH] bpfilter: fix user mode helper cross compilation
From: Stefano Brivio @ 2018-06-20 14:19 UTC (permalink / raw)
  To: Matteo Croce; +Cc: netdev
In-Reply-To: <20180620140434.18139-1-mcroce@redhat.com>

On Wed, 20 Jun 2018 16:04:34 +0200
Matteo Croce <mcroce@redhat.com> wrote:

> Use $(OBJDUMP) instead of literal 'objdump' to avoid
> using host toolchain when cross compiling.
> 
> Fixes: 421780fd4983 ("bpfilter: fix build error")
> Signed-off-by: Matteo Croce <mcroce@redhat.com>

Reported-by: Stefano Brivio <sbrivio@redhat.com>

-- 
Stefano

^ permalink raw reply

* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Michael S. Tsirkin @ 2018-06-20 14:11 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, konrad.wilk,
	Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
	Siwei Liu, Venu Busireddy, Netdev, boris.ostrovsky, aaron.f.brown,
	Joao Martins
In-Reply-To: <20180620115359.1a3bf6fb.cohuck@redhat.com>

On Wed, Jun 20, 2018 at 11:53:59AM +0200, Cornelia Huck wrote:
> On Tue, 19 Jun 2018 23:32:06 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Jun 19, 2018 at 12:54:53PM +0200, Cornelia Huck wrote:
> > > Sorry about dragging mainframes into this, but this will only work for
> > > homogenous device coupling, not for heterogenous. Consider my vfio-pci
> > > + virtio-net-ccw example again: The guest cannot find out that the two
> > > belong together by checking some group ID, it has to either use the MAC
> > > or some needs-to-be-architectured property.
> > > 
> > > Alternatively, we could propose that mechanism as pci-only, which means
> > > we can rely on mechanisms that won't necessarily work on non-pci
> > > transports. (FWIW, I don't see a use case for using vfio-ccw to pass
> > > through a network card anytime in the near future, due to the nature of
> > > network cards currently in use on s390.)  
> > 
> > That's what it boils down to, yes.  If there's need to have this for
> > non-pci devices, then we should put it in config space.
> > Cornelia, what do you think?
> > 
> 
> I think the only really useful config on s390 is the vfio-pci network
> card coupled with a virtio-net-ccw device: Using an s390 network card
> via vfio-ccw is out due to the nature of the s390 network cards, and
> virtio-ccw is the default transport (virtio-pci is not supported on any
> enterprise distro AFAIK).
> 
> For this, having a uuid in the config space could work (vfio-pci
> devices have a config space by virtue of being pci devices, and
> virtio-net-ccw devices have a config space by virtue of being virtio
> devices -- ccw devices usually don't have that concept).

OK so this calls for adding such a field generally (it's
device agnostic right now).

How would you suggest doing that?


-- 
MST

^ permalink raw reply

* Re: [PATCH] bpfilter: fix build error
From: Matteo Croce @ 2018-06-20 14:07 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: netdev, Alexei Starovoitov
In-Reply-To: <20180620143926.50720aec@elisabeth>

On Wed, Jun 20, 2018 at 12:39 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Tue, 19 Jun 2018 17:16:20 +0200
> Matteo Croce <mcroce@redhat.com> wrote:
>
> > bpfilter Makefile assumes that the system locale is en_US, and the
> > parsing of objdump output fails.
> > Set LC_ALL=C and, while at it, rewrite the objdump parsing so it spawns
> > only 2 processes instead of 7.
> >
> > Fixes: d2ba09c17a064 ("net: add skeleton of bpfilter kernel module")
> > Signed-off-by: Matteo Croce <mcroce@redhat.com>
> > ---
> >  net/bpfilter/Makefile | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> > index e0bbe7583e58..dd86b022eff0 100644
> > --- a/net/bpfilter/Makefile
> > +++ b/net/bpfilter/Makefile
> > @@ -21,8 +21,10 @@ endif
> >  # which bpfilter_kern.c passes further into umh blob loader at run-time
> >  quiet_cmd_copy_umh = GEN $@
> >        cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> > -      $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
> > -      -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> > +      $(OBJCOPY) -I binary \
> > +          `LC_ALL=C objdump -f net/bpfilter/bpfilter_umh \
>
> Why do you use objdump instead of $(OBJDUMP) now? I guess this might
> cause issues if you're cross-compiling.
>
> --
> Stefano

Right, I've sent a proper fix.

Thanks,
-- 
Matteo Croce
per aspera ad upstream

^ permalink raw reply

* [PATCH] bpfilter: fix user mode helper cross compilation
From: Matteo Croce @ 2018-06-20 14:04 UTC (permalink / raw)
  To: netdev

Use $(OBJDUMP) instead of literal 'objdump' to avoid
using host toolchain when cross compiling.

Fixes: 421780fd4983 ("bpfilter: fix build error")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 net/bpfilter/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index dd86b022eff0..051dc18b8ccb 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -22,7 +22,7 @@ endif
 quiet_cmd_copy_umh = GEN $@
       cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
       $(OBJCOPY) -I binary \
-          `LC_ALL=C objdump -f net/bpfilter/bpfilter_umh \
+          `LC_ALL=C $(OBJDUMP) -f net/bpfilter/bpfilter_umh \
           |awk -F' |,' '/file format/{print "-O",$$NF} \
           /^architecture:/{print "-B",$$2}'` \
       --rename-section .data=.init.rodata $< $@
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH] net: macb: Fix ptp time adjustment for large negative delta
From: Nicolas Ferre @ 2018-06-20 14:04 UTC (permalink / raw)
  To: Harini Katakam, davem
  Cc: netdev, linux-kernel, michal.simek, harinikatakamlinux,
	Claudiu Beznea - M18063
In-Reply-To: <1529494460-4689-1-git-send-email-harini.katakam@xilinx.com>

On 20/06/2018 at 13:34, Harini Katakam wrote:
> When delta passed to gem_ptp_adjtime is negative, the sign is
> maintained in the ns_to_timespec64 conversion. Hence timespec_add
> should be used directly. timespec_sub will just subtract the negative
> value thus increasing the time difference.
> 
> Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>

Makes sense:
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>

> ---
>   drivers/net/ethernet/cadence/macb_ptp.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_ptp.c b/drivers/net/ethernet/cadence/macb_ptp.c
> index 2220c77..6788351 100644
> --- a/drivers/net/ethernet/cadence/macb_ptp.c
> +++ b/drivers/net/ethernet/cadence/macb_ptp.c
> @@ -170,10 +170,7 @@ static int gem_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
>   
>   	if (delta > TSU_NSEC_MAX_VAL) {
>   		gem_tsu_get_time(&bp->ptp_clock_info, &now);
> -		if (sign)
> -			now = timespec64_sub(now, then);
> -		else
> -			now = timespec64_add(now, then);
> +		now = timespec64_add(now, then);
>   
>   		gem_tsu_set_time(&bp->ptp_clock_info,
>   				 (const struct timespec64 *)&now);
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver
From: Ivan Vecera @ 2018-06-20 13:54 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: Jiri Pirko, Grygorii Strashko, netdev, ivan.khoronzhuk, nsekhar,
	andrew, f.fainelli, francois.ozog, yogeshs, spatton, Jose.Abreu
In-Reply-To: <20180620125929.GA8582@apalos>

On 20.6.2018 14:59, Ilias Apalodimas wrote:
> On Wed, Jun 20, 2018 at 02:53:47PM +0200, Ivan Vecera wrote:
>> On 20.6.2018 09:08, Jiri Pirko wrote:
>>> Tue, Jun 19, 2018 at 01:19:00AM CEST, grygorii.strashko@ti.com wrote:
>>>>
>>>>
>>>> On 06/14/2018 06:43 AM, Ilias Apalodimas wrote:
>>>>> On Thu, Jun 14, 2018 at 01:39:58PM +0200, Jiri Pirko wrote:
>>>>>> Thu, Jun 14, 2018 at 01:34:04PM CEST, ilias.apalodimas@linaro.org wrote:
>>>>>>> On Thu, Jun 14, 2018 at 01:30:28PM +0200, Jiri Pirko wrote:
>>>>>>>> Thu, Jun 14, 2018 at 01:11:30PM CEST, ilias.apalodimas@linaro.org wrote:
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>> @@ -2711,6 +2789,10 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
>>>>>>>>> 	if (of_property_read_bool(node, "dual_emac"))
>>>>>>>>> 		data->switch_mode = CPSW_DUAL_EMAC;
>>>>>>>>>
>>>>>>>>> +	/* switchdev overrides DTS */
>>>>>>>>> +	if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))
>>>>>>>>> +		data->switch_mode = CPSW_SWITCHDEV;
>>>>>>>>
>>>>>>>> So you force CPSW_SWITCHDEV mode if the CONFIG_TI_CPSW_SWITCHDEV is
>>>>>>>> enabled. That does not sound right. I think that user should tell what
>>>>>>>> mode does he want regardless what the kernel config is.
>>>>>>> We discussed this during the V1 of the RFC. Yes it doesn't seem good, but the
>>>>>>> device currently configures the modes using DTS (which is not correct). I choose
>>>>>>> the .config due to that. I can't think of anything better, but i am open to
>>>>>>> suggestions
>>>>>>
>>>>>> Agreed that DTS does fit as well. I think that this might be a job for
>>>>>> devlink parameters (patchset is going to be sent upstream next week).
>>>>>> You do have 1 bus address for the whole device (both ports), right?
>>>>>>
>>>>> Yes devlink sounds reasonable. I thyink there's only one bus for it, but then
>>>>> again i am far from an expert on the hardware interrnals. Grygorii can correct
>>>>> me if i am wrong.
>>>>
>>>> Devlink and NFS boot are not compatible as per my understanding, so .. 
>>>
>>> ? Why do you say so?
>>
>> Why aren't they compatible?? You can have devlink binary in initramfs and
>> configure the ASIC and ports via devlink batch file - prior mounting NFS root.
> This is doable but, are we taking into consideration device reconfiguration 
> (which was the reason for creating the minimal filesystem) or are we just 
> talking about device booting/initial config?
Devlink calls should be placed in setup.sh

I.

^ permalink raw reply

* [PATCH v1 1/1] VSOCK: fix loopback on big-endian systems
From: Claudio Imbrenda @ 2018-06-20 13:51 UTC (permalink / raw)
  To: stefanha
  Cc: davem, jhansen, cavery, borntraeger, fiuczy, linux-kernel, netdev

The dst_cid and src_cid are 64 bits, therefore 64 bit accessors should be
used, and in fact in virtio_transport_common.c only 64 bit accessors are
used. Using 32 bit accessors for 64 bit values breaks big endian systems.

This patch fixes a wrong use of le32_to_cpu in virtio_transport_send_pkt.

Fixes: b9116823189e85ccf384 ("VSOCK: add loopback to virtio_transport")

Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
---
 net/vmw_vsock/virtio_transport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 8e03bd3..5d3cce9 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -201,7 +201,7 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
 		return -ENODEV;
 	}
 
-	if (le32_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid)
+	if (le64_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid)
 		return virtio_transport_send_pkt_loopback(vsock, pkt);
 
 	if (pkt->reply)
-- 
2.7.4

^ permalink raw reply related

* Re: Route fallback issue
From: David Ahern @ 2018-06-20 13:48 UTC (permalink / raw)
  To: Akshat Kakkar, netdev; +Cc: cronolog+lartc, lartc, Erik Auerswald
In-Reply-To: <CAA5aLPiPBTEp4eMs9T6xA3dkOPry3H=oD2svo3Ybon3dEgh3gA@mail.gmail.com>

On 6/20/18 2:26 AM, Akshat Kakkar wrote:
> Hi netdev community,
> 
> I have 2 interfaces
> eno1 : 192.168.1.10/24
> eno2 : 192.168.2.10/24
> 
> I added routes as
> 172.16.0.0/12 via 192.168.1.254 metric 1
> 172.16.0.0/12 via 192.168.2.254 metric 2
> 
> My intention : All traffic to 172.16.0.0/12 should go thru eno1. If
> 192.168.1.254 is not reachable (no arp entry or link down), then it
> should fall back to eno2.

See the ignore_routes_with_linkdown and fib_multipath_use_neigh sysctl
settings.


> On Wed, Jun 20, 2018 at 1:49 PM, Erik Auerswald
> <auerswal@unix-ag.uni-kl.de> wrote:
>> Hi,
>>
>> I have usually used the "replace" keyword of iproute2 for similar
>> purposes. I would suggest a script as well, run via cron unless 1 minute
>> failover times are not acceptable. The logic could be as follows:
>>
>> if ping -c1 $PRIMARY_NH >/dev/null 2>&1; then
>>   ip route replace $PREFIX via $PRIMARY_NH
>> elif ping -c1 $SECONDARY_NH >/dev/null 2>&1; then
>>   ip route replace $PREFIX via $SECONDARY_NH
>> else
>>   ip route del $PREFIX
>> fi
>>
>> Alternatively, one could look into a routing daemon that supports static
>> routing (Zebra/Quagga/FRRouting, BIRD, ...) and check if that supports
>> some form of next-hop tracking or at least removes static routes with
>> unreachable next-hops as one would expect from experience with dedicated
>> networking devices.

A feature is in the works to have fallback nexthops.


>>
>> IMHO static route handling as done by the Linux kernel does not seem
>> useful for networking devices. I have even had bad experiences with
>> Arista switches and static routing because they relied too much on the
>> Linux kernel (probably still do).

Useful how? what did not work as expected?

Do not confuse Arista's NOS with Linux's capabilities or any NOS truly
based on Linux and using a modern kernel. A lot of work has been put
into bringing Linux up to par with NOS features. If something is not
working, demonstrate the problem on the latest kernel and inquire if
someone is working on it.

^ permalink raw reply

* [PATCH] net: vhost: improve performance when enable busyloop
From: Tonghao Zhang @ 2018-06-20 13:28 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, Tonghao Zhang, virtualization

This patch improves the guest receive performance from
host. On the handle_tx side, we poll the sock receive
queue at the same time. handle_rx do that in the same way.

we set the poll-us=100 us and use the iperf3 to test
its throughput. The iperf3 command is shown as below.

iperf3 -s -D
iperf3 -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400 --bandwidth 100000M

* With the patch:    21.1 Gbits/sec
* Without the patch: 12.7 Gbits/sec

Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
---
 drivers/vhost/net.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e7cf7d2..9364ede 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -429,22 +429,43 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 	return vhost_poll_start(poll, sock->file);
 }
 
+static int sk_has_rx_data(struct sock *sk);
+
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct vhost_virtqueue *vq,
 				    struct iovec iov[], unsigned int iov_size,
 				    unsigned int *out_num, unsigned int *in_num)
 {
 	unsigned long uninitialized_var(endtime);
+	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_RX];
+	struct vhost_virtqueue *rvq = &nvq->vq;
+	struct socket *sock = rvq->private_data;
+
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				  out_num, in_num, NULL, NULL);
 
 	if (r == vq->num && vq->busyloop_timeout) {
+		mutex_lock_nested(&rvq->mutex, 1);
+
+		vhost_disable_notify(&net->dev, rvq);
+
 		preempt_disable();
 		endtime = busy_clock() + vq->busyloop_timeout;
 		while (vhost_can_busy_poll(vq->dev, endtime) &&
+		       !(sock && sk_has_rx_data(sock->sk)) &&
 		       vhost_vq_avail_empty(vq->dev, vq))
 			cpu_relax();
 		preempt_enable();
+
+		if (sock && sk_has_rx_data(sock->sk))
+			vhost_poll_queue(&rvq->poll);
+		else if (unlikely(vhost_enable_notify(&net->dev, rvq))) {
+			vhost_disable_notify(&net->dev, rvq);
+			vhost_poll_queue(&rvq->poll);
+		}
+
+		mutex_unlock(&rvq->mutex);
+
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				      out_num, in_num, NULL, NULL);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net 1/5] net sched actions: fix coding style in pedit action
From: Roman Mashak @ 2018-06-20 13:25 UTC (permalink / raw)
  To: Davide Caratti; +Cc: davem, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <fccb4f8c0e3070c289c4c110786ac7071f4e61cf.camel@redhat.com>

Davide Caratti <dcaratti@redhat.com> writes:

> On Tue, 2018-06-19 at 12:56 -0400, Roman Mashak wrote:
>> Fix coding style issues in tc pedit action detected by the
>> checkpatch script.
>> 
>> Signed-off-by: Roman Mashak <mrv@mojatatu.com>
> ...
>
>> ---
>> @@ -316,16 +318,15 @@ static int tcf_pedit(struct sk_buff *skb, const struct tc_action *a,
>>  						hoffset + tkey->at);
>>  					goto bad;
>>  				}
>> -				d = skb_header_pointer(skb, hoffset + tkey->at, 1,
>> -						       &_d);
>> +				d = skb_header_pointer(skb, hoffset + tkey->at,
>> +						       1, &_d);
>> 				if (!d)
>>  					goto bad;
>>  				offset += (*d & tkey->offmask) >> tkey->shift;
>>  			}
>
> hello Roman,
>
> nit: while we are here, what about changing the declaration of _d and *d
> to u8, so that the bitwise operation is done on unsigned?

Yes makes sense, I will send v2 in net-next once opened. Thanks Davide.

> BTW: the patch (and the series) looks ok, but I guess it will better
> target net-next when the branch reopens

^ permalink raw reply

* Re: [PATCH V2] brcmfmac: stop watchdog before detach and free everything
From: Andy Shevchenko @ 2018-06-20 13:15 UTC (permalink / raw)
  To: Michael Trimarchi
  Cc: Arend van Spriel, Franky Lin, Hante Meuleman, Chi-Hsien Lin,
	Wright Feng, Kalle Valo, David S. Miller, Pieter-Paul Giesberts,
	Ian Molton, open list:TI WILINK WIRELES...,
	open list:BROADCOM BRCM80211 IEEE802.11n WIRELESS DRIVER,
	brcm80211-dev-list, netdev, Linux Kernel Mailing List
In-Reply-To: <20180530090633.GA15390@panicking>

On Wed, May 30, 2018 at 12:06 PM, Michael Trimarchi
<michael@amarulasolutions.com> wrote:
> Using built-in in kernel image without a firmware in filesystem
> or in the kernel image can lead to a kernel NULL pointer deference.
> Watchdog need to be stopped in brcmf_sdio_remove
>
> The system is going down NOW!
> [ 1348.110759] Unable to handle kernel NULL pointer dereference at virtual address 000002f8
> Sent SIGTERM to all processes
> [ 1348.121412] Mem abort info:
> [ 1348.126962]   ESR = 0x96000004
> [ 1348.130023]   Exception class = DABT (current EL), IL = 32 bits
> [ 1348.135948]   SET = 0, FnV = 0
> [ 1348.138997]   EA = 0, S1PTW = 0
> [ 1348.142154] Data abort info:
> [ 1348.145045]   ISV = 0, ISS = 0x00000004
> [ 1348.148884]   CM = 0, WnR = 0
> [ 1348.151861] user pgtable: 4k pages, 48-bit VAs, pgdp = (____ptrval____)
> [ 1348.158475] [00000000000002f8] pgd=0000000000000000
> [ 1348.163364] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> [ 1348.168927] Modules linked in: ipv6
> [ 1348.172421] CPU: 3 PID: 1421 Comm: brcmf_wdog/mmc0 Not tainted 4.17.0-rc5-next-20180517 #18
> [ 1348.180757] Hardware name: Amarula A64-Relic (DT)
> [ 1348.185455] pstate: 60000005 (nZCv daif -PAN -UAO)
> [ 1348.190251] pc : brcmf_sdiod_freezer_count+0x0/0x20
> [ 1348.195124] lr : brcmf_sdio_watchdog_thread+0x64/0x290
> [ 1348.200253] sp : ffff00000b85be30
> [ 1348.203561] x29: ffff00000b85be30 x28: 0000000000000000
> [ 1348.208868] x27: ffff00000b6cb918 x26: ffff80003b990638
> [ 1348.214176] x25: ffff0000087b1a20 x24: ffff80003b94f800
> [ 1348.219483] x23: ffff000008e620c8 x22: ffff000008f0b660
> [ 1348.224790] x21: ffff000008c6a858 x20: 00000000fffffe00
> [ 1348.230097] x19: ffff80003b94f800 x18: 0000000000000001
> [ 1348.235404] x17: 0000ffffab2e8a74 x16: ffff0000080d7de8
> [ 1348.240711] x15: 0000000000000000 x14: 0000000000000400
> [ 1348.246018] x13: 0000000000000400 x12: 0000000000000001
> [ 1348.251324] x11: 00000000000002c4 x10: 0000000000000a10
> [ 1348.256631] x9 : ffff00000b85bc40 x8 : ffff80003be11870
> [ 1348.261937] x7 : ffff80003dfc7308 x6 : 000000078ff08b55
> [ 1348.267243] x5 : 00000139e1058400 x4 : 0000000000000000
> [ 1348.272550] x3 : dead000000000100 x2 : 958f2788d6618100
> [ 1348.277856] x1 : 00000000fffffe00 x0 : 0000000000000000
>

It was a 100% (or so) reproducible on Intel Edison with vanilla kernel
and linux-firmware package, while calibration file (*.txt) is not in
distribution.

The system is going down NOW!
Sent SIGTERM to all processes
[  118.695900] PGD 800000003bc3f067 P4D 800000003bc3f067 PUD 3bc5b067 PMD 0
[  118.695935] Oops: 0002 [#1] SMP PTI
[  118.711905] CPU: 1 PID: 1255 Comm: brcmf_wdog/mmc2 Not tainted
4.18.0-rc1-next-20180618+ #59
[  118.720354] Hardware name: Intel Corporation Merrifield/BODEGA BAY,
BIOS 542 2015.01.21:18.19.48
[  118.729181] RIP: 0010:brcmf_sdiod_freezer_count+0x7/0x10 [brcmfmac]

After this patch applied problem was gone

Tested-by: Andy Shevchenko <andy.shevchenko@gmail.com>

P.S. Sorry it took a bit longer.

> Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
> ---
>  drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> index 412a05b..061f69d 100644
> --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> @@ -4294,6 +4294,13 @@ void brcmf_sdio_remove(struct brcmf_sdio *bus)
>         brcmf_dbg(TRACE, "Enter\n");
>
>         if (bus) {
> +               /* Stop watchdog task */
> +               if (bus->watchdog_tsk) {
> +                       send_sig(SIGTERM, bus->watchdog_tsk, 1);
> +                       kthread_stop(bus->watchdog_tsk);
> +                       bus->watchdog_tsk = NULL;
> +               }
> +
>                 /* De-register interrupt handler */
>                 brcmf_sdiod_intr_unregister(bus->sdiodev);
>
> --
> 2.7.4



-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* [PATCH] strparser: Don't schedule in workqueue in paused state
From: Vakul Garg @ 2018-06-20 13:04 UTC (permalink / raw)
  To: davem
  Cc: doronrk, tom, john.fastabend, davejwatson, netdev, ebiggers,
	linux-kernel, Vakul Garg

In function strp_data_ready(), it is useless to call queue_work if
the state of strparser is already paused. The state checking should
be done before calling queue_work. The change reduces the context
switches and improves the ktls-rx throughput by approx 20% (measured
on cortex-a53 based platform).

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
---
 net/strparser/strparser.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index 1a9695183599..373836615c57 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -392,7 +392,7 @@ static int strp_read_sock(struct strparser *strp)
 /* Lower sock lock held */
 void strp_data_ready(struct strparser *strp)
 {
-	if (unlikely(strp->stopped))
+	if (unlikely(strp->stopped) || strp->paused)
 		return;
 
 	/* This check is needed to synchronize with do_strp_work.
@@ -407,9 +407,6 @@ void strp_data_ready(struct strparser *strp)
 		return;
 	}
 
-	if (strp->paused)
-		return;
-
 	if (strp->need_bytes) {
 		if (strp_peek_len(strp) < strp->need_bytes)
 			return;
-- 
2.13.6

^ permalink raw reply related

* Re: [RFC v2, net-next, PATCH 4/4] net/cpsw_switchdev: add switchdev mode of operation on cpsw driver
From: Ilias Apalodimas @ 2018-06-20 12:59 UTC (permalink / raw)
  To: Ivan Vecera
  Cc: Jiri Pirko, Grygorii Strashko, netdev, ivan.khoronzhuk, nsekhar,
	andrew, f.fainelli, francois.ozog, yogeshs, spatton, Jose.Abreu
In-Reply-To: <77627ca3-3e24-ea0b-7ba6-55f1f2a1114e@redhat.com>

On Wed, Jun 20, 2018 at 02:53:47PM +0200, Ivan Vecera wrote:
> On 20.6.2018 09:08, Jiri Pirko wrote:
> > Tue, Jun 19, 2018 at 01:19:00AM CEST, grygorii.strashko@ti.com wrote:
> >>
> >>
> >> On 06/14/2018 06:43 AM, Ilias Apalodimas wrote:
> >>> On Thu, Jun 14, 2018 at 01:39:58PM +0200, Jiri Pirko wrote:
> >>>> Thu, Jun 14, 2018 at 01:34:04PM CEST, ilias.apalodimas@linaro.org wrote:
> >>>>> On Thu, Jun 14, 2018 at 01:30:28PM +0200, Jiri Pirko wrote:
> >>>>>> Thu, Jun 14, 2018 at 01:11:30PM CEST, ilias.apalodimas@linaro.org wrote:
> >>>>>>
> >>>>>> [...]
> >>>>>>
> >>>>>>> @@ -2711,6 +2789,10 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
> >>>>>>> 	if (of_property_read_bool(node, "dual_emac"))
> >>>>>>> 		data->switch_mode = CPSW_DUAL_EMAC;
> >>>>>>>
> >>>>>>> +	/* switchdev overrides DTS */
> >>>>>>> +	if (IS_ENABLED(CONFIG_TI_CPSW_SWITCHDEV))
> >>>>>>> +		data->switch_mode = CPSW_SWITCHDEV;
> >>>>>>
> >>>>>> So you force CPSW_SWITCHDEV mode if the CONFIG_TI_CPSW_SWITCHDEV is
> >>>>>> enabled. That does not sound right. I think that user should tell what
> >>>>>> mode does he want regardless what the kernel config is.
> >>>>> We discussed this during the V1 of the RFC. Yes it doesn't seem good, but the
> >>>>> device currently configures the modes using DTS (which is not correct). I choose
> >>>>> the .config due to that. I can't think of anything better, but i am open to
> >>>>> suggestions
> >>>>
> >>>> Agreed that DTS does fit as well. I think that this might be a job for
> >>>> devlink parameters (patchset is going to be sent upstream next week).
> >>>> You do have 1 bus address for the whole device (both ports), right?
> >>>>
> >>> Yes devlink sounds reasonable. I thyink there's only one bus for it, but then
> >>> again i am far from an expert on the hardware interrnals. Grygorii can correct
> >>> me if i am wrong.
> >>
> >> Devlink and NFS boot are not compatible as per my understanding, so .. 
> > 
> > ? Why do you say so?
> 
> Why aren't they compatible?? You can have devlink binary in initramfs and
> configure the ASIC and ports via devlink batch file - prior mounting NFS root.
This is doable but, are we taking into consideration device reconfiguration 
(which was the reason for creating the minimal filesystem) or are we just 
talking about device booting/initial config?
> 
> >>
> >> Again, current driver, as is, supports NFS boot in all modes
> >> (how good is the cur driver is question which out of scope of this discussion).
> >>
> >> And we discussed this in RFC v1 - driver mode selection will be changed 
> >> if we will proceed and it will be new driver for proper switch support.
> >>
> >> Not sure about about Devlink (need to study it and we never got any requests from end
> >> users for this as I know), and I'd like to note (again) that this is embedded 
> >> (industrial/automotive etc), so everything preferred to be simple, fast and
> >> preferably configured statically (in most of the cases end users what boot time 
> >> configuration).
> > 
> > You need to study it indeed.
> > 
> >>
> >> -- 
> >> regards,
> >> -grygorii
> 

Regards
Ilias

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox