* [PATCH v2 0/7] assorted v5.x related fixups
@ 2021-03-27 10:19 Philippe Gerum
2021-03-27 10:19 ` [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls Philippe Gerum
` (7 more replies)
0 siblings, 8 replies; 27+ messages in thread
From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw)
To: xenomai
From: Philippe Gerum <rpm@xenomai.org>
More wrappers and other changes required to build for the kernel v5.x
series.
Philippe Gerum (7):
cobalt/memory: fix __vmalloc() calls
cobalt/debug: switch to mmap_lock interface
cobalt/kernel: convert to proc_ops
cobalt/debug: prefer dump_stack() to show_stack()
drivers/net: wrap csum_partial_copy_nocheck()
drivers/net: icmp: remove variable-length array
drivers/net: cfg: fix config file load up
kernel/cobalt/debug.c | 6 +-
kernel/cobalt/heap.c | 3 +-
.../include/asm-generic/xenomai/wrappers.h | 59 +++++++++++++++
kernel/cobalt/posix/memory.c | 7 +-
kernel/cobalt/vfile.c | 26 +++----
kernel/drivers/analogy/device.c | 12 +--
kernel/drivers/can/mscan/rtcan_mscan_proc.c | 12 +--
kernel/drivers/can/rtcan_module.c | 66 ++++++++---------
.../drivers/can/sja1000/rtcan_sja1000_proc.c | 12 +--
kernel/drivers/net/stack/ipv4/icmp.c | 34 ++++++---
kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c | 74 +++++++++----------
kernel/drivers/net/stack/rtskb.c | 3 +-
12 files changed, 189 insertions(+), 125 deletions(-)
--
2.29.2
^ permalink raw reply [flat|nested] 27+ messages in thread* [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum @ 2021-03-27 10:19 ` Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 2/7] cobalt/debug: switch to mmap_lock interface Philippe Gerum ` (6 subsequent siblings) 7 siblings, 0 replies; 27+ messages in thread From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw) To: xenomai From: Philippe Gerum <rpm@xenomai.org> Since kernel v5.8, __vmalloc() does not take protection bits as PROT_KERNEL is now wired in. Therefore we cannot disable the cache for the UMM segment via the allocation call directly anymore. This said, we don't support any CPU architecture exhibiting cache aliasing braindamage anymore either (was armv4/v5), so let's convert to the new __vmalloc() call format without bothering for cache settings. Signed-off-by: Philippe Gerum <rpm@xenomai.org> --- kernel/cobalt/heap.c | 3 ++- kernel/cobalt/include/asm-generic/xenomai/wrappers.h | 6 ++++++ kernel/cobalt/posix/memory.c | 7 ++++--- 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/kernel/cobalt/heap.c b/kernel/cobalt/heap.c index 9e184d4c3..d3f43729e 100644 --- a/kernel/cobalt/heap.c +++ b/kernel/cobalt/heap.c @@ -28,6 +28,7 @@ #include <cobalt/kernel/heap.h> #include <cobalt/kernel/vfile.h> #include <cobalt/kernel/ancillaries.h> +#include <asm/xenomai/wrappers.h> /** * @ingroup cobalt_core @@ -849,7 +850,7 @@ void *xnheap_vmalloc(size_t size) * software on a 32bit system had to be wrong in the first * place anyway. */ - return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL); + return vmalloc_kernel(size, 0); } EXPORT_SYMBOL_GPL(xnheap_vmalloc); diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h index 9e8733b47..ac4e95aa0 100644 --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h @@ -176,4 +176,10 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, #define old_timeval32 compat_timeval #endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0) +#define vmalloc_kernel(__size, __flags) __vmalloc(__size, GFP_KERNEL|__flags, PAGE_KERNEL) +#else +#define vmalloc_kernel(__size, __flags) __vmalloc(__size, GFP_KERNEL|__flags) +#endif + #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */ diff --git a/kernel/cobalt/posix/memory.c b/kernel/cobalt/posix/memory.c index 1d888a219..fc88e264c 100644 --- a/kernel/cobalt/posix/memory.c +++ b/kernel/cobalt/posix/memory.c @@ -320,10 +320,11 @@ int cobalt_umm_init(struct cobalt_umm *umm, u32 size, secondary_mode_only(); + /* We don't support CPUs with VIVT caches and the like. */ + BUG_ON(xnarch_cache_aliasing()); + size = PAGE_ALIGN(size); - basemem = __vmalloc(size, GFP_KERNEL|__GFP_ZERO, - xnarch_cache_aliasing() ? - pgprot_noncached(PAGE_KERNEL) : PAGE_KERNEL); + basemem = vmalloc_kernel(size, __GFP_ZERO); if (basemem == NULL) return -ENOMEM; -- 2.29.2 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 2/7] cobalt/debug: switch to mmap_lock interface 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls Philippe Gerum @ 2021-03-27 10:19 ` Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 3/7] cobalt/kernel: convert to proc_ops Philippe Gerum ` (5 subsequent siblings) 7 siblings, 0 replies; 27+ messages in thread From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw) To: xenomai From: Philippe Gerum <rpm@xenomai.org> Signed-off-by: Philippe Gerum <rpm@xenomai.org> --- kernel/cobalt/debug.c | 4 ++-- kernel/cobalt/include/asm-generic/xenomai/wrappers.h | 7 +++++++ 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/kernel/cobalt/debug.c b/kernel/cobalt/debug.c index 1e9edda99..f97144aeb 100644 --- a/kernel/cobalt/debug.c +++ b/kernel/cobalt/debug.c @@ -239,7 +239,7 @@ void xndebug_trace_relax(int nr, unsigned long *backtrace, memset(&spot, 0, sizeof(spot)); mm = get_task_mm(current); - down_read(&mm->mmap_sem); + mmap_read_lock(mm); for (n = 0, depth = 0; n < nr; n++) { pc = backtrace[n]; @@ -278,7 +278,7 @@ void xndebug_trace_relax(int nr, unsigned long *backtrace, depth++; } - up_read(&mm->mmap_sem); + mmap_read_unlock(mm); mmput(mm); free_page((unsigned long)tmp); diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h index ac4e95aa0..cd22a8db5 100644 --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h @@ -176,6 +176,13 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, #define old_timeval32 compat_timeval #endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0) +#define mmap_read_lock(__mm) down_read(&mm->mmap_sem) +#define mmap_read_unlock(__mm) up_read(&mm->mmap_sem) +#define mmap_write_lock(__mm) down_write(&mm->mmap_sem) +#define mmap_write_unlock(__mm) up_write(&mm->mmap_sem) +#endif + #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0) #define vmalloc_kernel(__size, __flags) __vmalloc(__size, GFP_KERNEL|__flags, PAGE_KERNEL) #else -- 2.29.2 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 3/7] cobalt/kernel: convert to proc_ops 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 2/7] cobalt/debug: switch to mmap_lock interface Philippe Gerum @ 2021-03-27 10:19 ` Philippe Gerum 2021-04-07 9:52 ` Jan Kiszka 2021-03-27 10:19 ` [PATCH v2 4/7] cobalt/debug: prefer dump_stack() to show_stack() Philippe Gerum ` (4 subsequent siblings) 7 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw) To: xenomai From: Philippe Gerum <rpm@xenomai.org> Signed-off-by: Philippe Gerum <rpm@xenomai.org> --- .../include/asm-generic/xenomai/wrappers.h | 20 ++++++ kernel/cobalt/vfile.c | 26 ++++---- kernel/drivers/analogy/device.c | 12 ++-- kernel/drivers/can/mscan/rtcan_mscan_proc.c | 12 ++-- kernel/drivers/can/rtcan_module.c | 66 +++++++++---------- .../drivers/can/sja1000/rtcan_sja1000_proc.c | 12 ++-- 6 files changed, 80 insertions(+), 68 deletions(-) diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h index cd22a8db5..cc0cb0896 100644 --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, #define __kernel_old_timeval timeval #endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ + struct file_operations __name = { \ + .open = (__open), \ + .release = (__release), \ + .read = (__read), \ + .write = (__write), \ + .llseek = seq_lseek, \ +} +#else +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ + struct proc_ops __name = { \ + .proc_open = (__open), \ + .proc_release = (__release), \ + .proc_read = (__read), \ + .proc_write = (__write), \ + .proc_lseek = seq_lseek, \ +} +#endif + #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0) #define old_timespec32 compat_timespec #define old_itimerspec32 compat_itimerspec diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c index f65d46ddf..e9e10ce8d 100644 --- a/kernel/cobalt/vfile.c +++ b/kernel/cobalt/vfile.c @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf, return ret; } -static struct file_operations vfile_snapshot_fops = { - .open = vfile_snapshot_open, - .read = seq_read, - .write = vfile_snapshot_write, - .llseek = seq_lseek, - .release = vfile_snapshot_release, -}; +static const DEFINE_PROC_OPS(vfile_snapshot_fops, + vfile_snapshot_open, + vfile_snapshot_release, + seq_read, + vfile_snapshot_write +); /** * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent) @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf, return ret; } -static struct file_operations vfile_regular_fops = { - .open = vfile_regular_open, - .read = seq_read, - .write = vfile_regular_write, - .llseek = seq_lseek, - .release = vfile_regular_release, -}; +static const DEFINE_PROC_OPS(vfile_regular_fops, + vfile_regular_open, + vfile_regular_release, + seq_read, + vfile_regular_write +); /** * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent) diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c index 160dcf547..6ed588708 100644 --- a/kernel/drivers/analogy/device.c +++ b/kernel/drivers/analogy/device.c @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file) return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode)); } -static const struct file_operations a4l_proc_transfer_ops = { - .open = a4l_proc_transfer_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops, + a4l_proc_transfer_open, + single_release, + seq_read, + NULL +); int a4l_proc_attach(struct a4l_device_context * cxt) { diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c index 6b54ad4c7..732a02765 100644 --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file) return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode)); } -static const struct file_operations rtcan_mscan_proc_regs_ops = { - .open = rtcan_mscan_proc_regs_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops, + rtcan_mscan_proc_regs_open, + single_elease, + seq_read, + NULL +); int rtcan_mscan_create_proc(struct rtcan_device* dev) { diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c index fbc5c35e4..7f3d4c395 100644 --- a/kernel/drivers/can/rtcan_module.c +++ b/kernel/drivers/can/rtcan_module.c @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file) return single_open(file, rtcan_read_proc_devices, NULL); } -static const struct file_operations rtcan_proc_devices_ops = { - .open = rtcan_proc_devices_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops, + rtcan_proc_devices_open, + single_release, + seq_read, + NULL +); static int rtcan_read_proc_sockets(struct seq_file *p, void *data) { @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file) return single_open(file, rtcan_read_proc_sockets, NULL); } -static const struct file_operations rtcan_proc_sockets_ops = { - .open = rtcan_proc_sockets_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops, + rtcan_proc_sockets_open, + single_release, + seq_read, + NULL +); static int rtcan_read_proc_info(struct seq_file *p, void *data) { @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file) return single_open(file, rtcan_read_proc_info, PDE_DATA(inode)); } -static const struct file_operations rtcan_proc_info_ops = { - .open = rtcan_proc_info_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - - +static const DEFINE_PROC_OPS(rtcan_proc_info_ops, + rtcan_proc_info_open, + single_release, + seq_read, + NULL +); static int rtcan_read_proc_filter(struct seq_file *p, void *data) { @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file) return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode)); } -static const struct file_operations rtcan_proc_filter_ops = { - .open = rtcan_proc_filter_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - - +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops, + rtcan_proc_filter_open, + single_release, + seq_read, + NULL +); static int rtcan_read_proc_version(struct seq_file *p, void *data) { @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file) return single_open(file, rtcan_read_proc_version, NULL); } -static const struct file_operations rtcan_proc_version_ops = { - .open = rtcan_proc_version_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; - +static const DEFINE_PROC_OPS(rtcan_proc_version_ops, + rtcan_proc_version_open, + single_release, + seq_read, + NULL +); void rtcan_dev_remove_proc(struct rtcan_device* dev) { diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c index b4af8ab2e..0fdee8c37 100644 --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file) return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode)); } -static const struct file_operations rtcan_sja_proc_regs_ops = { - .open = rtcan_sja_proc_regs_open, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops, + rtcan_sja_proc_regs_open, + single_release, + seq_read, + NULL +); int rtcan_sja_create_proc(struct rtcan_device* dev) { -- 2.29.2 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 3/7] cobalt/kernel: convert to proc_ops 2021-03-27 10:19 ` [PATCH v2 3/7] cobalt/kernel: convert to proc_ops Philippe Gerum @ 2021-04-07 9:52 ` Jan Kiszka 2021-04-07 10:17 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Jan Kiszka @ 2021-04-07 9:52 UTC (permalink / raw) To: Philippe Gerum, xenomai On 27.03.21 11:19, Philippe Gerum wrote: > From: Philippe Gerum <rpm@xenomai.org> > > Signed-off-by: Philippe Gerum <rpm@xenomai.org> > --- > .../include/asm-generic/xenomai/wrappers.h | 20 ++++++ > kernel/cobalt/vfile.c | 26 ++++---- > kernel/drivers/analogy/device.c | 12 ++-- > kernel/drivers/can/mscan/rtcan_mscan_proc.c | 12 ++-- > kernel/drivers/can/rtcan_module.c | 66 +++++++++---------- > .../drivers/can/sja1000/rtcan_sja1000_proc.c | 12 ++-- > 6 files changed, 80 insertions(+), 68 deletions(-) > > diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > index cd22a8db5..cc0cb0896 100644 > --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, > #define __kernel_old_timeval timeval > #endif > > +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) > +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ > + struct file_operations __name = { \ > + .open = (__open), \ > + .release = (__release), \ > + .read = (__read), \ > + .write = (__write), \ > + .llseek = seq_lseek, \ > +} > +#else > +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ > + struct proc_ops __name = { \ > + .proc_open = (__open), \ > + .proc_release = (__release), \ > + .proc_read = (__read), \ > + .proc_write = (__write), \ > + .proc_lseek = seq_lseek, \ > +} > +#endif > + > #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0) > #define old_timespec32 compat_timespec > #define old_itimerspec32 compat_itimerspec > diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c > index f65d46ddf..e9e10ce8d 100644 > --- a/kernel/cobalt/vfile.c > +++ b/kernel/cobalt/vfile.c > @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf, > return ret; > } > > -static struct file_operations vfile_snapshot_fops = { > - .open = vfile_snapshot_open, > - .read = seq_read, > - .write = vfile_snapshot_write, > - .llseek = seq_lseek, > - .release = vfile_snapshot_release, > -}; > +static const DEFINE_PROC_OPS(vfile_snapshot_fops, > + vfile_snapshot_open, > + vfile_snapshot_release, > + seq_read, > + vfile_snapshot_write > +); > > /** > * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent) > @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf, > return ret; > } > > -static struct file_operations vfile_regular_fops = { > - .open = vfile_regular_open, > - .read = seq_read, > - .write = vfile_regular_write, > - .llseek = seq_lseek, > - .release = vfile_regular_release, > -}; > +static const DEFINE_PROC_OPS(vfile_regular_fops, > + vfile_regular_open, > + vfile_regular_release, > + seq_read, > + vfile_regular_write > +); > > /** > * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent) > diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c > index 160dcf547..6ed588708 100644 > --- a/kernel/drivers/analogy/device.c > +++ b/kernel/drivers/analogy/device.c > @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file) > return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode)); > } > > -static const struct file_operations a4l_proc_transfer_ops = { > - .open = a4l_proc_transfer_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = single_release, > -}; > +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops, > + a4l_proc_transfer_open, > + single_release, > + seq_read, > + NULL > +); > > int a4l_proc_attach(struct a4l_device_context * cxt) > { > diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c > index 6b54ad4c7..732a02765 100644 > --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c > +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c > @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file) > return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode)); > } > > -static const struct file_operations rtcan_mscan_proc_regs_ops = { > - .open = rtcan_mscan_proc_regs_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = single_release, > -}; > +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops, > + rtcan_mscan_proc_regs_open, > + single_elease, > + seq_read, > + NULL > +); > > int rtcan_mscan_create_proc(struct rtcan_device* dev) > { > diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c > index fbc5c35e4..7f3d4c395 100644 > --- a/kernel/drivers/can/rtcan_module.c > +++ b/kernel/drivers/can/rtcan_module.c > @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file) > return single_open(file, rtcan_read_proc_devices, NULL); > } > > -static const struct file_operations rtcan_proc_devices_ops = { > - .open = rtcan_proc_devices_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = single_release, > -}; > +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops, > + rtcan_proc_devices_open, > + single_release, > + seq_read, > + NULL > +); > > static int rtcan_read_proc_sockets(struct seq_file *p, void *data) > { > @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file) > return single_open(file, rtcan_read_proc_sockets, NULL); > } > > -static const struct file_operations rtcan_proc_sockets_ops = { > - .open = rtcan_proc_sockets_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = single_release, > -}; > - > +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops, > + rtcan_proc_sockets_open, > + single_release, > + seq_read, > + NULL > +); > > static int rtcan_read_proc_info(struct seq_file *p, void *data) > { > @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file) > return single_open(file, rtcan_read_proc_info, PDE_DATA(inode)); > } > > -static const struct file_operations rtcan_proc_info_ops = { > - .open = rtcan_proc_info_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = single_release, > -}; > - > - > +static const DEFINE_PROC_OPS(rtcan_proc_info_ops, > + rtcan_proc_info_open, > + single_release, > + seq_read, > + NULL > +); > > static int rtcan_read_proc_filter(struct seq_file *p, void *data) > { > @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file) > return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode)); > } > > -static const struct file_operations rtcan_proc_filter_ops = { > - .open = rtcan_proc_filter_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = single_release, > -}; > - > - > +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops, > + rtcan_proc_filter_open, > + single_release, > + seq_read, > + NULL > +); > > static int rtcan_read_proc_version(struct seq_file *p, void *data) > { > @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file) > return single_open(file, rtcan_read_proc_version, NULL); > } > > -static const struct file_operations rtcan_proc_version_ops = { > - .open = rtcan_proc_version_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = single_release, > -}; > - > +static const DEFINE_PROC_OPS(rtcan_proc_version_ops, > + rtcan_proc_version_open, > + single_release, > + seq_read, > + NULL > +); > > void rtcan_dev_remove_proc(struct rtcan_device* dev) > { > diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c > index b4af8ab2e..0fdee8c37 100644 > --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c > +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c > @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file) > return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode)); > } > > -static const struct file_operations rtcan_sja_proc_regs_ops = { > - .open = rtcan_sja_proc_regs_open, > - .read = seq_read, > - .llseek = seq_lseek, > - .release = single_release, > -}; > +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops, > + rtcan_sja_proc_regs_open, > + single_release, > + seq_read, > + NULL > +); > > int rtcan_sja_create_proc(struct rtcan_device* dev) > { > This leaves out a4l_proc_devs_ops and a4l_proc_drvs_ops, while converting a4l_proc_transfer_ops. By accident? In any case, I've added those two cases to the patch. Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 3/7] cobalt/kernel: convert to proc_ops 2021-04-07 9:52 ` Jan Kiszka @ 2021-04-07 10:17 ` Philippe Gerum 2021-04-07 10:37 ` Jan Kiszka 0 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-04-07 10:17 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > On 27.03.21 11:19, Philippe Gerum wrote: >> From: Philippe Gerum <rpm@xenomai.org> >> >> Signed-off-by: Philippe Gerum <rpm@xenomai.org> >> --- >> .../include/asm-generic/xenomai/wrappers.h | 20 ++++++ >> kernel/cobalt/vfile.c | 26 ++++---- >> kernel/drivers/analogy/device.c | 12 ++-- >> kernel/drivers/can/mscan/rtcan_mscan_proc.c | 12 ++-- >> kernel/drivers/can/rtcan_module.c | 66 +++++++++---------- >> .../drivers/can/sja1000/rtcan_sja1000_proc.c | 12 ++-- >> 6 files changed, 80 insertions(+), 68 deletions(-) >> >> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >> index cd22a8db5..cc0cb0896 100644 >> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >> @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, >> #define __kernel_old_timeval timeval >> #endif >> >> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) >> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ >> + struct file_operations __name = { \ >> + .open = (__open), \ >> + .release = (__release), \ >> + .read = (__read), \ >> + .write = (__write), \ >> + .llseek = seq_lseek, \ >> +} >> +#else >> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ >> + struct proc_ops __name = { \ >> + .proc_open = (__open), \ >> + .proc_release = (__release), \ >> + .proc_read = (__read), \ >> + .proc_write = (__write), \ >> + .proc_lseek = seq_lseek, \ >> +} >> +#endif >> + >> #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0) >> #define old_timespec32 compat_timespec >> #define old_itimerspec32 compat_itimerspec >> diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c >> index f65d46ddf..e9e10ce8d 100644 >> --- a/kernel/cobalt/vfile.c >> +++ b/kernel/cobalt/vfile.c >> @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf, >> return ret; >> } >> >> -static struct file_operations vfile_snapshot_fops = { >> - .open = vfile_snapshot_open, >> - .read = seq_read, >> - .write = vfile_snapshot_write, >> - .llseek = seq_lseek, >> - .release = vfile_snapshot_release, >> -}; >> +static const DEFINE_PROC_OPS(vfile_snapshot_fops, >> + vfile_snapshot_open, >> + vfile_snapshot_release, >> + seq_read, >> + vfile_snapshot_write >> +); >> >> /** >> * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent) >> @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf, >> return ret; >> } >> >> -static struct file_operations vfile_regular_fops = { >> - .open = vfile_regular_open, >> - .read = seq_read, >> - .write = vfile_regular_write, >> - .llseek = seq_lseek, >> - .release = vfile_regular_release, >> -}; >> +static const DEFINE_PROC_OPS(vfile_regular_fops, >> + vfile_regular_open, >> + vfile_regular_release, >> + seq_read, >> + vfile_regular_write >> +); >> >> /** >> * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent) >> diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c >> index 160dcf547..6ed588708 100644 >> --- a/kernel/drivers/analogy/device.c >> +++ b/kernel/drivers/analogy/device.c >> @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file) >> return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode)); >> } >> >> -static const struct file_operations a4l_proc_transfer_ops = { >> - .open = a4l_proc_transfer_open, >> - .read = seq_read, >> - .llseek = seq_lseek, >> - .release = single_release, >> -}; >> +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops, >> + a4l_proc_transfer_open, >> + single_release, >> + seq_read, >> + NULL >> +); >> >> int a4l_proc_attach(struct a4l_device_context * cxt) >> { >> diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c >> index 6b54ad4c7..732a02765 100644 >> --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c >> +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c >> @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file) >> return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode)); >> } >> >> -static const struct file_operations rtcan_mscan_proc_regs_ops = { >> - .open = rtcan_mscan_proc_regs_open, >> - .read = seq_read, >> - .llseek = seq_lseek, >> - .release = single_release, >> -}; >> +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops, >> + rtcan_mscan_proc_regs_open, >> + single_elease, >> + seq_read, >> + NULL >> +); >> >> int rtcan_mscan_create_proc(struct rtcan_device* dev) >> { >> diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c >> index fbc5c35e4..7f3d4c395 100644 >> --- a/kernel/drivers/can/rtcan_module.c >> +++ b/kernel/drivers/can/rtcan_module.c >> @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file) >> return single_open(file, rtcan_read_proc_devices, NULL); >> } >> >> -static const struct file_operations rtcan_proc_devices_ops = { >> - .open = rtcan_proc_devices_open, >> - .read = seq_read, >> - .llseek = seq_lseek, >> - .release = single_release, >> -}; >> +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops, >> + rtcan_proc_devices_open, >> + single_release, >> + seq_read, >> + NULL >> +); >> >> static int rtcan_read_proc_sockets(struct seq_file *p, void *data) >> { >> @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file) >> return single_open(file, rtcan_read_proc_sockets, NULL); >> } >> >> -static const struct file_operations rtcan_proc_sockets_ops = { >> - .open = rtcan_proc_sockets_open, >> - .read = seq_read, >> - .llseek = seq_lseek, >> - .release = single_release, >> -}; >> - >> +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops, >> + rtcan_proc_sockets_open, >> + single_release, >> + seq_read, >> + NULL >> +); >> >> static int rtcan_read_proc_info(struct seq_file *p, void *data) >> { >> @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file) >> return single_open(file, rtcan_read_proc_info, PDE_DATA(inode)); >> } >> >> -static const struct file_operations rtcan_proc_info_ops = { >> - .open = rtcan_proc_info_open, >> - .read = seq_read, >> - .llseek = seq_lseek, >> - .release = single_release, >> -}; >> - >> - >> +static const DEFINE_PROC_OPS(rtcan_proc_info_ops, >> + rtcan_proc_info_open, >> + single_release, >> + seq_read, >> + NULL >> +); >> >> static int rtcan_read_proc_filter(struct seq_file *p, void *data) >> { >> @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file) >> return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode)); >> } >> >> -static const struct file_operations rtcan_proc_filter_ops = { >> - .open = rtcan_proc_filter_open, >> - .read = seq_read, >> - .llseek = seq_lseek, >> - .release = single_release, >> -}; >> - >> - >> +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops, >> + rtcan_proc_filter_open, >> + single_release, >> + seq_read, >> + NULL >> +); >> >> static int rtcan_read_proc_version(struct seq_file *p, void *data) >> { >> @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file) >> return single_open(file, rtcan_read_proc_version, NULL); >> } >> >> -static const struct file_operations rtcan_proc_version_ops = { >> - .open = rtcan_proc_version_open, >> - .read = seq_read, >> - .llseek = seq_lseek, >> - .release = single_release, >> -}; >> - >> +static const DEFINE_PROC_OPS(rtcan_proc_version_ops, >> + rtcan_proc_version_open, >> + single_release, >> + seq_read, >> + NULL >> +); >> >> void rtcan_dev_remove_proc(struct rtcan_device* dev) >> { >> diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >> index b4af8ab2e..0fdee8c37 100644 >> --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >> +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >> @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file) >> return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode)); >> } >> >> -static const struct file_operations rtcan_sja_proc_regs_ops = { >> - .open = rtcan_sja_proc_regs_open, >> - .read = seq_read, >> - .llseek = seq_lseek, >> - .release = single_release, >> -}; >> +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops, >> + rtcan_sja_proc_regs_open, >> + single_release, >> + seq_read, >> + NULL >> +); >> >> int rtcan_sja_create_proc(struct rtcan_device* dev) >> { >> > > This leaves out a4l_proc_devs_ops and a4l_proc_drvs_ops, while > converting a4l_proc_transfer_ops. By accident? In any case, I've added > those two cases to the patch. > Not by accident, Analogy is off the map as far as I'm concerned. I would simply disable it from the CI stuff, until a patch dropping it entirely is pushed upstream. -- Philippe. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 3/7] cobalt/kernel: convert to proc_ops 2021-04-07 10:17 ` Philippe Gerum @ 2021-04-07 10:37 ` Jan Kiszka 2021-04-07 11:03 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Jan Kiszka @ 2021-04-07 10:37 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai On 07.04.21 12:17, Philippe Gerum wrote: > > Jan Kiszka <jan.kiszka@siemens.com> writes: > >> On 27.03.21 11:19, Philippe Gerum wrote: >>> From: Philippe Gerum <rpm@xenomai.org> >>> >>> Signed-off-by: Philippe Gerum <rpm@xenomai.org> >>> --- >>> .../include/asm-generic/xenomai/wrappers.h | 20 ++++++ >>> kernel/cobalt/vfile.c | 26 ++++---- >>> kernel/drivers/analogy/device.c | 12 ++-- >>> kernel/drivers/can/mscan/rtcan_mscan_proc.c | 12 ++-- >>> kernel/drivers/can/rtcan_module.c | 66 +++++++++---------- >>> .../drivers/can/sja1000/rtcan_sja1000_proc.c | 12 ++-- >>> 6 files changed, 80 insertions(+), 68 deletions(-) >>> >>> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >>> index cd22a8db5..cc0cb0896 100644 >>> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >>> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >>> @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, >>> #define __kernel_old_timeval timeval >>> #endif >>> >>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) >>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ >>> + struct file_operations __name = { \ >>> + .open = (__open), \ >>> + .release = (__release), \ >>> + .read = (__read), \ >>> + .write = (__write), \ >>> + .llseek = seq_lseek, \ >>> +} >>> +#else >>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ >>> + struct proc_ops __name = { \ >>> + .proc_open = (__open), \ >>> + .proc_release = (__release), \ >>> + .proc_read = (__read), \ >>> + .proc_write = (__write), \ >>> + .proc_lseek = seq_lseek, \ >>> +} >>> +#endif >>> + >>> #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0) >>> #define old_timespec32 compat_timespec >>> #define old_itimerspec32 compat_itimerspec >>> diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c >>> index f65d46ddf..e9e10ce8d 100644 >>> --- a/kernel/cobalt/vfile.c >>> +++ b/kernel/cobalt/vfile.c >>> @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf, >>> return ret; >>> } >>> >>> -static struct file_operations vfile_snapshot_fops = { >>> - .open = vfile_snapshot_open, >>> - .read = seq_read, >>> - .write = vfile_snapshot_write, >>> - .llseek = seq_lseek, >>> - .release = vfile_snapshot_release, >>> -}; >>> +static const DEFINE_PROC_OPS(vfile_snapshot_fops, >>> + vfile_snapshot_open, >>> + vfile_snapshot_release, >>> + seq_read, >>> + vfile_snapshot_write >>> +); >>> >>> /** >>> * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent) >>> @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf, >>> return ret; >>> } >>> >>> -static struct file_operations vfile_regular_fops = { >>> - .open = vfile_regular_open, >>> - .read = seq_read, >>> - .write = vfile_regular_write, >>> - .llseek = seq_lseek, >>> - .release = vfile_regular_release, >>> -}; >>> +static const DEFINE_PROC_OPS(vfile_regular_fops, >>> + vfile_regular_open, >>> + vfile_regular_release, >>> + seq_read, >>> + vfile_regular_write >>> +); >>> >>> /** >>> * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent) >>> diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c >>> index 160dcf547..6ed588708 100644 >>> --- a/kernel/drivers/analogy/device.c >>> +++ b/kernel/drivers/analogy/device.c >>> @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file) >>> return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode)); >>> } >>> >>> -static const struct file_operations a4l_proc_transfer_ops = { >>> - .open = a4l_proc_transfer_open, >>> - .read = seq_read, >>> - .llseek = seq_lseek, >>> - .release = single_release, >>> -}; >>> +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops, >>> + a4l_proc_transfer_open, >>> + single_release, >>> + seq_read, >>> + NULL >>> +); >>> >>> int a4l_proc_attach(struct a4l_device_context * cxt) >>> { >>> diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c >>> index 6b54ad4c7..732a02765 100644 >>> --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c >>> +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c >>> @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file) >>> return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode)); >>> } >>> >>> -static const struct file_operations rtcan_mscan_proc_regs_ops = { >>> - .open = rtcan_mscan_proc_regs_open, >>> - .read = seq_read, >>> - .llseek = seq_lseek, >>> - .release = single_release, >>> -}; >>> +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops, >>> + rtcan_mscan_proc_regs_open, >>> + single_elease, >>> + seq_read, >>> + NULL >>> +); >>> >>> int rtcan_mscan_create_proc(struct rtcan_device* dev) >>> { >>> diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c >>> index fbc5c35e4..7f3d4c395 100644 >>> --- a/kernel/drivers/can/rtcan_module.c >>> +++ b/kernel/drivers/can/rtcan_module.c >>> @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file) >>> return single_open(file, rtcan_read_proc_devices, NULL); >>> } >>> >>> -static const struct file_operations rtcan_proc_devices_ops = { >>> - .open = rtcan_proc_devices_open, >>> - .read = seq_read, >>> - .llseek = seq_lseek, >>> - .release = single_release, >>> -}; >>> +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops, >>> + rtcan_proc_devices_open, >>> + single_release, >>> + seq_read, >>> + NULL >>> +); >>> >>> static int rtcan_read_proc_sockets(struct seq_file *p, void *data) >>> { >>> @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file) >>> return single_open(file, rtcan_read_proc_sockets, NULL); >>> } >>> >>> -static const struct file_operations rtcan_proc_sockets_ops = { >>> - .open = rtcan_proc_sockets_open, >>> - .read = seq_read, >>> - .llseek = seq_lseek, >>> - .release = single_release, >>> -}; >>> - >>> +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops, >>> + rtcan_proc_sockets_open, >>> + single_release, >>> + seq_read, >>> + NULL >>> +); >>> >>> static int rtcan_read_proc_info(struct seq_file *p, void *data) >>> { >>> @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file) >>> return single_open(file, rtcan_read_proc_info, PDE_DATA(inode)); >>> } >>> >>> -static const struct file_operations rtcan_proc_info_ops = { >>> - .open = rtcan_proc_info_open, >>> - .read = seq_read, >>> - .llseek = seq_lseek, >>> - .release = single_release, >>> -}; >>> - >>> - >>> +static const DEFINE_PROC_OPS(rtcan_proc_info_ops, >>> + rtcan_proc_info_open, >>> + single_release, >>> + seq_read, >>> + NULL >>> +); >>> >>> static int rtcan_read_proc_filter(struct seq_file *p, void *data) >>> { >>> @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file) >>> return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode)); >>> } >>> >>> -static const struct file_operations rtcan_proc_filter_ops = { >>> - .open = rtcan_proc_filter_open, >>> - .read = seq_read, >>> - .llseek = seq_lseek, >>> - .release = single_release, >>> -}; >>> - >>> - >>> +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops, >>> + rtcan_proc_filter_open, >>> + single_release, >>> + seq_read, >>> + NULL >>> +); >>> >>> static int rtcan_read_proc_version(struct seq_file *p, void *data) >>> { >>> @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file) >>> return single_open(file, rtcan_read_proc_version, NULL); >>> } >>> >>> -static const struct file_operations rtcan_proc_version_ops = { >>> - .open = rtcan_proc_version_open, >>> - .read = seq_read, >>> - .llseek = seq_lseek, >>> - .release = single_release, >>> -}; >>> - >>> +static const DEFINE_PROC_OPS(rtcan_proc_version_ops, >>> + rtcan_proc_version_open, >>> + single_release, >>> + seq_read, >>> + NULL >>> +); >>> >>> void rtcan_dev_remove_proc(struct rtcan_device* dev) >>> { >>> diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >>> index b4af8ab2e..0fdee8c37 100644 >>> --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >>> +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >>> @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file) >>> return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode)); >>> } >>> >>> -static const struct file_operations rtcan_sja_proc_regs_ops = { >>> - .open = rtcan_sja_proc_regs_open, >>> - .read = seq_read, >>> - .llseek = seq_lseek, >>> - .release = single_release, >>> -}; >>> +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops, >>> + rtcan_sja_proc_regs_open, >>> + single_release, >>> + seq_read, >>> + NULL >>> +); >>> >>> int rtcan_sja_create_proc(struct rtcan_device* dev) >>> { >>> >> >> This leaves out a4l_proc_devs_ops and a4l_proc_drvs_ops, while >> converting a4l_proc_transfer_ops. By accident? In any case, I've added >> those two cases to the patch. >> > > Not by accident, Analogy is off the map as far as I'm concerned. I would > simply disable it from the CI stuff, until a patch dropping it entirely > is pushed upstream. > Well, touch it or leave it, but not half-convert it. Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 3/7] cobalt/kernel: convert to proc_ops 2021-04-07 10:37 ` Jan Kiszka @ 2021-04-07 11:03 ` Philippe Gerum 0 siblings, 0 replies; 27+ messages in thread From: Philippe Gerum @ 2021-04-07 11:03 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > On 07.04.21 12:17, Philippe Gerum wrote: >> >> Jan Kiszka <jan.kiszka@siemens.com> writes: >> >>> On 27.03.21 11:19, Philippe Gerum wrote: >>>> From: Philippe Gerum <rpm@xenomai.org> >>>> >>>> Signed-off-by: Philippe Gerum <rpm@xenomai.org> >>>> --- >>>> .../include/asm-generic/xenomai/wrappers.h | 20 ++++++ >>>> kernel/cobalt/vfile.c | 26 ++++---- >>>> kernel/drivers/analogy/device.c | 12 ++-- >>>> kernel/drivers/can/mscan/rtcan_mscan_proc.c | 12 ++-- >>>> kernel/drivers/can/rtcan_module.c | 66 +++++++++---------- >>>> .../drivers/can/sja1000/rtcan_sja1000_proc.c | 12 ++-- >>>> 6 files changed, 80 insertions(+), 68 deletions(-) >>>> >>>> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >>>> index cd22a8db5..cc0cb0896 100644 >>>> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >>>> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >>>> @@ -170,6 +170,26 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, >>>> #define __kernel_old_timeval timeval >>>> #endif >>>> >>>> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0) >>>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ >>>> + struct file_operations __name = { \ >>>> + .open = (__open), \ >>>> + .release = (__release), \ >>>> + .read = (__read), \ >>>> + .write = (__write), \ >>>> + .llseek = seq_lseek, \ >>>> +} >>>> +#else >>>> +#define DEFINE_PROC_OPS(__name, __open, __release, __read, __write) \ >>>> + struct proc_ops __name = { \ >>>> + .proc_open = (__open), \ >>>> + .proc_release = (__release), \ >>>> + .proc_read = (__read), \ >>>> + .proc_write = (__write), \ >>>> + .proc_lseek = seq_lseek, \ >>>> +} >>>> +#endif >>>> + >>>> #if LINUX_VERSION_CODE < KERNEL_VERSION(5,8,0) >>>> #define old_timespec32 compat_timespec >>>> #define old_itimerspec32 compat_itimerspec >>>> diff --git a/kernel/cobalt/vfile.c b/kernel/cobalt/vfile.c >>>> index f65d46ddf..e9e10ce8d 100644 >>>> --- a/kernel/cobalt/vfile.c >>>> +++ b/kernel/cobalt/vfile.c >>>> @@ -340,13 +340,12 @@ ssize_t vfile_snapshot_write(struct file *file, const char __user *buf, >>>> return ret; >>>> } >>>> >>>> -static struct file_operations vfile_snapshot_fops = { >>>> - .open = vfile_snapshot_open, >>>> - .read = seq_read, >>>> - .write = vfile_snapshot_write, >>>> - .llseek = seq_lseek, >>>> - .release = vfile_snapshot_release, >>>> -}; >>>> +static const DEFINE_PROC_OPS(vfile_snapshot_fops, >>>> + vfile_snapshot_open, >>>> + vfile_snapshot_release, >>>> + seq_read, >>>> + vfile_snapshot_write >>>> +); >>>> >>>> /** >>>> * @fn int xnvfile_init_snapshot(const char *name, struct xnvfile_snapshot *vfile, struct xnvfile_directory *parent) >>>> @@ -592,13 +591,12 @@ ssize_t vfile_regular_write(struct file *file, const char __user *buf, >>>> return ret; >>>> } >>>> >>>> -static struct file_operations vfile_regular_fops = { >>>> - .open = vfile_regular_open, >>>> - .read = seq_read, >>>> - .write = vfile_regular_write, >>>> - .llseek = seq_lseek, >>>> - .release = vfile_regular_release, >>>> -}; >>>> +static const DEFINE_PROC_OPS(vfile_regular_fops, >>>> + vfile_regular_open, >>>> + vfile_regular_release, >>>> + seq_read, >>>> + vfile_regular_write >>>> +); >>>> >>>> /** >>>> * @fn int xnvfile_init_regular(const char *name, struct xnvfile_regular *vfile, struct xnvfile_directory *parent) >>>> diff --git a/kernel/drivers/analogy/device.c b/kernel/drivers/analogy/device.c >>>> index 160dcf547..6ed588708 100644 >>>> --- a/kernel/drivers/analogy/device.c >>>> +++ b/kernel/drivers/analogy/device.c >>>> @@ -95,12 +95,12 @@ static int a4l_proc_transfer_open(struct inode *inode, struct file *file) >>>> return single_open(file, a4l_rdproc_transfer, PDE_DATA(inode)); >>>> } >>>> >>>> -static const struct file_operations a4l_proc_transfer_ops = { >>>> - .open = a4l_proc_transfer_open, >>>> - .read = seq_read, >>>> - .llseek = seq_lseek, >>>> - .release = single_release, >>>> -}; >>>> +static const DEFINE_PROC_OPS(a4l_proc_transfer_ops, >>>> + a4l_proc_transfer_open, >>>> + single_release, >>>> + seq_read, >>>> + NULL >>>> +); >>>> >>>> int a4l_proc_attach(struct a4l_device_context * cxt) >>>> { >>>> diff --git a/kernel/drivers/can/mscan/rtcan_mscan_proc.c b/kernel/drivers/can/mscan/rtcan_mscan_proc.c >>>> index 6b54ad4c7..732a02765 100644 >>>> --- a/kernel/drivers/can/mscan/rtcan_mscan_proc.c >>>> +++ b/kernel/drivers/can/mscan/rtcan_mscan_proc.c >>>> @@ -114,12 +114,12 @@ static int rtcan_mscan_proc_regs_open(struct inode *inode, struct file *file) >>>> return single_open(file, rtcan_mscan_proc_regs, PDE_DATA(inode)); >>>> } >>>> >>>> -static const struct file_operations rtcan_mscan_proc_regs_ops = { >>>> - .open = rtcan_mscan_proc_regs_open, >>>> - .read = seq_read, >>>> - .llseek = seq_lseek, >>>> - .release = single_release, >>>> -}; >>>> +static const DEFINE_PROC_OPS(rtcan_mscan_proc_regs_ops, >>>> + rtcan_mscan_proc_regs_open, >>>> + single_elease, >>>> + seq_read, >>>> + NULL >>>> +); >>>> >>>> int rtcan_mscan_create_proc(struct rtcan_device* dev) >>>> { >>>> diff --git a/kernel/drivers/can/rtcan_module.c b/kernel/drivers/can/rtcan_module.c >>>> index fbc5c35e4..7f3d4c395 100644 >>>> --- a/kernel/drivers/can/rtcan_module.c >>>> +++ b/kernel/drivers/can/rtcan_module.c >>>> @@ -157,12 +157,12 @@ static int rtcan_proc_devices_open(struct inode *inode, struct file *file) >>>> return single_open(file, rtcan_read_proc_devices, NULL); >>>> } >>>> >>>> -static const struct file_operations rtcan_proc_devices_ops = { >>>> - .open = rtcan_proc_devices_open, >>>> - .read = seq_read, >>>> - .llseek = seq_lseek, >>>> - .release = single_release, >>>> -}; >>>> +static const DEFINE_PROC_OPS(rtcan_proc_devices_ops, >>>> + rtcan_proc_devices_open, >>>> + single_release, >>>> + seq_read, >>>> + NULL >>>> +); >>>> >>>> static int rtcan_read_proc_sockets(struct seq_file *p, void *data) >>>> { >>>> @@ -220,13 +220,12 @@ static int rtcan_proc_sockets_open(struct inode *inode, struct file *file) >>>> return single_open(file, rtcan_read_proc_sockets, NULL); >>>> } >>>> >>>> -static const struct file_operations rtcan_proc_sockets_ops = { >>>> - .open = rtcan_proc_sockets_open, >>>> - .read = seq_read, >>>> - .llseek = seq_lseek, >>>> - .release = single_release, >>>> -}; >>>> - >>>> +static const DEFINE_PROC_OPS(rtcan_proc_sockets_ops, >>>> + rtcan_proc_sockets_open, >>>> + single_release, >>>> + seq_read, >>>> + NULL >>>> +); >>>> >>>> static int rtcan_read_proc_info(struct seq_file *p, void *data) >>>> { >>>> @@ -271,14 +270,12 @@ static int rtcan_proc_info_open(struct inode *inode, struct file *file) >>>> return single_open(file, rtcan_read_proc_info, PDE_DATA(inode)); >>>> } >>>> >>>> -static const struct file_operations rtcan_proc_info_ops = { >>>> - .open = rtcan_proc_info_open, >>>> - .read = seq_read, >>>> - .llseek = seq_lseek, >>>> - .release = single_release, >>>> -}; >>>> - >>>> - >>>> +static const DEFINE_PROC_OPS(rtcan_proc_info_ops, >>>> + rtcan_proc_info_open, >>>> + single_release, >>>> + seq_read, >>>> + NULL >>>> +); >>>> >>>> static int rtcan_read_proc_filter(struct seq_file *p, void *data) >>>> { >>>> @@ -319,14 +316,12 @@ static int rtcan_proc_filter_open(struct inode *inode, struct file *file) >>>> return single_open(file, rtcan_read_proc_filter, PDE_DATA(inode)); >>>> } >>>> >>>> -static const struct file_operations rtcan_proc_filter_ops = { >>>> - .open = rtcan_proc_filter_open, >>>> - .read = seq_read, >>>> - .llseek = seq_lseek, >>>> - .release = single_release, >>>> -}; >>>> - >>>> - >>>> +static const DEFINE_PROC_OPS(rtcan_proc_filter_ops, >>>> + rtcan_proc_filter_open, >>>> + single_release, >>>> + seq_read, >>>> + NULL >>>> +); >>>> >>>> static int rtcan_read_proc_version(struct seq_file *p, void *data) >>>> { >>>> @@ -341,13 +336,12 @@ static int rtcan_proc_version_open(struct inode *inode, struct file *file) >>>> return single_open(file, rtcan_read_proc_version, NULL); >>>> } >>>> >>>> -static const struct file_operations rtcan_proc_version_ops = { >>>> - .open = rtcan_proc_version_open, >>>> - .read = seq_read, >>>> - .llseek = seq_lseek, >>>> - .release = single_release, >>>> -}; >>>> - >>>> +static const DEFINE_PROC_OPS(rtcan_proc_version_ops, >>>> + rtcan_proc_version_open, >>>> + single_release, >>>> + seq_read, >>>> + NULL >>>> +); >>>> >>>> void rtcan_dev_remove_proc(struct rtcan_device* dev) >>>> { >>>> diff --git a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >>>> index b4af8ab2e..0fdee8c37 100644 >>>> --- a/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >>>> +++ b/kernel/drivers/can/sja1000/rtcan_sja1000_proc.c >>>> @@ -50,12 +50,12 @@ static int rtcan_sja_proc_regs_open(struct inode *inode, struct file *file) >>>> return single_open(file, rtcan_sja_proc_regs, PDE_DATA(inode)); >>>> } >>>> >>>> -static const struct file_operations rtcan_sja_proc_regs_ops = { >>>> - .open = rtcan_sja_proc_regs_open, >>>> - .read = seq_read, >>>> - .llseek = seq_lseek, >>>> - .release = single_release, >>>> -}; >>>> +static const DEFINE_PROC_OPS(rtcan_sja_proc_regs_ops, >>>> + rtcan_sja_proc_regs_open, >>>> + single_release, >>>> + seq_read, >>>> + NULL >>>> +); >>>> >>>> int rtcan_sja_create_proc(struct rtcan_device* dev) >>>> { >>>> >>> >>> This leaves out a4l_proc_devs_ops and a4l_proc_drvs_ops, while >>> converting a4l_proc_transfer_ops. By accident? In any case, I've added >>> those two cases to the patch. >>> >> >> Not by accident, Analogy is off the map as far as I'm concerned. I would >> simply disable it from the CI stuff, until a patch dropping it entirely >> is pushed upstream. >> > > Well, touch it or leave it, but not half-convert it. > Then let's not fix the unfixable at all. Please drop all changes, then issue a patch removing Analogy entirely. -- Philippe. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 4/7] cobalt/debug: prefer dump_stack() to show_stack() 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum ` (2 preceding siblings ...) 2021-03-27 10:19 ` [PATCH v2 3/7] cobalt/kernel: convert to proc_ops Philippe Gerum @ 2021-03-27 10:19 ` Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() Philippe Gerum ` (3 subsequent siblings) 7 siblings, 0 replies; 27+ messages in thread From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw) To: xenomai From: Philippe Gerum <rpm@xenomai.org> dump_stack() also prints out the system information, which is always good to have. Signed-off-by: Philippe Gerum <rpm@xenomai.org> --- kernel/cobalt/debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cobalt/debug.c b/kernel/cobalt/debug.c index f97144aeb..a6e2cc42d 100644 --- a/kernel/cobalt/debug.c +++ b/kernel/cobalt/debug.c @@ -592,7 +592,7 @@ int xnlock_dbg_release(struct xnlock *lock, " last owner = %s:%u (%s(), CPU #%d)\n", lock, cpu, lock->file, lock->line, lock->function, lock->cpu); - show_stack(NULL,NULL); + dump_stack(); return 1; } -- 2.29.2 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum ` (3 preceding siblings ...) 2021-03-27 10:19 ` [PATCH v2 4/7] cobalt/debug: prefer dump_stack() to show_stack() Philippe Gerum @ 2021-03-27 10:19 ` Philippe Gerum 2021-04-07 10:06 ` Jan Kiszka 2021-03-27 10:19 ` [PATCH v2 6/7] drivers/net: icmp: remove variable-length array Philippe Gerum ` (2 subsequent siblings) 7 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw) To: xenomai From: Philippe Gerum <rpm@xenomai.org> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its last argument to csum_partial(). According to #cc44c17baf7f3, passing a non-zero value would not even yield the proper result on some architectures. Nevertheless, the current ICMP code does expect a non-zero csum seed to be used in the next computation, so let's wrap net_csum_copy() to csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for later kernels so that we still feed csum_partial() with the user-given csum. We still expect the x86, ARM and arm64 implementations of csum_partial() to do the right thing. Signed-off-by: Philippe Gerum <rpm@xenomai.org> --- .../include/asm-generic/xenomai/wrappers.h | 11 +++++++++++ kernel/drivers/net/stack/ipv4/icmp.c | 18 +++++++++--------- kernel/drivers/net/stack/rtskb.c | 3 +-- 3 files changed, 21 insertions(+), 11 deletions(-) diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h index cc0cb0896..cfd28fc47 100644 --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h @@ -209,4 +209,15 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, #define vmalloc_kernel(__size, __flags) __vmalloc(__size, GFP_KERNEL|__flags) #endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) +#define net_csum_copy(__src, __dst, __len, __csum) \ + csum_partial_copy_nocheck(__src, __dst, __len, __csum) +#else +#define net_csum_copy(__src, __dst, __len, __csum) \ + ({ \ + memcpy(__dst, __src, __len); \ + csum_partial(__dst, __len, (__force __wsum)__csum); \ + }) +#endif + #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */ diff --git a/kernel/drivers/net/stack/ipv4/icmp.c b/kernel/drivers/net/stack/ipv4/icmp.c index b723040aa..0671cbc93 100644 --- a/kernel/drivers/net/stack/ipv4/icmp.c +++ b/kernel/drivers/net/stack/ipv4/icmp.c @@ -142,9 +142,9 @@ static int rt_icmp_glue_reply_bits(const void *p, unsigned char *to, if (offset != 0) return -EMSGSIZE; - csum = csum_partial_copy_nocheck((void *)&icmp_param->head, to, - icmp_param->head_len, - icmp_param->csum); + csum = net_csum_copy((void *)&icmp_param->head, to, + icmp_param->head_len, + icmp_param->csum); csum = rtskb_copy_and_csum_bits(icmp_param->data.skb, icmp_param->offset, @@ -259,13 +259,13 @@ static int rt_icmp_glue_request_bits(const void *p, unsigned char *to, __FUNCTION__); return -1;); - csum = csum_partial_copy_nocheck((void *)&icmp_param->head, to, - icmp_param->head_len, - icmp_param->csum); + csum = net_csum_copy((void *)&icmp_param->head, to, + icmp_param->head_len, + icmp_param->csum); - csum = csum_partial_copy_nocheck(icmp_param->data.buf, - to + icmp_param->head_len, - fraglen - icmp_param->head_len, csum); + csum = net_csum_copy(icmp_param->data.buf, + to + icmp_param->head_len, + fraglen - icmp_param->head_len, csum); icmph = (struct icmphdr *)to; diff --git a/kernel/drivers/net/stack/rtskb.c b/kernel/drivers/net/stack/rtskb.c index 84d021d24..0d0c66e58 100644 --- a/kernel/drivers/net/stack/rtskb.c +++ b/kernel/drivers/net/stack/rtskb.c @@ -69,8 +69,7 @@ unsigned int rtskb_copy_and_csum_bits(const struct rtskb *skb, int offset, if ((copy = skb->len - offset) > 0) { if (copy > len) copy = len; - csum = csum_partial_copy_nocheck(skb->data + offset, to, copy, - csum); + csum = net_csum_copy(skb->data + offset, to, copy, csum); if ((len -= copy) == 0) return csum; offset += copy; -- 2.29.2 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-03-27 10:19 ` [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() Philippe Gerum @ 2021-04-07 10:06 ` Jan Kiszka 2021-04-15 7:21 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Jan Kiszka @ 2021-04-07 10:06 UTC (permalink / raw) To: Philippe Gerum, xenomai On 27.03.21 11:19, Philippe Gerum wrote: > From: Philippe Gerum <rpm@xenomai.org> > > Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its > last argument to csum_partial(). According to #cc44c17baf7f3, passing > a non-zero value would not even yield the proper result on some > architectures. > > Nevertheless, the current ICMP code does expect a non-zero csum seed > to be used in the next computation, so let's wrap net_csum_copy() to > csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for > later kernels so that we still feed csum_partial() with the user-given > csum. We still expect the x86, ARM and arm64 implementations of > csum_partial() to do the right thing. > If that issue only affects the ICMP code path, why not only changing that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy and csum in one step? Jan > Signed-off-by: Philippe Gerum <rpm@xenomai.org> > --- > .../include/asm-generic/xenomai/wrappers.h | 11 +++++++++++ > kernel/drivers/net/stack/ipv4/icmp.c | 18 +++++++++--------- > kernel/drivers/net/stack/rtskb.c | 3 +-- > 3 files changed, 21 insertions(+), 11 deletions(-) > > diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > index cc0cb0896..cfd28fc47 100644 > --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > @@ -209,4 +209,15 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, > #define vmalloc_kernel(__size, __flags) __vmalloc(__size, GFP_KERNEL|__flags) > #endif > > +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) > +#define net_csum_copy(__src, __dst, __len, __csum) \ > + csum_partial_copy_nocheck(__src, __dst, __len, __csum) > +#else > +#define net_csum_copy(__src, __dst, __len, __csum) \ > + ({ \ > + memcpy(__dst, __src, __len); \ > + csum_partial(__dst, __len, (__force __wsum)__csum); \ > + }) > +#endif > + > #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */ > diff --git a/kernel/drivers/net/stack/ipv4/icmp.c b/kernel/drivers/net/stack/ipv4/icmp.c > index b723040aa..0671cbc93 100644 > --- a/kernel/drivers/net/stack/ipv4/icmp.c > +++ b/kernel/drivers/net/stack/ipv4/icmp.c > @@ -142,9 +142,9 @@ static int rt_icmp_glue_reply_bits(const void *p, unsigned char *to, > if (offset != 0) > return -EMSGSIZE; > > - csum = csum_partial_copy_nocheck((void *)&icmp_param->head, to, > - icmp_param->head_len, > - icmp_param->csum); > + csum = net_csum_copy((void *)&icmp_param->head, to, > + icmp_param->head_len, > + icmp_param->csum); > > csum = rtskb_copy_and_csum_bits(icmp_param->data.skb, > icmp_param->offset, > @@ -259,13 +259,13 @@ static int rt_icmp_glue_request_bits(const void *p, unsigned char *to, > __FUNCTION__); > return -1;); > > - csum = csum_partial_copy_nocheck((void *)&icmp_param->head, to, > - icmp_param->head_len, > - icmp_param->csum); > + csum = net_csum_copy((void *)&icmp_param->head, to, > + icmp_param->head_len, > + icmp_param->csum); > > - csum = csum_partial_copy_nocheck(icmp_param->data.buf, > - to + icmp_param->head_len, > - fraglen - icmp_param->head_len, csum); > + csum = net_csum_copy(icmp_param->data.buf, > + to + icmp_param->head_len, > + fraglen - icmp_param->head_len, csum); > > icmph = (struct icmphdr *)to; > > diff --git a/kernel/drivers/net/stack/rtskb.c b/kernel/drivers/net/stack/rtskb.c > index 84d021d24..0d0c66e58 100644 > --- a/kernel/drivers/net/stack/rtskb.c > +++ b/kernel/drivers/net/stack/rtskb.c > @@ -69,8 +69,7 @@ unsigned int rtskb_copy_and_csum_bits(const struct rtskb *skb, int offset, > if ((copy = skb->len - offset) > 0) { > if (copy > len) > copy = len; > - csum = csum_partial_copy_nocheck(skb->data + offset, to, copy, > - csum); > + csum = net_csum_copy(skb->data + offset, to, copy, csum); > if ((len -= copy) == 0) > return csum; > offset += copy; > -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-07 10:06 ` Jan Kiszka @ 2021-04-15 7:21 ` Philippe Gerum 2021-04-15 7:35 ` Jan Kiszka 0 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-04-15 7:21 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > On 27.03.21 11:19, Philippe Gerum wrote: >> From: Philippe Gerum <rpm@xenomai.org> >> >> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >> last argument to csum_partial(). According to #cc44c17baf7f3, passing >> a non-zero value would not even yield the proper result on some >> architectures. >> >> Nevertheless, the current ICMP code does expect a non-zero csum seed >> to be used in the next computation, so let's wrap net_csum_copy() to >> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >> later kernels so that we still feed csum_partial() with the user-given >> csum. We still expect the x86, ARM and arm64 implementations of >> csum_partial() to do the right thing. >> > > If that issue only affects the ICMP code path, why not only changing > that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy > and csum in one step? > As a result of #cc44c17baf7f3, I see no common helper available from the kernel folding the copy and checksum operations anymore, so I see no way to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one replacement for csum_partial_copy_nocheck() which would allow a csum value to be fed in? -- Philippe. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-15 7:21 ` Philippe Gerum @ 2021-04-15 7:35 ` Jan Kiszka 2021-04-15 7:54 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Jan Kiszka @ 2021-04-15 7:35 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai On 15.04.21 09:21, Philippe Gerum wrote: > > Jan Kiszka <jan.kiszka@siemens.com> writes: > >> On 27.03.21 11:19, Philippe Gerum wrote: >>> From: Philippe Gerum <rpm@xenomai.org> >>> >>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>> a non-zero value would not even yield the proper result on some >>> architectures. >>> >>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>> to be used in the next computation, so let's wrap net_csum_copy() to >>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>> later kernels so that we still feed csum_partial() with the user-given >>> csum. We still expect the x86, ARM and arm64 implementations of >>> csum_partial() to do the right thing. >>> >> >> If that issue only affects the ICMP code path, why not only changing >> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >> and csum in one step? >> > > As a result of #cc44c17baf7f3, I see no common helper available from the > kernel folding the copy and checksum operations anymore, so I see no way > to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one > replacement for csum_partial_copy_nocheck() which would allow a csum > value to be fed in? > rtskb_copy_and_csum_dev does not need that. Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-15 7:35 ` Jan Kiszka @ 2021-04-15 7:54 ` Philippe Gerum 2021-04-15 8:10 ` Jan Kiszka 0 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-04-15 7:54 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > On 15.04.21 09:21, Philippe Gerum wrote: >> >> Jan Kiszka <jan.kiszka@siemens.com> writes: >> >>> On 27.03.21 11:19, Philippe Gerum wrote: >>>> From: Philippe Gerum <rpm@xenomai.org> >>>> >>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>> a non-zero value would not even yield the proper result on some >>>> architectures. >>>> >>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>> later kernels so that we still feed csum_partial() with the user-given >>>> csum. We still expect the x86, ARM and arm64 implementations of >>>> csum_partial() to do the right thing. >>>> >>> >>> If that issue only affects the ICMP code path, why not only changing >>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>> and csum in one step? >>> >> >> As a result of #cc44c17baf7f3, I see no common helper available from the >> kernel folding the copy and checksum operations anymore, so I see no way >> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >> replacement for csum_partial_copy_nocheck() which would allow a csum >> value to be fed in? >> > > rtskb_copy_and_csum_dev does not need that. > You made rtskb_copy_and_csum_bits() part of the exported API. So how do you want to deal with this? -- Philippe. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-15 7:54 ` Philippe Gerum @ 2021-04-15 8:10 ` Jan Kiszka 2021-04-16 16:48 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Jan Kiszka @ 2021-04-15 8:10 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai On 15.04.21 09:54, Philippe Gerum wrote: > > Jan Kiszka <jan.kiszka@siemens.com> writes: > >> On 15.04.21 09:21, Philippe Gerum wrote: >>> >>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>> >>>> On 27.03.21 11:19, Philippe Gerum wrote: >>>>> From: Philippe Gerum <rpm@xenomai.org> >>>>> >>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>>> a non-zero value would not even yield the proper result on some >>>>> architectures. >>>>> >>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>>> later kernels so that we still feed csum_partial() with the user-given >>>>> csum. We still expect the x86, ARM and arm64 implementations of >>>>> csum_partial() to do the right thing. >>>>> >>>> >>>> If that issue only affects the ICMP code path, why not only changing >>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>>> and csum in one step? >>>> >>> >>> As a result of #cc44c17baf7f3, I see no common helper available from the >>> kernel folding the copy and checksum operations anymore, so I see no way >>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >>> replacement for csum_partial_copy_nocheck() which would allow a csum >>> value to be fed in? >>> >> >> rtskb_copy_and_csum_dev does not need that. >> > > You made rtskb_copy_and_csum_bits() part of the exported API. So how do > you want to deal with this? > That is an internal API, so we don't care. But even if we converted rtskb_copy_and_csum_bits to work as it used to (with a csum != 0), there would be not reason to make rtskb_copy_and_csum_dev pay that price. Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-15 8:10 ` Jan Kiszka @ 2021-04-16 16:48 ` Philippe Gerum 2021-04-16 17:12 ` Jan Kiszka 0 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-04-16 16:48 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > On 15.04.21 09:54, Philippe Gerum wrote: >> >> Jan Kiszka <jan.kiszka@siemens.com> writes: >> >>> On 15.04.21 09:21, Philippe Gerum wrote: >>>> >>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>> >>>>> On 27.03.21 11:19, Philippe Gerum wrote: >>>>>> From: Philippe Gerum <rpm@xenomai.org> >>>>>> >>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>>>> a non-zero value would not even yield the proper result on some >>>>>> architectures. >>>>>> >>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>>>> later kernels so that we still feed csum_partial() with the user-given >>>>>> csum. We still expect the x86, ARM and arm64 implementations of >>>>>> csum_partial() to do the right thing. >>>>>> >>>>> >>>>> If that issue only affects the ICMP code path, why not only changing >>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>>>> and csum in one step? >>>>> >>>> >>>> As a result of #cc44c17baf7f3, I see no common helper available from the >>>> kernel folding the copy and checksum operations anymore, so I see no way >>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >>>> replacement for csum_partial_copy_nocheck() which would allow a csum >>>> value to be fed in? >>>> >>> >>> rtskb_copy_and_csum_dev does not need that. >>> >> >> You made rtskb_copy_and_csum_bits() part of the exported API. So how do >> you want to deal with this? >> > > That is an internal API, so we don't care. > > But even if we converted rtskb_copy_and_csum_bits to work as it used to > (with a csum != 0), there would be not reason to make > rtskb_copy_and_csum_dev pay that price. > Well, there may be a reason to challenge the idea that a folded copy_and_csum is better for a real-time system than a split memcpy+csum in the first place. Out of curiosity, I ran a few benchmarks lately on a few SoCs regarding this, and it turned out that optimizing the data copy to get the buffer quickly in place before checksumming the result may well yield much better results with respect to jitter than what csum_and_copy currently achieves on these SoCs. Inline csum_and_copy did perform slightly better on average (a couple of hundreds of nanosecs at best) but with pathological jittery in the worst case at times. On the contrary, the split memcpy+csum method did not exhibit such jittery during these tests, not once. - four SoCs tested (2 x x86, armv7, armv8a) - test code ran in kernel space (real-time task context, out-of-band/primary context) - csum_partial_copy_nocheck() vs memcpy()+csum_partial() - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each - all buffers (src & dst) aligned on L1_CACHE_BYTES - each run performed 1000,000 iterations of a given checksum loop, no pause in between. - no concurrent load on the board - all results in nanoseconds The worst results obtained are presented here for each SoC. x86[1] ------ vendor_id : GenuineIntel cpu family : 6 model : 92 model name : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz stepping : 9 cpu MHz : 1593.600 cache size : 1024 KB cpuid level : 21 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs == CSUM_COPY 32b: min=68, max=653, avg=70 CSUM_COPY 1024b: min=248, max=373, avg=251 CSUM_COPY 1500b: min=344, max=3123, avg=350 <================= COPY+CSUM 32b: min=101, max=790, avg=103 COPY+CSUM 1024b: min=297, max=397, avg=300 COPY+CSUM 1500b: min=402, max=490, avg=405 == CSUM_COPY 32b: min=68, max=1420, avg=70 CSUM_COPY 1024b: min=248, max=29706, avg=251 <================= CSUM_COPY 1500b: min=344, max=792, avg=350 COPY+CSUM 32b: min=101, max=872, avg=103 COPY+CSUM 1024b: min=297, max=776, avg=300 COPY+CSUM 1500b: min=402, max=853, avg=405 == CSUM_COPY 32b: min=68, max=661, avg=70 CSUM_COPY 1024b: min=248, max=1714, avg=251 CSUM_COPY 1500b: min=344, max=33937, avg=350 <================= COPY+CSUM 32b: min=101, max=610, avg=103 COPY+CSUM 1024b: min=297, max=605, avg=300 COPY+CSUM 1500b: min=402, max=712, avg=405 x86[2] ------ vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz stepping : 6 microcode : 0x60c cpu MHz : 1627.113 cache size : 3072 KB cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm CSUM_COPY 32b: min=558, max=31010, avg=674 <================= CSUM_COPY 1024b: min=558, max=2794, avg=745 CSUM_COPY 1500b: min=558, max=2794, avg=841 COPY+CSUM 32b: min=558, max=2794, avg=671 COPY+CSUM 1024b: min=558, max=2794, avg=778 COPY+CSUM 1500b: min=838, max=2794, avg=865 == CSUM_COPY 32b: min=59, max=532, avg=62 CSUM_COPY 1024b: min=198, max=270, avg=201 CSUM_COPY 1500b: min=288, max=34921, avg=289 <================= COPY+CSUM 32b: min=53, max=326, avg=56 COPY+CSUM 1024b: min=228, max=461, avg=232 COPY+CSUM 1500b: min=311, max=341, avg=317 == CSUM_COPY 32b: min=59, max=382, avg=62 CSUM_COPY 1024b: min=198, max=383, avg=201 CSUM_COPY 1500b: min=285, max=21235, avg=289 <================= COPY+CSUM 32b: min=52, max=300, avg=56 COPY+CSUM 1024b: min=228, max=348, avg=232 COPY+CSUM 1500b: min=311, max=409, avg=317 Cortex A9 quad-core 1.2Ghz (iMX6qp) ----------------------------------- model name : ARMv7 Processor rev 10 (v7l) Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x2 CPU part : 0xc09 CPU revision : 10 CSUM_COPY 32b: min=333, max=1334, avg=440 CSUM_COPY 1024b: min=1000, max=2666, avg=1060 CSUM_COPY 1500b: min=1333, max=45333, avg=1357 <================= COPY+CSUM 32b: min=333, max=1334, avg=476 COPY+CSUM 1024b: min=1000, max=2333, avg=1324 COPY+CSUM 1500b: min=1666, max=2334, avg=1713 == CSUM_COPY 32b: min=333, max=1334, avg=439 CSUM_COPY 1024b: min=1000, max=46000, avg=1060 <================= CSUM_COPY 1500b: min=1333, max=5000, avg=1351 COPY+CSUM 32b: min=333, max=1334, avg=476 COPY+CSUM 1024b: min=1000, max=2334, avg=1324 COPY+CSUM 1500b: min=1666, max=2667, avg=1713 == CSUM_COPY 32b: min=333, max=1666, avg=454 CSUM_COPY 1024b: min=1000, max=2000, avg=1060 CSUM_COPY 1500b: min=1333, max=45000, avg=1348 <================= COPY+CSUM 32b: min=333, max=1334, avg=454 COPY+CSUM 1024b: min=1000, max=2334, avg=1317 COPY+CSUM 1500b: min=1666, max=6000, avg=1712 Cortex A55 quad-core 2Ghz (Odroid C4) ------------------------------------- Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd05 CPU revision : 0 CSUM_COPY 32b: min=125, max=833, avg=140 CSUM_COPY 1024b: min=625, max=41916, avg=673 <================= CSUM_COPY 1500b: min=875, max=3875, avg=923 COPY+CSUM 32b: min=125, max=458, avg=140 COPY+CSUM 1024b: min=625, max=1166, avg=666 COPY+CSUM 1500b: min=875, max=1167, avg=913 == CSUM_COPY 32b: min=125, max=1292, avg=139 CSUM_COPY 1024b: min=541, max=48333, avg=555 CSUM_COPY 1500b: min=708, max=3458, avg=740 COPY+CSUM 32b: min=125, max=292, avg=136 COPY+CSUM 1024b: min=541, max=750, avg=556 COPY+CSUM 1500b: min=708, max=834, avg=740 == CSUM_COPY 32b: min=125, max=833, avg=140 CSUM_COPY 1024b: min=666, max=55667, avg=673 <================= CSUM_COPY 1500b: min=875, max=4208, avg=913 COPY+CSUM 32b: min=125, max=375, avg=140 COPY+CSUM 1024b: min=666, max=916, avg=673 COPY+CSUM 1500b: min=875, max=1042, avg=913 ============ A few additional observations from looking at the implementation: For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb which is faster. arm32/armv7 optimizes memcpy by loading up to 8 words in a single instruction. csum_and_copy loads/stores at best 4 words at a time, only when src and dst are 32bit aligned (which matches the test case). arm64/armv8a uses load/store pair instructions to copy memory blocks. It does not have asm-optimized csum_and_copy support, so it uses the generic C version. What could be inferred in terms of prefetching and speculation might explain some differences between the approaches too. I would be interested in any converging / diverging results testing the same combo with a different test code, because from my standpoint, things do not seem as obvious as they are supposed to be at the moment. -- Philippe. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-16 16:48 ` Philippe Gerum @ 2021-04-16 17:12 ` Jan Kiszka 2021-04-18 9:18 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Jan Kiszka @ 2021-04-16 17:12 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai On 16.04.21 18:48, Philippe Gerum wrote: > > Jan Kiszka <jan.kiszka@siemens.com> writes: > >> On 15.04.21 09:54, Philippe Gerum wrote: >>> >>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>> >>>> On 15.04.21 09:21, Philippe Gerum wrote: >>>>> >>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>> >>>>>> On 27.03.21 11:19, Philippe Gerum wrote: >>>>>>> From: Philippe Gerum <rpm@xenomai.org> >>>>>>> >>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>>>>> a non-zero value would not even yield the proper result on some >>>>>>> architectures. >>>>>>> >>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>>>>> later kernels so that we still feed csum_partial() with the user-given >>>>>>> csum. We still expect the x86, ARM and arm64 implementations of >>>>>>> csum_partial() to do the right thing. >>>>>>> >>>>>> >>>>>> If that issue only affects the ICMP code path, why not only changing >>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>>>>> and csum in one step? >>>>>> >>>>> >>>>> As a result of #cc44c17baf7f3, I see no common helper available from the >>>>> kernel folding the copy and checksum operations anymore, so I see no way >>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >>>>> replacement for csum_partial_copy_nocheck() which would allow a csum >>>>> value to be fed in? >>>>> >>>> >>>> rtskb_copy_and_csum_dev does not need that. >>>> >>> >>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do >>> you want to deal with this? >>> >> >> That is an internal API, so we don't care. >> >> But even if we converted rtskb_copy_and_csum_bits to work as it used to >> (with a csum != 0), there would be not reason to make >> rtskb_copy_and_csum_dev pay that price. >> > > Well, there may be a reason to challenge the idea that a folded > copy_and_csum is better for a real-time system than a split memcpy+csum > in the first place. Out of curiosity, I ran a few benchmarks lately on a > few SoCs regarding this, and it turned out that optimizing the data copy > to get the buffer quickly in place before checksumming the result may > well yield much better results with respect to jitter than what > csum_and_copy currently achieves on these SoCs. > > Inline csum_and_copy did perform slightly better on average (a couple of > hundreds of nanosecs at best) but with pathological jittery in the worst > case at times. On the contrary, the split memcpy+csum method did not > exhibit such jittery during these tests, not once. > > - four SoCs tested (2 x x86, armv7, armv8a) > - test code ran in kernel space (real-time task context, > out-of-band/primary context) > - csum_partial_copy_nocheck() vs memcpy()+csum_partial() > - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each > - all buffers (src & dst) aligned on L1_CACHE_BYTES > - each run performed 1000,000 iterations of a given checksum loop, no > pause in between. > - no concurrent load on the board > - all results in nanoseconds > > The worst results obtained are presented here for each SoC. > > x86[1] > ------ > > vendor_id : GenuineIntel > cpu family : 6 > model : 92 > model name : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz > stepping : 9 > cpu MHz : 1593.600 > cache size : 1024 KB > cpuid level : 21 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts > vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs > > == > > CSUM_COPY 32b: min=68, max=653, avg=70 > CSUM_COPY 1024b: min=248, max=373, avg=251 > CSUM_COPY 1500b: min=344, max=3123, avg=350 <================= > COPY+CSUM 32b: min=101, max=790, avg=103 > COPY+CSUM 1024b: min=297, max=397, avg=300 > COPY+CSUM 1500b: min=402, max=490, avg=405 > > == > > CSUM_COPY 32b: min=68, max=1420, avg=70 > CSUM_COPY 1024b: min=248, max=29706, avg=251 <================= > CSUM_COPY 1500b: min=344, max=792, avg=350 > COPY+CSUM 32b: min=101, max=872, avg=103 > COPY+CSUM 1024b: min=297, max=776, avg=300 > COPY+CSUM 1500b: min=402, max=853, avg=405 > > == > > CSUM_COPY 32b: min=68, max=661, avg=70 > CSUM_COPY 1024b: min=248, max=1714, avg=251 > CSUM_COPY 1500b: min=344, max=33937, avg=350 <================= > COPY+CSUM 32b: min=101, max=610, avg=103 > COPY+CSUM 1024b: min=297, max=605, avg=300 > COPY+CSUM 1500b: min=402, max=712, avg=405 > > x86[2] > ------ > > vendor_id : GenuineIntel > cpu family : 6 > model : 23 > model name : Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz > stepping : 6 > microcode : 0x60c > cpu MHz : 1627.113 > cache size : 3072 KB > cpuid level : 10 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm > > CSUM_COPY 32b: min=558, max=31010, avg=674 <================= > CSUM_COPY 1024b: min=558, max=2794, avg=745 > CSUM_COPY 1500b: min=558, max=2794, avg=841 > COPY+CSUM 32b: min=558, max=2794, avg=671 > COPY+CSUM 1024b: min=558, max=2794, avg=778 > COPY+CSUM 1500b: min=838, max=2794, avg=865 > > == > > CSUM_COPY 32b: min=59, max=532, avg=62 > CSUM_COPY 1024b: min=198, max=270, avg=201 > CSUM_COPY 1500b: min=288, max=34921, avg=289 <================= > COPY+CSUM 32b: min=53, max=326, avg=56 > COPY+CSUM 1024b: min=228, max=461, avg=232 > COPY+CSUM 1500b: min=311, max=341, avg=317 > > == > > CSUM_COPY 32b: min=59, max=382, avg=62 > CSUM_COPY 1024b: min=198, max=383, avg=201 > CSUM_COPY 1500b: min=285, max=21235, avg=289 <================= > COPY+CSUM 32b: min=52, max=300, avg=56 > COPY+CSUM 1024b: min=228, max=348, avg=232 > COPY+CSUM 1500b: min=311, max=409, avg=317 > > Cortex A9 quad-core 1.2Ghz (iMX6qp) > ----------------------------------- > > model name : ARMv7 Processor rev 10 (v7l) > Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 > CPU implementer : 0x41 > CPU architecture: 7 > CPU variant : 0x2 > CPU part : 0xc09 > CPU revision : 10 > > CSUM_COPY 32b: min=333, max=1334, avg=440 > CSUM_COPY 1024b: min=1000, max=2666, avg=1060 > CSUM_COPY 1500b: min=1333, max=45333, avg=1357 <================= > COPY+CSUM 32b: min=333, max=1334, avg=476 > COPY+CSUM 1024b: min=1000, max=2333, avg=1324 > COPY+CSUM 1500b: min=1666, max=2334, avg=1713 > > == > > CSUM_COPY 32b: min=333, max=1334, avg=439 > CSUM_COPY 1024b: min=1000, max=46000, avg=1060 <================= > CSUM_COPY 1500b: min=1333, max=5000, avg=1351 > COPY+CSUM 32b: min=333, max=1334, avg=476 > COPY+CSUM 1024b: min=1000, max=2334, avg=1324 > COPY+CSUM 1500b: min=1666, max=2667, avg=1713 > > == > > CSUM_COPY 32b: min=333, max=1666, avg=454 > CSUM_COPY 1024b: min=1000, max=2000, avg=1060 > CSUM_COPY 1500b: min=1333, max=45000, avg=1348 <================= > COPY+CSUM 32b: min=333, max=1334, avg=454 > COPY+CSUM 1024b: min=1000, max=2334, avg=1317 > COPY+CSUM 1500b: min=1666, max=6000, avg=1712 > > Cortex A55 quad-core 2Ghz (Odroid C4) > ------------------------------------- > > Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x1 > CPU part : 0xd05 > CPU revision : 0 > > > CSUM_COPY 32b: min=125, max=833, avg=140 > CSUM_COPY 1024b: min=625, max=41916, avg=673 <================= > CSUM_COPY 1500b: min=875, max=3875, avg=923 > COPY+CSUM 32b: min=125, max=458, avg=140 > COPY+CSUM 1024b: min=625, max=1166, avg=666 > COPY+CSUM 1500b: min=875, max=1167, avg=913 > > == > > CSUM_COPY 32b: min=125, max=1292, avg=139 > CSUM_COPY 1024b: min=541, max=48333, avg=555 > CSUM_COPY 1500b: min=708, max=3458, avg=740 > COPY+CSUM 32b: min=125, max=292, avg=136 > COPY+CSUM 1024b: min=541, max=750, avg=556 > COPY+CSUM 1500b: min=708, max=834, avg=740 > > == > > CSUM_COPY 32b: min=125, max=833, avg=140 > CSUM_COPY 1024b: min=666, max=55667, avg=673 <================= > CSUM_COPY 1500b: min=875, max=4208, avg=913 > COPY+CSUM 32b: min=125, max=375, avg=140 > COPY+CSUM 1024b: min=666, max=916, avg=673 > COPY+CSUM 1500b: min=875, max=1042, avg=913 > > ============ > > A few additional observations from looking at the implementation: > > For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete > buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb > which is faster. > > arm32/armv7 optimizes memcpy by loading up to 8 words in a single > instruction. csum_and_copy loads/stores at best 4 words at a time, > only when src and dst are 32bit aligned (which matches the test case). > > arm64/armv8a uses load/store pair instructions to copy memory > blocks. It does not have asm-optimized csum_and_copy support, so it > uses the generic C version. > > What could be inferred in terms of prefetching and speculation might > explain some differences between the approaches too. > > I would be interested in any converging / diverging results testing the > same combo with a different test code, because from my standpoint, > things do not seem as obvious as they are supposed to be at the moment. > If copy+csum is not using any recent memcopy optimizations, that is an argument for at least equivalent performance. But I don't get yet where the huge jittery should be coming from. Were the measurement loop preemptible? In that case I would expect a split copy followed by another loop to csum should give much worse results as it needs the cache to stay warm - while copy-csum only touches the data once. Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-16 17:12 ` Jan Kiszka @ 2021-04-18 9:18 ` Philippe Gerum 2021-04-18 15:50 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-04-18 9:18 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > On 16.04.21 18:48, Philippe Gerum wrote: >> >> Jan Kiszka <jan.kiszka@siemens.com> writes: >> >>> On 15.04.21 09:54, Philippe Gerum wrote: >>>> >>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>> >>>>> On 15.04.21 09:21, Philippe Gerum wrote: >>>>>> >>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>>> >>>>>>> On 27.03.21 11:19, Philippe Gerum wrote: >>>>>>>> From: Philippe Gerum <rpm@xenomai.org> >>>>>>>> >>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>>>>>> a non-zero value would not even yield the proper result on some >>>>>>>> architectures. >>>>>>>> >>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>>>>>> later kernels so that we still feed csum_partial() with the user-given >>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of >>>>>>>> csum_partial() to do the right thing. >>>>>>>> >>>>>>> >>>>>>> If that issue only affects the ICMP code path, why not only changing >>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>>>>>> and csum in one step? >>>>>>> >>>>>> >>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the >>>>>> kernel folding the copy and checksum operations anymore, so I see no way >>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum >>>>>> value to be fed in? >>>>>> >>>>> >>>>> rtskb_copy_and_csum_dev does not need that. >>>>> >>>> >>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do >>>> you want to deal with this? >>>> >>> >>> That is an internal API, so we don't care. >>> >>> But even if we converted rtskb_copy_and_csum_bits to work as it used to >>> (with a csum != 0), there would be not reason to make >>> rtskb_copy_and_csum_dev pay that price. >>> >> >> Well, there may be a reason to challenge the idea that a folded >> copy_and_csum is better for a real-time system than a split memcpy+csum >> in the first place. Out of curiosity, I ran a few benchmarks lately on a >> few SoCs regarding this, and it turned out that optimizing the data copy >> to get the buffer quickly in place before checksumming the result may >> well yield much better results with respect to jitter than what >> csum_and_copy currently achieves on these SoCs. >> >> Inline csum_and_copy did perform slightly better on average (a couple of >> hundreds of nanosecs at best) but with pathological jittery in the worst >> case at times. On the contrary, the split memcpy+csum method did not >> exhibit such jittery during these tests, not once. >> >> - four SoCs tested (2 x x86, armv7, armv8a) >> - test code ran in kernel space (real-time task context, >> out-of-band/primary context) >> - csum_partial_copy_nocheck() vs memcpy()+csum_partial() >> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each >> - all buffers (src & dst) aligned on L1_CACHE_BYTES >> - each run performed 1000,000 iterations of a given checksum loop, no >> pause in between. >> - no concurrent load on the board >> - all results in nanoseconds >> >> The worst results obtained are presented here for each SoC. >> >> x86[1] >> ------ >> >> vendor_id : GenuineIntel >> cpu family : 6 >> model : 92 >> model name : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz >> stepping : 9 >> cpu MHz : 1593.600 >> cache size : 1024 KB >> cpuid level : 21 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts >> vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs >> >> == >> >> CSUM_COPY 32b: min=68, max=653, avg=70 >> CSUM_COPY 1024b: min=248, max=373, avg=251 >> CSUM_COPY 1500b: min=344, max=3123, avg=350 <================= >> COPY+CSUM 32b: min=101, max=790, avg=103 >> COPY+CSUM 1024b: min=297, max=397, avg=300 >> COPY+CSUM 1500b: min=402, max=490, avg=405 >> >> == >> >> CSUM_COPY 32b: min=68, max=1420, avg=70 >> CSUM_COPY 1024b: min=248, max=29706, avg=251 <================= >> CSUM_COPY 1500b: min=344, max=792, avg=350 >> COPY+CSUM 32b: min=101, max=872, avg=103 >> COPY+CSUM 1024b: min=297, max=776, avg=300 >> COPY+CSUM 1500b: min=402, max=853, avg=405 >> >> == >> >> CSUM_COPY 32b: min=68, max=661, avg=70 >> CSUM_COPY 1024b: min=248, max=1714, avg=251 >> CSUM_COPY 1500b: min=344, max=33937, avg=350 <================= >> COPY+CSUM 32b: min=101, max=610, avg=103 >> COPY+CSUM 1024b: min=297, max=605, avg=300 >> COPY+CSUM 1500b: min=402, max=712, avg=405 >> >> x86[2] >> ------ >> >> vendor_id : GenuineIntel >> cpu family : 6 >> model : 23 >> model name : Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz >> stepping : 6 >> microcode : 0x60c >> cpu MHz : 1627.113 >> cache size : 3072 KB >> cpuid level : 10 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm >> >> CSUM_COPY 32b: min=558, max=31010, avg=674 <================= >> CSUM_COPY 1024b: min=558, max=2794, avg=745 >> CSUM_COPY 1500b: min=558, max=2794, avg=841 >> COPY+CSUM 32b: min=558, max=2794, avg=671 >> COPY+CSUM 1024b: min=558, max=2794, avg=778 >> COPY+CSUM 1500b: min=838, max=2794, avg=865 >> >> == >> >> CSUM_COPY 32b: min=59, max=532, avg=62 >> CSUM_COPY 1024b: min=198, max=270, avg=201 >> CSUM_COPY 1500b: min=288, max=34921, avg=289 <================= >> COPY+CSUM 32b: min=53, max=326, avg=56 >> COPY+CSUM 1024b: min=228, max=461, avg=232 >> COPY+CSUM 1500b: min=311, max=341, avg=317 >> >> == >> >> CSUM_COPY 32b: min=59, max=382, avg=62 >> CSUM_COPY 1024b: min=198, max=383, avg=201 >> CSUM_COPY 1500b: min=285, max=21235, avg=289 <================= >> COPY+CSUM 32b: min=52, max=300, avg=56 >> COPY+CSUM 1024b: min=228, max=348, avg=232 >> COPY+CSUM 1500b: min=311, max=409, avg=317 >> >> Cortex A9 quad-core 1.2Ghz (iMX6qp) >> ----------------------------------- >> >> model name : ARMv7 Processor rev 10 (v7l) >> Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 >> CPU implementer : 0x41 >> CPU architecture: 7 >> CPU variant : 0x2 >> CPU part : 0xc09 >> CPU revision : 10 >> >> CSUM_COPY 32b: min=333, max=1334, avg=440 >> CSUM_COPY 1024b: min=1000, max=2666, avg=1060 >> CSUM_COPY 1500b: min=1333, max=45333, avg=1357 <================= >> COPY+CSUM 32b: min=333, max=1334, avg=476 >> COPY+CSUM 1024b: min=1000, max=2333, avg=1324 >> COPY+CSUM 1500b: min=1666, max=2334, avg=1713 >> >> == >> >> CSUM_COPY 32b: min=333, max=1334, avg=439 >> CSUM_COPY 1024b: min=1000, max=46000, avg=1060 <================= >> CSUM_COPY 1500b: min=1333, max=5000, avg=1351 >> COPY+CSUM 32b: min=333, max=1334, avg=476 >> COPY+CSUM 1024b: min=1000, max=2334, avg=1324 >> COPY+CSUM 1500b: min=1666, max=2667, avg=1713 >> >> == >> >> CSUM_COPY 32b: min=333, max=1666, avg=454 >> CSUM_COPY 1024b: min=1000, max=2000, avg=1060 >> CSUM_COPY 1500b: min=1333, max=45000, avg=1348 <================= >> COPY+CSUM 32b: min=333, max=1334, avg=454 >> COPY+CSUM 1024b: min=1000, max=2334, avg=1317 >> COPY+CSUM 1500b: min=1666, max=6000, avg=1712 >> >> Cortex A55 quad-core 2Ghz (Odroid C4) >> ------------------------------------- >> >> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp >> CPU implementer : 0x41 >> CPU architecture: 8 >> CPU variant : 0x1 >> CPU part : 0xd05 >> CPU revision : 0 >> >> >> CSUM_COPY 32b: min=125, max=833, avg=140 >> CSUM_COPY 1024b: min=625, max=41916, avg=673 <================= >> CSUM_COPY 1500b: min=875, max=3875, avg=923 >> COPY+CSUM 32b: min=125, max=458, avg=140 >> COPY+CSUM 1024b: min=625, max=1166, avg=666 >> COPY+CSUM 1500b: min=875, max=1167, avg=913 >> >> == >> >> CSUM_COPY 32b: min=125, max=1292, avg=139 >> CSUM_COPY 1024b: min=541, max=48333, avg=555 >> CSUM_COPY 1500b: min=708, max=3458, avg=740 >> COPY+CSUM 32b: min=125, max=292, avg=136 >> COPY+CSUM 1024b: min=541, max=750, avg=556 >> COPY+CSUM 1500b: min=708, max=834, avg=740 >> >> == >> >> CSUM_COPY 32b: min=125, max=833, avg=140 >> CSUM_COPY 1024b: min=666, max=55667, avg=673 <================= >> CSUM_COPY 1500b: min=875, max=4208, avg=913 >> COPY+CSUM 32b: min=125, max=375, avg=140 >> COPY+CSUM 1024b: min=666, max=916, avg=673 >> COPY+CSUM 1500b: min=875, max=1042, avg=913 >> >> ============ >> >> A few additional observations from looking at the implementation: >> >> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete >> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb >> which is faster. >> >> arm32/armv7 optimizes memcpy by loading up to 8 words in a single >> instruction. csum_and_copy loads/stores at best 4 words at a time, >> only when src and dst are 32bit aligned (which matches the test case). >> >> arm64/armv8a uses load/store pair instructions to copy memory >> blocks. It does not have asm-optimized csum_and_copy support, so it >> uses the generic C version. >> >> What could be inferred in terms of prefetching and speculation might >> explain some differences between the approaches too. >> >> I would be interested in any converging / diverging results testing the >> same combo with a different test code, because from my standpoint, >> things do not seem as obvious as they are supposed to be at the moment. >> > > If copy+csum is not using any recent memcopy optimizations, that is an > argument for at least equivalent performance. > You mean the folded version, i.e. copy_and_csum? If so, I can't see any way for that one to optimize via fast string operations. > But I don't get yet where the huge jittery should be coming from. Were > the measurement loop preemptible? In that case I would expect a split Out of band stage, so only preemptible by Xenomai timer ticks, which means only the host tick emulation at this point since there was no outstanding Xenomai timers started yet when running the loops. Pretty slim chance to see these latency spots consistently reproduced, and only for the folded copy_sum version. > copy followed by another loop to csum should give much worse results as > it needs the cache to stay warm - while copy-csum only touches the data > once. > Conversely, if the copy is much faster, the odds of being preempted may increase, yielding better results overall. This said, I'm unsure this is related to preemption anyway, this looks like the fingerprints of minor faults with PTEs. Why this would only happen in the folded version is still a mystery to me at the moment. -- Philippe. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-18 9:18 ` Philippe Gerum @ 2021-04-18 15:50 ` Philippe Gerum 2021-05-04 14:48 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-04-18 15:50 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Philippe Gerum <rpm@xenomai.org> writes: > Jan Kiszka <jan.kiszka@siemens.com> writes: > >> On 16.04.21 18:48, Philippe Gerum wrote: >>> >>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>> >>>> On 15.04.21 09:54, Philippe Gerum wrote: >>>>> >>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>> >>>>>> On 15.04.21 09:21, Philippe Gerum wrote: >>>>>>> >>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>>>> >>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote: >>>>>>>>> From: Philippe Gerum <rpm@xenomai.org> >>>>>>>>> >>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>>>>>>> a non-zero value would not even yield the proper result on some >>>>>>>>> architectures. >>>>>>>>> >>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>>>>>>> later kernels so that we still feed csum_partial() with the user-given >>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of >>>>>>>>> csum_partial() to do the right thing. >>>>>>>>> >>>>>>>> >>>>>>>> If that issue only affects the ICMP code path, why not only changing >>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>>>>>>> and csum in one step? >>>>>>>> >>>>>>> >>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the >>>>>>> kernel folding the copy and checksum operations anymore, so I see no way >>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum >>>>>>> value to be fed in? >>>>>>> >>>>>> >>>>>> rtskb_copy_and_csum_dev does not need that. >>>>>> >>>>> >>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do >>>>> you want to deal with this? >>>>> >>>> >>>> That is an internal API, so we don't care. >>>> >>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to >>>> (with a csum != 0), there would be not reason to make >>>> rtskb_copy_and_csum_dev pay that price. >>>> >>> >>> Well, there may be a reason to challenge the idea that a folded >>> copy_and_csum is better for a real-time system than a split memcpy+csum >>> in the first place. Out of curiosity, I ran a few benchmarks lately on a >>> few SoCs regarding this, and it turned out that optimizing the data copy >>> to get the buffer quickly in place before checksumming the result may >>> well yield much better results with respect to jitter than what >>> csum_and_copy currently achieves on these SoCs. >>> >>> Inline csum_and_copy did perform slightly better on average (a couple of >>> hundreds of nanosecs at best) but with pathological jittery in the worst >>> case at times. On the contrary, the split memcpy+csum method did not >>> exhibit such jittery during these tests, not once. >>> >>> - four SoCs tested (2 x x86, armv7, armv8a) >>> - test code ran in kernel space (real-time task context, >>> out-of-band/primary context) >>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial() >>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each >>> - all buffers (src & dst) aligned on L1_CACHE_BYTES >>> - each run performed 1000,000 iterations of a given checksum loop, no >>> pause in between. >>> - no concurrent load on the board >>> - all results in nanoseconds >>> >>> The worst results obtained are presented here for each SoC. >>> >>> x86[1] >>> ------ >>> >>> vendor_id : GenuineIntel >>> cpu family : 6 >>> model : 92 >>> model name : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz >>> stepping : 9 >>> cpu MHz : 1593.600 >>> cache size : 1024 KB >>> cpuid level : 21 >>> wp : yes >>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts >>> vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs >>> >>> == >>> >>> CSUM_COPY 32b: min=68, max=653, avg=70 >>> CSUM_COPY 1024b: min=248, max=373, avg=251 >>> CSUM_COPY 1500b: min=344, max=3123, avg=350 <================= >>> COPY+CSUM 32b: min=101, max=790, avg=103 >>> COPY+CSUM 1024b: min=297, max=397, avg=300 >>> COPY+CSUM 1500b: min=402, max=490, avg=405 >>> >>> == >>> >>> CSUM_COPY 32b: min=68, max=1420, avg=70 >>> CSUM_COPY 1024b: min=248, max=29706, avg=251 <================= >>> CSUM_COPY 1500b: min=344, max=792, avg=350 >>> COPY+CSUM 32b: min=101, max=872, avg=103 >>> COPY+CSUM 1024b: min=297, max=776, avg=300 >>> COPY+CSUM 1500b: min=402, max=853, avg=405 >>> >>> == >>> >>> CSUM_COPY 32b: min=68, max=661, avg=70 >>> CSUM_COPY 1024b: min=248, max=1714, avg=251 >>> CSUM_COPY 1500b: min=344, max=33937, avg=350 <================= >>> COPY+CSUM 32b: min=101, max=610, avg=103 >>> COPY+CSUM 1024b: min=297, max=605, avg=300 >>> COPY+CSUM 1500b: min=402, max=712, avg=405 >>> >>> x86[2] >>> ------ >>> >>> vendor_id : GenuineIntel >>> cpu family : 6 >>> model : 23 >>> model name : Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz >>> stepping : 6 >>> microcode : 0x60c >>> cpu MHz : 1627.113 >>> cache size : 3072 KB >>> cpuid level : 10 >>> wp : yes >>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm >>> >>> CSUM_COPY 32b: min=558, max=31010, avg=674 <================= >>> CSUM_COPY 1024b: min=558, max=2794, avg=745 >>> CSUM_COPY 1500b: min=558, max=2794, avg=841 >>> COPY+CSUM 32b: min=558, max=2794, avg=671 >>> COPY+CSUM 1024b: min=558, max=2794, avg=778 >>> COPY+CSUM 1500b: min=838, max=2794, avg=865 >>> >>> == >>> >>> CSUM_COPY 32b: min=59, max=532, avg=62 >>> CSUM_COPY 1024b: min=198, max=270, avg=201 >>> CSUM_COPY 1500b: min=288, max=34921, avg=289 <================= >>> COPY+CSUM 32b: min=53, max=326, avg=56 >>> COPY+CSUM 1024b: min=228, max=461, avg=232 >>> COPY+CSUM 1500b: min=311, max=341, avg=317 >>> >>> == >>> >>> CSUM_COPY 32b: min=59, max=382, avg=62 >>> CSUM_COPY 1024b: min=198, max=383, avg=201 >>> CSUM_COPY 1500b: min=285, max=21235, avg=289 <================= >>> COPY+CSUM 32b: min=52, max=300, avg=56 >>> COPY+CSUM 1024b: min=228, max=348, avg=232 >>> COPY+CSUM 1500b: min=311, max=409, avg=317 >>> >>> Cortex A9 quad-core 1.2Ghz (iMX6qp) >>> ----------------------------------- >>> >>> model name : ARMv7 Processor rev 10 (v7l) >>> Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 >>> CPU implementer : 0x41 >>> CPU architecture: 7 >>> CPU variant : 0x2 >>> CPU part : 0xc09 >>> CPU revision : 10 >>> >>> CSUM_COPY 32b: min=333, max=1334, avg=440 >>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060 >>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357 <================= >>> COPY+CSUM 32b: min=333, max=1334, avg=476 >>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324 >>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713 >>> >>> == >>> >>> CSUM_COPY 32b: min=333, max=1334, avg=439 >>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060 <================= >>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351 >>> COPY+CSUM 32b: min=333, max=1334, avg=476 >>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324 >>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713 >>> >>> == >>> >>> CSUM_COPY 32b: min=333, max=1666, avg=454 >>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060 >>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348 <================= >>> COPY+CSUM 32b: min=333, max=1334, avg=454 >>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317 >>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712 >>> >>> Cortex A55 quad-core 2Ghz (Odroid C4) >>> ------------------------------------- >>> >>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp >>> CPU implementer : 0x41 >>> CPU architecture: 8 >>> CPU variant : 0x1 >>> CPU part : 0xd05 >>> CPU revision : 0 >>> >>> >>> CSUM_COPY 32b: min=125, max=833, avg=140 >>> CSUM_COPY 1024b: min=625, max=41916, avg=673 <================= >>> CSUM_COPY 1500b: min=875, max=3875, avg=923 >>> COPY+CSUM 32b: min=125, max=458, avg=140 >>> COPY+CSUM 1024b: min=625, max=1166, avg=666 >>> COPY+CSUM 1500b: min=875, max=1167, avg=913 >>> >>> == >>> >>> CSUM_COPY 32b: min=125, max=1292, avg=139 >>> CSUM_COPY 1024b: min=541, max=48333, avg=555 >>> CSUM_COPY 1500b: min=708, max=3458, avg=740 >>> COPY+CSUM 32b: min=125, max=292, avg=136 >>> COPY+CSUM 1024b: min=541, max=750, avg=556 >>> COPY+CSUM 1500b: min=708, max=834, avg=740 >>> >>> == >>> >>> CSUM_COPY 32b: min=125, max=833, avg=140 >>> CSUM_COPY 1024b: min=666, max=55667, avg=673 <================= >>> CSUM_COPY 1500b: min=875, max=4208, avg=913 >>> COPY+CSUM 32b: min=125, max=375, avg=140 >>> COPY+CSUM 1024b: min=666, max=916, avg=673 >>> COPY+CSUM 1500b: min=875, max=1042, avg=913 >>> >>> ============ >>> >>> A few additional observations from looking at the implementation: >>> >>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete >>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb >>> which is faster. >>> >>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single >>> instruction. csum_and_copy loads/stores at best 4 words at a time, >>> only when src and dst are 32bit aligned (which matches the test case). >>> >>> arm64/armv8a uses load/store pair instructions to copy memory >>> blocks. It does not have asm-optimized csum_and_copy support, so it >>> uses the generic C version. >>> >>> What could be inferred in terms of prefetching and speculation might >>> explain some differences between the approaches too. >>> >>> I would be interested in any converging / diverging results testing the >>> same combo with a different test code, because from my standpoint, >>> things do not seem as obvious as they are supposed to be at the moment. >>> >> >> If copy+csum is not using any recent memcopy optimizations, that is an >> argument for at least equivalent performance. >> > > You mean the folded version, i.e. copy_and_csum? If so, I can't see any > way for that one to optimize via fast string operations. > >> But I don't get yet where the huge jittery should be coming from. Were >> the measurement loop preemptible? In that case I would expect a split > > Out of band stage, so only preemptible by Xenomai timer ticks, which > means only the host tick emulation at this point since there was no > outstanding Xenomai timers started yet when running the loops. Pretty > slim chance to see these latency spots consistently reproduced, and only > for the folded copy_sum version. > >> copy followed by another loop to csum should give much worse results as >> it needs the cache to stay warm - while copy-csum only touches the data >> once. >> > > Conversely, if the copy is much faster, the odds of being preempted may > increase, yielding better results overall. False alarm. Preemption was the issue, by the top half of the host tick handling in primary mode. The latest clock event scheduled by the kernel managed to enter the pipeline at a random time, but always within the execution window of the all-in-one csum_and_copy code. Although this event was deferred and not immediately passed to the in-band context, the time spent dealing with it was enough to show up in the results. > This said, I'm unsure this is > related to preemption anyway, this looks like the fingerprints of minor > faults with PTEs. Why this would only happen in the folded version is > still a mystery to me at the moment. It did not actually, no minor faults. The results are now consistent, both implementations are comparable performance-wise as the optimized memcpy tends to offset the advantage of calculating the checksum on the fly, saving a read access. armv8 benefits more from the former, since it does not have an optimized csum_and_copy but uses the generic C version instead. == x86[1] CSUM_COPY 32b: min=68, max=640, avg=70 CSUM_COPY 1024b: min=247, max=773, avg=252 CSUM_COPY 1500b: min=343, max=832, avg=350 COPY+CSUM 32b: min=100, max=651, avg=131 COPY+CSUM 1024b: min=296, max=752, avg=298 COPY+CSUM 1500b: min=397, max=845, avg=400 == x86[2] CSUM_COPY 32b: min=63, max=267, avg=66 CSUM_COPY 1024b: min=198, max=300, avg=201 CSUM_COPY 1500b: min=288, max=611, avg=291 COPY+CSUM 32b: min=56, max=360, avg=56 COPY+CSUM 1024b: min=228, max=420, avg=231 COPY+CSUM 1500b: min=307, max=337, avg=318 == armv7 (imx6qp) CSUM_COPY 32b: min=333, max=1334, avg=439 CSUM_COPY 1024b: min=1000, max=2000, avg=1045 CSUM_COPY 1500b: min=1000, max=2334, avg=1325 COPY+CSUM 32b: min=333, max=1334, avg=454 COPY+CSUM 1024b: min=1333, max=2334, avg=1347 COPY+CSUM 1500b: min=1666, max=2667, avg=1734 == armv8a (C4) CSUM_COPY 32b: min=125, max=792, avg=130 CSUM_COPY 1024b: min=500, max=1125, avg=550 CSUM_COPY 1500b: min=708, max=1833, avg=726 COPY+CSUM 32b: min=125, max=292, avg=130 COPY+CSUM 1024b: min=541, max=708, avg=550 COPY+CSUM 1500b: min=708, max=875, avg=730 -- Philippe. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-04-18 15:50 ` Philippe Gerum @ 2021-05-04 14:48 ` Philippe Gerum 2021-05-05 5:43 ` Jan Kiszka 0 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-05-04 14:48 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Philippe Gerum <rpm@xenomai.org> writes: > Philippe Gerum <rpm@xenomai.org> writes: > >> Jan Kiszka <jan.kiszka@siemens.com> writes: >> >>> On 16.04.21 18:48, Philippe Gerum wrote: >>>> >>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>> >>>>> On 15.04.21 09:54, Philippe Gerum wrote: >>>>>> >>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>>> >>>>>>> On 15.04.21 09:21, Philippe Gerum wrote: >>>>>>>> >>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>>>>> >>>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote: >>>>>>>>>> From: Philippe Gerum <rpm@xenomai.org> >>>>>>>>>> >>>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>>>>>>>> a non-zero value would not even yield the proper result on some >>>>>>>>>> architectures. >>>>>>>>>> >>>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>>>>>>>> later kernels so that we still feed csum_partial() with the user-given >>>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of >>>>>>>>>> csum_partial() to do the right thing. >>>>>>>>>> >>>>>>>>> >>>>>>>>> If that issue only affects the ICMP code path, why not only changing >>>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>>>>>>>> and csum in one step? >>>>>>>>> >>>>>>>> >>>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the >>>>>>>> kernel folding the copy and checksum operations anymore, so I see no way >>>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >>>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum >>>>>>>> value to be fed in? >>>>>>>> >>>>>>> >>>>>>> rtskb_copy_and_csum_dev does not need that. >>>>>>> >>>>>> >>>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do >>>>>> you want to deal with this? >>>>>> >>>>> >>>>> That is an internal API, so we don't care. >>>>> >>>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to >>>>> (with a csum != 0), there would be not reason to make >>>>> rtskb_copy_and_csum_dev pay that price. >>>>> >>>> >>>> Well, there may be a reason to challenge the idea that a folded >>>> copy_and_csum is better for a real-time system than a split memcpy+csum >>>> in the first place. Out of curiosity, I ran a few benchmarks lately on a >>>> few SoCs regarding this, and it turned out that optimizing the data copy >>>> to get the buffer quickly in place before checksumming the result may >>>> well yield much better results with respect to jitter than what >>>> csum_and_copy currently achieves on these SoCs. >>>> >>>> Inline csum_and_copy did perform slightly better on average (a couple of >>>> hundreds of nanosecs at best) but with pathological jittery in the worst >>>> case at times. On the contrary, the split memcpy+csum method did not >>>> exhibit such jittery during these tests, not once. >>>> >>>> - four SoCs tested (2 x x86, armv7, armv8a) >>>> - test code ran in kernel space (real-time task context, >>>> out-of-band/primary context) >>>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial() >>>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each >>>> - all buffers (src & dst) aligned on L1_CACHE_BYTES >>>> - each run performed 1000,000 iterations of a given checksum loop, no >>>> pause in between. >>>> - no concurrent load on the board >>>> - all results in nanoseconds >>>> >>>> The worst results obtained are presented here for each SoC. >>>> >>>> x86[1] >>>> ------ >>>> >>>> vendor_id : GenuineIntel >>>> cpu family : 6 >>>> model : 92 >>>> model name : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz >>>> stepping : 9 >>>> cpu MHz : 1593.600 >>>> cache size : 1024 KB >>>> cpuid level : 21 >>>> wp : yes >>>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts >>>> vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=68, max=653, avg=70 >>>> CSUM_COPY 1024b: min=248, max=373, avg=251 >>>> CSUM_COPY 1500b: min=344, max=3123, avg=350 <================= >>>> COPY+CSUM 32b: min=101, max=790, avg=103 >>>> COPY+CSUM 1024b: min=297, max=397, avg=300 >>>> COPY+CSUM 1500b: min=402, max=490, avg=405 >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=68, max=1420, avg=70 >>>> CSUM_COPY 1024b: min=248, max=29706, avg=251 <================= >>>> CSUM_COPY 1500b: min=344, max=792, avg=350 >>>> COPY+CSUM 32b: min=101, max=872, avg=103 >>>> COPY+CSUM 1024b: min=297, max=776, avg=300 >>>> COPY+CSUM 1500b: min=402, max=853, avg=405 >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=68, max=661, avg=70 >>>> CSUM_COPY 1024b: min=248, max=1714, avg=251 >>>> CSUM_COPY 1500b: min=344, max=33937, avg=350 <================= >>>> COPY+CSUM 32b: min=101, max=610, avg=103 >>>> COPY+CSUM 1024b: min=297, max=605, avg=300 >>>> COPY+CSUM 1500b: min=402, max=712, avg=405 >>>> >>>> x86[2] >>>> ------ >>>> >>>> vendor_id : GenuineIntel >>>> cpu family : 6 >>>> model : 23 >>>> model name : Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz >>>> stepping : 6 >>>> microcode : 0x60c >>>> cpu MHz : 1627.113 >>>> cache size : 3072 KB >>>> cpuid level : 10 >>>> wp : yes >>>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm >>>> >>>> CSUM_COPY 32b: min=558, max=31010, avg=674 <================= >>>> CSUM_COPY 1024b: min=558, max=2794, avg=745 >>>> CSUM_COPY 1500b: min=558, max=2794, avg=841 >>>> COPY+CSUM 32b: min=558, max=2794, avg=671 >>>> COPY+CSUM 1024b: min=558, max=2794, avg=778 >>>> COPY+CSUM 1500b: min=838, max=2794, avg=865 >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=59, max=532, avg=62 >>>> CSUM_COPY 1024b: min=198, max=270, avg=201 >>>> CSUM_COPY 1500b: min=288, max=34921, avg=289 <================= >>>> COPY+CSUM 32b: min=53, max=326, avg=56 >>>> COPY+CSUM 1024b: min=228, max=461, avg=232 >>>> COPY+CSUM 1500b: min=311, max=341, avg=317 >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=59, max=382, avg=62 >>>> CSUM_COPY 1024b: min=198, max=383, avg=201 >>>> CSUM_COPY 1500b: min=285, max=21235, avg=289 <================= >>>> COPY+CSUM 32b: min=52, max=300, avg=56 >>>> COPY+CSUM 1024b: min=228, max=348, avg=232 >>>> COPY+CSUM 1500b: min=311, max=409, avg=317 >>>> >>>> Cortex A9 quad-core 1.2Ghz (iMX6qp) >>>> ----------------------------------- >>>> >>>> model name : ARMv7 Processor rev 10 (v7l) >>>> Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 >>>> CPU implementer : 0x41 >>>> CPU architecture: 7 >>>> CPU variant : 0x2 >>>> CPU part : 0xc09 >>>> CPU revision : 10 >>>> >>>> CSUM_COPY 32b: min=333, max=1334, avg=440 >>>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060 >>>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357 <================= >>>> COPY+CSUM 32b: min=333, max=1334, avg=476 >>>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324 >>>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713 >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=333, max=1334, avg=439 >>>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060 <================= >>>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351 >>>> COPY+CSUM 32b: min=333, max=1334, avg=476 >>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324 >>>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713 >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=333, max=1666, avg=454 >>>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060 >>>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348 <================= >>>> COPY+CSUM 32b: min=333, max=1334, avg=454 >>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317 >>>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712 >>>> >>>> Cortex A55 quad-core 2Ghz (Odroid C4) >>>> ------------------------------------- >>>> >>>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp >>>> CPU implementer : 0x41 >>>> CPU architecture: 8 >>>> CPU variant : 0x1 >>>> CPU part : 0xd05 >>>> CPU revision : 0 >>>> >>>> >>>> CSUM_COPY 32b: min=125, max=833, avg=140 >>>> CSUM_COPY 1024b: min=625, max=41916, avg=673 <================= >>>> CSUM_COPY 1500b: min=875, max=3875, avg=923 >>>> COPY+CSUM 32b: min=125, max=458, avg=140 >>>> COPY+CSUM 1024b: min=625, max=1166, avg=666 >>>> COPY+CSUM 1500b: min=875, max=1167, avg=913 >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=125, max=1292, avg=139 >>>> CSUM_COPY 1024b: min=541, max=48333, avg=555 >>>> CSUM_COPY 1500b: min=708, max=3458, avg=740 >>>> COPY+CSUM 32b: min=125, max=292, avg=136 >>>> COPY+CSUM 1024b: min=541, max=750, avg=556 >>>> COPY+CSUM 1500b: min=708, max=834, avg=740 >>>> >>>> == >>>> >>>> CSUM_COPY 32b: min=125, max=833, avg=140 >>>> CSUM_COPY 1024b: min=666, max=55667, avg=673 <================= >>>> CSUM_COPY 1500b: min=875, max=4208, avg=913 >>>> COPY+CSUM 32b: min=125, max=375, avg=140 >>>> COPY+CSUM 1024b: min=666, max=916, avg=673 >>>> COPY+CSUM 1500b: min=875, max=1042, avg=913 >>>> >>>> ============ >>>> >>>> A few additional observations from looking at the implementation: >>>> >>>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete >>>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb >>>> which is faster. >>>> >>>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single >>>> instruction. csum_and_copy loads/stores at best 4 words at a time, >>>> only when src and dst are 32bit aligned (which matches the test case). >>>> >>>> arm64/armv8a uses load/store pair instructions to copy memory >>>> blocks. It does not have asm-optimized csum_and_copy support, so it >>>> uses the generic C version. >>>> >>>> What could be inferred in terms of prefetching and speculation might >>>> explain some differences between the approaches too. >>>> >>>> I would be interested in any converging / diverging results testing the >>>> same combo with a different test code, because from my standpoint, >>>> things do not seem as obvious as they are supposed to be at the moment. >>>> >>> >>> If copy+csum is not using any recent memcopy optimizations, that is an >>> argument for at least equivalent performance. >>> >> >> You mean the folded version, i.e. copy_and_csum? If so, I can't see any >> way for that one to optimize via fast string operations. >> >>> But I don't get yet where the huge jittery should be coming from. Were >>> the measurement loop preemptible? In that case I would expect a split >> >> Out of band stage, so only preemptible by Xenomai timer ticks, which >> means only the host tick emulation at this point since there was no >> outstanding Xenomai timers started yet when running the loops. Pretty >> slim chance to see these latency spots consistently reproduced, and only >> for the folded copy_sum version. >> >>> copy followed by another loop to csum should give much worse results as >>> it needs the cache to stay warm - while copy-csum only touches the data >>> once. >>> >> >> Conversely, if the copy is much faster, the odds of being preempted may >> increase, yielding better results overall. > > False alarm. Preemption was the issue, by the top half of the host tick > handling in primary mode. The latest clock event scheduled by the kernel > managed to enter the pipeline at a random time, but always within the > execution window of the all-in-one csum_and_copy code. Although this > event was deferred and not immediately passed to the in-band context, > the time spent dealing with it was enough to show up in the results. > >> This said, I'm unsure this is >> related to preemption anyway, this looks like the fingerprints of minor >> faults with PTEs. Why this would only happen in the folded version is >> still a mystery to me at the moment. > > It did not actually, no minor faults. > > The results are now consistent, both implementations are comparable > performance-wise as the optimized memcpy tends to offset the advantage > of calculating the checksum on the fly, saving a read access. armv8 > benefits more from the former, since it does not have an optimized > csum_and_copy but uses the generic C version instead. > > == x86[1] > > CSUM_COPY 32b: min=68, max=640, avg=70 > CSUM_COPY 1024b: min=247, max=773, avg=252 > CSUM_COPY 1500b: min=343, max=832, avg=350 > COPY+CSUM 32b: min=100, max=651, avg=131 > COPY+CSUM 1024b: min=296, max=752, avg=298 > COPY+CSUM 1500b: min=397, max=845, avg=400 > > == x86[2] > > CSUM_COPY 32b: min=63, max=267, avg=66 > CSUM_COPY 1024b: min=198, max=300, avg=201 > CSUM_COPY 1500b: min=288, max=611, avg=291 > COPY+CSUM 32b: min=56, max=360, avg=56 > COPY+CSUM 1024b: min=228, max=420, avg=231 > COPY+CSUM 1500b: min=307, max=337, avg=318 > > == armv7 (imx6qp) > > CSUM_COPY 32b: min=333, max=1334, avg=439 > CSUM_COPY 1024b: min=1000, max=2000, avg=1045 > CSUM_COPY 1500b: min=1000, max=2334, avg=1325 > COPY+CSUM 32b: min=333, max=1334, avg=454 > COPY+CSUM 1024b: min=1333, max=2334, avg=1347 > COPY+CSUM 1500b: min=1666, max=2667, avg=1734 > > == armv8a (C4) > > CSUM_COPY 32b: min=125, max=792, avg=130 > CSUM_COPY 1024b: min=500, max=1125, avg=550 > CSUM_COPY 1500b: min=708, max=1833, avg=726 > COPY+CSUM 32b: min=125, max=292, avg=130 > COPY+CSUM 1024b: min=541, max=708, avg=550 > COPY+CSUM 1500b: min=708, max=875, avg=730 Last round of results about this issue, now measuring the csum_copy vs csum+copy performances in idle vs busy scenarios. Busy means hackbench+dd loop streaming 128M in the background from zero -> null, in order to badly trash the D-caches while the test runs. All figures in nanosecs. iMX6QP (Cortex A9) ------------------ === idle CSUM_COPY 32b: min=333, max=1333, avg=439 CSUM_COPY 1024b: min=1000, max=2000, avg=1045 CSUM_COPY 1500b: min=1333, max=2000, avg=1333 COPY+CSUM 32b: min=333, max=1333, avg=443 COPY+CSUM 1024b: min=1000, max=2334, avg=1345 COPY+CSUM 1500b: min=1666, max=2667, avg=1737 === busy CSUM_COPY 32b: min=333, max=4333, avg=466 CSUM_COPY 1024b: min=1000, max=5000, avg=1088 CSUM_COPY 1500b: min=1333, max=5667, avg=1393 COPY+CSUM 32b: min=333, max=1334, avg=454 COPY+CSUM 1024b: min=1000, max=2000, avg=1341 COPY+CSUM 1500b: min=1666, max=2666, avg=1745 C4 (Cortex A55) --------------- === idle CSUM_COPY 32b: min=125, max=791, avg=130 CSUM_COPY 1024b: min=541, max=834, avg=550 CSUM_COPY 1500b: min=708, max=1875, avg=740 COPY+CSUM 32b: min=125, max=167, avg=133 COPY+CSUM 1024b: min=541, max=625, avg=553 COPY+CSUM 1500b: min=708, max=750, avg=730 === busy CSUM_COPY 32b: min=125, max=792, avg=133 CSUM_COPY 1024b: min=500, max=2000, avg=552 CSUM_COPY 1500b: min=708, max=1542, avg=744 COPY+CSUM 32b: min=125, max=375, avg=133 COPY+CSUM 1024b: min=500, max=709, avg=553 COPY+CSUM 1500b: min=708, max=916, avg=743 x86 (atom x5) ------------- === idle CSUM_COPY 32b: min=67, max=590, avg=70 CSUM_COPY 1024b: min=245, max=385, avg=251 CSUM_COPY 1500b: min=343, max=521, avg=350 COPY+CSUM 32b: min=101, max=679, avg=117 COPY+CSUM 1024b: min=296, max=379, avg=298 COPY+CSUM 1500b: min=399, max=502, avg=404 == busy CSUM_COPY 32b: min=65, max=709, avg=71 CSUM_COPY 1024b: min=243, max=702, avg=252 CSUM_COPY 1500b: min=340, max=1055, avg=351 COPY+CSUM 32b: min=100, max=665, avg=120 COPY+CSUM 1024b: min=295, max=669, avg=298 COPY+CSUM 1500b: min=399, max=686, avg=403 As expected from the code, arm64 which has no folded csum_copy implementation makes the best of using the split copy+csum path. All architectures seem to benefit from optimized memcpy under load when it comes to the worst case execution time. x86 is less prone to jittery under cache trashing than others as usual, but even there, the max. figures for csum+copy in busy context look pretty much on par with the csum_copy version. -- Philippe. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() 2021-05-04 14:48 ` Philippe Gerum @ 2021-05-05 5:43 ` Jan Kiszka 0 siblings, 0 replies; 27+ messages in thread From: Jan Kiszka @ 2021-05-05 5:43 UTC (permalink / raw) To: Philippe Gerum; +Cc: xenomai On 04.05.21 16:48, Philippe Gerum wrote: > > Philippe Gerum <rpm@xenomai.org> writes: > >> Philippe Gerum <rpm@xenomai.org> writes: >> >>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>> >>>> On 16.04.21 18:48, Philippe Gerum wrote: >>>>> >>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>> >>>>>> On 15.04.21 09:54, Philippe Gerum wrote: >>>>>>> >>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>>>> >>>>>>>> On 15.04.21 09:21, Philippe Gerum wrote: >>>>>>>>> >>>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes: >>>>>>>>> >>>>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote: >>>>>>>>>>> From: Philippe Gerum <rpm@xenomai.org> >>>>>>>>>>> >>>>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its >>>>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing >>>>>>>>>>> a non-zero value would not even yield the proper result on some >>>>>>>>>>> architectures. >>>>>>>>>>> >>>>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed >>>>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to >>>>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for >>>>>>>>>>> later kernels so that we still feed csum_partial() with the user-given >>>>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of >>>>>>>>>>> csum_partial() to do the right thing. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> If that issue only affects the ICMP code path, why not only changing >>>>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy >>>>>>>>>> and csum in one step? >>>>>>>>>> >>>>>>>>> >>>>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the >>>>>>>>> kernel folding the copy and checksum operations anymore, so I see no way >>>>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one >>>>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum >>>>>>>>> value to be fed in? >>>>>>>>> >>>>>>>> >>>>>>>> rtskb_copy_and_csum_dev does not need that. >>>>>>>> >>>>>>> >>>>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do >>>>>>> you want to deal with this? >>>>>>> >>>>>> >>>>>> That is an internal API, so we don't care. >>>>>> >>>>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to >>>>>> (with a csum != 0), there would be not reason to make >>>>>> rtskb_copy_and_csum_dev pay that price. >>>>>> >>>>> >>>>> Well, there may be a reason to challenge the idea that a folded >>>>> copy_and_csum is better for a real-time system than a split memcpy+csum >>>>> in the first place. Out of curiosity, I ran a few benchmarks lately on a >>>>> few SoCs regarding this, and it turned out that optimizing the data copy >>>>> to get the buffer quickly in place before checksumming the result may >>>>> well yield much better results with respect to jitter than what >>>>> csum_and_copy currently achieves on these SoCs. >>>>> >>>>> Inline csum_and_copy did perform slightly better on average (a couple of >>>>> hundreds of nanosecs at best) but with pathological jittery in the worst >>>>> case at times. On the contrary, the split memcpy+csum method did not >>>>> exhibit such jittery during these tests, not once. >>>>> >>>>> - four SoCs tested (2 x x86, armv7, armv8a) >>>>> - test code ran in kernel space (real-time task context, >>>>> out-of-band/primary context) >>>>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial() >>>>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each >>>>> - all buffers (src & dst) aligned on L1_CACHE_BYTES >>>>> - each run performed 1000,000 iterations of a given checksum loop, no >>>>> pause in between. >>>>> - no concurrent load on the board >>>>> - all results in nanoseconds >>>>> >>>>> The worst results obtained are presented here for each SoC. >>>>> >>>>> x86[1] >>>>> ------ >>>>> >>>>> vendor_id : GenuineIntel >>>>> cpu family : 6 >>>>> model : 92 >>>>> model name : Intel(R) Atom(TM) Processor E3940 @ 1.60GHz >>>>> stepping : 9 >>>>> cpu MHz : 1593.600 >>>>> cache size : 1024 KB >>>>> cpuid level : 21 >>>>> wp : yes >>>>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts >>>>> vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=68, max=653, avg=70 >>>>> CSUM_COPY 1024b: min=248, max=373, avg=251 >>>>> CSUM_COPY 1500b: min=344, max=3123, avg=350 <================= >>>>> COPY+CSUM 32b: min=101, max=790, avg=103 >>>>> COPY+CSUM 1024b: min=297, max=397, avg=300 >>>>> COPY+CSUM 1500b: min=402, max=490, avg=405 >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=68, max=1420, avg=70 >>>>> CSUM_COPY 1024b: min=248, max=29706, avg=251 <================= >>>>> CSUM_COPY 1500b: min=344, max=792, avg=350 >>>>> COPY+CSUM 32b: min=101, max=872, avg=103 >>>>> COPY+CSUM 1024b: min=297, max=776, avg=300 >>>>> COPY+CSUM 1500b: min=402, max=853, avg=405 >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=68, max=661, avg=70 >>>>> CSUM_COPY 1024b: min=248, max=1714, avg=251 >>>>> CSUM_COPY 1500b: min=344, max=33937, avg=350 <================= >>>>> COPY+CSUM 32b: min=101, max=610, avg=103 >>>>> COPY+CSUM 1024b: min=297, max=605, avg=300 >>>>> COPY+CSUM 1500b: min=402, max=712, avg=405 >>>>> >>>>> x86[2] >>>>> ------ >>>>> >>>>> vendor_id : GenuineIntel >>>>> cpu family : 6 >>>>> model : 23 >>>>> model name : Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz >>>>> stepping : 6 >>>>> microcode : 0x60c >>>>> cpu MHz : 1627.113 >>>>> cache size : 3072 KB >>>>> cpuid level : 10 >>>>> wp : yes >>>>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm >>>>> >>>>> CSUM_COPY 32b: min=558, max=31010, avg=674 <================= >>>>> CSUM_COPY 1024b: min=558, max=2794, avg=745 >>>>> CSUM_COPY 1500b: min=558, max=2794, avg=841 >>>>> COPY+CSUM 32b: min=558, max=2794, avg=671 >>>>> COPY+CSUM 1024b: min=558, max=2794, avg=778 >>>>> COPY+CSUM 1500b: min=838, max=2794, avg=865 >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=59, max=532, avg=62 >>>>> CSUM_COPY 1024b: min=198, max=270, avg=201 >>>>> CSUM_COPY 1500b: min=288, max=34921, avg=289 <================= >>>>> COPY+CSUM 32b: min=53, max=326, avg=56 >>>>> COPY+CSUM 1024b: min=228, max=461, avg=232 >>>>> COPY+CSUM 1500b: min=311, max=341, avg=317 >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=59, max=382, avg=62 >>>>> CSUM_COPY 1024b: min=198, max=383, avg=201 >>>>> CSUM_COPY 1500b: min=285, max=21235, avg=289 <================= >>>>> COPY+CSUM 32b: min=52, max=300, avg=56 >>>>> COPY+CSUM 1024b: min=228, max=348, avg=232 >>>>> COPY+CSUM 1500b: min=311, max=409, avg=317 >>>>> >>>>> Cortex A9 quad-core 1.2Ghz (iMX6qp) >>>>> ----------------------------------- >>>>> >>>>> model name : ARMv7 Processor rev 10 (v7l) >>>>> Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 >>>>> CPU implementer : 0x41 >>>>> CPU architecture: 7 >>>>> CPU variant : 0x2 >>>>> CPU part : 0xc09 >>>>> CPU revision : 10 >>>>> >>>>> CSUM_COPY 32b: min=333, max=1334, avg=440 >>>>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060 >>>>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357 <================= >>>>> COPY+CSUM 32b: min=333, max=1334, avg=476 >>>>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324 >>>>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713 >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=333, max=1334, avg=439 >>>>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060 <================= >>>>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351 >>>>> COPY+CSUM 32b: min=333, max=1334, avg=476 >>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324 >>>>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713 >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=333, max=1666, avg=454 >>>>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060 >>>>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348 <================= >>>>> COPY+CSUM 32b: min=333, max=1334, avg=454 >>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317 >>>>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712 >>>>> >>>>> Cortex A55 quad-core 2Ghz (Odroid C4) >>>>> ------------------------------------- >>>>> >>>>> Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp >>>>> CPU implementer : 0x41 >>>>> CPU architecture: 8 >>>>> CPU variant : 0x1 >>>>> CPU part : 0xd05 >>>>> CPU revision : 0 >>>>> >>>>> >>>>> CSUM_COPY 32b: min=125, max=833, avg=140 >>>>> CSUM_COPY 1024b: min=625, max=41916, avg=673 <================= >>>>> CSUM_COPY 1500b: min=875, max=3875, avg=923 >>>>> COPY+CSUM 32b: min=125, max=458, avg=140 >>>>> COPY+CSUM 1024b: min=625, max=1166, avg=666 >>>>> COPY+CSUM 1500b: min=875, max=1167, avg=913 >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=125, max=1292, avg=139 >>>>> CSUM_COPY 1024b: min=541, max=48333, avg=555 >>>>> CSUM_COPY 1500b: min=708, max=3458, avg=740 >>>>> COPY+CSUM 32b: min=125, max=292, avg=136 >>>>> COPY+CSUM 1024b: min=541, max=750, avg=556 >>>>> COPY+CSUM 1500b: min=708, max=834, avg=740 >>>>> >>>>> == >>>>> >>>>> CSUM_COPY 32b: min=125, max=833, avg=140 >>>>> CSUM_COPY 1024b: min=666, max=55667, avg=673 <================= >>>>> CSUM_COPY 1500b: min=875, max=4208, avg=913 >>>>> COPY+CSUM 32b: min=125, max=375, avg=140 >>>>> COPY+CSUM 1024b: min=666, max=916, avg=673 >>>>> COPY+CSUM 1500b: min=875, max=1042, avg=913 >>>>> >>>>> ============ >>>>> >>>>> A few additional observations from looking at the implementation: >>>>> >>>>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete >>>>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb >>>>> which is faster. >>>>> >>>>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single >>>>> instruction. csum_and_copy loads/stores at best 4 words at a time, >>>>> only when src and dst are 32bit aligned (which matches the test case). >>>>> >>>>> arm64/armv8a uses load/store pair instructions to copy memory >>>>> blocks. It does not have asm-optimized csum_and_copy support, so it >>>>> uses the generic C version. >>>>> >>>>> What could be inferred in terms of prefetching and speculation might >>>>> explain some differences between the approaches too. >>>>> >>>>> I would be interested in any converging / diverging results testing the >>>>> same combo with a different test code, because from my standpoint, >>>>> things do not seem as obvious as they are supposed to be at the moment. >>>>> >>>> >>>> If copy+csum is not using any recent memcopy optimizations, that is an >>>> argument for at least equivalent performance. >>>> >>> >>> You mean the folded version, i.e. copy_and_csum? If so, I can't see any >>> way for that one to optimize via fast string operations. >>> >>>> But I don't get yet where the huge jittery should be coming from. Were >>>> the measurement loop preemptible? In that case I would expect a split >>> >>> Out of band stage, so only preemptible by Xenomai timer ticks, which >>> means only the host tick emulation at this point since there was no >>> outstanding Xenomai timers started yet when running the loops. Pretty >>> slim chance to see these latency spots consistently reproduced, and only >>> for the folded copy_sum version. >>> >>>> copy followed by another loop to csum should give much worse results as >>>> it needs the cache to stay warm - while copy-csum only touches the data >>>> once. >>>> >>> >>> Conversely, if the copy is much faster, the odds of being preempted may >>> increase, yielding better results overall. >> >> False alarm. Preemption was the issue, by the top half of the host tick >> handling in primary mode. The latest clock event scheduled by the kernel >> managed to enter the pipeline at a random time, but always within the >> execution window of the all-in-one csum_and_copy code. Although this >> event was deferred and not immediately passed to the in-band context, >> the time spent dealing with it was enough to show up in the results. >> >>> This said, I'm unsure this is >>> related to preemption anyway, this looks like the fingerprints of minor >>> faults with PTEs. Why this would only happen in the folded version is >>> still a mystery to me at the moment. >> >> It did not actually, no minor faults. >> >> The results are now consistent, both implementations are comparable >> performance-wise as the optimized memcpy tends to offset the advantage >> of calculating the checksum on the fly, saving a read access. armv8 >> benefits more from the former, since it does not have an optimized >> csum_and_copy but uses the generic C version instead. >> >> == x86[1] >> >> CSUM_COPY 32b: min=68, max=640, avg=70 >> CSUM_COPY 1024b: min=247, max=773, avg=252 >> CSUM_COPY 1500b: min=343, max=832, avg=350 >> COPY+CSUM 32b: min=100, max=651, avg=131 >> COPY+CSUM 1024b: min=296, max=752, avg=298 >> COPY+CSUM 1500b: min=397, max=845, avg=400 >> >> == x86[2] >> >> CSUM_COPY 32b: min=63, max=267, avg=66 >> CSUM_COPY 1024b: min=198, max=300, avg=201 >> CSUM_COPY 1500b: min=288, max=611, avg=291 >> COPY+CSUM 32b: min=56, max=360, avg=56 >> COPY+CSUM 1024b: min=228, max=420, avg=231 >> COPY+CSUM 1500b: min=307, max=337, avg=318 >> >> == armv7 (imx6qp) >> >> CSUM_COPY 32b: min=333, max=1334, avg=439 >> CSUM_COPY 1024b: min=1000, max=2000, avg=1045 >> CSUM_COPY 1500b: min=1000, max=2334, avg=1325 >> COPY+CSUM 32b: min=333, max=1334, avg=454 >> COPY+CSUM 1024b: min=1333, max=2334, avg=1347 >> COPY+CSUM 1500b: min=1666, max=2667, avg=1734 >> >> == armv8a (C4) >> >> CSUM_COPY 32b: min=125, max=792, avg=130 >> CSUM_COPY 1024b: min=500, max=1125, avg=550 >> CSUM_COPY 1500b: min=708, max=1833, avg=726 >> COPY+CSUM 32b: min=125, max=292, avg=130 >> COPY+CSUM 1024b: min=541, max=708, avg=550 >> COPY+CSUM 1500b: min=708, max=875, avg=730 > > Last round of results about this issue, now measuring the csum_copy vs > csum+copy performances in idle vs busy scenarios. Busy means > hackbench+dd loop streaming 128M in the background from zero -> null, in > order to badly trash the D-caches while the test runs. All figures in > nanosecs. > > iMX6QP (Cortex A9) > ------------------ > > === idle > > CSUM_COPY 32b: min=333, max=1333, avg=439 > CSUM_COPY 1024b: min=1000, max=2000, avg=1045 > CSUM_COPY 1500b: min=1333, max=2000, avg=1333 > COPY+CSUM 32b: min=333, max=1333, avg=443 > COPY+CSUM 1024b: min=1000, max=2334, avg=1345 > COPY+CSUM 1500b: min=1666, max=2667, avg=1737 > > === busy > > CSUM_COPY 32b: min=333, max=4333, avg=466 > CSUM_COPY 1024b: min=1000, max=5000, avg=1088 > CSUM_COPY 1500b: min=1333, max=5667, avg=1393 > COPY+CSUM 32b: min=333, max=1334, avg=454 > COPY+CSUM 1024b: min=1000, max=2000, avg=1341 > COPY+CSUM 1500b: min=1666, max=2666, avg=1745 > > C4 (Cortex A55) > --------------- > > === idle > > CSUM_COPY 32b: min=125, max=791, avg=130 > CSUM_COPY 1024b: min=541, max=834, avg=550 > CSUM_COPY 1500b: min=708, max=1875, avg=740 > COPY+CSUM 32b: min=125, max=167, avg=133 > COPY+CSUM 1024b: min=541, max=625, avg=553 > COPY+CSUM 1500b: min=708, max=750, avg=730 > > === busy > > CSUM_COPY 32b: min=125, max=792, avg=133 > CSUM_COPY 1024b: min=500, max=2000, avg=552 > CSUM_COPY 1500b: min=708, max=1542, avg=744 > COPY+CSUM 32b: min=125, max=375, avg=133 > COPY+CSUM 1024b: min=500, max=709, avg=553 > COPY+CSUM 1500b: min=708, max=916, avg=743 > > x86 (atom x5) > ------------- > > === idle > > CSUM_COPY 32b: min=67, max=590, avg=70 > CSUM_COPY 1024b: min=245, max=385, avg=251 > CSUM_COPY 1500b: min=343, max=521, avg=350 > COPY+CSUM 32b: min=101, max=679, avg=117 > COPY+CSUM 1024b: min=296, max=379, avg=298 > COPY+CSUM 1500b: min=399, max=502, avg=404 > > == busy > > CSUM_COPY 32b: min=65, max=709, avg=71 > CSUM_COPY 1024b: min=243, max=702, avg=252 > CSUM_COPY 1500b: min=340, max=1055, avg=351 > COPY+CSUM 32b: min=100, max=665, avg=120 > COPY+CSUM 1024b: min=295, max=669, avg=298 > COPY+CSUM 1500b: min=399, max=686, avg=403 > > As expected from the code, arm64 which has no folded csum_copy > implementation makes the best of using the split copy+csum path. All > architectures seem to benefit from optimized memcpy under load when it > comes to the worst case execution time. x86 is less prone to jittery > under cache trashing than others as usual, but even there, the > max. figures for csum+copy in busy context look pretty much on par with > the csum_copy version. > Then let's go for your conversion - but then possibly even unconditionally, no? Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 6/7] drivers/net: icmp: remove variable-length array 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum ` (4 preceding siblings ...) 2021-03-27 10:19 ` [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() Philippe Gerum @ 2021-03-27 10:19 ` Philippe Gerum 2021-04-07 10:24 ` Jan Kiszka 2021-03-27 10:19 ` [PATCH v2 7/7] drivers/net: cfg: fix config file load up Philippe Gerum 2021-04-07 10:39 ` [PATCH v2 0/7] assorted v5.x related fixups Jan Kiszka 7 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw) To: xenomai From: Philippe Gerum <rpm@xenomai.org> Signed-off-by: Philippe Gerum <rpm@xenomai.org> --- kernel/drivers/net/stack/ipv4/icmp.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/drivers/net/stack/ipv4/icmp.c b/kernel/drivers/net/stack/ipv4/icmp.c index 0671cbc93..c3563595e 100644 --- a/kernel/drivers/net/stack/ipv4/icmp.c +++ b/kernel/drivers/net/stack/ipv4/icmp.c @@ -313,8 +313,17 @@ static int rt_icmp_send_request(u32 daddr, struct icmp_bxm *icmp_param) int rt_icmp_send_echo(u32 daddr, u16 id, u16 sequence, size_t msg_size) { struct icmp_bxm icmp_param; - unsigned char pattern_buf[msg_size]; + unsigned char *pattern_buf; off_t pos; + int ret; + + /* + * This is ping, exec time is not that critical, so + * rtdm_malloc() is ok for the purpose of echoing. + */ + pattern_buf = rtdm_malloc(msg_size); + if (pattern_buf == NULL) + return -ENOMEM; /* first purge any potentially pending ICMP fragments */ rt_ip_frag_invalidate_socket(icmp_socket); @@ -343,7 +352,10 @@ int rt_icmp_send_echo(u32 daddr, u16 id, u16 sequence, size_t msg_size) } icmp_param.data.buf = pattern_buf; - return rt_icmp_send_request(daddr, &icmp_param); + ret = rt_icmp_send_request(daddr, &icmp_param); + rtdm_free(pattern_buf); + + return ret; } /*** -- 2.29.2 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 6/7] drivers/net: icmp: remove variable-length array 2021-03-27 10:19 ` [PATCH v2 6/7] drivers/net: icmp: remove variable-length array Philippe Gerum @ 2021-04-07 10:24 ` Jan Kiszka 0 siblings, 0 replies; 27+ messages in thread From: Jan Kiszka @ 2021-04-07 10:24 UTC (permalink / raw) To: Philippe Gerum, xenomai On 27.03.21 11:19, Philippe Gerum wrote: > From: Philippe Gerum <rpm@xenomai.org> > > Signed-off-by: Philippe Gerum <rpm@xenomai.org> > --- > kernel/drivers/net/stack/ipv4/icmp.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/kernel/drivers/net/stack/ipv4/icmp.c b/kernel/drivers/net/stack/ipv4/icmp.c > index 0671cbc93..c3563595e 100644 > --- a/kernel/drivers/net/stack/ipv4/icmp.c > +++ b/kernel/drivers/net/stack/ipv4/icmp.c > @@ -313,8 +313,17 @@ static int rt_icmp_send_request(u32 daddr, struct icmp_bxm *icmp_param) > int rt_icmp_send_echo(u32 daddr, u16 id, u16 sequence, size_t msg_size) > { > struct icmp_bxm icmp_param; > - unsigned char pattern_buf[msg_size]; > + unsigned char *pattern_buf; > off_t pos; > + int ret; > + > + /* > + * This is ping, exec time is not that critical, so > + * rtdm_malloc() is ok for the purpose of echoing. Solution is fine but comment is misleading: This is the submission path, not the reflection. We are setting up the ping message here, and this is indeed not time-critical. I'll fix that up on merge. > + */ > + pattern_buf = rtdm_malloc(msg_size); > + if (pattern_buf == NULL) > + return -ENOMEM; > > /* first purge any potentially pending ICMP fragments */ > rt_ip_frag_invalidate_socket(icmp_socket); > @@ -343,7 +352,10 @@ int rt_icmp_send_echo(u32 daddr, u16 id, u16 sequence, size_t msg_size) > } > icmp_param.data.buf = pattern_buf; > > - return rt_icmp_send_request(daddr, &icmp_param); > + ret = rt_icmp_send_request(daddr, &icmp_param); > + rtdm_free(pattern_buf); > + > + return ret; > } > > /*** > Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v2 7/7] drivers/net: cfg: fix config file load up 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum ` (5 preceding siblings ...) 2021-03-27 10:19 ` [PATCH v2 6/7] drivers/net: icmp: remove variable-length array Philippe Gerum @ 2021-03-27 10:19 ` Philippe Gerum 2021-04-07 10:35 ` Jan Kiszka 2021-04-07 10:39 ` [PATCH v2 0/7] assorted v5.x related fixups Jan Kiszka 7 siblings, 1 reply; 27+ messages in thread From: Philippe Gerum @ 2021-03-27 10:19 UTC (permalink / raw) To: xenomai From: Philippe Gerum <rpm@xenomai.org> set_fs() is on its way out, so we cannot open code a file read operation by calling the VFS handler directly anymore, faking a user address space. We do have kernel interfaces for loading files though, particularly kernel_read_file(). So let's use that one for loading the configuration file contents. Unfortunately, the signature of this service changed during the 5.9-rc cycle, so we have to resort to an ugly wrapper to cope with all supported kernels once again. Sigh. Signed-off-by: Philippe Gerum <rpm@xenomai.org> --- .../include/asm-generic/xenomai/wrappers.h | 15 ++++ kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c | 74 +++++++++---------- 2 files changed, 52 insertions(+), 37 deletions(-) diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h index cfd28fc47..f15fe4389 100644 --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h @@ -220,4 +220,19 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, }) #endif +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \ + ({ \ + loff_t ___file_size; \ + int __ret; \ + __ret = kernel_read_file(__file, __buf, &___file_size, \ + __buf_size, __id); \ + (*__file_size) = ___file_size; \ + __ret; \ + }) +#else +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \ + kernel_read_file(__file, 0, __buf, __buf_size, __file_size, __id) +#endif + #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */ diff --git a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c index 769b4e143..e460571c2 100644 --- a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c +++ b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c @@ -22,6 +22,7 @@ * */ +#include <linux/fs.h> #include <linux/file.h> #include <linux/vmalloc.h> @@ -196,6 +197,40 @@ void cleanup_cmd_detach(void *priv_data) kfree_rtskb(cmd->args.detach.stage2_chain); } +static int load_cfg_file(struct rtcfg_file *cfgfile, struct rtcfg_cmd *cmd) +{ + size_t file_size = 0; + struct file *filp; + loff_t i_size; + int ret; + + filp = filp_open(cfgfile->name, O_RDONLY, 0); + if (IS_ERR(filp)) + return PTR_ERR(filp); + + i_size = i_size_read(file_inode(filp)); + if (i_size <= 0) { + /* allocate buffer even for empty files */ + cfgfile->buffer = vmalloc(1); + } else { + cfgfile->buffer = NULL; + /* Assume 1Mb should be enough for a config file... */ + ret = read_file_from_kernel(filp, &cfgfile->buffer, + 1UL << 30, &file_size, READING_UNKNOWN); + if (ret < 0) { + fput(filp); + return ret; + } + } + + fput(filp); + cfgfile->size = file_size; + + /* dispatch again, this time with new file attached */ + return rtpc_dispatch_call(rtcfg_event_handler, 0, cmd, + sizeof(*cmd), NULL, cleanup_cmd_add); +} + int rtcfg_ioctl_add(struct rtnet_device *rtdev, struct rtcfg_cmd *cmd) { struct rtcfg_connection *conn_buf; @@ -264,46 +299,11 @@ int rtcfg_ioctl_add(struct rtnet_device *rtdev, struct rtcfg_cmd *cmd) /* load file if missing */ if (ret > 0) { - struct file *filp; - mm_segment_t oldfs; - - filp = filp_open(file->name, O_RDONLY, 0); - if (IS_ERR(filp)) { + ret = load_cfg_file(file, cmd); + if (ret) { rtcfg_unlockwr_proc(cmd->internal.data.ifindex); - ret = PTR_ERR(filp); goto err; } - - file->size = filp->f_path.dentry->d_inode->i_size; - - /* allocate buffer even for empty files */ - file->buffer = vmalloc((file->size) ? file->size : 1); - if (file->buffer == NULL) { - rtcfg_unlockwr_proc(cmd->internal.data.ifindex); - fput(filp); - ret = -ENOMEM; - goto err; - } - - oldfs = get_fs(); - set_fs(KERNEL_DS); - filp->f_pos = 0; - - ret = filp->f_op->read(filp, file->buffer, file->size, - &filp->f_pos); - - set_fs(oldfs); - fput(filp); - - if (ret != (int)file->size) { - rtcfg_unlockwr_proc(cmd->internal.data.ifindex); - ret = -EIO; - goto err; - } - - /* dispatch again, this time with new file attached */ - ret = rtpc_dispatch_call(rtcfg_event_handler, 0, cmd, - sizeof(*cmd), NULL, cleanup_cmd_add); } return ret; -- 2.29.2 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 7/7] drivers/net: cfg: fix config file load up 2021-03-27 10:19 ` [PATCH v2 7/7] drivers/net: cfg: fix config file load up Philippe Gerum @ 2021-04-07 10:35 ` Jan Kiszka 2021-05-04 17:18 ` Philippe Gerum 0 siblings, 1 reply; 27+ messages in thread From: Jan Kiszka @ 2021-04-07 10:35 UTC (permalink / raw) To: Philippe Gerum, xenomai On 27.03.21 11:19, Philippe Gerum wrote: > From: Philippe Gerum <rpm@xenomai.org> > > set_fs() is on its way out, so we cannot open code a file read > operation by calling the VFS handler directly anymore, faking a user > address space. > > We do have kernel interfaces for loading files though, particularly > kernel_read_file(). So let's use that one for loading the > configuration file contents. Unfortunately, the signature of this > service changed during the 5.9-rc cycle, so we have to resort to an > ugly wrapper to cope with all supported kernels once again. Sigh. > > Signed-off-by: Philippe Gerum <rpm@xenomai.org> > --- > .../include/asm-generic/xenomai/wrappers.h | 15 ++++ > kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c | 74 +++++++++---------- > 2 files changed, 52 insertions(+), 37 deletions(-) > > diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > index cfd28fc47..f15fe4389 100644 > --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h > @@ -220,4 +220,19 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, > }) > #endif > > +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) > +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \ > + ({ \ > + loff_t ___file_size; \ > + int __ret; \ > + __ret = kernel_read_file(__file, __buf, &___file_size, \ > + __buf_size, __id); \ > + (*__file_size) = ___file_size; \ > + __ret; \ > + }) > +#else > +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \ > + kernel_read_file(__file, 0, __buf, __buf_size, __file_size, __id) > +#endif > + > #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */ > diff --git a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c > index 769b4e143..e460571c2 100644 > --- a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c > +++ b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c > @@ -22,6 +22,7 @@ > * > */ > > +#include <linux/fs.h> > #include <linux/file.h> > #include <linux/vmalloc.h> > > @@ -196,6 +197,40 @@ void cleanup_cmd_detach(void *priv_data) > kfree_rtskb(cmd->args.detach.stage2_chain); > } > > +static int load_cfg_file(struct rtcfg_file *cfgfile, struct rtcfg_cmd *cmd) > +{ > + size_t file_size = 0; > + struct file *filp; > + loff_t i_size; > + int ret; > + > + filp = filp_open(cfgfile->name, O_RDONLY, 0); > + if (IS_ERR(filp)) > + return PTR_ERR(filp); > + > + i_size = i_size_read(file_inode(filp)); > + if (i_size <= 0) { > + /* allocate buffer even for empty files */ > + cfgfile->buffer = vmalloc(1); > + } else { > + cfgfile->buffer = NULL; > + /* Assume 1Mb should be enough for a config file... */ This limitation is new, and I'm not sure if that would be a good idea. But I need to read up the protocol details again. Why do we need that limit? We know i_size already, no? Jan > + ret = read_file_from_kernel(filp, &cfgfile->buffer, > + 1UL << 30, &file_size, READING_UNKNOWN); > + if (ret < 0) { > + fput(filp); > + return ret; > + } > + } > + > + fput(filp); > + cfgfile->size = file_size; > + > + /* dispatch again, this time with new file attached */ > + return rtpc_dispatch_call(rtcfg_event_handler, 0, cmd, > + sizeof(*cmd), NULL, cleanup_cmd_add); > +} > + > int rtcfg_ioctl_add(struct rtnet_device *rtdev, struct rtcfg_cmd *cmd) > { > struct rtcfg_connection *conn_buf; > @@ -264,46 +299,11 @@ int rtcfg_ioctl_add(struct rtnet_device *rtdev, struct rtcfg_cmd *cmd) > > /* load file if missing */ > if (ret > 0) { > - struct file *filp; > - mm_segment_t oldfs; > - > - filp = filp_open(file->name, O_RDONLY, 0); > - if (IS_ERR(filp)) { > + ret = load_cfg_file(file, cmd); > + if (ret) { > rtcfg_unlockwr_proc(cmd->internal.data.ifindex); > - ret = PTR_ERR(filp); > goto err; > } > - > - file->size = filp->f_path.dentry->d_inode->i_size; > - > - /* allocate buffer even for empty files */ > - file->buffer = vmalloc((file->size) ? file->size : 1); > - if (file->buffer == NULL) { > - rtcfg_unlockwr_proc(cmd->internal.data.ifindex); > - fput(filp); > - ret = -ENOMEM; > - goto err; > - } > - > - oldfs = get_fs(); > - set_fs(KERNEL_DS); > - filp->f_pos = 0; > - > - ret = filp->f_op->read(filp, file->buffer, file->size, > - &filp->f_pos); > - > - set_fs(oldfs); > - fput(filp); > - > - if (ret != (int)file->size) { > - rtcfg_unlockwr_proc(cmd->internal.data.ifindex); > - ret = -EIO; > - goto err; > - } > - > - /* dispatch again, this time with new file attached */ > - ret = rtpc_dispatch_call(rtcfg_event_handler, 0, cmd, > - sizeof(*cmd), NULL, cleanup_cmd_add); > } > > return ret; > -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v2 7/7] drivers/net: cfg: fix config file load up 2021-04-07 10:35 ` Jan Kiszka @ 2021-05-04 17:18 ` Philippe Gerum 0 siblings, 0 replies; 27+ messages in thread From: Philippe Gerum @ 2021-05-04 17:18 UTC (permalink / raw) To: Jan Kiszka; +Cc: xenomai Jan Kiszka <jan.kiszka@siemens.com> writes: > On 27.03.21 11:19, Philippe Gerum wrote: >> From: Philippe Gerum <rpm@xenomai.org> >> >> set_fs() is on its way out, so we cannot open code a file read >> operation by calling the VFS handler directly anymore, faking a user >> address space. >> >> We do have kernel interfaces for loading files though, particularly >> kernel_read_file(). So let's use that one for loading the >> configuration file contents. Unfortunately, the signature of this >> service changed during the 5.9-rc cycle, so we have to resort to an >> ugly wrapper to cope with all supported kernels once again. Sigh. >> >> Signed-off-by: Philippe Gerum <rpm@xenomai.org> >> --- >> .../include/asm-generic/xenomai/wrappers.h | 15 ++++ >> kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c | 74 +++++++++---------- >> 2 files changed, 52 insertions(+), 37 deletions(-) >> >> diff --git a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >> index cfd28fc47..f15fe4389 100644 >> --- a/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >> +++ b/kernel/cobalt/include/asm-generic/xenomai/wrappers.h >> @@ -220,4 +220,19 @@ devm_hwmon_device_register_with_groups(struct device *dev, const char *name, >> }) >> #endif >> >> +#if LINUX_VERSION_CODE < KERNEL_VERSION(5,9,0) >> +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \ >> + ({ \ >> + loff_t ___file_size; \ >> + int __ret; \ >> + __ret = kernel_read_file(__file, __buf, &___file_size, \ >> + __buf_size, __id); \ >> + (*__file_size) = ___file_size; \ >> + __ret; \ >> + }) >> +#else >> +#define read_file_from_kernel(__file, __buf, __buf_size, __file_size, __id) \ >> + kernel_read_file(__file, 0, __buf, __buf_size, __file_size, __id) >> +#endif >> + >> #endif /* _COBALT_ASM_GENERIC_WRAPPERS_H */ >> diff --git a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c >> index 769b4e143..e460571c2 100644 >> --- a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c >> +++ b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c >> @@ -22,6 +22,7 @@ >> * >> */ >> >> +#include <linux/fs.h> >> #include <linux/file.h> >> #include <linux/vmalloc.h> >> >> @@ -196,6 +197,40 @@ void cleanup_cmd_detach(void *priv_data) >> kfree_rtskb(cmd->args.detach.stage2_chain); >> } >> >> +static int load_cfg_file(struct rtcfg_file *cfgfile, struct rtcfg_cmd *cmd) >> +{ >> + size_t file_size = 0; >> + struct file *filp; >> + loff_t i_size; >> + int ret; >> + >> + filp = filp_open(cfgfile->name, O_RDONLY, 0); >> + if (IS_ERR(filp)) >> + return PTR_ERR(filp); >> + >> + i_size = i_size_read(file_inode(filp)); >> + if (i_size <= 0) { >> + /* allocate buffer even for empty files */ >> + cfgfile->buffer = vmalloc(1); >> + } else { >> + cfgfile->buffer = NULL; >> + /* Assume 1Mb should be enough for a config file... */ > > This limitation is new, and I'm not sure if that would be a good idea. This would be the size of a config file. 1MB would be too tight in some real world scenario(s)? Anyway, we don't need to impose this limit, I agree. > But I need to read up the protocol details again. > > Why do we need that limit? We know i_size already, no? > Yes we do. So here is a better option I'm going to push upstream instead: diff --git a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c index e460571c2..158d7118f 100644 --- a/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c +++ b/kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c @@ -213,10 +213,10 @@ static int load_cfg_file(struct rtcfg_file *cfgfile, struct rtcfg_cmd *cmd) /* allocate buffer even for empty files */ cfgfile->buffer = vmalloc(1); } else { - cfgfile->buffer = NULL; - /* Assume 1Mb should be enough for a config file... */ + cfgfile->buffer = NULL; /* Leave allocation to the kernel. */ ret = read_file_from_kernel(filp, &cfgfile->buffer, - 1UL << 30, &file_size, READING_UNKNOWN); + i_size_read(file_inode(filp)), + &file_size, READING_UNKNOWN); if (ret < 0) { fput(filp); return ret; -- Philippe. ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v2 0/7] assorted v5.x related fixups 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum ` (6 preceding siblings ...) 2021-03-27 10:19 ` [PATCH v2 7/7] drivers/net: cfg: fix config file load up Philippe Gerum @ 2021-04-07 10:39 ` Jan Kiszka 7 siblings, 0 replies; 27+ messages in thread From: Jan Kiszka @ 2021-04-07 10:39 UTC (permalink / raw) To: Philippe Gerum, xenomai On 27.03.21 11:19, Philippe Gerum wrote: > From: Philippe Gerum <rpm@xenomai.org> > > More wrappers and other changes required to build for the kernel v5.x > series. > > Philippe Gerum (7): > cobalt/memory: fix __vmalloc() calls > cobalt/debug: switch to mmap_lock interface > cobalt/kernel: convert to proc_ops > cobalt/debug: prefer dump_stack() to show_stack() > drivers/net: wrap csum_partial_copy_nocheck() > drivers/net: icmp: remove variable-length array > drivers/net: cfg: fix config file load up > > kernel/cobalt/debug.c | 6 +- > kernel/cobalt/heap.c | 3 +- > .../include/asm-generic/xenomai/wrappers.h | 59 +++++++++++++++ > kernel/cobalt/posix/memory.c | 7 +- > kernel/cobalt/vfile.c | 26 +++---- > kernel/drivers/analogy/device.c | 12 +-- > kernel/drivers/can/mscan/rtcan_mscan_proc.c | 12 +-- > kernel/drivers/can/rtcan_module.c | 66 ++++++++--------- > .../drivers/can/sja1000/rtcan_sja1000_proc.c | 12 +-- > kernel/drivers/net/stack/ipv4/icmp.c | 34 ++++++--- > kernel/drivers/net/stack/rtcfg/rtcfg_ioctl.c | 74 +++++++++---------- > kernel/drivers/net/stack/rtskb.c | 3 +- > 12 files changed, 189 insertions(+), 125 deletions(-) > Applied 1-4 and 6. 5 and 7 may need further work. Thanks, Jan -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2021-05-05 5:43 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-03-27 10:19 [PATCH v2 0/7] assorted v5.x related fixups Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 1/7] cobalt/memory: fix __vmalloc() calls Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 2/7] cobalt/debug: switch to mmap_lock interface Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 3/7] cobalt/kernel: convert to proc_ops Philippe Gerum 2021-04-07 9:52 ` Jan Kiszka 2021-04-07 10:17 ` Philippe Gerum 2021-04-07 10:37 ` Jan Kiszka 2021-04-07 11:03 ` Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 4/7] cobalt/debug: prefer dump_stack() to show_stack() Philippe Gerum 2021-03-27 10:19 ` [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck() Philippe Gerum 2021-04-07 10:06 ` Jan Kiszka 2021-04-15 7:21 ` Philippe Gerum 2021-04-15 7:35 ` Jan Kiszka 2021-04-15 7:54 ` Philippe Gerum 2021-04-15 8:10 ` Jan Kiszka 2021-04-16 16:48 ` Philippe Gerum 2021-04-16 17:12 ` Jan Kiszka 2021-04-18 9:18 ` Philippe Gerum 2021-04-18 15:50 ` Philippe Gerum 2021-05-04 14:48 ` Philippe Gerum 2021-05-05 5:43 ` Jan Kiszka 2021-03-27 10:19 ` [PATCH v2 6/7] drivers/net: icmp: remove variable-length array Philippe Gerum 2021-04-07 10:24 ` Jan Kiszka 2021-03-27 10:19 ` [PATCH v2 7/7] drivers/net: cfg: fix config file load up Philippe Gerum 2021-04-07 10:35 ` Jan Kiszka 2021-05-04 17:18 ` Philippe Gerum 2021-04-07 10:39 ` [PATCH v2 0/7] assorted v5.x related fixups Jan Kiszka
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.