Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [v8 4/5] ext4: adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
From: Dave Chinner @ 2014-12-09 22:57 UTC (permalink / raw)
  To: Li Xi
  Cc: linux-fsdevel, linux-ext4, linux-api, tytso, adilger, jack, viro,
	hch, dmonakhov
In-Reply-To: <1418102548-5469-5-git-send-email-lixi@ddn.com>

On Tue, Dec 09, 2014 at 01:22:27PM +0800, Li Xi wrote:
> This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface
> support for ext4. The interface is kept consistent with
> XFS_IOC_FSGETXATTR/XFS_IOC_FSGETXATTR.
> 
> Signed-off-by: Li Xi <lixi@ddn.com>
> ---
>  fs/ext4/ext4.h          |  111 ++++++++++++++++
>  fs/ext4/ioctl.c         |  330 +++++++++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_fs.h         |   47 +++-----
>  include/uapi/linux/fs.h |   58 ++++++++
>  4 files changed, 418 insertions(+), 128 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 136e18c..43a2a88 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -384,6 +384,115 @@ struct flex_groups {
>  #define EXT4_FL_USER_VISIBLE		0x204BDFFF /* User visible flags */
>  #define EXT4_FL_USER_MODIFIABLE		0x204380FF /* User modifiable flags */
>  
> +/* Transfer internal flags to xflags */
> +static inline __u32 ext4_iflags_to_xflags(unsigned long iflags)
> +{
> +	__u32 xflags = 0;
> +
> +	if (iflags & EXT4_SECRM_FL)
> +		xflags |= FS_XFLAG_SECRM;
> +	if (iflags & EXT4_UNRM_FL)
> +		xflags |= FS_XFLAG_UNRM;
> +	if (iflags & EXT4_COMPR_FL)
> +		xflags |= FS_XFLAG_COMPR;
....

NACK.

> +	if (iflags & EXT4_SYNC_FL)
> +		xflags |= FS_XFLAG_SYNC;
> +	if (iflags & EXT4_IMMUTABLE_FL)
> +		xflags |= FS_XFLAG_IMMUTABLE;
> +	if (iflags & EXT4_APPEND_FL)
> +		xflags |= FS_XFLAG_APPEND;
> +	if (iflags & EXT4_NODUMP_FL)
> +		xflags |= FS_XFLAG_NODUMP;
> +	if (iflags & EXT4_NOATIME_FL)
> +		xflags |= FS_XFLAG_NOATIME;

These are OK because they already exist in the interface, but all
the ext4 specific flags you've added have no place in this patchset.


> +
>  /* Flags that should be inherited by new inodes from their parent. */
>  #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
>  			   EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
> @@ -606,6 +715,8 @@ enum {
>  #define EXT4_IOC_RESIZE_FS		_IOW('f', 16, __u64)
>  #define EXT4_IOC_SWAP_BOOT		_IO('f', 17)
>  #define EXT4_IOC_PRECACHE_EXTENTS	_IO('f', 18)
> +#define EXT4_IOC_FSGETXATTR		FS_IOC_FSGETXATTR
> +#define EXT4_IOC_FSSETXATTR		FS_IOC_FSSETXATTR

NACK. Userspace only needs FS_IOC_FS[GS]ETXATTR so that it works across
all filesystems. We need to retain XFS_IOC_FS[GS]ETXATTR so we
don't break existing userspace tools, but we do not need new
filesystem specific definitions anywhere.

> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index fcbf647..872fed5 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -58,6 +58,62 @@ struct inodes_stat_t {
>  	long dummy[5];		/* padding for sysctl ABI compatibility */
>  };
>  
> +/*
> + * Extend attribute flags. These should be or-ed together to figure out what
> + * is valid.
> + */
> +#define FSX_XFLAGS	(1 << 0)
> +#define FSX_EXTSIZE	(1 << 1)
> +#define FSX_NEXTENTS	(1 << 2)
> +#define FSX_PROJID	(1 << 3)

NACK.

I've said this more than once: these are *private to XFS's
implementation* and are not be part of the user interface. Do not
move them from their existing location.


> +
> +/*
> + * Structure for FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR.
> + */
> +struct fsxattr {
> +	__u32		fsx_xflags;	/* xflags field value (get/set) */
> +	__u32		fsx_extsize;	/* extsize field value (get/set)*/
> +	__u32		fsx_nextents;	/* nextents field value (get)	*/
> +	__u32		fsx_projid;	/* project identifier (get/set) */
> +	unsigned char	fsx_pad[12];
> +};
> +
> +/*
> + * Flags for the fsx_xflags field
> + */
> +#define FS_XFLAG_REALTIME	0x00000001	/* data in realtime volume */
> +#define FS_XFLAG_PREALLOC	0x00000002	/* preallocated file extents */
> +#define FS_XFLAG_SECRM		0x00000004	/* secure deletion */

NACK - ext4 specific.

> +#define FS_XFLAG_IMMUTABLE	0x00000008	/* file cannot be modified */
> +#define FS_XFLAG_APPEND		0x00000010	/* all writes append */
> +#define FS_XFLAG_SYNC		0x00000020	/* all writes synchronous */
> +#define FS_XFLAG_NOATIME	0x00000040	/* do not update access time */
> +#define FS_XFLAG_NODUMP		0x00000080	/* do not include in backups */
> +#define FS_XFLAG_RTINHERIT	0x00000100	/* create with rt bit set */
> +#define FS_XFLAG_PROJINHERIT	0x00000200	/* create with parents projid */
> +#define FS_XFLAG_NOSYMLINKS	0x00000400	/* disallow symlink creation */
> +#define FS_XFLAG_EXTSIZE	0x00000800	/* extent size allocator hint */
> +#define FS_XFLAG_EXTSZINHERIT	0x00001000	/* inherit inode extent size */
> +#define FS_XFLAG_NODEFRAG	0x00002000  	/* do not defragment */
> +#define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */

existing flags.

> +#define FS_XFLAG_UNRM		0x00008000	/* undelete */
> +#define FS_XFLAG_COMPR		0x00010000	/* compress file */
> +#define FS_XFLAG_COMPRBLK	0x00020000	/* one or more compressed clusters */
> +#define FS_XFLAG_NOCOMPR	0x00040000	/* don't compress */
> +#define FS_XFLAG_ECOMPR		0x00080000	/* compression error */
> +#define FS_XFLAG_INDEX		0x00100000	/* hash-indexed directory */
> +#define FS_XFLAG_IMAGIC		0x00200000	/* AFS directory */
> +#define FS_XFLAG_JOURNAL_DATA	0x00400000	/* file data should be journaled */
> +#define FS_XFLAG_NOTAIL		0x00800000	/* file tail should not be merged */
> +#define FS_XFLAG_DIRSYNC	0x01000000	/* dirsync behaviour (directories only) */
> +#define FS_XFLAG_TOPDIR		0x02000000	/* top of directory hierarchies*/
> +#define FS_XFLAG_HUGE_FILE	0x04000000	/* set to each huge file */
> +#define FS_XFLAG_EXTENTS	0x08000000	/* inode uses extents */
> +#define FS_XFLAG_EA_INODE	0x10000000	/* inode used for large EA */
> +#define FS_XFLAG_EOFBLOCKS	0x20000000	/* blocks allocated beyond EOF */
> +#define FS_XFLAG_INLINE_DATA	0x40000000	/* inode has inline data. */

And a bunch more ext4 specific flags that *uses all the remaining
flag space*.  At minimum, we need to keep space in this existing
flags field for flags to future indication of how the padding is
used, so that's yet another NACK.

Further, none of these have any relevance to project quotas so
should not be a part of this patchset. Nor are they relevant to any
other filesystem, and many are duplicated by information you can get
from FIEMAP and other interfaces. NACK, again.

Because I'm getting annoyed at being repeatedly ignored about what
needs to be done here, I'm now going to shout a bit. DO NOT CHANGE
THE INTERFACE.  DO NOT ADD any new flags to the interface. DO NOT
REDEFINE how the interface works. DO NOT "ext4-ise" the interface.

The only changes to the interface code should be moving the XFS
definitions and renaming them so as to provide the new generic
ioctl definition as well as the historic XFS definitions. The ext4
implementation needs to be done in a separate patch to the interface
rename, and it must only implement the functionality the interface
already provides. Anything else is outside the scope of this
patchset and requires separate discussion.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* Re: [CFT][PATCH 3/8] userns: Don't allow unprivileged creation of gid mappings
From: Andy Lutomirski @ 2014-12-09 23:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-man, Kees Cook, Linux API, Linux Containers, Josh Triplett,
	stable, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kenton Varda, LSM, Michael Kerrisk-manpages, Richard Weinberger,
	Casey Schaufler, Andrew Morton
In-Reply-To: <874mt4lfr6.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Tue, Dec 9, 2014 at 12:39 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> As any gid mapping will allow and must allow for backwards
> compatibility dropping groups don't allow any gid mappings to be
> established without CAP_SETGID in the parent user namespace.
>
> For a small class of applications this change breaks userspace
> and removes useful functionality.  This small class of applications
> includes tools/testing/selftests/mount/unprivilged-remount-test.c
>
> Most of the removed functionality will be added back with the addition
> of a one way knob to disable setgroups.  Once setgroups is disabled
> setting the gid_map becomes as safe as setting the uid_map.
>
> For more common applications that set the uid_map and the gid_map
> with privilege this change will have no affect.
>

Reviewed-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

> This is part of a fix for CVE-2014-8989.
>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  kernel/user_namespace.c | 4 ----
>  1 file changed, 4 deletions(-)
>
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 27c8dab48c07..1ce6d67c07b7 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -821,10 +821,6 @@ static bool new_idmap_permitted(const struct file *file,
>                         kuid_t uid = make_kuid(ns->parent, id);
>                         if (uid_eq(uid, file->f_cred->fsuid))
>                                 return true;
> -               } else if (cap_setid == CAP_SETGID) {
> -                       kgid_t gid = make_kgid(ns->parent, id);
> -                       if (gid_eq(gid, file->f_cred->fsgid))
> -                               return true;
>                 }
>         }
>
> --
> 1.9.1
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [CFT][PATCH 7/8] userns: Add a knob to disable setgroups on a per user namespace basis
From: Eric W.Biederman @ 2014-12-10  0:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-man, Kees Cook, Linux API, Linux Containers, Josh Triplett,
	stable, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kenton Varda, LSM, Michael Kerrisk-manpages, Richard Weinberger,
	Casey Schaufler, Andrew Morton
In-Reply-To: <CALCETrVfKiXuY=KY_=nHpkTyLWgpy_3DK=4Mr2mhpyX9z1TzrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>



On December 9, 2014 4:28:38 PM CST, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>On Tue, Dec 9, 2014 at 12:42 PM, Eric W. Biederman
><ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> - Expose the knob to user space through a proc file
>/proc/<pid>/setgroups
>>
>>   A value of "deny" means the setgroups system call is disabled in
>the
>>   current processes user namespace and can not be enabled in the
>>   future in this user namespace.
>>
>>   A value of "allow" means the segtoups system call is enabled.
>>
>> - Descendant user namespaces inherit the value of setgroups from
>>   their parents.
>>
>> - A proc file is used (instead of a sysctl) as sysctls
>>   currently do not pass in a struct file so file_ns_capable
>>   is unusable.
>
>Reviewed-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
>
>But I still don't like the name "setgroups".  People may look at that
>and have no clue what the scope of the setting is.  And anyone who, as
>root, writes "deny" to /proc/self/setgroups, thinking that it acts on
>self, will be in for a surprise.

True setgroups isn't perfect.  Documenting it in a manpage may have to be enough. The only real improvement I can think of would be to make the setting a sysctl.   But I think pursuing that approaches the point where perfection is the enemy of getting this problem fixed.

Eric

^ permalink raw reply

* Re: [CFT][PATCH 7/8] userns: Add a knob to disable setgroups on a per user namespace basis
From: Andy Lutomirski @ 2014-12-10  0:21 UTC (permalink / raw)
  To: Eric W.Biederman
  Cc: Linux Containers, Josh Triplett, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel@vger.kernel.org, LSM, Casey Schaufler,
	Serge E. Hallyn, Richard Weinberger, Kenton Varda, stable
In-Reply-To: <971ad3f6-90fd-4e3f-916c-8988af3c826d@email.android.com>

On Tue, Dec 9, 2014 at 4:04 PM, Eric W.Biederman <ebiederm@xmission.com> wrote:
>
>
> On December 9, 2014 4:28:38 PM CST, Andy Lutomirski <luto@amacapital.net> wrote:
>>On Tue, Dec 9, 2014 at 12:42 PM, Eric W. Biederman
>><ebiederm@xmission.com> wrote:
>>>
>>> - Expose the knob to user space through a proc file
>>/proc/<pid>/setgroups
>>>
>>>   A value of "deny" means the setgroups system call is disabled in
>>the
>>>   current processes user namespace and can not be enabled in the
>>>   future in this user namespace.
>>>
>>>   A value of "allow" means the segtoups system call is enabled.
>>>
>>> - Descendant user namespaces inherit the value of setgroups from
>>>   their parents.
>>>
>>> - A proc file is used (instead of a sysctl) as sysctls
>>>   currently do not pass in a struct file so file_ns_capable
>>>   is unusable.
>>
>>Reviewed-by: Andy Lutomirski <luto@amacapital.net>
>>
>>But I still don't like the name "setgroups".  People may look at that
>>and have no clue what the scope of the setting is.  And anyone who, as
>>root, writes "deny" to /proc/self/setgroups, thinking that it acts on
>>self, will be in for a surprise.
>
> True setgroups isn't perfect.  Documenting it in a manpage may have to be enough. The only real improvement I can think of would be to make the setting a sysctl.   But I think pursuing that approaches the point where perfection is the enemy of getting this problem fixed.
>

Would "userns_setgroups" be okay?

--Andy

> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
From: Chen Gang @ 2014-12-10  1:45 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org
  Cc: Minchan Kim, rth-hL46jP5Bxq7R7s880joybQ@public.gmane.org,
	ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09@public.gmane.org,
	mattst88-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, Ralf Baechle,
	jejb-6jwH94ZQLHl74goWV3ctuw@public.gmane.org,
	deller-Mmb7MZpHnFY@public.gmane.org,
	chris-YvXeqwSYzG2sTnJN9+BGXg@public.gmane.org,
	jcmvbkbc-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, Arnd Bergmann,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Geert Uytterhoeven,
	Carlos O'Donell, Darrick J. Wong, dietlibc-a6ha8lKT7ZQ,
	Felix von Leitner
In-Reply-To: <54822B19.2070804-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Hello Andrew:

Could you help check this patch, when you have time. And apply it if
possible.

I guess, this patch is a little urgent, since it changes uapi, if wait
a little longer, more user mode programs (e.g. libc) may use MADV_FREE
as various values.

Thanks.

On 12/6/14 06:00, Chen Gang wrote:
> On 12/05/2014 02:54 PM, Minchan Kim wrote:
>> On Fri, Dec 05, 2014 at 06:58:29AM +0800, Chen Gang wrote:
>>> For uapi, need try to let all macros have same value, and MADV_FREE is
>>> added into main branch recently, so need redefine MADV_FREE for it.
>>>
>>> At present, '8' can be shared with all architectures, so redefine it to
>>> '8'.
>>>
>>> Signed-off-by: Chen Gang <gang.chen.5i5j-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>
>>
>> Hello Chen,
>>
>> Thanks for looking at this.
>> Feel free to add my sign.
>>
>> Acked-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>
> 
> OK, thanks.
> 
> Originally I sent the same patch like you sent (but later than yours).
> Geert suggested to use same value for MADV_FREE, and Carlos confirmed
> it and suggested more things (sorry, I forgot to Cc them in this mail).
> 

-- 
Chen Gang

Open, share, and attitude like air, water, and life which God blessed

^ permalink raw reply

* Re: [PATCH V4] powerpc: add little endian flag to syscall_get_arch()
From: Paul Moore @ 2014-12-10  2:11 UTC (permalink / raw)
  To: Richard Guy Briggs, linux-audit-H+wXaHxf7aLQT0dZR+AlfA,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mpe-Gsx/Oe8HsFggBc27wqDAHg,
	anton-eUNUBHrolfbYtjvyW6yDsg, sgrubb-H+wXaHxf7aLQT0dZR+AlfA,
	eparis-FjpueFixGhCM4zKIHC2jIg, tonyj-l3A5Bk7waGM
In-Reply-To: <24fcd2ea2c7633152e74a7a92abab1dd1782745d.1418157273.git.rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tuesday, December 09, 2014 03:37:07 PM Richard Guy Briggs wrote:
> Since both ppc and ppc64 have LE variants which are now reported by uname,
> add that flag (__AUDIT_ARCH_LE) to syscall_get_arch() and add
> AUDIT_ARCH_PPC64LE variant.
> 
> Without this,  perf trace and auditctl fail.
> 
> Mainline kernel reports ppc64le (per a058801) but there is no matching
> AUDIT_ARCH_PPC64LE.
> 
> Since 32-bit PPC LE is not supported by audit, don't advertise it in
> AUDIT_ARCH_PPC* variants.
> 
> See:
> 	https://www.redhat.com/archives/linux-audit/2014-August/msg00082.html
> 	https://www.redhat.com/archives/linux-audit/2014-December/msg00004.html
> 
> Signed-off-by: Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  arch/powerpc/include/asm/syscall.h |    6 +++++-
>  include/uapi/linux/audit.h         |    2 ++
>  2 files changed, 7 insertions(+), 1 deletions(-)

The audit changes look fine to me, but as I mentioned earlier, this should go 
in via the ppc tree and not the audit tree.

Acked-by: Paul Moore <paul-r2n+y4ga6xFZroRs9YW3xA@public.gmane.org>

> diff --git a/arch/powerpc/include/asm/syscall.h
> b/arch/powerpc/include/asm/syscall.h index 6fa2708..d1934e5 100644
> --- a/arch/powerpc/include/asm/syscall.h
> +++ b/arch/powerpc/include/asm/syscall.h
> @@ -90,6 +90,10 @@ static inline void syscall_set_arguments(struct
> task_struct *task,
> 
>  static inline int syscall_get_arch(void)
>  {
> -	return is_32bit_task() ? AUDIT_ARCH_PPC : AUDIT_ARCH_PPC64;
> +	int arch = is_32bit_task() ? AUDIT_ARCH_PPC : AUDIT_ARCH_PPC64;
> +#ifdef __LITTLE_ENDIAN__
> +	arch |= __AUDIT_ARCH_LE;
> +#endif
> +	return arch;
>  }
>  #endif	/* _ASM_SYSCALL_H */
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 4d100c8..d82beec 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -364,7 +364,9 @@ enum {
>  #define AUDIT_ARCH_PARISC	(EM_PARISC)
>  #define AUDIT_ARCH_PARISC64	(EM_PARISC|__AUDIT_ARCH_64BIT)
>  #define AUDIT_ARCH_PPC		(EM_PPC)
> +/* do not define AUDIT_ARCH_PPCLE since it is not supported by audit */
>  #define AUDIT_ARCH_PPC64	(EM_PPC64|__AUDIT_ARCH_64BIT)
> +#define AUDIT_ARCH_PPC64LE	(EM_PPC64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
>  #define AUDIT_ARCH_S390		(EM_S390)
>  #define AUDIT_ARCH_S390X	(EM_S390|__AUDIT_ARCH_64BIT)
>  #define AUDIT_ARCH_SH		(EM_SH)

-- 
paul moore
security and virtualization @ redhat

^ permalink raw reply

* Re: [PATCH V4] powerpc: add little endian flag to syscall_get_arch()
From: Michael Ellerman @ 2014-12-10  2:20 UTC (permalink / raw)
  To: Paul Moore
  Cc: Richard Guy Briggs, linux-audit-H+wXaHxf7aLQT0dZR+AlfA,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, anton-eUNUBHrolfbYtjvyW6yDsg,
	sgrubb-H+wXaHxf7aLQT0dZR+AlfA, eparis-FjpueFixGhCM4zKIHC2jIg,
	tonyj-l3A5Bk7waGM
In-Reply-To: <1557787.lOBez6GG22@sifl>

On Tue, 2014-12-09 at 21:11 -0500, Paul Moore wrote:
> On Tuesday, December 09, 2014 03:37:07 PM Richard Guy Briggs wrote:
> > Since both ppc and ppc64 have LE variants which are now reported by uname,
> > add that flag (__AUDIT_ARCH_LE) to syscall_get_arch() and add
> > AUDIT_ARCH_PPC64LE variant.
> > 
> > Without this,  perf trace and auditctl fail.
> > 
> > Mainline kernel reports ppc64le (per a058801) but there is no matching
> > AUDIT_ARCH_PPC64LE.
> > 
> > Since 32-bit PPC LE is not supported by audit, don't advertise it in
> > AUDIT_ARCH_PPC* variants.
> > 
> > See:
> > 	https://www.redhat.com/archives/linux-audit/2014-August/msg00082.html
> > 	https://www.redhat.com/archives/linux-audit/2014-December/msg00004.html
> > 
> > Signed-off-by: Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  arch/powerpc/include/asm/syscall.h |    6 +++++-
> >  include/uapi/linux/audit.h         |    2 ++
> >  2 files changed, 7 insertions(+), 1 deletions(-)
> 
> The audit changes look fine to me, but as I mentioned earlier, this should go 
> in via the ppc tree and not the audit tree.
> 
> Acked-by: Paul Moore <paul-r2n+y4ga6xFZroRs9YW3xA@public.gmane.org>

Thanks.

Yep I'll take it via the powerpc tree, I'll CC stable as well as presumably we
want this to work in all versions that had LE support.

cheers

^ permalink raw reply

* Re: [PATCH V4] powerpc: add little endian flag to syscall_get_arch()
From: Richard Guy Briggs @ 2014-12-10  2:43 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Paul Moore, linux-audit-H+wXaHxf7aLQT0dZR+AlfA,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, anton-eUNUBHrolfbYtjvyW6yDsg,
	sgrubb-H+wXaHxf7aLQT0dZR+AlfA, eparis-FjpueFixGhCM4zKIHC2jIg,
	tonyj-l3A5Bk7waGM
In-Reply-To: <1418178015.30244.5.camel@concordia>

On 14/12/10, Michael Ellerman wrote:
> On Tue, 2014-12-09 at 21:11 -0500, Paul Moore wrote:
> > On Tuesday, December 09, 2014 03:37:07 PM Richard Guy Briggs wrote:
> > > Since both ppc and ppc64 have LE variants which are now reported by uname,
> > > add that flag (__AUDIT_ARCH_LE) to syscall_get_arch() and add
> > > AUDIT_ARCH_PPC64LE variant.
> > > 
> > > Without this,  perf trace and auditctl fail.
> > > 
> > > Mainline kernel reports ppc64le (per a058801) but there is no matching
> > > AUDIT_ARCH_PPC64LE.
> > > 
> > > Since 32-bit PPC LE is not supported by audit, don't advertise it in
> > > AUDIT_ARCH_PPC* variants.
> > > 
> > > See:
> > > 	https://www.redhat.com/archives/linux-audit/2014-August/msg00082.html
> > > 	https://www.redhat.com/archives/linux-audit/2014-December/msg00004.html
> > > 
> > > Signed-off-by: Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > ---
> > >  arch/powerpc/include/asm/syscall.h |    6 +++++-
> > >  include/uapi/linux/audit.h         |    2 ++
> > >  2 files changed, 7 insertions(+), 1 deletions(-)
> > 
> > The audit changes look fine to me, but as I mentioned earlier, this should go 
> > in via the ppc tree and not the audit tree.
> > 
> > Acked-by: Paul Moore <paul-r2n+y4ga6xFZroRs9YW3xA@public.gmane.org>
> 
> Thanks.
> 
> Yep I'll take it via the powerpc tree, I'll CC stable as well as presumably we
> want this to work in all versions that had LE support.

Yes, please!

(I was very tempted to change the #error to #warning, but figured the
placeholder comment in the uapi file was sufficient.)

> cheers

- RGB

--
Richard Guy Briggs <rbriggs-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

^ permalink raw reply

* [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
From: Jesper Dangaard Brouer @ 2014-12-10 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, linux-kernel, linux-mm,
	Christoph Lameter
  Cc: linux-api, Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt
In-Reply-To: <20141210033902.2114.68658.stgit@ahduyck-vm-fedora20>

The network stack have some use-cases that puts some extreme demands
on the memory allocator.  One use-case, 10Gbit/s wirespeed at smallest
packet size[1], requires handling a packet every 67.2 ns (nanosec).

Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
elements), show "fast-path" instant reuse only costs 19 ns, but a
closer to network usage pattern show the cost rise to 45 ns.

This patchset introduce a quick mempool (qmempool), which when used
in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
drop in iptables "raw" table, but more importantly saves 40 ns with
IP-forwarding, which were hitting the slower SLUB use-case.


One of the building blocks for achieving this speedup is a cmpxchg
based Lock-Free queue that supports bulking, named alf_queue for
Array-based Lock-Free queue.  By bulking elements (pointers) from the
queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
elements.

 Patch1: alf_queue (Lock-Free queue)

 Patch2: qmempool using alf_queue

 Patch3: usage of qmempool for SKB caching


Notice, this patchset depend on introduction of napi_alloc_skb(),
which is part of Alexander Duyck's work patchset [3].

Different correctness tests and micro benchmarks are avail via my
github repo "prototype-kernel"[4], where the alf_queue and qmempool is
also kept in sync with this patchset.

Links:
 [1]: http://netoptimizer.blogspot.dk/2014/05/the-calculations-10gbits-wirespeed.html
 [2]: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/qmempool_bench.c
 [3]: http://thread.gmane.org/gmane.linux.network/342347
 [4]: https://github.com/netoptimizer/prototype-kernel

---

Jesper Dangaard Brouer (3):
      net: use qmempool in-front of sk_buff kmem_cache
      mm: qmempool - quick queue based memory pool
      lib: adding an Array-based Lock-Free (ALF) queue


 include/linux/alf_queue.h |  303 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/qmempool.h  |  205 +++++++++++++++++++++++++++++
 include/linux/skbuff.h    |    4 -
 lib/Kconfig               |   13 ++
 lib/Makefile              |    2 
 lib/alf_queue.c           |   47 +++++++
 mm/Kconfig                |   12 ++
 mm/Makefile               |    1 
 mm/qmempool.c             |  322 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |    5 +
 net/core/skbuff.c         |   43 +++++-
 11 files changed, 950 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/alf_queue.h
 create mode 100644 include/linux/qmempool.h
 create mode 100644 lib/alf_queue.c
 create mode 100644 mm/qmempool.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue
From: Jesper Dangaard Brouer @ 2014-12-10 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, linux-kernel, linux-mm,
	Christoph Lameter
  Cc: linux-api, Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt
In-Reply-To: <20141210141332.31779.56391.stgit@dragon>

This Array-based Lock-Free (ALF) queue, is a very fast bounded
Producer-Consumer queue, supporting bulking.  The MPMC
(Multi-Producer/Multi-Consumer) variant uses a locked cmpxchg, but the
cost can be amorized by utilizing bulk enqueue/dequeue.

Results on x86_64 CPU E5-2695, for variants:
 MPMC = Multi-Producer-Multi-Consumer
 SPSC = Single-Producer-Single-Consumer

(none-bulking):  per element cost MPMC and SPSC
                   MPMC     -- SPSC
 simple         :  9.519 ns -- 1.282 ns
 multi(step:128): 12.905 ns -- 2.240 ns

The majority of the cost is associated with the locked cmpxchg in the
MPMC variant.  Bulking helps amortize this cost:

(bulking) cost per element comparing MPMC -> SPSC:
         MPMC     -- SPSC
 bulk2 : 5.849 ns -- 1.748 ns
 bulk3 : 4.102 ns -- 1.531 ns
 bulk4 : 3.281 ns -- 1.383 ns
 bulk6 : 2.530 ns -- 1.238 ns
 bulk8 : 2.125 ns -- 1.196 ns
 bulk16: 1.552 ns -- 1.109 ns

Joint work with Hannes Frederic Sowa.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---

Correctness of memory barries on different arch's need to be evaluated.

The API might need some adjustments/discussions regarding:

1) the semantics of enq/deq when doing bulking, choosing what
to do when e.g. a bulk enqueue cannot fit fully.

2) some better way to detect miss-use of API, e.g. using
single-enqueue variant function call on a multi-enqueue variant data
structure.  Now no detection happens.

 include/linux/alf_queue.h |  303 +++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig               |   13 ++
 lib/Makefile              |    2 
 lib/alf_queue.c           |   47 +++++++
 4 files changed, 365 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/alf_queue.h
 create mode 100644 lib/alf_queue.c

diff --git a/include/linux/alf_queue.h b/include/linux/alf_queue.h
new file mode 100644
index 0000000..fb1a774
--- /dev/null
+++ b/include/linux/alf_queue.h
@@ -0,0 +1,303 @@
+#ifndef _LINUX_ALF_QUEUE_H
+#define _LINUX_ALF_QUEUE_H
+/* linux/alf_queue.h
+ *
+ * ALF: Array-based Lock-Free queue
+ *
+ * Queue properties
+ *  - Array based for cache-line optimization
+ *  - Bounded by the array size
+ *  - FIFO Producer/Consumer queue, no queue traversal supported
+ *  - Very fast
+ *  - Designed as a queue for pointers to objects
+ *  - Bulk enqueue and dequeue support
+ *  - Supports combinations of Multi and Single Producer/Consumer
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct alf_actor {
+	u32 head;
+	u32 tail;
+};
+
+struct alf_queue {
+	u32 size;
+	u32 mask;
+	u32 flags;
+	struct alf_actor producer ____cacheline_aligned_in_smp;
+	struct alf_actor consumer ____cacheline_aligned_in_smp;
+	void *ring[0] ____cacheline_aligned_in_smp;
+};
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp);
+void		  alf_queue_free(struct alf_queue *q);
+
+/* Helpers for LOAD and STORE of elements, have been split-out because:
+ *  1. They can be reused for both "Single" and "Multi" variants
+ *  2. Allow us to experiment with (pipeline) optimizations in this area.
+ */
+static inline void
+__helper_alf_enqueue_store(u32 p_head, struct alf_queue *q,
+			   void **ptr, const u32 n)
+{
+	int i, index = p_head;
+
+	for (i = 0; i < n; i++, index++)
+		q->ring[index & q->mask] = ptr[i];
+}
+
+static inline void
+__helper_alf_dequeue_load(u32 c_head, struct alf_queue *q,
+			  void **ptr, const u32 elems)
+{
+	int i, index = c_head;
+
+	for (i = 0; i < elems; i++, index++)
+		ptr[i] = q->ring[index & q->mask];
+}
+
+/* Main Multi-Producer ENQUEUE
+ *
+ * Even-though current API have a "fixed" semantics of aborting if it
+ * cannot enqueue the full bulk size.  Users of this API should check
+ * on the returned number of enqueue elements match, to verify enqueue
+ * was successful.  This allow us to introduce a "variable" enqueue
+ * scheme later.
+ *
+ * Not preemption safe. Multiple CPUs can enqueue elements, but the
+ * same CPU is not allowed to be preempted and access the same
+ * queue. Due to how the tail is updated, this can result in a soft
+ * lock-up. (Same goes for alf_mc_dequeue).
+ */
+static inline int
+alf_mp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	do {
+		p_head = ACCESS_ONCE(q->producer.head);
+		c_tail = ACCESS_ONCE(q->consumer.tail);
+
+		space = q->size + c_tail - p_head;
+		if (n > space)
+			return 0;
+
+		p_next = p_head + n;
+	}
+	while (unlikely(cmpxchg(&q->producer.head, p_head, p_next) != p_head));
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Wait for other concurrent preceding enqueues not yet done,
+	 * this part make us none-wait-free and could be problematic
+	 * in case of congestion with many CPUs
+	 */
+	while (unlikely(ACCESS_ONCE(q->producer.tail) != p_head))
+		cpu_relax();
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main Multi-Consumer DEQUEUE */
+static inline int
+alf_mc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	do {
+		c_head = ACCESS_ONCE(q->consumer.head);
+		p_tail = ACCESS_ONCE(q->producer.tail);
+
+		elems = p_tail - c_head;
+
+		if (elems == 0)
+			return 0;
+		else
+			elems = min(elems, n);
+
+		c_next = c_head + elems;
+	}
+	while (unlikely(cmpxchg(&q->consumer.head, c_head, c_next) != c_head));
+
+	/* LOAD the elems from the queue array.
+	 *   We don't need a smb_rmb() Read-Memory-Barrier here because
+	 *   the above cmpxchg is an implied full Memory-Barrier.
+	 */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough. Paired with enqueue
+	 * implicit full-MB in cmpxchg.
+	 */
+	smp_wmb();
+
+	/* Wait for other concurrent preceding dequeues not yet done */
+	while (unlikely(ACCESS_ONCE(q->consumer.tail) != c_head))
+		cpu_relax();
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+/* #define ASSERT_DEBUG_SPSC 1 */
+#ifndef ASSERT_DEBUG_SPSC
+#define ASSERT(x) do { } while (0)
+#else
+#define ASSERT(x)							\
+	do {								\
+		if (unlikely(!(x))) {					\
+			pr_crit("Assertion failed %s:%d: \"%s\"\n",	\
+				__FILE__, __LINE__, #x);		\
+			BUG();						\
+		}							\
+	} while (0)
+#endif
+
+/* Main SINGLE Producer ENQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	p_head = q->producer.head;
+	smp_rmb(); /* for consumer.tail write, making sure deq loads are done */
+	c_tail = ACCESS_ONCE(q->consumer.tail);
+
+	space = q->size + c_tail - p_head;
+	if (n > space)
+		return 0;
+
+	p_next = p_head + n;
+	ASSERT(ACCESS_ONCE(q->producer.head) == p_head);
+	q->producer.head = p_next;
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->producer.tail) == p_head);
+
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main SINGLE Consumer DEQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	c_head = q->consumer.head;
+	p_tail = ACCESS_ONCE(q->producer.tail);
+
+	elems = p_tail - c_head;
+
+	if (elems == 0)
+		return 0;
+	else
+		elems = min(elems, n);
+
+	c_next = c_head + elems;
+	ASSERT(ACCESS_ONCE(q->consumer.head) == c_head);
+	q->consumer.head = c_next;
+
+	smp_rmb(); /* Read-Memory-Barrier matching enq STOREs */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough.
+	 */
+	smp_wmb();
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->consumer.tail) == c_head);
+
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+static inline bool
+alf_queue_empty(struct alf_queue *q)
+{
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+
+	/* The empty (and initial state) is when consumer have reached
+	 * up with producer.
+	 *
+	 * DOUBLE-CHECK: Should we use producer.head, as this indicate
+	 * a producer is in-progress(?)
+	 */
+	return c_tail == p_tail;
+}
+
+static inline int
+alf_queue_count(struct alf_queue *q)
+{
+	u32 c_head = ACCESS_ONCE(q->consumer.head);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+	u32 elems;
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	elems = p_tail - c_head;
+	/* Thus, same as:
+	 *  elems = (p_tail - c_head) & q->mask;
+	 */
+	return elems;
+}
+
+static inline int
+alf_queue_avail_space(struct alf_queue *q)
+{
+	u32 p_head = ACCESS_ONCE(q->producer.head);
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 space;
+
+	/* The max avail space is q->size and
+	 * the empty state is when (consumer == producer)
+	 */
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	space = q->size + c_tail - p_head;
+	/* Thus, same as:
+	 *  space = (q->size + c_tail - p_head) & q->mask;
+	 */
+	return space;
+}
+
+#endif /* _LINUX_ALF_QUEUE_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 54cf309..3c0cd58 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -439,6 +439,19 @@ config NLATTR
 	bool
 
 #
+# ALF queue
+#
+config ALF_QUEUE
+	bool "ALF: Array-based Lock-Free (Producer-Consumer) queue"
+	default y
+	help
+	  This Array-based Lock-Free (ALF) queue, is a very fast
+	  bounded Producer-Consumer queue, supporting bulking.  The
+	  MPMC (Multi-Producer/Multi-Consumer) variant uses a locked
+	  cmpxchg, but the cost can be amorized by utilizing bulk
+	  enqueue/dequeue.
+
+#
 # Generic 64-bit atomic support is selected if needed
 #
 config GENERIC_ATOMIC64
diff --git a/lib/Makefile b/lib/Makefile
index 0211d2b..cd3a2d0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -119,6 +119,8 @@ obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o
 
 obj-$(CONFIG_NLATTR) += nlattr.o
 
+obj-$(CONFIG_ALF_QUEUE) += alf_queue.o
+
 obj-$(CONFIG_LRU_CACHE) += lru_cache.o
 
 obj-$(CONFIG_DMA_API_DEBUG) += dma-debug.o
diff --git a/lib/alf_queue.c b/lib/alf_queue.c
new file mode 100644
index 0000000..d6c9b69
--- /dev/null
+++ b/lib/alf_queue.c
@@ -0,0 +1,47 @@
+/*
+ * lib/alf_queue.c
+ *
+ * ALF: Array-based Lock-Free queue
+ *  - Main implementation in: include/linux/alf_queue.h
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/slab.h> /* kzalloc */
+#include <linux/alf_queue.h>
+#include <linux/log2.h>
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp)
+{
+	struct alf_queue *q;
+	size_t mem_size;
+
+	if (!(is_power_of_2(size)) || size > 65536)
+		return ERR_PTR(-EINVAL);
+
+	/* The ring array is allocated together with the queue struct */
+	mem_size = size * sizeof(void *) + sizeof(struct alf_queue);
+	q = kzalloc(mem_size, gfp);
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	q->size = size;
+	q->mask = size - 1;
+
+	return q;
+}
+EXPORT_SYMBOL_GPL(alf_queue_alloc);
+
+void alf_queue_free(struct alf_queue *q)
+{
+	kfree(q);
+}
+EXPORT_SYMBOL_GPL(alf_queue_free);
+
+MODULE_DESCRIPTION("ALF: Array-based Lock-Free queue");
+MODULE_AUTHOR("Jesper Dangaard Brouer <netoptimizer@brouer.com>");
+MODULE_LICENSE("GPL");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [RFC PATCH 2/3] mm: qmempool - quick queue based memory pool
From: Jesper Dangaard Brouer @ 2014-12-10 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, linux-kernel, linux-mm,
	Christoph Lameter
  Cc: linux-api, Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt
In-Reply-To: <20141210141332.31779.56391.stgit@dragon>

A quick queue-based memory pool, that functions as a cache in-front
of kmem_cache SLAB/SLUB allocators.  Which allows faster than
SLAB/SLUB reuse/caching of fixed size memory elements

The speed gain comes from, the shared storage, using a Lock-Free
queue that supports bulk refilling elements (to a percpu cache)
with a single cmpxchg.  Thus, the (lock-prefixed) cmpxchg cost is
amortize over the bulk size.

Qmempool cannot easily replace all kmem_cache usage, because it is
restricted in which contexts is can be used in, as the Lock-Free
queue is not preemption safe. E.g. only supports GFP_ATOMIC allocations
from SLAB.

This version is optimized for usage from softirq context, and cannot
be used from hardirq context.  Usage from none-softirq requires usage
of local_bh_{disable,enable}, which have a fairly high cost.

Performance micro benchmarks against SLUB. First test is fast-path
reuse of same element. Second test is allocating 256 element before
freeing elements again, this pattern comes from how NIC ring queue
cleanups often run.

On CPU E5-2695, CONFIG_PREEMPT=y, showing cost of alloc+free:

                 SLUB      - softirq   - none-softirq
 fastpath-reuse: 19.563 ns -  7.837 ns - 18.536 ns
 N(256)-pattern: 45.039 ns - 11.782 ns - 24.186 ns

A significant win for usage from softirq, and a smaller win for
none-softirq which requires taking local_bh_{disable,enable}.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/linux/qmempool.h |  205 +++++++++++++++++++++++++++++
 mm/Kconfig               |   12 ++
 mm/Makefile              |    1 
 mm/qmempool.c            |  322 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 540 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/qmempool.h
 create mode 100644 mm/qmempool.c

diff --git a/include/linux/qmempool.h b/include/linux/qmempool.h
new file mode 100644
index 0000000..922ed27
--- /dev/null
+++ b/include/linux/qmempool.h
@@ -0,0 +1,205 @@
+/*
+ * qmempool - a quick queue based mempool
+ *
+ * A quick queue-based memory pool, that functions as a cache in-front
+ * of kmem_cache SLAB/SLUB allocators.  Which allows faster than
+ * SLAB/SLUB reuse/caching of fixed size memory elements
+ *
+ * The speed gain comes from, the shared storage, using a Lock-Free
+ * queue that supports bulk refilling elements (to a percpu cache)
+ * with a single cmpxchg.  Thus, the lock-prefixed cmpxchg cost is
+ * amortize over the bulk size.
+ *
+ * The Lock-Free queue is based on an array (of pointer to elements).
+ * This make access more cache optimal, as e.g. on 64bit 8 pointers
+ * can be stored per cache-line (which is superior to a linked list
+ * approach).  Only storing the pointers to elements, is also
+ * beneficial as we don't touch the elements data.
+ *
+ * Qmempool cannot easily replace all kmem_cache usage, because it is
+ * restricted in which contexts is can be used in, as the Lock-Free
+ * queue is not preemption safe.  This version is optimized for usage
+ * from softirq context, and cannot be used from hardirq context.
+ *
+ * Only support GFP_ATOMIC allocations from SLAB.
+ *
+ * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer
+ *  for licensing details see kernel-base/COPYING
+ */
+
+#ifndef _LINUX_QMEMPOOL_H
+#define _LINUX_QMEMPOOL_H
+
+#include <linux/alf_queue.h>
+#include <linux/prefetch.h>
+#include <linux/hardirq.h>
+
+/* Bulking is an essential part of the performance gains as this
+ * amortize the cost of cmpxchg ops used when accessing sharedq
+ */
+#define QMEMPOOL_BULK 16
+#define QMEMPOOL_REFILL_MULTIPLIER 2
+
+struct qmempool_percpu {
+	struct alf_queue *localq;
+};
+
+struct qmempool {
+	/* The shared queue (sharedq) is a Multi-Producer-Multi-Consumer
+	 *  queue where access is protected by an atomic cmpxchg operation.
+	 *  The queue support bulk transfers, which amortize the cost
+	 *  of the atomic cmpxchg operation.
+	 */
+	struct alf_queue	*sharedq;
+
+	/* Per CPU local "cache" queues for faster atomic free access.
+	 * The local queues (localq) are Single-Producer-Single-Consumer
+	 * queues as they are per CPU.
+	 */
+	struct qmempool_percpu __percpu *percpu;
+
+	/* Backed by some SLAB kmem_cache */
+	struct kmem_cache	*kmem;
+
+	/* Setup */
+	uint32_t prealloc;
+	gfp_t gfp_mask;
+};
+
+extern void qmempool_destroy(struct qmempool *pool);
+extern struct qmempool *qmempool_create(
+	uint32_t localq_sz, uint32_t sharedq_sz, uint32_t prealloc,
+	struct kmem_cache *kmem, gfp_t gfp_mask);
+
+extern void *__qmempool_alloc_from_sharedq(
+	struct qmempool *pool, gfp_t gfp_mask, struct alf_queue *localq);
+extern void __qmempool_free_to_sharedq(void *elem, struct qmempool *pool,
+				       struct alf_queue *localq);
+
+/* The percpu variables (SPSC queues) needs preempt protection, and
+ * the shared MPMC queue also needs protection against the same CPU
+ * access the same queue.
+ *
+ * Specialize and optimize the qmempool to run from softirq.
+ * Don't allow qmempool to be used from interrupt context.
+ *
+ * IDEA: When used from softirq, take advantage of the protection
+ * softirq gives.  A softirq will never preempt another softirq,
+ * running on the same CPU.  The only event that can preempt a softirq
+ * is an interrupt handler (and perhaps we don't need to support
+ * calling qmempool from an interrupt).  Another softirq, even the
+ * same one, can run on another CPU however, but these helpers are
+ * only protecting our percpu variables.
+ *
+ * Thus, our percpu variables are safe if current the CPU is the one
+ * serving the softirq (tested via in_serving_softirq()), like:
+ *
+ *  if (!in_serving_softirq())
+ *		local_bh_disable();
+ *
+ * This makes qmempool very fast, when accesses from softirq, but
+ * slower when accessed outside softirq.  The other contexts need to
+ * disable bottom-halves "bh" via local_bh_{disable,enable} (which on
+ * have been measured add cost if 7.5ns on CPU E5-2695).
+ *
+ * MUST not be used from interrupt context, when relying on softirq usage.
+ */
+static inline int __qmempool_preempt_disable(void)
+{
+	int in_serving_softirq = in_serving_softirq();
+
+	if (!in_serving_softirq)
+		local_bh_disable();
+
+	return in_serving_softirq;
+}
+
+static inline void __qmempool_preempt_enable(int in_serving_softirq)
+{
+	if (!in_serving_softirq)
+		local_bh_enable();
+}
+
+/* Elements - alloc and free functions are inlined here for
+ * performance reasons, as the per CPU lockless access should be as
+ * fast as possible.
+ */
+
+/* Main allocation function
+ *
+ * Caller must make sure this is called from a preemptive safe context
+ */
+static inline void * main_qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+	/* NUMA considerations, for now the numa node is not handles,
+	 * this could be handled via e.g. numa_mem_id()
+	 */
+	void *elem;
+	struct qmempool_percpu *cpu;
+	int num;
+
+	/* 1. attempt get element from local per CPU queue */
+	cpu = this_cpu_ptr(pool->percpu);
+	num = alf_sc_dequeue(cpu->localq, (void **)&elem, 1);
+	if (num == 1) /* Succes: alloc elem by deq from localq cpu cache */
+		return elem;
+
+	/* 2. attempt get element from shared queue.  This involves
+	 * refilling the localq for next round. Side-effect can be
+	 * alloc from SLAB.
+	 */
+	elem = __qmempool_alloc_from_sharedq(pool, gfp_mask, cpu->localq);
+	return elem;
+}
+
+static inline void *__qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+	void *elem;
+	int state;
+
+	state = __qmempool_preempt_disable();
+	elem  = main_qmempool_alloc(pool, gfp_mask);
+	__qmempool_preempt_enable(state);
+	return elem;
+}
+
+static inline void *__qmempool_alloc_softirq(struct qmempool *pool,
+					     gfp_t gfp_mask)
+{
+	return main_qmempool_alloc(pool, gfp_mask);
+}
+
+/* Main free function */
+static inline void __qmempool_free(struct qmempool *pool, void *elem)
+{
+	struct qmempool_percpu *cpu;
+	int num;
+	int state;
+
+	/* NUMA considerations, how do we make sure to avoid caching
+	 * elements from a different NUMA node.
+	 */
+	state = __qmempool_preempt_disable();
+
+	/* 1. attempt to free/return element to local per CPU queue */
+	cpu = this_cpu_ptr(pool->percpu);
+	num = alf_sp_enqueue(cpu->localq, &elem, 1);
+	if (num == 1) /* success: element free'ed by enqueue to localq */
+		goto done;
+
+	/* 2. localq cannot store more elements, need to return some
+	 * from localq to sharedq, to make room. Side-effect can be
+	 * free to SLAB.
+	 */
+	__qmempool_free_to_sharedq(elem, pool, cpu->localq);
+
+done:
+	__qmempool_preempt_enable(state);
+}
+
+/* API users can choose to use "__" prefixed versions for inlining */
+extern void *qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask);
+extern void *qmempool_alloc_softirq(struct qmempool *pool, gfp_t gfp_mask);
+extern void qmempool_free(struct qmempool *pool, void *elem);
+
+#endif /* _LINUX_QMEMPOOL_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1d1ae6b..abaa94c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -618,3 +618,15 @@ config MAX_STACK_SIZE_MB
 	  changed to a smaller value in which case that is used.
 
 	  A sane initial value is 80 MB.
+
+config QMEMPOOL
+	bool "Quick queue based mempool (qmempool)"
+	default y
+	select ALF_QUEUE
+	help
+	  A mempool designed for faster than SLAB/kmem_cache
+	  reuse/caching of fixed size memory elements.  Works as a
+	  caching layer in-front of existing kmem_cache SLABs.  Speed
+	  is achieved by _bulk_ refilling percpu local cache, from a
+	  Lock-Free queue requiring a single (locked) cmpxchg per bulk
+	  transfer, thus amortizing the cost of the cmpxchg.
diff --git a/mm/Makefile b/mm/Makefile
index 8405eb0..49c1e18 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -69,3 +69,4 @@ obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_QMEMPOOL) += qmempool.o
diff --git a/mm/qmempool.c b/mm/qmempool.c
new file mode 100644
index 0000000..d6debcc
--- /dev/null
+++ b/mm/qmempool.c
@@ -0,0 +1,322 @@
+/*
+ * qmempool - a quick queue based mempool
+ *
+ * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer
+ *  for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/percpu.h>
+#include <linux/qmempool.h>
+#include <linux/log2.h>
+
+/* Due to hotplug CPU support, we need access to all qmempools
+ * in-order to cleanup elements in localq for the CPU going offline.
+ *
+ * TODO: implement HOTPLUG_CPU
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(qmempool_list);
+static DEFINE_SPINLOCK(qmempool_list_lock);
+#endif
+ */
+
+void qmempool_destroy(struct qmempool *pool)
+{
+	void *elem = NULL;
+	int j;
+
+	if (pool->percpu) {
+		for_each_possible_cpu(j) {
+			struct qmempool_percpu *cpu =
+				per_cpu_ptr(pool->percpu, j);
+
+			while (alf_mc_dequeue(cpu->localq, &elem, 1) == 1)
+				kmem_cache_free(pool->kmem, elem);
+			BUG_ON(!alf_queue_empty(cpu->localq));
+			alf_queue_free(cpu->localq);
+		}
+		free_percpu(pool->percpu);
+	}
+
+	if (pool->sharedq) {
+		while (alf_mc_dequeue(pool->sharedq, &elem, 1) == 1)
+			kmem_cache_free(pool->kmem, elem);
+		BUG_ON(!alf_queue_empty(pool->sharedq));
+		alf_queue_free(pool->sharedq);
+	}
+
+	kfree(pool);
+}
+EXPORT_SYMBOL(qmempool_destroy);
+
+struct qmempool *
+qmempool_create(uint32_t localq_sz, uint32_t sharedq_sz, uint32_t prealloc,
+		struct kmem_cache *kmem, gfp_t gfp_mask)
+{
+	struct qmempool *pool;
+	int i, j, num;
+	void *elem;
+
+	/* Validate constraints, e.g. due to bulking */
+	if (localq_sz < QMEMPOOL_BULK) {
+		pr_err("%s() localq size(%d) too small for bulking\n",
+		       __func__, localq_sz);
+		return NULL;
+	}
+	if (sharedq_sz < (QMEMPOOL_BULK * QMEMPOOL_REFILL_MULTIPLIER)) {
+		pr_err("%s() sharedq size(%d) too small for bulk refill\n",
+		       __func__, sharedq_sz);
+		return NULL;
+	}
+	if (!is_power_of_2(localq_sz) || !is_power_of_2(sharedq_sz)) {
+		pr_err("%s() queue sizes (%d/%d) must be power-of-2\n",
+		       __func__, localq_sz, sharedq_sz);
+		return NULL;
+	}
+	if (prealloc > sharedq_sz) {
+		pr_err("%s() prealloc(%d) req > sharedq size(%d)\n",
+		       __func__, prealloc, sharedq_sz);
+		return NULL;
+	}
+	if ((prealloc % QMEMPOOL_BULK) != 0) {
+		pr_warn("%s() prealloc(%d) should be div by BULK size(%d)\n",
+			__func__, prealloc, QMEMPOOL_BULK);
+	}
+	if (!kmem) {
+		pr_err("%s() kmem_cache is a NULL ptr\n",  __func__);
+		return NULL;
+	}
+
+	pool = kzalloc(sizeof(*pool), gfp_mask);
+	if (!pool)
+		return NULL;
+	pool->kmem     = kmem;
+	pool->gfp_mask = gfp_mask;
+
+	/* MPMC (Multi-Producer-Multi-Consumer) queue */
+	pool->sharedq = alf_queue_alloc(sharedq_sz, gfp_mask);
+	if (IS_ERR_OR_NULL(pool->sharedq)) {
+		pr_err("%s() failed to create shared queue(%d) ERR_PTR:0x%p\n",
+		       __func__, sharedq_sz, pool->sharedq);
+		qmempool_destroy(pool);
+		return NULL;
+	}
+
+	pool->prealloc = prealloc;
+	for (i = 0; i < prealloc; i++) {
+		elem = kmem_cache_alloc(pool->kmem, gfp_mask);
+		if (!elem) {
+			pr_err("%s() kmem_cache out of memory?!\n",  __func__);
+			qmempool_destroy(pool);
+			return NULL;
+		}
+		/* Could use the SP version given it is not visible yet */
+		num = alf_mp_enqueue(pool->sharedq, &elem, 1);
+		BUG_ON(num <= 0);
+	}
+
+	pool->percpu = alloc_percpu(struct qmempool_percpu);
+	if (pool->percpu == NULL) {
+		pr_err("%s() failed to alloc percpu\n", __func__);
+		qmempool_destroy(pool);
+		return NULL;
+	}
+
+	/* SPSC (Single-Consumer-Single-Producer) queue per CPU */
+	for_each_possible_cpu(j) {
+		struct qmempool_percpu *cpu = per_cpu_ptr(pool->percpu, j);
+
+		cpu->localq = alf_queue_alloc(localq_sz, gfp_mask);
+		if (IS_ERR_OR_NULL(cpu->localq)) {
+			pr_err("%s() failed alloc localq(sz:%d) on cpu:%d\n",
+			       __func__, localq_sz, j);
+			qmempool_destroy(pool);
+			return NULL;
+		}
+	}
+
+	return pool;
+}
+EXPORT_SYMBOL(qmempool_create);
+
+/* Element handling
+ */
+
+/* This function is called when sharedq runs-out of elements.
+ * Thus, sharedq needs to be refilled (enq) with elems from slab.
+ *
+ * Caller must assure this is called in an preemptive safe context due
+ * to alf_mp_enqueue() call.
+ */
+void *__qmempool_alloc_from_slab(struct qmempool *pool, gfp_t gfp_mask)
+{
+	void *elems[QMEMPOOL_BULK]; /* on stack variable */
+	void *elem;
+	int num, i, j;
+
+	/* Cannot use SLAB that can sleep if (gfp_mask & __GFP_WAIT),
+	 * else preemption disable/enable scheme becomes too complicated
+	 */
+	BUG_ON(gfp_mask & __GFP_WAIT);
+
+	elem = kmem_cache_alloc(pool->kmem, gfp_mask);
+	if (elem == NULL) /* slab depleted, no reason to call below allocs */
+		return NULL;
+
+	/* SLAB considerations, we need a kmem_cache interface that
+	 * supports allocating a bulk of elements.
+	 */
+
+	for (i = 0; i < QMEMPOOL_REFILL_MULTIPLIER; i++) {
+		for (j = 0; j < QMEMPOOL_BULK; j++) {
+			elems[j] = kmem_cache_alloc(pool->kmem, gfp_mask);
+			/* Handle if slab gives us NULL elem */
+			if (elems[j] == NULL) {
+				pr_err("%s() ARGH - slab returned NULL",
+				       __func__);
+				num = alf_mp_enqueue(pool->sharedq, elems, j-1);
+				BUG_ON(num == 0); //FIXME handle
+				return elem;
+			}
+		}
+		num = alf_mp_enqueue(pool->sharedq, elems, QMEMPOOL_BULK);
+		/* FIXME: There is a theoretical chance that multiple
+		 * CPU enter here, refilling sharedq at the same time,
+		 * thus we must handle "full" situation, for now die
+		 * hard so someone will need to fix this.
+		 */
+		BUG_ON(num == 0); /* sharedq should have room */
+	}
+
+	/* What about refilling localq here? (else it will happen on
+	 * next cycle, and will cost an extra cmpxchg).
+	 */
+	return elem;
+}
+
+/* This function is called when the localq runs out-of elements.
+ * Thus, localq is refilled (enq) with elements (deq) from sharedq.
+ *
+ * Caller must assure this is called in an preemptive safe context due
+ * to alf_mp_dequeue() call.
+ */
+void *__qmempool_alloc_from_sharedq(struct qmempool *pool, gfp_t gfp_mask,
+				    struct alf_queue *localq)
+{
+	void *elems[QMEMPOOL_BULK]; /* on stack variable */
+	void *elem;
+	int num;
+
+	/* Costs atomic "cmpxchg", but amortize cost by bulk dequeue */
+	num = alf_mc_dequeue(pool->sharedq, elems, QMEMPOOL_BULK);
+	if (likely(num > 0)) {
+		/* Consider prefetching data part of elements here, it
+		 * should be an optimal place to hide memory prefetching.
+		 * Especially given the localq is known to be an empty FIFO
+		 * which guarantees the order objs are accessed in.
+		 */
+		elem = elems[0]; /* extract one element */
+		if (num > 1) {
+			num = alf_sp_enqueue(localq, &elems[1], num-1);
+			/* Refill localq, should be empty, must succeed */
+			BUG_ON(num == 0);
+		}
+		return elem;
+	}
+	/* Use slab if sharedq runs out of elements */
+	elem = __qmempool_alloc_from_slab(pool, gfp_mask);
+	return elem;
+}
+EXPORT_SYMBOL(__qmempool_alloc_from_sharedq);
+
+/* Called when sharedq is full. Thus also make room in sharedq,
+ * besides also freeing the "elems" given.
+ */
+bool __qmempool_free_to_slab(struct qmempool *pool, void **elems, int n)
+{
+	int num, i, j;
+	/* SLAB considerations, we could use kmem_cache interface that
+	 * supports returning a bulk of elements.
+	 */
+
+	/* free these elements for real */
+	for (i = 0; i < n; i++)
+		kmem_cache_free(pool->kmem, elems[i]);
+
+	/* Make room in sharedq for next round */
+	for (i = 0; i < QMEMPOOL_REFILL_MULTIPLIER; i++) {
+		num = alf_mc_dequeue(pool->sharedq, elems, QMEMPOOL_BULK);
+		for (j = 0; j < num; j++)
+			kmem_cache_free(pool->kmem, elems[j]);
+	}
+	return true;
+}
+
+/* This function is called when the localq is full. Thus, elements
+ * from localq needs to be (dequeued) and returned (enqueued) to
+ * sharedq (or if shared is full, need to be free'ed to slab)
+ *
+ * MUST be called from a preemptive safe context.
+ */
+void __qmempool_free_to_sharedq(void *elem, struct qmempool *pool,
+				struct alf_queue *localq)
+{
+	void *elems[QMEMPOOL_BULK]; /* on stack variable */
+	int num_enq, num_deq;
+
+	elems[0] = elem;
+	/* Make room in localq */
+	num_deq = alf_sc_dequeue(localq, &elems[1], QMEMPOOL_BULK-1);
+	if (unlikely(num_deq == 0))
+		goto failed;
+	num_deq++; /* count first 'elem' */
+
+	/* Successful dequeued 'num_deq' elements from localq, "free"
+	 * these elems by enqueuing to sharedq
+	 */
+	num_enq = alf_mp_enqueue(pool->sharedq, elems, num_deq);
+	if (likely(num_enq == num_deq)) /* Success enqueued to sharedq */
+		return;
+
+	/* If sharedq is full (num_enq == 0) dequeue elements will be
+	 * returned directly to the SLAB allocator.
+	 *
+	 * Note: This usage of alf_queue API depend on enqueue is
+	 * fixed, by only enqueueing if all elements could fit, this
+	 * is an API that might change.
+	 */
+
+	__qmempool_free_to_slab(pool, elems, num_deq);
+	return;
+failed:
+	/* dequeing from a full localq should always be possible */
+	BUG();
+}
+EXPORT_SYMBOL(__qmempool_free_to_sharedq);
+
+/* API users can choose to use "__" prefixed versions for inlining */
+void *qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+	return __qmempool_alloc(pool, gfp_mask);
+}
+EXPORT_SYMBOL(qmempool_alloc);
+
+void *qmempool_alloc_softirq(struct qmempool *pool, gfp_t gfp_mask)
+{
+	return __qmempool_alloc_softirq(pool, gfp_mask);
+}
+EXPORT_SYMBOL(qmempool_alloc_softirq);
+
+void qmempool_free(struct qmempool *pool, void *elem)
+{
+	return __qmempool_free(pool, elem);
+}
+EXPORT_SYMBOL(qmempool_free);
+
+MODULE_DESCRIPTION("Quick queue based mempool (qmempool)");
+MODULE_AUTHOR("Jesper Dangaard Brouer <netoptimizer@brouer.com>");
+MODULE_LICENSE("GPL");

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [RFC PATCH 3/3] net: use qmempool in-front of sk_buff kmem_cache
From: Jesper Dangaard Brouer @ 2014-12-10 14:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, linux-kernel, linux-mm,
	Christoph Lameter
  Cc: linux-api, Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt
In-Reply-To: <20141210141332.31779.56391.stgit@dragon>

This patch uses qmempool for faster than SLAB caching of SKBs.

Only use this caching in connection with napi_alloc_skb() which runs
in softirq context.  This softirq context provides the needed
protection for qmempool and the underlying alf_queue.

Current caching settings are max 32 elements per CPU, which is 8192
bytes given SKB is SLAB_HWCACHE_ALIGN'ed.  The shared queue max limit
is 1024 which corresponds to worst-case 263KB memory usage.  Systems
with a NR_CPUS <= 8 will get a smaller max shared queue.

Benchmarked on a E5-2695 12-cores (no-HT) with ixgbe.
Baseline is Alex'es napi_alloc_skb patchset.

Single flow/CPU, early drop in iptables RAW table (fast path compare):
 * baseline: 3,159,160 pps
 * qmempool: 3,282,508 pps
 - Diff to baseline: +123348 pps => -11.89 ns

IP-forward single flow/cpu (slower path compare):
 * baseline: 1,137,284 pps
 * qmempool: 1,191,856 pps
 - Diff to baseline: +54572 pps => -40.26 ns

Some of the improvements also come from qmempool_{alloc,free} have
smaller code size than kmem_cache_{alloc,free}, which helps reduce
instruction-cache misses.

Also did some scaling tests, to stress qmempool sharedq allocs (which
stress the alf_queue's concurrency).

IP-forward MULTI flow/cpu (12 CPUs E5-2695 no-HT, 12 HWQs):
 * baseline: 11,946,666 pps
 * qmempool: 11,988,042 pps
 - Diff to baseline: +41376 pps => -0.29 ns

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/linux/skbuff.h |    4 +++-
 net/core/dev.c         |    5 ++++-
 net/core/skbuff.c      |   43 ++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index af79302..8881215 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -152,6 +152,7 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct qmempool;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -557,8 +558,8 @@ struct sk_buff {
 				fclone:2,
 				peeked:1,
 				head_frag:1,
+				qmempool:1,
 				xmit_more:1;
-	/* one bit hole */
 	kmemcheck_bitfield_end(flags1);
 
 	/* fields enclosed in headers_start/headers_end are copied
@@ -755,6 +756,7 @@ void skb_tx_error(struct sk_buff *skb);
 void consume_skb(struct sk_buff *skb);
 void  __kfree_skb(struct sk_buff *skb);
 extern struct kmem_cache *skbuff_head_cache;
+extern struct qmempool *skbuff_head_pool;
 
 void kfree_skb_partial(struct sk_buff *skb, bool head_stolen);
 bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/dev.c b/net/core/dev.c
index 80f798d..0c95fbd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -135,6 +135,7 @@
 #include <linux/if_macvlan.h>
 #include <linux/errqueue.h>
 #include <linux/hrtimer.h>
+#include <linux/qmempool.h>
 
 #include "net-sysfs.h"
 
@@ -4125,7 +4126,9 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 
 	case GRO_MERGED_FREE:
 		if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
-			kmem_cache_free(skbuff_head_cache, skb);
+			(skb->qmempool) ?
+				qmempool_free(skbuff_head_pool, skb) :
+				kmem_cache_free(skbuff_head_cache, skb);
 		else
 			__kfree_skb(skb);
 		break;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ae13ef6..a96ce75 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -74,10 +74,24 @@
 #include <asm/uaccess.h>
 #include <trace/events/skb.h>
 #include <linux/highmem.h>
+#include <linux/qmempool.h>
+#include <linux/log2.h>
 
 struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
+/* Keep max 32 skbs per CPU = 8192 bytes per CPU (as skb is
+ * SLAB_HWCACHE_ALIGN).  Sharedq cache is limited to max 1024 elems
+ * which is max 262KB skb memory, on small systems it allocs 32*2
+ * elems * NR_CPUS.
+ */
+struct qmempool *skbuff_head_pool __read_mostly;
+#define QMEMPOOL_LOCALQ 32
+#define QMEMPOOL_SCALE  (QMEMPOOL_LOCALQ * 2)
+#define QMEMPOOL_SYSTEM_SIZE roundup_pow_of_two(NR_CPUS * QMEMPOOL_SCALE)
+#define QMEMPOOL_SHAREDQ min(1024UL, QMEMPOOL_SYSTEM_SIZE)
+#define QMEMPOOL_PREALLOC 0
+
 /**
  *	skb_panic - private function for out-of-line support
  *	@skb:	buffer
@@ -278,13 +292,14 @@ nodata:
 EXPORT_SYMBOL(__alloc_skb);
 
 /**
- * build_skb - build a network buffer
+ * __build_skb - build a network buffer
  * @data: data buffer provided by caller
  * @frag_size: size of fragment, or 0 if head was kmalloced
  *
  * Allocate a new &sk_buff. Caller provides space holding head and
  * skb_shared_info. @data must have been allocated by kmalloc() only if
  * @frag_size is 0, otherwise data should come from the page allocator.
+ * @flags: FIXME-DESCRIBE
  * The return is the new skb buffer.
  * On a failure the return is %NULL, and @data is not freed.
  * Notes :
@@ -295,13 +310,16 @@ EXPORT_SYMBOL(__alloc_skb);
  *  before giving packet to stack.
  *  RX rings only contains data buffers, not full skbs.
  */
-struct sk_buff *build_skb(void *data, unsigned int frag_size)
+static struct sk_buff *__build_skb(void *data, unsigned int frag_size,
+				   int flags)
 {
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	unsigned int size = frag_size ? : ksize(data);
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	skb = (flags & SKB_ALLOC_NAPI) ?
+		qmempool_alloc_softirq(skbuff_head_pool, GFP_ATOMIC) :
+		kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
 	if (!skb)
 		return NULL;
 
@@ -310,6 +328,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = SKB_TRUESIZE(size);
 	skb->head_frag = frag_size != 0;
+	skb->qmempool = !!(flags & SKB_ALLOC_NAPI);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
 	skb->data = data;
@@ -326,6 +345,11 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 
 	return skb;
 }
+
+struct sk_buff *build_skb(void *data, unsigned int frag_size)
+{
+	return __build_skb(data, frag_size, 0);
+}
 EXPORT_SYMBOL(build_skb);
 
 struct netdev_alloc_cache {
@@ -477,7 +501,7 @@ static struct sk_buff *__alloc_rx_skb(unsigned int length, gfp_t gfp_mask,
 			__netdev_alloc_frag(fragsz, gfp_mask);
 
 		if (likely(data)) {
-			skb = build_skb(data, fragsz);
+			skb = __build_skb(data, fragsz, flags);
 			if (unlikely(!skb))
 				put_page(virt_to_head_page(data));
 		}
@@ -637,7 +661,9 @@ static void kfree_skbmem(struct sk_buff *skb)
 
 	switch (skb->fclone) {
 	case SKB_FCLONE_UNAVAILABLE:
-		kmem_cache_free(skbuff_head_cache, skb);
+		(skb->qmempool) ?
+			qmempool_free(skbuff_head_pool, skb) :
+			kmem_cache_free(skbuff_head_cache, skb);
 		return;
 
 	case SKB_FCLONE_ORIG:
@@ -862,6 +888,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
 	C(end);
 	C(head);
 	C(head_frag);
+	C(qmempool);
 	C(data);
 	C(truesize);
 	atomic_set(&n->users, 1);
@@ -3370,6 +3397,12 @@ void __init skb_init(void)
 						0,
 						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
 						NULL);
+	/* connect qmempools to slabs */
+	skbuff_head_pool = qmempool_create(QMEMPOOL_LOCALQ,
+					   QMEMPOOL_SHAREDQ,
+					   QMEMPOOL_PREALLOC,
+					   skbuff_head_cache, GFP_ATOMIC);
+	BUG_ON(skbuff_head_pool == NULL);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* RE: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
From: David Laight @ 2014-12-10 14:22 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Christoph Lameter
  Cc: linux-api@vger.kernel.org, Eric Dumazet, David S. Miller,
	Hannes Frederic Sowa, Alexander Duyck, Alexei Starovoitov,
	Paul E. McKenney, Mathieu Desnoyers, Steven Rostedt
In-Reply-To: <20141210141332.31779.56391.stgit@dragon>

From: Jesper Dangaard Brouer
> The network stack have some use-cases that puts some extreme demands
> on the memory allocator.  One use-case, 10Gbit/s wirespeed at smallest
> packet size[1], requires handling a packet every 67.2 ns (nanosec).
> 
> Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
> elements), show "fast-path" instant reuse only costs 19 ns, but a
> closer to network usage pattern show the cost rise to 45 ns.
> 
> This patchset introduce a quick mempool (qmempool), which when used
> in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
> drop in iptables "raw" table, but more importantly saves 40 ns with
> IP-forwarding, which were hitting the slower SLUB use-case.
> 
> 
> One of the building blocks for achieving this speedup is a cmpxchg
> based Lock-Free queue that supports bulking, named alf_queue for
> Array-based Lock-Free queue.  By bulking elements (pointers) from the
> queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> elements.

It seems to me that these improvements could be added to the
underlying allocator itself.
Nesting allocators doesn't really seem right to me.

	David


^ permalink raw reply

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
From: Jesper Dangaard Brouer @ 2014-12-10 14:40 UTC (permalink / raw)
  To: David Laight
  Cc: brouer, netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, Christoph Lameter, linux-api@vger.kernel.org,
	Eric Dumazet, David S. Miller, Hannes Frederic Sowa,
	Alexander Duyck, Alexei Starovoitov, Paul E. McKenney,
	Mathieu Desnoyers, Steven Rostedt
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1CA0A193@AcuExch.aculab.com>

On Wed, 10 Dec 2014 14:22:22 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jesper Dangaard Brouer
> > The network stack have some use-cases that puts some extreme demands
> > on the memory allocator.  One use-case, 10Gbit/s wirespeed at smallest
> > packet size[1], requires handling a packet every 67.2 ns (nanosec).
> > 
> > Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
> > elements), show "fast-path" instant reuse only costs 19 ns, but a
> > closer to network usage pattern show the cost rise to 45 ns.
> > 
> > This patchset introduce a quick mempool (qmempool), which when used
> > in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
> > drop in iptables "raw" table, but more importantly saves 40 ns with
> > IP-forwarding, which were hitting the slower SLUB use-case.
> > 
> > 
> > One of the building blocks for achieving this speedup is a cmpxchg
> > based Lock-Free queue that supports bulking, named alf_queue for
> > Array-based Lock-Free queue.  By bulking elements (pointers) from the
> > queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> > elements.
> 
> It seems to me that these improvements could be added to the
> underlying allocator itself.
> Nesting allocators doesn't really seem right to me.

Yes, I would very much like to see these ideas integrated into the
underlying allocators (hence addressing the mm-list).

This patchset demonstrates that it is possible to do something faster
than the existing SLUB allocator.  Which the network stack have a need
for.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [REPORT PATCH] driver core: amba: add device binding path 'driver_override'
From: Antonios Motakis @ 2014-12-10 14:47 UTC (permalink / raw)
  To: Russell King
  Cc: Eric Auger, Marc Zyngier, open list:ABI/API, Will Deacon,
	open list, Linux IOMMU, Antonios Motakis,
	VirtualOpenSystems Technical Team, kvm-arm, Christoffer Dall
In-Reply-To: <1417109154-10331-1-git-send-email-a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>

Dear Russel,

Do you think this patch can be included eventually?

If not, what would we still need to change, or is there another
approach you would accept?

Thanks and best regards,
Antonios

On Thu, Nov 27, 2014 at 6:25 PM, Antonios Motakis
<a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org> wrote:
> As already demonstrated with PCI [1] and the platform bus [2], a
> driver_override property in sysfs can be used to bypass the id matching
> of a device to a AMBA driver. This can be used by VFIO to bind to any AMBA
> device requested by the user.
>
> [1] http://lists-archives.com/linux-kernel/28030441-pci-introduce-new-device-binding-path-using-pci_dev-driver_override.html
> [2] https://www.redhat.com/archives/libvir-list/2014-April/msg00382.html
>
> Signed-off-by: Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
> Reviewed-by: Kim Phillips <kim.phillips-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
> ---
>  Documentation/ABI/testing/sysfs-bus-amba | 20 ++++++++++++++
>  drivers/amba/bus.c                       | 47 ++++++++++++++++++++++++++++++++
>  include/linux/amba/bus.h                 |  1 +
>  3 files changed, 68 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-bus-amba
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-amba b/Documentation/ABI/testing/sysfs-bus-amba
> new file mode 100644
> index 0000000..e7b5467
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-amba
> @@ -0,0 +1,20 @@
> +What:          /sys/bus/amba/devices/.../driver_override
> +Date:          September 2014
> +Contact:       Antonios Motakis <a.motakis-lrHrjnjw1UfHK3s98zE1ajGjJy/sRE9J@public.gmane.org>
> +Description:
> +               This file allows the driver for a device to be specified which
> +               will override standard OF, ACPI, ID table, and name matching.
> +               When specified, only a driver with a name matching the value
> +               written to driver_override will have an opportunity to bind to
> +               the device. The override is specified by writing a string to the
> +               driver_override file (echo vfio-amba > driver_override) and may
> +               be cleared with an empty string (echo > driver_override).
> +               This returns the device to standard matching rules binding.
> +               Writing to driver_override does not automatically unbind the
> +               device from its current driver or make any attempt to
> +               automatically load the specified driver. If no driver with a
> +               matching name is currently loaded in the kernel, the device will
> +               not bind to any driver. This also allows devices to opt-out of
> +               driver binding using a driver_override name such as "none".
> +               Only a single driver may be specified in the override, there is
> +               no support for parsing delimiters.
> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> index 47bbdc1..14bb08b 100644
> --- a/drivers/amba/bus.c
> +++ b/drivers/amba/bus.c
> @@ -18,6 +18,7 @@
>  #include <linux/pm_domain.h>
>  #include <linux/amba/bus.h>
>  #include <linux/sizes.h>
> +#include <linux/limits.h>
>
>  #include <asm/irq.h>
>
> @@ -43,6 +44,10 @@ static int amba_match(struct device *dev, struct device_driver *drv)
>         struct amba_device *pcdev = to_amba_device(dev);
>         struct amba_driver *pcdrv = to_amba_driver(drv);
>
> +       /* When driver_override is set, only bind to the matching driver */
> +       if (pcdev->driver_override)
> +               return !strcmp(pcdev->driver_override, drv->name);
> +
>         return amba_lookup(pcdrv->id_table, pcdev) != NULL;
>  }
>
> @@ -59,6 +64,47 @@ static int amba_uevent(struct device *dev, struct kobj_uevent_env *env)
>         return retval;
>  }
>
> +static ssize_t driver_override_show(struct device *_dev,
> +                                   struct device_attribute *attr, char *buf)
> +{
> +       struct amba_device *dev = to_amba_device(_dev);
> +
> +       if (!dev->driver_override)
> +               return 0;
> +
> +       return sprintf(buf, "%s\n", dev->driver_override);
> +}
> +
> +static ssize_t driver_override_store(struct device *_dev,
> +                                    struct device_attribute *attr,
> +                                    const char *buf, size_t count)
> +{
> +       struct amba_device *dev = to_amba_device(_dev);
> +       char *driver_override, *old = dev->driver_override, *cp;
> +
> +       if (count > PATH_MAX)
> +               return -EINVAL;
> +
> +       driver_override = kstrndup(buf, count, GFP_KERNEL);
> +       if (!driver_override)
> +               return -ENOMEM;
> +
> +       cp = strchr(driver_override, '\n');
> +       if (cp)
> +               *cp = '\0';
> +
> +       if (strlen(driver_override)) {
> +               dev->driver_override = driver_override;
> +       } else {
> +              kfree(driver_override);
> +              dev->driver_override = NULL;
> +       }
> +
> +       kfree(old);
> +
> +       return count;
> +}
> +
>  #define amba_attr_func(name,fmt,arg...)                                        \
>  static ssize_t name##_show(struct device *_dev,                                \
>                            struct device_attribute *attr, char *buf)    \
> @@ -81,6 +127,7 @@ amba_attr_func(resource, "\t%016llx\t%016llx\t%016lx\n",
>  static struct device_attribute amba_dev_attrs[] = {
>         __ATTR_RO(id),
>         __ATTR_RO(resource),
> +       __ATTR_RW(driver_override),
>         __ATTR_NULL,
>  };
>
> diff --git a/include/linux/amba/bus.h b/include/linux/amba/bus.h
> index c324f57..34a7004 100644
> --- a/include/linux/amba/bus.h
> +++ b/include/linux/amba/bus.h
> @@ -32,6 +32,7 @@ struct amba_device {
>         struct clk              *pclk;
>         unsigned int            periphid;
>         unsigned int            irq[AMBA_NR_IRQS];
> +       char                    *driver_override;
>  };
>
>  struct amba_driver {
> --
> 2.1.3
>

^ permalink raw reply

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
From: Christoph Lameter @ 2014-12-10 15:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, linux-kernel, linux-mm, linux-api
In-Reply-To: <20141210141332.31779.56391.stgit@dragon>

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

>  Patch1: alf_queue (Lock-Free queue)

For some reason that key patch is not in my linux-mm archives nor in my
inbox.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
From: Jesper Dangaard Brouer @ 2014-12-10 15:33 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, linux-kernel, linux-mm, linux-api, brouer
In-Reply-To: <alpine.DEB.2.11.1412100917020.4047@gentwo.org>

On Wed, 10 Dec 2014 09:17:32 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:

> On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:
> 
> >  Patch1: alf_queue (Lock-Free queue)
> 
> For some reason that key patch is not in my linux-mm archives nor in my
> inbox.

That is very strange! I did notice that it was somehow delayed in
showing up on gmane.org (http://thread.gmane.org/gmane.linux.network/342347/focus=126148)
and didn't show up on netdev either...

Inserting it below (hoping my email client does not screw it up):


[PATCH RFC] lib: adding an Array-based Lock-Free (ALF) queue

From: Jesper Dangaard Brouer <brouer@redhat.com>

This Array-based Lock-Free (ALF) queue, is a very fast bounded
Producer-Consumer queue, supporting bulking.  The MPMC
(Multi-Producer/Multi-Consumer) variant uses a locked cmpxchg, but the
cost can be amorized by utilizing bulk enqueue/dequeue.

Results on x86_64 CPU E5-2695, for variants:
 MPMC = Multi-Producer-Multi-Consumer
 SPSC = Single-Producer-Single-Consumer

(none-bulking):  per element cost MPMC and SPSC
                   MPMC     -- SPSC
 simple         :  9.519 ns -- 1.282 ns
 multi(step:128): 12.905 ns -- 2.240 ns

The majority of the cost is associated with the locked cmpxchg in the
MPMC variant.  Bulking helps amortize this cost:

(bulking) cost per element comparing MPMC -> SPSC:
         MPMC     -- SPSC
 bulk2 : 5.849 ns -- 1.748 ns
 bulk3 : 4.102 ns -- 1.531 ns
 bulk4 : 3.281 ns -- 1.383 ns
 bulk6 : 2.530 ns -- 1.238 ns
 bulk8 : 2.125 ns -- 1.196 ns
 bulk16: 1.552 ns -- 1.109 ns

Joint work with Hannes Frederic Sowa.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
----

Correctness of memory barries on different arch's need to be evaluated.

The API might need some adjustments/discussions regarding:

1) the semantics of enq/deq when doing bulking, choosing what
to do when e.g. a bulk enqueue cannot fit fully.

2) some better way to detect miss-use of API, e.g. using
single-enqueue variant function call on a multi-enqueue variant data
structure.  Now no detection happens.
---

 include/linux/alf_queue.h |  303 +++++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig               |   13 ++
 lib/Makefile              |    2 
 lib/alf_queue.c           |   47 +++++++
 4 files changed, 365 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/alf_queue.h
 create mode 100644 lib/alf_queue.c


diff --git a/include/linux/alf_queue.h b/include/linux/alf_queue.h
new file mode 100644
index 0000000..fb1a774
--- /dev/null
+++ b/include/linux/alf_queue.h
@@ -0,0 +1,303 @@
+#ifndef _LINUX_ALF_QUEUE_H
+#define _LINUX_ALF_QUEUE_H
+/* linux/alf_queue.h
+ *
+ * ALF: Array-based Lock-Free queue
+ *
+ * Queue properties
+ *  - Array based for cache-line optimization
+ *  - Bounded by the array size
+ *  - FIFO Producer/Consumer queue, no queue traversal supported
+ *  - Very fast
+ *  - Designed as a queue for pointers to objects
+ *  - Bulk enqueue and dequeue support
+ *  - Supports combinations of Multi and Single Producer/Consumer
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct alf_actor {
+	u32 head;
+	u32 tail;
+};
+
+struct alf_queue {
+	u32 size;
+	u32 mask;
+	u32 flags;
+	struct alf_actor producer ____cacheline_aligned_in_smp;
+	struct alf_actor consumer ____cacheline_aligned_in_smp;
+	void *ring[0] ____cacheline_aligned_in_smp;
+};
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp);
+void		  alf_queue_free(struct alf_queue *q);
+
+/* Helpers for LOAD and STORE of elements, have been split-out because:
+ *  1. They can be reused for both "Single" and "Multi" variants
+ *  2. Allow us to experiment with (pipeline) optimizations in this area.
+ */
+static inline void
+__helper_alf_enqueue_store(u32 p_head, struct alf_queue *q,
+			   void **ptr, const u32 n)
+{
+	int i, index = p_head;
+
+	for (i = 0; i < n; i++, index++)
+		q->ring[index & q->mask] = ptr[i];
+}
+
+static inline void
+__helper_alf_dequeue_load(u32 c_head, struct alf_queue *q,
+			  void **ptr, const u32 elems)
+{
+	int i, index = c_head;
+
+	for (i = 0; i < elems; i++, index++)
+		ptr[i] = q->ring[index & q->mask];
+}
+
+/* Main Multi-Producer ENQUEUE
+ *
+ * Even-though current API have a "fixed" semantics of aborting if it
+ * cannot enqueue the full bulk size.  Users of this API should check
+ * on the returned number of enqueue elements match, to verify enqueue
+ * was successful.  This allow us to introduce a "variable" enqueue
+ * scheme later.
+ *
+ * Not preemption safe. Multiple CPUs can enqueue elements, but the
+ * same CPU is not allowed to be preempted and access the same
+ * queue. Due to how the tail is updated, this can result in a soft
+ * lock-up. (Same goes for alf_mc_dequeue).
+ */
+static inline int
+alf_mp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	do {
+		p_head = ACCESS_ONCE(q->producer.head);
+		c_tail = ACCESS_ONCE(q->consumer.tail);
+
+		space = q->size + c_tail - p_head;
+		if (n > space)
+			return 0;
+
+		p_next = p_head + n;
+	}
+	while (unlikely(cmpxchg(&q->producer.head, p_head, p_next) != p_head));
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Wait for other concurrent preceding enqueues not yet done,
+	 * this part make us none-wait-free and could be problematic
+	 * in case of congestion with many CPUs
+	 */
+	while (unlikely(ACCESS_ONCE(q->producer.tail) != p_head))
+		cpu_relax();
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main Multi-Consumer DEQUEUE */
+static inline int
+alf_mc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	do {
+		c_head = ACCESS_ONCE(q->consumer.head);
+		p_tail = ACCESS_ONCE(q->producer.tail);
+
+		elems = p_tail - c_head;
+
+		if (elems == 0)
+			return 0;
+		else
+			elems = min(elems, n);
+
+		c_next = c_head + elems;
+	}
+	while (unlikely(cmpxchg(&q->consumer.head, c_head, c_next) != c_head));
+
+	/* LOAD the elems from the queue array.
+	 *   We don't need a smb_rmb() Read-Memory-Barrier here because
+	 *   the above cmpxchg is an implied full Memory-Barrier.
+	 */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough. Paired with enqueue
+	 * implicit full-MB in cmpxchg.
+	 */
+	smp_wmb();
+
+	/* Wait for other concurrent preceding dequeues not yet done */
+	while (unlikely(ACCESS_ONCE(q->consumer.tail) != c_head))
+		cpu_relax();
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+/* #define ASSERT_DEBUG_SPSC 1 */
+#ifndef ASSERT_DEBUG_SPSC
+#define ASSERT(x) do { } while (0)
+#else
+#define ASSERT(x)							\
+	do {								\
+		if (unlikely(!(x))) {					\
+			pr_crit("Assertion failed %s:%d: \"%s\"\n",	\
+				__FILE__, __LINE__, #x);		\
+			BUG();						\
+		}							\
+	} while (0)
+#endif
+
+/* Main SINGLE Producer ENQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	p_head = q->producer.head;
+	smp_rmb(); /* for consumer.tail write, making sure deq loads are done */
+	c_tail = ACCESS_ONCE(q->consumer.tail);
+
+	space = q->size + c_tail - p_head;
+	if (n > space)
+		return 0;
+
+	p_next = p_head + n;
+	ASSERT(ACCESS_ONCE(q->producer.head) == p_head);
+	q->producer.head = p_next;
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->producer.tail) == p_head);
+
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main SINGLE Consumer DEQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	c_head = q->consumer.head;
+	p_tail = ACCESS_ONCE(q->producer.tail);
+
+	elems = p_tail - c_head;
+
+	if (elems == 0)
+		return 0;
+	else
+		elems = min(elems, n);
+
+	c_next = c_head + elems;
+	ASSERT(ACCESS_ONCE(q->consumer.head) == c_head);
+	q->consumer.head = c_next;
+
+	smp_rmb(); /* Read-Memory-Barrier matching enq STOREs */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough.
+	 */
+	smp_wmb();
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->consumer.tail) == c_head);
+
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+static inline bool
+alf_queue_empty(struct alf_queue *q)
+{
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+
+	/* The empty (and initial state) is when consumer have reached
+	 * up with producer.
+	 *
+	 * DOUBLE-CHECK: Should we use producer.head, as this indicate
+	 * a producer is in-progress(?)
+	 */
+	return c_tail == p_tail;
+}
+
+static inline int
+alf_queue_count(struct alf_queue *q)
+{
+	u32 c_head = ACCESS_ONCE(q->consumer.head);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+	u32 elems;
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	elems = p_tail - c_head;
+	/* Thus, same as:
+	 *  elems = (p_tail - c_head) & q->mask;
+	 */
+	return elems;
+}
+
+static inline int
+alf_queue_avail_space(struct alf_queue *q)
+{
+	u32 p_head = ACCESS_ONCE(q->producer.head);
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 space;
+
+	/* The max avail space is q->size and
+	 * the empty state is when (consumer == producer)
+	 */
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	space = q->size + c_tail - p_head;
+	/* Thus, same as:
+	 *  space = (q->size + c_tail - p_head) & q->mask;
+	 */
+	return space;
+}
+
+#endif /* _LINUX_ALF_QUEUE_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 54cf309..3c0cd58 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -439,6 +439,19 @@ config NLATTR
 	bool
 
 #
+# ALF queue
+#
+config ALF_QUEUE
+	bool "ALF: Array-based Lock-Free (Producer-Consumer) queue"
+	default y
+	help
+	  This Array-based Lock-Free (ALF) queue, is a very fast
+	  bounded Producer-Consumer queue, supporting bulking.  The
+	  MPMC (Multi-Producer/Multi-Consumer) variant uses a locked
+	  cmpxchg, but the cost can be amorized by utilizing bulk
+	  enqueue/dequeue.
+
+#
 # Generic 64-bit atomic support is selected if needed
 #
 config GENERIC_ATOMIC64
diff --git a/lib/Makefile b/lib/Makefile
index 0211d2b..cd3a2d0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -119,6 +119,8 @@ obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o
 
 obj-$(CONFIG_NLATTR) += nlattr.o
 
+obj-$(CONFIG_ALF_QUEUE) += alf_queue.o
+
 obj-$(CONFIG_LRU_CACHE) += lru_cache.o
 
 obj-$(CONFIG_DMA_API_DEBUG) += dma-debug.o
diff --git a/lib/alf_queue.c b/lib/alf_queue.c
new file mode 100644
index 0000000..d6c9b69
--- /dev/null
+++ b/lib/alf_queue.c
@@ -0,0 +1,47 @@
+/*
+ * lib/alf_queue.c
+ *
+ * ALF: Array-based Lock-Free queue
+ *  - Main implementation in: include/linux/alf_queue.h
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/slab.h> /* kzalloc */
+#include <linux/alf_queue.h>
+#include <linux/log2.h>
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp)
+{
+	struct alf_queue *q;
+	size_t mem_size;
+
+	if (!(is_power_of_2(size)) || size > 65536)
+		return ERR_PTR(-EINVAL);
+
+	/* The ring array is allocated together with the queue struct */
+	mem_size = size * sizeof(void *) + sizeof(struct alf_queue);
+	q = kzalloc(mem_size, gfp);
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	q->size = size;
+	q->mask = size - 1;
+
+	return q;
+}
+EXPORT_SYMBOL_GPL(alf_queue_alloc);
+
+void alf_queue_free(struct alf_queue *q)
+{
+	kfree(q);
+}
+EXPORT_SYMBOL_GPL(alf_queue_free);
+
+MODULE_DESCRIPTION("ALF: Array-based Lock-Free queue");
+MODULE_AUTHOR("Jesper Dangaard Brouer <netoptimizer@brouer.com>");
+MODULE_LICENSE("GPL");


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
From: Christoph Lameter @ 2014-12-10 16:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141210163321.0e4e4fd2-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

> That is very strange! I did notice that it was somehow delayed in
> showing up on gmane.org (http://thread.gmane.org/gmane.linux.network/342347/focus=126148)
> and didn't show up on netdev either...

It finally got through.

^ permalink raw reply

* [CFT] Can I get some Tested-By's on this series?
From: Eric W. Biederman @ 2014-12-10 16:39 UTC (permalink / raw)
  To: Andy Lutomirski, Serge E. Hallyn, Richard Weinberger
  Cc: Linux Containers, Josh Triplett, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel@vger.kernel.org, LSM, Casey Schaufler, Kenton Varda,
	stable
In-Reply-To: <87iohklfvj.fsf_-_@x220.int.ebiederm.org>


Will people please test these patches with their container project?

These changes break container userspace (hopefully in a minimal way) if
I could have that confirmed by testing I would really appreciate it.  I
really don't want to send out a bug fix that accidentally breaks
userspace again.

The only issue sort of under discussion is if there is a better name for
/proc/<pid>/setgroups, and the name of the file will not affect the
functionality of the patchset.

With the code reviewed and written in simple obviously correct, easily
reviewable ways I am hoping/planning to send this to Linus ASAP.

Eric

^ permalink raw reply

* Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)
From: Christoph Lameter @ 2014-12-10 19:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, linux-kernel, linux-mm, linux-api, Eric Dumazet,
	David S. Miller, Hannes Frederic Sowa, Alexander Duyck,
	Alexei Starovoitov, Paul E. McKenney, Mathieu Desnoyers,
	Steven Rostedt
In-Reply-To: <20141210141332.31779.56391.stgit@dragon>

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

> One of the building blocks for achieving this speedup is a cmpxchg
> based Lock-Free queue that supports bulking, named alf_queue for
> Array-based Lock-Free queue.  By bulking elements (pointers) from the
> queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> elements.

This is a bit of an issue since the design of the SLUB allocator is such
that you should pick up an object, apply some processing and then take the
next one. The fetching of an object warms up the first cacheline and this
is tied into the way free objects are linked in SLUB.

So a bulk fetch from SLUB will not that effective and cause the touching
of many cachelines if we are dealing with just a few objects. If we are
looking at whole slab pages with all objects then SLUB can be effective
since we do not have to build up the linked pointer structure in each
page. SLAB has a different architecture there and a bulk fetch there is
possible without touching objects even for small sets since the freelist
management is separate from the objects.

If you do this bulking then you will later access cache cold objects?
Doesnt that negate the benefit that you gain? Or are these objects written
to by hardware and therefore by necessity cache cold?

We could provide a faster bulk alloc/free function.

	int kmem_cache_alloc_array(struct kmem_cache *s, gfp_t flags,
		size_t objects, void **array)

and this could be optimized by each slab allocator to provide fast
population of objects in that array. We then assume that the number of
objects is in the hundreds or so right?

The corresponding free function

	void kmem_cache_free_array(struct kmem_cache *s,
		size_t objects, void **array)


I think the queue management of the array can be improved by using a
similar technique as used the SLUB allocator using the cmpxchg_local.
cmpxchg_local is much faster than a full cmpxchg and we are operating on
per cpu structures anyways. So the overhead could still be reduced.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [CFT][PATCH 7/8] userns: Add a knob to disable setgroups on a per user namespace basis
From: Eric W. Biederman @ 2014-12-10 22:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-man, Kees Cook, Linux API, Linux Containers, Josh Triplett,
	stable, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kenton Varda, LSM, Michael Kerrisk-manpages, Richard Weinberger,
	Casey Schaufler, Andrew Morton
In-Reply-To: <CALCETrWpzvNm=fjOa3_+4QOqYP8qZUJvQAd6AsRZ71xyHZQRCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Tue, Dec 9, 2014 at 4:04 PM, Eric W.Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>>
>> On December 9, 2014 4:28:38 PM CST, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>On Tue, Dec 9, 2014 at 12:42 PM, Eric W. Biederman
>>><ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>>>
>>>> - Expose the knob to user space through a proc file
>>>/proc/<pid>/setgroups
>>>>
>>>>   A value of "deny" means the setgroups system call is disabled in
>>>the
>>>>   current processes user namespace and can not be enabled in the
>>>>   future in this user namespace.
>>>>
>>>>   A value of "allow" means the segtoups system call is enabled.
>>>>
>>>> - Descendant user namespaces inherit the value of setgroups from
>>>>   their parents.
>>>>
>>>> - A proc file is used (instead of a sysctl) as sysctls
>>>>   currently do not pass in a struct file so file_ns_capable
>>>>   is unusable.
>>>
>>>Reviewed-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
>>>
>>>But I still don't like the name "setgroups".  People may look at that
>>>and have no clue what the scope of the setting is.  And anyone who, as
>>>root, writes "deny" to /proc/self/setgroups, thinking that it acts on
>>>self, will be in for a surprise.
>>
>> True setgroups isn't perfect.  Documenting it in a manpage may have to be enough. The only real improvement I can think of would be to make the setting a sysctl.   But I think pursuing that approaches the point where perfection is the enemy of getting this problem fixed.
>>
>
> Would "userns_setgroups" be okay?

Maybe.

I just played with this and this is a much bigger booby trap than I had
realized.  Disabling setgroups disables the possibility of logging in the
future and since it is a one way switch the only way out is to reboot.

Hooray our software checks the returns of setgroups.  Booh.  This is a
really nasty knob to have anywhere.

I need to think about this a little bit.  Giving root the power to shoot
himself in the foot is one thing.  Giving root a loaded gun pointed at
his foot with the hammer pulled back, and a sign that says I dare you to
pull the trigger, seems like a bad idea.

I think I need to reduce when that knob can be used.  Grr.
Back to the drawing board!

Eric

^ permalink raw reply

* Re: [CFT] Can I get some Tested-By's on this series?
From: Serge Hallyn @ 2014-12-10 22:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-man, Kees Cook, Richard Weinberger, Linux Containers,
	Josh Triplett, stable, Andy Lutomirski, Kenton Varda, LSM,
	Michael Kerrisk-manpages, Linux API, Casey Schaufler,
	Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <87mw6vh31e.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> 
> Will people please test these patches with their container project?
> 
> These changes break container userspace (hopefully in a minimal way) if
> I could have that confirmed by testing I would really appreciate it.  I
> really don't want to send out a bug fix that accidentally breaks
> userspace again.
> 
> The only issue sort of under discussion is if there is a better name for
> /proc/<pid>/setgroups, and the name of the file will not affect the
> functionality of the patchset.
> 
> With the code reviewed and written in simple obviously correct, easily
> reviewable ways I am hoping/planning to send this to Linus ASAP.
> 
> Eric

Is there a git tree we can clone?

^ permalink raw reply

* Re: [CFT] Can I get some Tested-By's on this series?
From: Richard Weinberger @ 2014-12-10 22:50 UTC (permalink / raw)
  To: Serge Hallyn, Eric W. Biederman
  Cc: linux-man, Kees Cook, Linux API, Linux Containers, Josh Triplett,
	stable, Andy Lutomirski, Kenton Varda, LSM,
	Michael Kerrisk-manpages, Casey Schaufler, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20141210224822.GG20012@ubuntumail>

Am 10.12.2014 um 23:48 schrieb Serge Hallyn:
> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>>
>> Will people please test these patches with their container project?
>>
>> These changes break container userspace (hopefully in a minimal way) if
>> I could have that confirmed by testing I would really appreciate it.  I
>> really don't want to send out a bug fix that accidentally breaks
>> userspace again.
>>
>> The only issue sort of under discussion is if there is a better name for
>> /proc/<pid>/setgroups, and the name of the file will not affect the
>> functionality of the patchset.
>>
>> With the code reviewed and written in simple obviously correct, easily
>> reviewable ways I am hoping/planning to send this to Linus ASAP.
>>
>> Eric
> 
> Is there a git tree we can clone?

I was about to ask that too.
Hopefully I can tomorrow find some time for testing.

Thanks,
//richard

^ permalink raw reply

* Re: [CFT] Can I get some Tested-By's on this series?
From: Eric W. Biederman @ 2014-12-10 23:19 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: linux-man, Kees Cook, Linux API, Linux Containers, Serge Hallyn,
	Josh Triplett, stable, Andy Lutomirski, Kenton Varda, LSM,
	Michael Kerrisk-manpages, Casey Schaufler, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <5488CE4D.1000606-/L3Ra7n9ekc@public.gmane.org>

Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:

> Am 10.12.2014 um 23:48 schrieb Serge Hallyn:
>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>>>
>>> Will people please test these patches with their container project?
>>>
>>> These changes break container userspace (hopefully in a minimal way) if
>>> I could have that confirmed by testing I would really appreciate it.  I
>>> really don't want to send out a bug fix that accidentally breaks
>>> userspace again.
>>>
>>> The only issue sort of under discussion is if there is a better name for
>>> /proc/<pid>/setgroups, and the name of the file will not affect the
>>> functionality of the patchset.
>>>
>>> With the code reviewed and written in simple obviously correct, easily
>>> reviewable ways I am hoping/planning to send this to Linus ASAP.
>>>
>>> Eric
>> 
>> Is there a git tree we can clone?
>
> I was about to ask that too.
> Hopefully I can tomorrow find some time for testing.

git pull git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-testing

That holds my entire queue of fixes against 3.18-rc6

Eric

^ permalink raw reply

* Re: [PATCH] Documentation: Add entry for dell-laptop sysfs interface
From: Darren Hart @ 2014-12-11  5:08 UTC (permalink / raw)
  To: Gabriele Mazzotta
  Cc: mjg59-1xO5oi07KQx4cg9Nei1l7Q, linux-api-u79uwXL29TY76Z2rM5mHXA,
	platform-driver-x86-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	libsmbios-devel-XtjxT7Vmt5ZskZv2Y/7f+AC/G2K4zDHf,
	Srinivas_G_Gowda-8PEkshWhKlo, Michael_E_Brown-8PEkshWhKlo,
	pali.rohar-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <1417628493-29323-1-git-send-email-gabriele.mzt-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Wed, Dec 03, 2014 at 06:41:33PM +0100, Gabriele Mazzotta wrote:
> Add the documentation for the new sysfs interface of dell-laptop
> that allows to configure the keyboard illumination on Dell systems.

Queued to for-next.

Thanks,

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox