Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v1 5/5] IB/core: ib_copy_to_udata(): don't silently truncate response
From: Haggai Eran @ 2015-02-01  8:47 UTC (permalink / raw)
  To: Yann Droneaud, Sagi Grimberg, Shachar Raindel, Eli Cohen,
	Roland Dreier
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <c69af8952bf25fdbcdfc527b0636bc3177798b95.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On 29/01/2015 20:00, Yann Droneaud wrote:
> While ib_copy_to_udata() should check for the available output
> space as already proposed in some other patches [1][2][3], the
> changes brought by commit 5a77abf9a97a ("IB/core: Add support for
> extended query device caps") are silently truncating the data to
> be written to userspace if the output buffer is not large enough
> to hold the response data.
> 
> Silently truncating the response is not a reliable behavior as
> userspace is not given any hint about this truncation: userspace
> is leaved with garbage to play with.
> 
> Not checking the response buffer size and writing past the
> userspace buffer is no good either, but it's the current behavior.
> 
> So this patch revert the particular change on ib_copy_to_udata()
> as a better behavior is implemented in the upper level function
> ib_uverbs_ex_query_device().
> 
> [1] "[PATCH 00/22] infiniband: improve userspace input check"
> 
> http://mid.gmane.org/cover.1376847403.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
> 
> [2] "[PATCH 03/22] infiniband: ib_copy_from_udata(): check input length"
> 
> http://mid.gmane.org/2bf102a41c51f61965ee09df827abe8fefb523a9.1376847403.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
> 
> [3] "[PATCH 04/22] infiniband: ib_copy_to_udata(): check output length"
> 
> http://mid.gmane.org/d27716a3a1c180f832d153a7402f65ea8a75b734.1376847403.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
> 
> Link: http://mid.gmane.org/cover.1422553023.git.ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org
> Cc: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Reviewed-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

> Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
> ---
>  include/rdma/ib_verbs.h | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 0d74f1de99aa..65994a19e840 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -1707,10 +1707,7 @@ static inline int ib_copy_from_udata(void *dest, struct ib_udata *udata, size_t
>  
>  static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len)
>  {
> -	size_t copy_sz;
> -
> -	copy_sz = min_t(size_t, len, udata->outlen);
> -	return copy_to_user(udata->outbuf, src, copy_sz) ? -EFAULT : 0;
> +	return copy_to_user(udata->outbuf, src, len) ? -EFAULT : 0;
>  }
>  
>  /**
> 

^ permalink raw reply

* Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
From: Haggai Eran @ 2015-02-01 11:25 UTC (permalink / raw)
  To: Yann Droneaud, Roland Dreier
  Cc: Jason Gunthorpe, Sagi Grimberg, Shachar Raindel, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1422638760.3133.260.camel-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>

On 30/01/2015 19:26, Yann Droneaud wrote:
> Hi,
> 
> Le jeudi 29 janvier 2015 à 15:17 -0800, Roland Dreier a écrit :
>> On Thu, Jan 29, 2015 at 1:59 PM, Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org> wrote:
>>>> Roland: I agree with Yann, these patches need to go in, or the ODP
>>>> patches reverted.
>>
>>> Reverting all On Demand Paging patches seems overkill:
>>> if something as to be reverted it should be commit 5a77abf9a97a
>>> ("IB/core: Add support for extended query device caps") and the part of
>>> commit 860f10a799c8 ("IB/core: Add flags for on demand paging support")
>>> which modify ib_uverbs_ex_query_device().
>>
>> Thank you and Jason for taking on this interface.
>>
>> At this point I feel like I do about the IPoIB changes -- we should
>> revert the broken stuff and get it right for 3.20.
>>
>> If we revert the two things you describe above, is everything else OK
>> to leave in 3.19 with respect to ABI?
>>
> 
> I've tried to review every changes since v3.18 on drivers/infiniband
> include/rdma and include/uapi/rdma with respect to ABI issues.
> 
> I've noticed no other issue, but I have to admit I've not well reviewed
> the drivers (hw/) internal changes.
> 
> If the IB_USER_VERBS_EX_CMD_QUERY_DEVICE and ib_uverbs_ex_query_device
> changes are going to be reverted for v3.19, the on-demand-paging
> feature will be available (IB_DEVICE_ON_DEMAND_PAGING will be set 
> device_cap_flags in response to non extended QUERY_DEVICE for mlx5 HCA
> and IB_ACCESS_ON_DEMAND access flag will be effective for REG_MR 
> uverbs), but its parameters won't be. 

For user-space to make use of on demand paging, it should verify the
specific transport and operation is supported. If they don't, they will
encounter errors when a page fault occurs.

> I don't know if it's a no-go for 
> the usage of on-demand paging by userspace: I have not the chance
> of owning HCA with the support for this feature, nor the patches
> libibverbs / libmlx5 ... (anyway I would not have the time to test).
> I've hoped people from Mellanox would have commented on the revert 
> option too.

I would prefer it if Yann's patches are accepted. I understand it is
very late, but they are quite short, and I think they provide the right
semantic for this new verb.

As a second option, in case you prefer to revert the extended query
device patch, I will send a patch shortly to do that.

Regards,
Haggai

^ permalink raw reply

* Re: [PATCH v1 2/5] IB/uverbs: ex_query_device: check request's comp_mask
From: Yann Droneaud @ 2015-02-01 11:55 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Roland Dreier,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <54CDDFE4.7030003-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Hi,

Le dimanche 01 février 2015 à 10:12 +0200, Haggai Eran a écrit :
> On 29/01/2015 19:59, Yann Droneaud wrote:
> > This patch ensures the extended QUERY_DEVICE uverbs request's
> > comp_mask has only known and supported bits (currently none).
> > 
> > If userspace set unknown features bits, -EINVAL will be returned,
> > ensuring current programs are not allowed to set random feature
> > bits: such bits could enable new extended features in future kernel
> > versions and those features can trigger a behavior not unsupported
> > by the older programs or make the newer kernels return an error
> > for a request which was valid on older kernels.
> > 
> > Additionally, returning an error for unsupported feature would
> > allow userspace to probe/discover which extended features are
> > currently supported by a kernel.
> 
> As I wrote before, I hope in the future we don't force userspace to
> probe features this way, because it may be unnecessarily complex.
> 

I believe that most use cases won't need probing as applications are
often built according to the current kernel features in mind.

If applications need to use new features, it seems to be a small price
to pay to be prepared to get -EINVAL.

In another word: backward compatibility from application point of view:
a newer application wanting to run on older kernel must be prepared to.

> I agree though that we should have a way to extend this verb in the future.
> 
> Reviewed-by: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 

Thanks.

> > 
> > Link: http://mid.gmane.org/cover.1422553023.git.ydroneaud-RlY5vtjFyJ1hl2p70BpVqQ@public.gmane.orgm
> > Cc: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: Shachar Raindel <raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Signed-off-by: Yann Droneaud <ydroneaud-RlY5vtjFyJ3QT0dZR+AlfA@public.gmane.org>
> > ---
> >  drivers/infiniband/core/uverbs_cmd.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
> > index 6ef06a9b4362..fbcc54b86795 100644
> > --- a/drivers/infiniband/core/uverbs_cmd.c
> > +++ b/drivers/infiniband/core/uverbs_cmd.c
> > @@ -3312,6 +3312,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
> >  	if (err)
> >  		return err;
> >  
> > +	if (cmd.comp_mask)
> > +		return -EINVAL;
> > +
> >  	if (cmd.reserved)
> >  		return -EINVAL;
> >  
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH-v9 0/3] add support for lazytime mount option
From: Michael Kerrisk @ 2015-02-02  6:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
In-Reply-To: <1422855422-7444-1-git-send-email-tytso@mit.edu>

Hi Ted,

Since this is an API change, linux-api@ shouls be CCed, Added.

Thanks,

Michael


On Mon, Feb 2, 2015 at 6:36 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> This is an updated version of what had originally been an
> ext4-specific patch which significantly improves performance by lazily
> writing timestamp updates (and in particular, mtime updates) to disk.
> The in-memory timestamps are always correct, but they are only written
> to disk when required for correctness.
>
> This provides a huge performance boost for ext4 due to how it handles
> journalling, but it's valuable for all file systems running on flash
> storage or drive-managed SMR disks by reducing the metadata write
> load.  So upon request, I've moved the functionality to the VFS layer.
> Once the /sbin/mount program adds support for MS_LAZYTIME, all file
> systems should be able to benefit from this optimization.
>
> There is still an ext4-specific optimization, which may be applicable
> for other file systems which store more than one inode in a block, but
> it will require file system specific code.  It is purely optional,
> however.
>
> For people interested seeing how timestamp updates are held back, the
> following example commands to enable the tracepoints debugging may be
> helpful:
>
>   mount -o remount,lazytime /
>   cd /sys/kernel/debug/tracing
>   echo 1 > events/writeback/writeback_lazytime/enable
>   echo 1 > events/writeback/writeback_lazytime_iput/enable
>   echo "state & 2048" > events/writeback/writeback_dirty_inode_enqueue/filter
>   echo 1 > events/writeback/writeback_dirty_inode_enqueue/enable
>   echo 1 > events/ext4/ext4_other_inode_update_time/enable
>   cat trace_pipe
>
> You can also see how many lazytime inodes are in memory by looking in
> /sys/kernel/debug/bdi/<bdi>/stats
>
> Changes since -v8:
>   - in ext4_update_other_inodes_time() clear I_DIRTY_TIME_EXPIRED as
>     well as I_DIRTY_TIME
>   - Fixed a bug which broke writeback in some cases (introduced in -v7)
>
> Changes since -v7:
>    - Fix comment typos
>    - Clear the I_DIRTY_TIME flag if I_DIRTY_INODE gets added in
>      __mark_inode_dirty()
>    - Fix a bug accidentally introduced in -v7 which broke lazytime altogether
>
> Changes since -v6:
>    - Add a new tracepoint writeback_dirty_inode_enqueue
>    - Move generic handling of update_time() to generic_update_time(),
>      so filesystems can more easily hook or modify update_time()
>    - The file system's dirty_inode() will now always get called with
>      I_DIRTY_TIME when the inode time is updated.   (I_DIRTY_SYNC will
>      also be set if the inode should be updated right away.)   This allows
>      file systems such as XFS to update its on-disk copy of the inode if
>      I_DIRTY_TIME is set.
>
> Changes since -v5:
>    - Tweak move_expired_inodes to handle sync() and syncfs(), and drop
>      flush_sb_dirty_time().
>    - Move logic for handling the b_dirty_time list into
>      __mark_inode_dirty().
>    - Move I_DIRTY back to its original definition, and use I_DIRTY_ALL
>      for I_DIRTY plus I_DIRTY_TIME.
>    - Fold some patches together to make the first patch easier to
>      review (and modify/update).
>    - Use the pre-existing writeback tracepoints instead of creating a new
>      fs tracepoints.
>
> Changes since -v4:
>    - Fix ext4 optimization so it does not need to increment (and more
>      problematically, decrement) the inode reference count
>    - Per Christoph's suggestion, drop support for btrfs and xfs for now,
>      issues with how btrfs and xfs handle dirty inode tracking.  We can add
>      btrfs and xfs support back later or at the end of this series if we
>      want to revisit this decision.
>    - Miscellaneous cleanups
>
> Changes since -v3:
>    - inodes with I_DIRTY_TIME set are placed on a new bdi list,
>         b_dirty_time.  This allows filesystem-level syncs to more
>         easily iterate over those inodes that need to have their
>         timestamps written to disk.
>    - dirty timestamps will be written out asynchronously on the final
>         iput, instead of when the inode gets evicted.
>    - separate the definition of the new function
>         find_active_inode_nowait() to a separate patch
>    - create separate flag masks: I_DIRTY_WB and I_DIRTY_INODE, which
>        indicate whether the inode needs to be on the write back lists,
>        or whether the inode itself is dirty, while I_DIRTY means any one
>        of the inode dirty flags are set.  This simplifies the fs
>        writeback logic which needs to test for different combinations of
>        the inode dirty flags in different places.
>
> Changes since -v2:
>    - If update_time() updates i_version, it will not use lazytime (i..e,
>        the inode will be marked dirty so the change will be persisted on to
>        disk sooner rather than later).  Yes, this eliminates the
>        benefits of lazytime if the user is experting the file system via
>        NFSv4.  Sad, but NFS's requirements seem to mandate this.
>    - Fix time wrapping bug 49 days after the system boots (on a system
>         with a 32-bit jiffies).   Use get_monotonic_boottime() instead.
>    - Clean up type warning in include/tracing/ext4.h
>    - Added explicit parenthesis for stylistic reasons
>    - Added an is_readonly() inode operations method so btrfs doesn't
>        have to duplicate code in update_time().
>
> Changes since -v1:
>    - Added explanatory comments in update_time() regarding i_ts_dirty_days
>    - Fix type used for days_since_boot
>    - Improve SMP scalability in update_time and ext4_update_other_inodes_time
>    - Added tracepoints to help test and characterize how often and under
>          what circumstances inodes have their timestamps lazily updated
>
> Theodore Ts'o (3):
>   vfs: add support for a lazytime mount option
>   vfs: add find_inode_nowait() function
>   ext4: add optimization for the lazytime mount option
>
>  fs/ext4/inode.c                  |  70 +++++++++++++++++++++++++-
>  fs/ext4/super.c                  |  10 ++++
>  fs/fs-writeback.c                |  62 +++++++++++++++++++----
>  fs/gfs2/file.c                   |   4 +-
>  fs/inode.c                       | 106 +++++++++++++++++++++++++++++++++------
>  fs/jfs/file.c                    |   2 +-
>  fs/libfs.c                       |   2 +-
>  fs/proc_namespace.c              |   1 +
>  fs/sync.c                        |   8 +++
>  include/linux/backing-dev.h      |   1 +
>  include/linux/fs.h               |  10 ++++
>  include/trace/events/ext4.h      |  30 +++++++++++
>  include/trace/events/writeback.h |  60 +++++++++++++++++++++-
>  include/uapi/linux/fs.h          |   4 +-
>  mm/backing-dev.c                 |  10 +++-
>  15 files changed, 343 insertions(+), 37 deletions(-)
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply

* Re: [PATCH-v9 1/3] vfs: add support for a lazytime mount option
From: Michael Kerrisk @ 2015-02-02  6:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
In-Reply-To: <1422855422-7444-2-git-send-email-tytso-3s7WtUTddSA@public.gmane.org>

[CC += linux-api@]

On Mon, Feb 2, 2015 at 6:37 AM, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:
> Add a new mount option which enables a new "lazytime" mode.  This mode
> causes atime, mtime, and ctime updates to only be made to the
> in-memory version of the inode.  The on-disk times will only get
> updated when (a) if the inode needs to be updated for some non-time
> related change, (b) if userspace calls fsync(), syncfs() or sync(), or
> (c) just before an undeleted inode is evicted from memory.
>
> This is OK according to POSIX because there are no guarantees after a
> crash unless userspace explicitly requests via a fsync(2) call.
>
> For workloads which feature a large number of random write to a
> preallocated file, the lazytime mount option significantly reduces
> writes to the inode table.  The repeated 4k writes to a single block
> will result in undesirable stress on flash devices and SMR disk
> drives.  Even on conventional HDD's, the repeated writes to the inode
> table block will trigger Adjacent Track Interference (ATI) remediation
> latencies, which very negatively impact long tail latencies --- which
> is a very big deal for web serving tiers (for example).
>
> Google-Bug-Id: 18297052
>
> Signed-off-by: Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org>
> ---
>  fs/ext4/inode.c                  |  6 ++++
>  fs/fs-writeback.c                | 62 +++++++++++++++++++++++++++++++++-------
>  fs/gfs2/file.c                   |  4 +--
>  fs/inode.c                       | 56 +++++++++++++++++++++++++-----------
>  fs/jfs/file.c                    |  2 +-
>  fs/libfs.c                       |  2 +-
>  fs/proc_namespace.c              |  1 +
>  fs/sync.c                        |  8 ++++++
>  include/linux/backing-dev.h      |  1 +
>  include/linux/fs.h               |  5 ++++
>  include/trace/events/writeback.h | 60 +++++++++++++++++++++++++++++++++++++-
>  include/uapi/linux/fs.h          |  4 ++-
>  mm/backing-dev.c                 | 10 +++++--
>  13 files changed, 186 insertions(+), 35 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5653fa4..628df5b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4840,11 +4840,17 @@ int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode)
>   * If the inode is marked synchronous, we don't honour that here - doing
>   * so would cause a commit on atime updates, which we don't bother doing.
>   * We handle synchronous inodes at the highest possible level.
> + *
> + * If only the I_DIRTY_TIME flag is set, we can skip everything.  If
> + * I_DIRTY_TIME and I_DIRTY_SYNC is set, the only inode fields we need
> + * to copy into the on-disk inode structure are the timestamp files.
>   */
>  void ext4_dirty_inode(struct inode *inode, int flags)
>  {
>         handle_t *handle;
>
> +       if (flags == I_DIRTY_TIME)
> +               return;
>         handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>         if (IS_ERR(handle))
>                 goto out;
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 2d609a5..0046861 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -247,14 +247,19 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t)
>         return ret;
>  }
>
> +#define EXPIRE_DIRTY_ATIME 0x0001
> +
>  /*
>   * Move expired (dirtied before work->older_than_this) dirty inodes from
>   * @delaying_queue to @dispatch_queue.
>   */
>  static int move_expired_inodes(struct list_head *delaying_queue,
>                                struct list_head *dispatch_queue,
> +                              int flags,
>                                struct wb_writeback_work *work)
>  {
> +       unsigned long *older_than_this = NULL;
> +       unsigned long expire_time;
>         LIST_HEAD(tmp);
>         struct list_head *pos, *node;
>         struct super_block *sb = NULL;
> @@ -262,13 +267,21 @@ static int move_expired_inodes(struct list_head *delaying_queue,
>         int do_sb_sort = 0;
>         int moved = 0;
>
> +       if ((flags & EXPIRE_DIRTY_ATIME) == 0)
> +               older_than_this = work->older_than_this;
> +       else if ((work->reason == WB_REASON_SYNC) == 0) {
> +               expire_time = jiffies - (HZ * 86400);
> +               older_than_this = &expire_time;
> +       }
>         while (!list_empty(delaying_queue)) {
>                 inode = wb_inode(delaying_queue->prev);
> -               if (work->older_than_this &&
> -                   inode_dirtied_after(inode, *work->older_than_this))
> +               if (older_than_this &&
> +                   inode_dirtied_after(inode, *older_than_this))
>                         break;
>                 list_move(&inode->i_wb_list, &tmp);
>                 moved++;
> +               if (flags & EXPIRE_DIRTY_ATIME)
> +                       set_bit(__I_DIRTY_TIME_EXPIRED, &inode->i_state);
>                 if (sb_is_blkdev_sb(inode->i_sb))
>                         continue;
>                 if (sb && sb != inode->i_sb)
> @@ -309,9 +322,12 @@ out:
>  static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
>  {
>         int moved;
> +
>         assert_spin_locked(&wb->list_lock);
>         list_splice_init(&wb->b_more_io, &wb->b_io);
> -       moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
> +       moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, work);
> +       moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io,
> +                                    EXPIRE_DIRTY_ATIME, work);
>         trace_writeback_queue_io(wb, work, moved);
>  }
>
> @@ -435,6 +451,8 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
>                  * updates after data IO completion.
>                  */
>                 redirty_tail(inode, wb);
> +       } else if (inode->i_state & I_DIRTY_TIME) {
> +               list_move(&inode->i_wb_list, &wb->b_dirty_time);
>         } else {
>                 /* The inode is clean. Remove from writeback lists. */
>                 list_del_init(&inode->i_wb_list);
> @@ -481,7 +499,13 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>         spin_lock(&inode->i_lock);
>
>         dirty = inode->i_state & I_DIRTY;
> -       inode->i_state &= ~I_DIRTY;
> +       if (((dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) &&
> +            (inode->i_state & I_DIRTY_TIME)) ||
> +           (inode->i_state & I_DIRTY_TIME_EXPIRED)) {
> +               dirty |= I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED;
> +               trace_writeback_lazytime(inode);
> +       }
> +       inode->i_state &= ~dirty;
>
>         /*
>          * Paired with smp_mb() in __mark_inode_dirty().  This allows
> @@ -501,8 +525,10 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>
>         spin_unlock(&inode->i_lock);
>
> +       if (dirty & I_DIRTY_TIME)
> +               mark_inode_dirty_sync(inode);
>         /* Don't write the inode if only I_DIRTY_PAGES was set */
> -       if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> +       if (dirty & ~I_DIRTY_PAGES) {
>                 int err = write_inode(inode, wbc);
>                 if (ret == 0)
>                         ret = err;
> @@ -550,7 +576,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>          * make sure inode is on some writeback list and leave it there unless
>          * we have completely cleaned the inode.
>          */
> -       if (!(inode->i_state & I_DIRTY) &&
> +       if (!(inode->i_state & I_DIRTY_ALL) &&
>             (wbc->sync_mode != WB_SYNC_ALL ||
>              !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK)))
>                 goto out;
> @@ -565,7 +591,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>          * If inode is clean, remove it from writeback lists. Otherwise don't
>          * touch it. See comment above for explanation.
>          */
> -       if (!(inode->i_state & I_DIRTY))
> +       if (!(inode->i_state & I_DIRTY_ALL))
>                 list_del_init(&inode->i_wb_list);
>         spin_unlock(&wb->list_lock);
>         inode_sync_complete(inode);
> @@ -707,7 +733,7 @@ static long writeback_sb_inodes(struct super_block *sb,
>                 wrote += write_chunk - wbc.nr_to_write;
>                 spin_lock(&wb->list_lock);
>                 spin_lock(&inode->i_lock);
> -               if (!(inode->i_state & I_DIRTY))
> +               if (!(inode->i_state & I_DIRTY_ALL))
>                         wrote++;
>                 requeue_inode(inode, wb, &wbc);
>                 inode_sync_complete(inode);
> @@ -1145,16 +1171,20 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
>   * page->mapping->host, so the page-dirtying time is recorded in the internal
>   * blockdev inode.
>   */
> +#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
>  void __mark_inode_dirty(struct inode *inode, int flags)
>  {
>         struct super_block *sb = inode->i_sb;
>         struct backing_dev_info *bdi = NULL;
> +       int dirtytime;
> +
> +       trace_writeback_mark_inode_dirty(inode, flags);
>
>         /*
>          * Don't do this for I_DIRTY_PAGES - that doesn't actually
>          * dirty the inode itself
>          */
> -       if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> +       if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME)) {
>                 trace_writeback_dirty_inode_start(inode, flags);
>
>                 if (sb->s_op->dirty_inode)
> @@ -1162,6 +1192,9 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>
>                 trace_writeback_dirty_inode(inode, flags);
>         }
> +       if (flags & I_DIRTY_INODE)
> +               flags &= ~I_DIRTY_TIME;
> +       dirtytime = flags & I_DIRTY_TIME;
>
>         /*
>          * Paired with smp_mb() in __writeback_single_inode() for the
> @@ -1169,16 +1202,21 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>          */
>         smp_mb();
>
> -       if ((inode->i_state & flags) == flags)
> +       if (((inode->i_state & flags) == flags) ||
> +           (dirtytime && (inode->i_state & I_DIRTY_INODE)))
>                 return;
>
>         if (unlikely(block_dump))
>                 block_dump___mark_inode_dirty(inode);
>
>         spin_lock(&inode->i_lock);
> +       if (dirtytime && (inode->i_state & I_DIRTY_INODE))
> +               goto out_unlock_inode;
>         if ((inode->i_state & flags) != flags) {
>                 const int was_dirty = inode->i_state & I_DIRTY;
>
> +               if (flags & I_DIRTY_INODE)
> +                       inode->i_state &= ~I_DIRTY_TIME;
>                 inode->i_state |= flags;
>
>                 /*
> @@ -1225,8 +1263,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>                         }
>
>                         inode->dirtied_when = jiffies;
> -                       list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> +                       list_move(&inode->i_wb_list, dirtytime ?
> +                                 &bdi->wb.b_dirty_time : &bdi->wb.b_dirty);
>                         spin_unlock(&bdi->wb.list_lock);
> +                       trace_writeback_dirty_inode_enqueue(inode);
>
>                         if (wakeup_bdi)
>                                 bdi_wakeup_thread_delayed(bdi);
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 6e600ab..15c44cf 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -655,7 +655,7 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
>  {
>         struct address_space *mapping = file->f_mapping;
>         struct inode *inode = mapping->host;
> -       int sync_state = inode->i_state & I_DIRTY;
> +       int sync_state = inode->i_state & I_DIRTY_ALL;
>         struct gfs2_inode *ip = GFS2_I(inode);
>         int ret = 0, ret1 = 0;
>
> @@ -668,7 +668,7 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
>         if (!gfs2_is_jdata(ip))
>                 sync_state &= ~I_DIRTY_PAGES;
>         if (datasync)
> -               sync_state &= ~I_DIRTY_SYNC;
> +               sync_state &= ~(I_DIRTY_SYNC | I_DIRTY_TIME);
>
>         if (sync_state) {
>                 ret = sync_inode_metadata(inode, 1);
> diff --git a/fs/inode.c b/fs/inode.c
> index aa149e7..4feb85c 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -18,6 +18,7 @@
>  #include <linux/buffer_head.h> /* for inode_has_buffers */
>  #include <linux/ratelimit.h>
>  #include <linux/list_lru.h>
> +#include <trace/events/writeback.h>
>  #include "internal.h"
>
>  /*
> @@ -30,7 +31,7 @@
>   * inode_sb_list_lock protects:
>   *   sb->s_inodes, inode->i_sb_list
>   * bdi->wb.list_lock protects:
> - *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
> + *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_wb_list
>   * inode_hash_lock protects:
>   *   inode_hashtable, inode->i_hash
>   *
> @@ -416,7 +417,8 @@ static void inode_lru_list_add(struct inode *inode)
>   */
>  void inode_add_lru(struct inode *inode)
>  {
> -       if (!(inode->i_state & (I_DIRTY | I_SYNC | I_FREEING | I_WILL_FREE)) &&
> +       if (!(inode->i_state & (I_DIRTY_ALL | I_SYNC |
> +                               I_FREEING | I_WILL_FREE)) &&
>             !atomic_read(&inode->i_count) && inode->i_sb->s_flags & MS_ACTIVE)
>                 inode_lru_list_add(inode);
>  }
> @@ -647,7 +649,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>                         spin_unlock(&inode->i_lock);
>                         continue;
>                 }
> -               if (inode->i_state & I_DIRTY && !kill_dirty) {
> +               if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {
>                         spin_unlock(&inode->i_lock);
>                         busy = 1;
>                         continue;
> @@ -1432,11 +1434,20 @@ static void iput_final(struct inode *inode)
>   */
>  void iput(struct inode *inode)
>  {
> -       if (inode) {
> -               BUG_ON(inode->i_state & I_CLEAR);
> -
> -               if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock))
> -                       iput_final(inode);
> +       if (!inode)
> +               return;
> +       BUG_ON(inode->i_state & I_CLEAR);
> +retry:
> +       if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
> +               if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
> +                       atomic_inc(&inode->i_count);
> +                       inode->i_state &= ~I_DIRTY_TIME;
> +                       spin_unlock(&inode->i_lock);
> +                       trace_writeback_lazytime_iput(inode);
> +                       mark_inode_dirty_sync(inode);
> +                       goto retry;
> +               }
> +               iput_final(inode);
>         }
>  }
>  EXPORT_SYMBOL(iput);
> @@ -1495,14 +1506,9 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
>         return 0;
>  }
>
> -/*
> - * This does the actual work of updating an inodes time or version.  Must have
> - * had called mnt_want_write() before calling this.
> - */
> -static int update_time(struct inode *inode, struct timespec *time, int flags)
> +int generic_update_time(struct inode *inode, struct timespec *time, int flags)
>  {
> -       if (inode->i_op->update_time)
> -               return inode->i_op->update_time(inode, time, flags);
> +       int iflags = I_DIRTY_TIME;
>
>         if (flags & S_ATIME)
>                 inode->i_atime = *time;
> @@ -1512,9 +1518,27 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
>                 inode->i_ctime = *time;
>         if (flags & S_MTIME)
>                 inode->i_mtime = *time;
> -       mark_inode_dirty_sync(inode);
> +
> +       if (!(inode->i_sb->s_flags & MS_LAZYTIME) || (flags & S_VERSION))
> +               iflags |= I_DIRTY_SYNC;
> +       __mark_inode_dirty(inode, iflags);
>         return 0;
>  }
> +EXPORT_SYMBOL(generic_update_time);
> +
> +/*
> + * This does the actual work of updating an inodes time or version.  Must have
> + * had called mnt_want_write() before calling this.
> + */
> +static int update_time(struct inode *inode, struct timespec *time, int flags)
> +{
> +       int (*update_time)(struct inode *, struct timespec *, int);
> +
> +       update_time = inode->i_op->update_time ? inode->i_op->update_time :
> +               generic_update_time;
> +
> +       return update_time(inode, time, flags);
> +}
>
>  /**
>   *     touch_atime     -       update the access time
> diff --git a/fs/jfs/file.c b/fs/jfs/file.c
> index 33aa0cc..10815f8 100644
> --- a/fs/jfs/file.c
> +++ b/fs/jfs/file.c
> @@ -39,7 +39,7 @@ int jfs_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>                 return rc;
>
>         mutex_lock(&inode->i_mutex);
> -       if (!(inode->i_state & I_DIRTY) ||
> +       if (!(inode->i_state & I_DIRTY_ALL) ||
>             (datasync && !(inode->i_state & I_DIRTY_DATASYNC))) {
>                 /* Make sure committed changes hit the disk */
>                 jfs_flush_journal(JFS_SBI(inode->i_sb)->log, 1);
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 005843c..b2ffdb0 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -948,7 +948,7 @@ int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
>
>         mutex_lock(&inode->i_mutex);
>         ret = sync_mapping_buffers(inode->i_mapping);
> -       if (!(inode->i_state & I_DIRTY))
> +       if (!(inode->i_state & I_DIRTY_ALL))
>                 goto out;
>         if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
>                 goto out;
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 0f96f71..8db932d 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -44,6 +44,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
>                 { MS_SYNCHRONOUS, ",sync" },
>                 { MS_DIRSYNC, ",dirsync" },
>                 { MS_MANDLOCK, ",mand" },
> +               { MS_LAZYTIME, ",lazytime" },
>                 { 0, NULL }
>         };
>         const struct proc_fs_info *fs_infop;
> diff --git a/fs/sync.c b/fs/sync.c
> index 01d9f18..fbc98ee 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -177,8 +177,16 @@ SYSCALL_DEFINE1(syncfs, int, fd)
>   */
>  int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
>  {
> +       struct inode *inode = file->f_mapping->host;
> +
>         if (!file->f_op->fsync)
>                 return -EINVAL;
> +       if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
> +               spin_lock(&inode->i_lock);
> +               inode->i_state &= ~I_DIRTY_TIME;
> +               spin_unlock(&inode->i_lock);
> +               mark_inode_dirty_sync(inode);
> +       }
>         return file->f_op->fsync(file, start, end, datasync);
>  }
>  EXPORT_SYMBOL(vfs_fsync_range);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 5da6012..4cdf733 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -55,6 +55,7 @@ struct bdi_writeback {
>         struct list_head b_dirty;       /* dirty inodes */
>         struct list_head b_io;          /* parked for writeback */
>         struct list_head b_more_io;     /* parked for more writeback */
> +       struct list_head b_dirty_time;  /* time stamps are dirty */
>         spinlock_t list_lock;           /* protects the b_* lists */
>  };
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f90c028..5ca285f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1746,8 +1746,12 @@ struct super_operations {
>  #define __I_DIO_WAKEUP         9
>  #define I_DIO_WAKEUP           (1 << I_DIO_WAKEUP)
>  #define I_LINKABLE             (1 << 10)
> +#define I_DIRTY_TIME           (1 << 11)
> +#define __I_DIRTY_TIME_EXPIRED 12
> +#define I_DIRTY_TIME_EXPIRED   (1 << __I_DIRTY_TIME_EXPIRED)
>
>  #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
> +#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME)
>
>  extern void __mark_inode_dirty(struct inode *, int);
>  static inline void mark_inode_dirty(struct inode *inode)
> @@ -1910,6 +1914,7 @@ extern int current_umask(void);
>
>  extern void ihold(struct inode * inode);
>  extern void iput(struct inode *);
> +extern int generic_update_time(struct inode *, struct timespec *, int);
>
>  static inline struct inode *file_inode(const struct file *f)
>  {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index cee02d6..5ecb4c2 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -18,6 +18,8 @@
>                 {I_FREEING,             "I_FREEING"},           \
>                 {I_CLEAR,               "I_CLEAR"},             \
>                 {I_SYNC,                "I_SYNC"},              \
> +               {I_DIRTY_TIME,          "I_DIRTY_TIME"},        \
> +               {I_DIRTY_TIME_EXPIRED,  "I_DIRTY_TIME_EXPIRED"}, \
>                 {I_REFERENCED,          "I_REFERENCED"}         \
>         )
>
> @@ -68,6 +70,7 @@ DECLARE_EVENT_CLASS(writeback_dirty_inode_template,
>         TP_STRUCT__entry (
>                 __array(char, name, 32)
>                 __field(unsigned long, ino)
> +               __field(unsigned long, state)
>                 __field(unsigned long, flags)
>         ),
>
> @@ -78,16 +81,25 @@ DECLARE_EVENT_CLASS(writeback_dirty_inode_template,
>                 strncpy(__entry->name,
>                         bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32);
>                 __entry->ino            = inode->i_ino;
> +               __entry->state          = inode->i_state;
>                 __entry->flags          = flags;
>         ),
>
> -       TP_printk("bdi %s: ino=%lu flags=%s",
> +       TP_printk("bdi %s: ino=%lu state=%s flags=%s",
>                 __entry->name,
>                 __entry->ino,
> +               show_inode_state(__entry->state),
>                 show_inode_state(__entry->flags)
>         )
>  );
>
> +DEFINE_EVENT(writeback_dirty_inode_template, writeback_mark_inode_dirty,
> +
> +       TP_PROTO(struct inode *inode, int flags),
> +
> +       TP_ARGS(inode, flags)
> +);
> +
>  DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode_start,
>
>         TP_PROTO(struct inode *inode, int flags),
> @@ -598,6 +610,52 @@ DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
>         TP_ARGS(inode, wbc, nr_to_write)
>  );
>
> +DECLARE_EVENT_CLASS(writeback_lazytime_template,
> +       TP_PROTO(struct inode *inode),
> +
> +       TP_ARGS(inode),
> +
> +       TP_STRUCT__entry(
> +               __field(        dev_t,  dev                     )
> +               __field(unsigned long,  ino                     )
> +               __field(unsigned long,  state                   )
> +               __field(        __u16, mode                     )
> +               __field(unsigned long, dirtied_when             )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->dev    = inode->i_sb->s_dev;
> +               __entry->ino    = inode->i_ino;
> +               __entry->state  = inode->i_state;
> +               __entry->mode   = inode->i_mode;
> +               __entry->dirtied_when = inode->dirtied_when;
> +       ),
> +
> +       TP_printk("dev %d,%d ino %lu dirtied %lu state %s mode 0%o",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 __entry->ino, __entry->dirtied_when,
> +                 show_inode_state(__entry->state), __entry->mode)
> +);
> +
> +DEFINE_EVENT(writeback_lazytime_template, writeback_lazytime,
> +       TP_PROTO(struct inode *inode),
> +
> +       TP_ARGS(inode)
> +);
> +
> +DEFINE_EVENT(writeback_lazytime_template, writeback_lazytime_iput,
> +       TP_PROTO(struct inode *inode),
> +
> +       TP_ARGS(inode)
> +);
> +
> +DEFINE_EVENT(writeback_lazytime_template, writeback_dirty_inode_enqueue,
> +
> +       TP_PROTO(struct inode *inode),
> +
> +       TP_ARGS(inode)
> +);
> +
>  #endif /* _TRACE_WRITEBACK_H */
>
>  /* This part must be outside protection */
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 3735fa0..9b964a5 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -90,6 +90,7 @@ struct inodes_stat_t {
>  #define MS_KERNMOUNT   (1<<22) /* this is a kern_mount call */
>  #define MS_I_VERSION   (1<<23) /* Update inode I_version field */
>  #define MS_STRICTATIME (1<<24) /* Always perform atime updates */
> +#define MS_LAZYTIME    (1<<25) /* Update the on-disk [acm]times lazily */
>
>  /* These sb flags are internal to the kernel */
>  #define MS_NOSEC       (1<<28)
> @@ -100,7 +101,8 @@ struct inodes_stat_t {
>  /*
>   * Superblock flags that can be altered by MS_REMOUNT
>   */
> -#define MS_RMT_MASK    (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION)
> +#define MS_RMT_MASK    (MS_RDONLY|MS_SYNCHRONOUS|MS_MANDLOCK|MS_I_VERSION|\
> +                        MS_LAZYTIME)
>
>  /*
>   * Old magic mount flag and mask
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 0ae0df5..915feea 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -69,10 +69,10 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>         unsigned long background_thresh;
>         unsigned long dirty_thresh;
>         unsigned long bdi_thresh;
> -       unsigned long nr_dirty, nr_io, nr_more_io;
> +       unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
>         struct inode *inode;
>
> -       nr_dirty = nr_io = nr_more_io = 0;
> +       nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
>         spin_lock(&wb->list_lock);
>         list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
>                 nr_dirty++;
> @@ -80,6 +80,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>                 nr_io++;
>         list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
>                 nr_more_io++;
> +       list_for_each_entry(inode, &wb->b_dirty_time, i_wb_list)
> +               if (inode->i_state & I_DIRTY_TIME)
> +                       nr_dirty_time++;
>         spin_unlock(&wb->list_lock);
>
>         global_dirty_limits(&background_thresh, &dirty_thresh);
> @@ -98,6 +101,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>                    "b_dirty:            %10lu\n"
>                    "b_io:               %10lu\n"
>                    "b_more_io:          %10lu\n"
> +                  "b_dirty_time:       %10lu\n"
>                    "bdi_list:           %10u\n"
>                    "state:              %10lx\n",
>                    (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
> @@ -111,6 +115,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>                    nr_dirty,
>                    nr_io,
>                    nr_more_io,
> +                  nr_dirty_time,
>                    !list_empty(&bdi->bdi_list), bdi->state);
>  #undef K
>
> @@ -418,6 +423,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>         INIT_LIST_HEAD(&wb->b_dirty);
>         INIT_LIST_HEAD(&wb->b_io);
>         INIT_LIST_HEAD(&wb->b_more_io);
> +       INIT_LIST_HEAD(&wb->b_dirty_time);
>         spin_lock_init(&wb->list_lock);
>         INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply

* Re: [PATCH-v9 3/3] ext4: add optimization for the lazytime mount option
From: Michael Kerrisk @ 2015-02-02  6:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
In-Reply-To: <1422855422-7444-4-git-send-email-tytso@mit.edu>

[CC += linux-api@]

On Mon, Feb 2, 2015 at 6:37 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> Add an optimization for the MS_LAZYTIME mount option so that we will
> opportunistically write out any inodes with the I_DIRTY_TIME flag set
> in a particular inode table block when we need to update some inode in
> that inode table block anyway.
>
> Also add some temporary code so that we can set the lazytime mount
> option without needing a modified /sbin/mount program which can set
> MS_LAZYTIME.  We can eventually make this go away once util-linux has
> added support.
>
> Google-Bug-Id: 18297052
>
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>  fs/ext4/inode.c             | 64 +++++++++++++++++++++++++++++++++++++++++++--
>  fs/ext4/super.c             | 10 +++++++
>  include/trace/events/ext4.h | 30 +++++++++++++++++++++
>  3 files changed, 102 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 628df5b..9193ea1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4139,6 +4139,65 @@ static int ext4_inode_blocks_set(handle_t *handle,
>         return 0;
>  }
>
> +struct other_inode {
> +       unsigned long           orig_ino;
> +       struct ext4_inode       *raw_inode;
> +};
> +
> +static int other_inode_match(struct inode * inode, unsigned long ino,
> +                            void *data)
> +{
> +       struct other_inode *oi = (struct other_inode *) data;
> +
> +       if ((inode->i_ino != ino) ||
> +           (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
> +                              I_DIRTY_SYNC | I_DIRTY_DATASYNC)) ||
> +           ((inode->i_state & I_DIRTY_TIME) == 0))
> +               return 0;
> +       spin_lock(&inode->i_lock);
> +       if (((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW |
> +                               I_DIRTY_SYNC | I_DIRTY_DATASYNC)) == 0) &&
> +           (inode->i_state & I_DIRTY_TIME)) {
> +               struct ext4_inode_info  *ei = EXT4_I(inode);
> +
> +               inode->i_state &= ~(I_DIRTY_TIME | I_DIRTY_TIME_EXPIRED);
> +               spin_unlock(&inode->i_lock);
> +
> +               spin_lock(&ei->i_raw_lock);
> +               EXT4_INODE_SET_XTIME(i_ctime, inode, oi->raw_inode);
> +               EXT4_INODE_SET_XTIME(i_mtime, inode, oi->raw_inode);
> +               EXT4_INODE_SET_XTIME(i_atime, inode, oi->raw_inode);
> +               ext4_inode_csum_set(inode, oi->raw_inode, ei);
> +               spin_unlock(&ei->i_raw_lock);
> +               trace_ext4_other_inode_update_time(inode, oi->orig_ino);
> +               return -1;
> +       }
> +       spin_unlock(&inode->i_lock);
> +       return -1;
> +}
> +
> +/*
> + * Opportunistically update the other time fields for other inodes in
> + * the same inode table block.
> + */
> +static void ext4_update_other_inodes_time(struct super_block *sb,
> +                                         unsigned long orig_ino, char *buf)
> +{
> +       struct other_inode oi;
> +       unsigned long ino;
> +       int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> +       int inode_size = EXT4_INODE_SIZE(sb);
> +
> +       oi.orig_ino = orig_ino;
> +       ino = orig_ino & ~(inodes_per_block - 1);
> +       for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
> +               if (ino == orig_ino)
> +                       continue;
> +               oi.raw_inode = (struct ext4_inode *) buf;
> +               (void) find_inode_nowait(sb, ino, other_inode_match, &oi);
> +       }
> +}
> +
>  /*
>   * Post the struct inode info into an on-disk inode location in the
>   * buffer-cache.  This gobbles the caller's reference to the
> @@ -4248,10 +4307,11 @@ static int ext4_do_update_inode(handle_t *handle,
>                                 cpu_to_le16(ei->i_extra_isize);
>                 }
>         }
> -
>         ext4_inode_csum_set(inode, raw_inode, ei);
> -
>         spin_unlock(&ei->i_raw_lock);
> +       if (inode->i_sb->s_flags & MS_LAZYTIME)
> +               ext4_update_other_inodes_time(inode->i_sb, inode->i_ino,
> +                                             bh->b_data);
>
>         BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
>         rc = ext4_handle_dirty_metadata(handle, NULL, bh);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 74c5f53..362b23c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1139,6 +1139,7 @@ enum {
>         Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
>         Opt_usrquota, Opt_grpquota, Opt_i_version,
>         Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
> +       Opt_lazytime, Opt_nolazytime,
>         Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
>         Opt_inode_readahead_blks, Opt_journal_ioprio,
>         Opt_dioread_nolock, Opt_dioread_lock,
> @@ -1202,6 +1203,8 @@ static const match_table_t tokens = {
>         {Opt_i_version, "i_version"},
>         {Opt_stripe, "stripe=%u"},
>         {Opt_delalloc, "delalloc"},
> +       {Opt_lazytime, "lazytime"},
> +       {Opt_nolazytime, "nolazytime"},
>         {Opt_nodelalloc, "nodelalloc"},
>         {Opt_removed, "mblk_io_submit"},
>         {Opt_removed, "nomblk_io_submit"},
> @@ -1459,6 +1462,12 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>         case Opt_i_version:
>                 sb->s_flags |= MS_I_VERSION;
>                 return 1;
> +       case Opt_lazytime:
> +               sb->s_flags |= MS_LAZYTIME;
> +               return 1;
> +       case Opt_nolazytime:
> +               sb->s_flags &= ~MS_LAZYTIME;
> +               return 1;
>         }
>
>         for (m = ext4_mount_opts; m->token != Opt_err; m++)
> @@ -5020,6 +5029,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
>         }
>  #endif
>
> +       *flags = (*flags & ~MS_LAZYTIME) | (sb->s_flags & MS_LAZYTIME);
>         ext4_msg(sb, KERN_INFO, "re-mounted. Opts: %s", orig_data);
>         kfree(orig_data);
>         return 0;
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 6cfb841..6e5abd6 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -73,6 +73,36 @@ struct extent_status;
>         { FALLOC_FL_ZERO_RANGE,         "ZERO_RANGE"})
>
>
> +TRACE_EVENT(ext4_other_inode_update_time,
> +       TP_PROTO(struct inode *inode, ino_t orig_ino),
> +
> +       TP_ARGS(inode, orig_ino),
> +
> +       TP_STRUCT__entry(
> +               __field(        dev_t,  dev                     )
> +               __field(        ino_t,  ino                     )
> +               __field(        ino_t,  orig_ino                )
> +               __field(        uid_t,  uid                     )
> +               __field(        gid_t,  gid                     )
> +               __field(        __u16, mode                     )
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->orig_ino = orig_ino;
> +               __entry->dev    = inode->i_sb->s_dev;
> +               __entry->ino    = inode->i_ino;
> +               __entry->uid    = i_uid_read(inode);
> +               __entry->gid    = i_gid_read(inode);
> +               __entry->mode   = inode->i_mode;
> +       ),
> +
> +       TP_printk("dev %d,%d orig_ino %lu ino %lu mode 0%o uid %u gid %u",
> +                 MAJOR(__entry->dev), MINOR(__entry->dev),
> +                 (unsigned long) __entry->orig_ino,
> +                 (unsigned long) __entry->ino, __entry->mode,
> +                 __entry->uid, __entry->gid)
> +);
> +
>  TRACE_EVENT(ext4_free_inode,
>         TP_PROTO(struct inode *inode),
>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply

* Re: [PATCH-v9 2/3] vfs: add find_inode_nowait() function
From: Michael Kerrisk @ 2015-02-02  6:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
In-Reply-To: <1422855422-7444-3-git-send-email-tytso-3s7WtUTddSA@public.gmane.org>

[CC += linux-api@]

On Mon, Feb 2, 2015 at 6:37 AM, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:
> Add a new function find_inode_nowait() which is an even more general
> version of ilookup5_nowait().  It is designed for callers which need
> very fine grained control over when the function is allowed to block
> or increment the inode's reference count.
>
> Signed-off-by: Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org>
> ---
>  fs/inode.c         | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h |  5 +++++
>  2 files changed, 55 insertions(+)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 4feb85c..740cba7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1284,6 +1284,56 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
>  }
>  EXPORT_SYMBOL(ilookup);
>
> +/**
> + * find_inode_nowait - find an inode in the inode cache
> + * @sb:                super block of file system to search
> + * @hashval:   hash value (usually inode number) to search for
> + * @match:     callback used for comparisons between inodes
> + * @data:      opaque data pointer to pass to @match
> + *
> + * Search for the inode specified by @hashval and @data in the inode
> + * cache, where the helper function @match will return 0 if the inode
> + * does not match, 1 if the inode does match, and -1 if the search
> + * should be stopped.  The @match function must be responsible for
> + * taking the i_lock spin_lock and checking i_state for an inode being
> + * freed or being initialized, and incrementing the reference count
> + * before returning 1.  It also must not sleep, since it is called with
> + * the inode_hash_lock spinlock held.
> + *
> + * This is a even more generalized version of ilookup5() when the
> + * function must never block --- find_inode() can block in
> + * __wait_on_freeing_inode() --- or when the caller can not increment
> + * the reference count because the resulting iput() might cause an
> + * inode eviction.  The tradeoff is that the @match funtion must be
> + * very carefully implemented.
> + */
> +struct inode *find_inode_nowait(struct super_block *sb,
> +                               unsigned long hashval,
> +                               int (*match)(struct inode *, unsigned long,
> +                                            void *),
> +                               void *data)
> +{
> +       struct hlist_head *head = inode_hashtable + hash(sb, hashval);
> +       struct inode *inode, *ret_inode = NULL;
> +       int mval;
> +
> +       spin_lock(&inode_hash_lock);
> +       hlist_for_each_entry(inode, head, i_hash) {
> +               if (inode->i_sb != sb)
> +                       continue;
> +               mval = match(inode, hashval, data);
> +               if (mval == 0)
> +                       continue;
> +               if (mval == 1)
> +                       ret_inode = inode;
> +               goto out;
> +       }
> +out:
> +       spin_unlock(&inode_hash_lock);
> +       return ret_inode;
> +}
> +EXPORT_SYMBOL(find_inode_nowait);
> +
>  int insert_inode_locked(struct inode *inode)
>  {
>         struct super_block *sb = inode->i_sb;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5ca285f..af810cc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2441,6 +2441,11 @@ extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
>
>  extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
>  extern struct inode * iget_locked(struct super_block *, unsigned long);
> +extern struct inode *find_inode_nowait(struct super_block *,
> +                                      unsigned long,
> +                                      int (*match)(struct inode *,
> +                                                   unsigned long, void *),
> +                                      void *data);
>  extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
>  extern int insert_inode_locked(struct inode *);
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply

* [PATCH] fcntl.h: Fix a typo
From: Bart Van Assche @ 2015-02-02  7:43 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David S. Miller, Stephen Rothwell, linux-kernel,
	linux-api-u79uwXL29TY76Z2rM5mHXA

In the source file fs/fcntl.c and also in the fcntl() man page one
can see that the FD_CLOEXEC flag can be manipulated via F_GETFD and
F_SETFD. Update the comment in <fcntl.h> accordingly.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
Cc: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: Stephen Rothwell <sfr-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org>
---
 include/uapi/asm-generic/fcntl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063eff..584fa2b 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -157,7 +157,7 @@ struct f_owner_ex {
 	__kernel_pid_t	pid;
 };
 
-/* for F_[GET|SET]FL */
+/* for F_[GET|SET]FD */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
 
 /* for posix fcntl() and lockf() */
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH] fcntl.h: Fix a typo
From: Stephen Rothwell @ 2015-02-02  8:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Arnd Bergmann, David S. Miller, linux-kernel,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <54CF2A94.9080107-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1301 bytes --]

Hi Bart,

On Mon, 02 Feb 2015 08:43:16 +0100 Bart Van Assche <bart.vanassche@sandisk.com> wrote:
>
> In the source file fs/fcntl.c and also in the fcntl() man page one
> can see that the FD_CLOEXEC flag can be manipulated via F_GETFD and
> F_SETFD. Update the comment in <fcntl.h> accordingly.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Cc: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> Cc: Stephen Rothwell <sfr-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org>
> ---
>  include/uapi/asm-generic/fcntl.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index e063eff..584fa2b 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -157,7 +157,7 @@ struct f_owner_ex {
>  	__kernel_pid_t	pid;
>  };
>  
> -/* for F_[GET|SET]FL */
> +/* for F_[GET|SET]FD */
>  #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
>  
>  /* for posix fcntl() and lockf() */
> -- 
> 2.1.4

Looks good to me

Acked-by: Stephen Rothwell <sfr-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org>

-- 
Cheers,
Stephen Rothwell                    sfr-3FnU+UHB4dNDw9hX6IcOSA@public.gmane.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-02-02  9:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Ted Ts'o, Linux API, Michael Kerrisk,
	One Thousand Gnomes, Austin S Hemmelgarn, Tom Gundersen,
	Greg Kroah-Hartman, linux-kernel, David Herrmann,
	Eric W. Biederman, Djalal Harouni, Johannes Stezenbach,
	Christoph Hellwig
In-Reply-To: <CALCETrXD41=ohFSkCmBD8zPHFVUtr49QXMhYnChAxqQtmUjJYw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi Andy,

On 01/29/2015 01:09 PM, Andy Lutomirski wrote:
> On Jan 29, 2015 6:42 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:

>> As we explained before, currently, D-Bus peers do collect the same
>> information already if they need to have them, but they have to do deal
>> with the inherit races in such cases. kdbus is closing the gap by
>> optionally providing the same information along with each message, if
>> requested.
> 
> In all these discussions, no one ever gave a decent example use case.
> If a process drops some privilege, it must close all fds it has that
> captured its old privilege.  This has nothing to do with kdbus.

kdbus does not implement any new concept here but sticks to what
SCM_CREDENTIALS does on SOL_SEQPACKET. An application can get a
file-descriptor from socket() or socketpair() and freely pass it around
between different tasks or threads, but messages will always have the
credentials attached that are valid at *send* time. SO_PEERCREDS,
however, still reports the connect-time credentials, and kdbus provides
exactly the same semantics and both ways of retrieving information.

> I agree that the design seems to have improved to a state of being at
> least decent,

One reason for that is your feedback. Thanks for that again!

> It's an optional feature that will get used, non-optionally, thousands
> of times on each boot, apparently.  Keep in mind that it's also a
> scalability problem because it takes locks.  If it ever gets used
> thousands of times per CPU on a big thousand-core machine, it's going
> to suck, and you'll have backed yourself into a corner.

That's right, but again - if an application wants to gather this kind of
information about tasks it interacts with, it can do so today by looking
at /proc or similar sources. Desktop machines do exactly that already,
and the kernel code executed in such cases very much resembles that in
metadata.c, and is certainly not cheaper. kdbus just makes such
information more accessible when requested. Which information is
collected is defined by bit-masks on both the sender and the receiver
connection, and most applications will effectively only use a very
limited set by default if they go through one of the more high-level
libraries.

Also, when metadata is collected, the code mostly takes temporary
references on objects like PIDs, namespaces etc. Which operation would
you consider particularly expensive?

Thanks again,
Daniel

^ permalink raw reply

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
From: Austin S Hemmelgarn @ 2015-02-02 14:01 UTC (permalink / raw)
  To: Calvin Owens, Kees Cook
  Cc: Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov, Linux API
In-Reply-To: <20150131015842.GA431662@mail.thefacebook.com>

[-- Attachment #1: Type: text/plain, Size: 6117 bytes --]

On 2015-01-30 20:58, Calvin Owens wrote:
> On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
>> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens@fb.com> wrote:
>>> On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
>>>> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>>>>
>>>>> On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
>>>>>> On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
>>>>>>> Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>>>>>>> is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
>>>>>>> is very useful for enumerating the files mapped into a process when
>>>>>>> the more verbose information in /proc/<pid>/maps is not needed.
>>>>
>>>> This is the main (actually only) justification for the patch, and it it
>>>> far too thin.  What does "not needed" mean.  Why can't people just use
>>>> /proc/pid/maps?
>>>
>>> The biggest difference is that if you do something like this:
>>>
>>>          fd = open("/stuff", O_BLAH);
>>>          map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
>>>          close(fd);
>>>          unlink("/stuff");
>>>
>>> ...then map_files/ gives you a way to get a file descriptor for
>>> "/stuff", which you couldn't do with /proc/pid/maps.
>>>
>>> It's also something of a win if you just want to see what is mapped at a
>>> specific address, since you can just readlink() the symlink for the
>>> address range you care about and it will go grab the appropriate VMA and
>>> give you the answer. /proc/pid/maps requires walking the VMA tree, which
>>> is quite expensive for processes with many thousands of threads, even
>>> without the O(N^2) issue.
>>>
>>> (You have to know what address range you want though, since readdir() on
>>> map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
>>>
>>>>>>> This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>>>>>>> removes the CAP_SYS_ADMIN restrictions. Following the links requires
>>>>>>> the ability to ptrace the process in question, so this doesn't allow
>>>>>>> an attacker to do anything they couldn't already do before.
>>>>>>>
>>>>>>> Signed-off-by: Calvin Owens <calvinowens@fb.com>
>>>>>>
>>>>>> Cc +linux-api@
>>>>>
>>>>> Looks good to me, thanks! Though I would really appreciate if someone
>>>>> from security camp take a look as well.
>>>>
>>>> hm, who's that.  Kees comes to mind.
>>>>
>>>> And reviewers' task would be a heck of a lot easier if they knew what
>>>> /proc/pid/map_files actually does.  This:
>>>>
>>>> akpm3:/usr/src/25> grep -r map_files Documentation
>>>> akpm3:/usr/src/25>
>>>>
>>>> does not help.
>>>>
>>>> The 640708a2cff7f81 changelog says:
>>>>
>>>> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
>>>> :     symlinks one for each mapping with file, the name of a symlink is
>>>> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
>>>> :     results in a file that point exactly to the same inode as them vma's one.
>>>> :
>>>> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
>>>> :
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>>>> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
>>>>
>>>> afacit this info is also available in /proc/pid/maps, so things
>>>> shouldn't get worse if the /proc/pid/map_files permissions are at least
>>>> as restrictive as the /proc/pid/maps permissions.  Is that the case?
>>>> (Please add to changelog).
>>>
>>> Yes, the only difference is that you can follow the link as per above.
>>> I'll resend with a new message explaining that and the deletion thing.
>>>
>>>> There's one other problem here: we're assuming that the map_files
>>>> implementation doesn't have bugs.  If it does have bugs then relaxing
>>>> permissions like this will create new vulnerabilities.  And the
>>>> map_files implementation is surprisingly complex.  Is it bug-free?
>>>
>>> While I was messing with it I used it a good bit and didn't see any
>>> issues, although I didn't actively try to fuzz it or anything. I'd be
>>> happy to write something to test hammering it in weird ways if you like.
>>> I'm also happy to write testcases for namespaces.
>>>
>>> So far as security issues, as others have pointed out you can't follow
>>> the links unless you can ptrace the process in question, which seems
>>> like a pretty solid guarantee. As Cyrill pointed out in the discussion
>>> about the documentation, that's the same protection as /proc/N/fd/*, and
>>> those links function in the same way.
>>
>> My concern here is that fd/* are connected as streams, and while that
>> has a certain level of badness as an external-to-the-process attacker,
>> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
>> required for access to /proc/N/mem). Since these fds are the things
>> mapped into memory on a process, writing to them is a subset of access
>> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
>
> If you haven't done close() on a mmapped file, doesn't fd/* allow the
> same access to the corresponding regions of memory? Or am I missing
> something?
>
> But that said, I can't think of any reason making it MODE_ATTACH would
> be a problem. Would you rather that be enforced on follow_link() like
> the original patch did, or enforce it for the whole directory?
>
Whole directory would probably be better, as even just the mapped ranges 
could be considered sensitive information.  Ideally, the check should be 
done on both follow_link(), and the directory itself.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply

* Re: [PATCH-v9 0/3] add support for lazytime mount option
From: Theodore Ts'o @ 2015-02-02 14:48 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Linux Filesystem Development List, Al Viro, Linux API
In-Reply-To: <CAHO5Pa0ySnLb_UGUw3deVyZEr8gdzzdeyMP5rXcT1MLOeccLGg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Feb 02, 2015 at 07:03:11AM +0100, Michael Kerrisk wrote:
> Hi Ted,
> 
> Since this is an API change, linux-api@ shouls be CCed, Added.

I didn't realize a mount option would be considered an API change.
The man page project isn't documenting these things, are they? 

  	 	       	       	    	       - Ted

^ permalink raw reply

* Re: [PATCH 1/2] proc.5: Document /proc/[pid]/setgroups
From: Michael Kerrisk (man-pages) @ 2015-02-02 15:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, Linux Containers, Josh Triplett, Andrew Morton,
	Kees Cook, Linux API, linux-man, linux-kernel@vger.kernel.org,
	LSM, Casey Schaufler, Serge E. Hallyn, Richard Weinberger,
	Kenton Varda, stable, Andy Lutomirski
In-Reply-To: <87vblg1qme.fsf@x220.int.ebiederm.org>

[Adding Josh to CC in case he has anything to add.]

On 12/12/2014 10:54 PM, Eric W. Biederman wrote:
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>  man5/proc.5 | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/man5/proc.5 b/man5/proc.5
> index 96077d0dd195..d661e8cfeac9 100644
> --- a/man5/proc.5
> +++ b/man5/proc.5
> @@ -1097,6 +1097,21 @@ are not available if the main thread has already terminated
>  .\"       Added in 2.6.9
>  .\"       CONFIG_SCHEDSTATS
>  .TP
> +.IR /proc/[pid]/setgroups " (since Linux 3.19-rc1)"
> +This file reports
> +.BR allow
> +if the setgroups system call is allowed in the current user namespace.
> +This file reports
> +.BR deny
> +if the setgroups system call is not allowed in the current user namespace.
> +This file may be written to with values of
> +.BR allow
> +and
> +.BR deny
> +before
> +.IR /proc/[pid]/gid_map
> +is written to (enabling setgroups) in a user namespace.
> +.TP
>  .IR /proc/[pid]/smaps " (since Linux 2.6.14)"
>  This file shows memory consumption for each of the process's mappings.
>  (The

Hi Eric,

Thanks for this patch. I applied it, and then tried to work in
quite a few other details gleaned from the source code and commit 
message, and Jon Corbet's article at http://lwn.net/Articles/626665/.
Could you please let me know if the following is correct:

    /proc/[pid]/setgroups (since Linux 3.19)
           This file displays the string "allow"  if  processes  in 
           the  user  namespace  that  contains the process pid are
           permitted to employ the setgroups(2)  system  call,  and
           "deny"  if  setgroups(2)  is  not permitted in that user
           namespace.

           A privileged process (one with the  CAP_SYS_ADMIN  capa‐
           bility in the namespace) may write either of the strings
           "allow" or "deny" to this file before writing a group ID 
           mapping   for   this   user   namespace   to   the  file
           /proc/[pid]/gid_map.  Writing the string "deny" prevents
           any  process  in  the user namespace from employing set‐
           groups(2).

           The default value of  this  file  in  the  initial  user
           namespace is "allow".

           Once  /proc/[pid]/gid_map has been written to (which has
           the effect of enabling setgroups(2) in the  user  names‐
           pace),  it is no longer possible to deny setgroups(2) by 
           writing to /proc/[pid]/setgroups.

           A child user namespace inherits the  /proc/[pid]/gid_map
           setting from its parent.

           If  the  setgroups  file  has the value "deny", then the
           setgroups(2) system call can't subsequently be reenabled
           (by writing "allow" to the file) in this user namespace.
           This restriction also propagates down to all child  user
           namespaces of this user namespace.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/2] user_namespaces.7: Update the documention to reflect the fixes for negative groups
From: Michael Kerrisk (man-pages) @ 2015-02-02 15:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Linux Containers,
	Josh Triplett, Andrew Morton, Kees Cook, Linux API, linux-man,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, LSM,
	Casey Schaufler, Serge E. Hallyn, Richard Weinberger,
	Kenton Varda, stable, Andy Lutomirski
In-Reply-To: <87ppbo1ql4.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Hi Eric,

Thanks for writing this up!

On 12/12/2014 10:54 PM, Eric W. Biederman wrote:
> 
> Files with access permissions such as ---rwx---rwx give fewer
> permissions to their group then they do to everyone else.  Which means
> dropping groups with setgroups(0, NULL) actually grants a process
> privileges.
> 
> The uprivileged setting of gid_map turned out not to be safe after
> this change.  Privilege setting of gid_map can be interpreted as
> meaning yes it is ok to drop groups.

I had trouble to parse that sentence (and I'd like to make sure that
the right sentence ends up in the commit message). Did you mean: 

    "*Unprivileged* setting of gid_map can be interpreted as meaning
     yes it is ok to drop groups"

?

Or something else?

> To prevent this problem and future problems user namespaces were
> changed in such a way as to guarantee a user can not obtain
> credentials without privilege they could not obtain without the
> help of user namespaces.
> 
> This meant testing the effective user ID and not the filesystem user
> ID as setresuid and setregid allow setting any process uid or gid
> (except the supplemental groups) to the effective ID.
> 
> Furthermore to preserve in some form the useful applications that have
> been setting gid_map without privilege the file /proc/[pid]/setgroups
> was added to allow disabling setgroups.  With the setgroups system
> call permanently disabled in a user namespace it again becomes safe to
> allow writes to gid_map without privilege.
> 
> Here is my meager attempt to update user_namespaces.7 to reflect these
> issues.

It looked pretty serviceable as patch, IMO. So, thanks again. I've applied,
tweaking some wordings afterward, but changing nothing essential. See below
for a question.

> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  man7/user_namespaces.7 | 52 +++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 49 insertions(+), 3 deletions(-)
> 
> diff --git a/man7/user_namespaces.7 b/man7/user_namespaces.7
> index d76721d9a0a1..f8333a762308 100644
> --- a/man7/user_namespaces.7
> +++ b/man7/user_namespaces.7
> @@ -533,11 +533,16 @@ One of the following is true:
>  The data written to
>  .I uid_map
>  .RI ( gid_map )
> -consists of a single line that maps the writing process's filesystem user ID
> +consists of a single line that maps the writing process's effective user ID
>  (group ID) in the parent user namespace to a user ID (group ID)
>  in the user namespace.
> -The usual case here is that this single line provides a mapping for user ID
> -of the process that created the namespace.
> +The writing process must have the same effective user ID as the process
> +that created the user namespace.
> +In the case of
> +.I gid_map
> +the
> +.I setgroups
> +file must have been written to earlier and disabled the setgroups system call.
>  .IP * 3
>  The opening process has the
>  .BR CAP_SETUID
> @@ -552,6 +557,47 @@ Writes that violate the above rules fail with the error
>  .\"
>  .\" ============================================================
>  .\"
> +.SS Interaction with system calls that change the uid or gid values
> +When in a user namespace where the
> +.I uid_map
> +or
> +.I gid_map
> +file has not been written the system calls that change user IDs
> +or group IDs respectively will fail.  After the
> +.I uid_map
> +and
> +.I gid_map
> +file have been written only the mapped values may be used in
> +system calls that change user IDs and group IDs.
> +
> +For user IDs these system calls include
> +.BR setuid ,
> +.BR setfsuid ,
> +.BR setreuid ,
> +and
> +.BR setresuid .
> +
> +For group IDs these system calls include
> +.BR setgid ,
> +.BR setfsgid ,
> +.BR setregid ,
> +.BR setresgid ,
> +and
> +.BR setgroups.
> +
> +Writing
> +.BR deny
> +to the
> +.I /proc/[pid]/setgroups
> +file before writing to
> +.I /proc/[pid]/gid_map
> +will permanently disable the setgroups system call in a user namespace
> +and allow writing to
> +.I /proc/[pid]/gid_map
> +without
> +.BR CAP_SETGID
> +in the parent user namespace.

I just want to double check: you really did mean to write "*parent* namespace"
above, right?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH-v9 0/3] add support for lazytime mount option
From: Michael Kerrisk (man-pages) @ 2015-02-02 15:40 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux Filesystem Development List, Al Viro, Linux API
In-Reply-To: <20150202144833.GB2509-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>

Hi Ted,

On 2 February 2015 at 15:48, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:
> On Mon, Feb 02, 2015 at 07:03:11AM +0100, Michael Kerrisk wrote:
>> Hi Ted,
>>
>> Since this is an API change, linux-api@ shouls be CCed, Added.
>
> I didn't realize a mount option would be considered an API change.

Well, inasmuch as it's exposed via a system call, sure it is.

> The man page project isn't documenting these things, are they?

Indeed it is. See http://man7.org/linux/man-pages/man2/mount.2.html.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
From: Pawel Moll @ 2015-02-02 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Thomas Gleixner,
	Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1421872037-12559-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>

Afternoon, Peter,

On Wed, 2015-01-21 at 20:27 +0000, Pawel Moll wrote:
> Until now, perf framework never defined the meaning of the timestamps
> captured as PERF_SAMPLE_TIME sample type. The values were obtaining
> from local (sched) clock, which is unavailable in userspace. This made
> it impossible to correlate perf data with any other events. Other
> tracing solutions have the source configurable (ftrace) or just share
> a common time domain between kernel and userspace (LTTng).
> 
> Follow the trend by using monotonic clock, which is readily available
> as POSIX CLOCK_MONOTONIC.
> 
> Also add a sysctl "perf_sample_time_clk_id" attribute (usually available
> as "/proc/sys/kernel/perf_sample_time_clk_id") which can be used by the
> user to obtain the clk_id to be used with POSIX clock API (eg.
> clock_gettime()) to obtain a time value comparable with perf samples.
> 
> Old behaviour can be restored by using "perf_use_local_clock" kernel
> parameter.
> 
> Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>

I know that you're busy with other stuff, but it's already rc7 time
again... We can leave the other two patches from the series for later,
but how about getting this one merged for 3.20 and ending the 2 or 3
years long struggle? I'm not saying that everyone is happy about it, but
no one seems to be unhappy enough to speak :-)

Cheers!

Pawel

^ permalink raw reply

* Re: [PATCH v2] tpm: fix suspend/resume paths for TPM 2.0
From: Jarkko Sakkinen @ 2015-02-02 19:20 UTC (permalink / raw)
  To: Scot Doyle
  Cc: Peter Huewe, Ashley Lai,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, josh-iaAMLnmF4UmaiuxdJuQwMA,
	christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
	jason.gunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	stefanb-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	trousers-tech-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <alpine.DEB.2.11.1501291832460.1678-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

Are we good with this?

/Jarkko

On Thu, Jan 29, 2015 at 06:43:12PM +0000, Scot Doyle wrote:
> On Thu, 29 Jan 2015, Jarkko Sakkinen wrote:
> > Fixed suspend/resume paths for TPM 2.0 and consolidated all the
> > associated code to the tpm_pm_suspend() and tpm_pm_resume()
> > functions. Resume path should be handled by the firmware, i.e.
> > Startup(CLEAR) for hibernate and Startup(STATE) for suspend.
> > 
> > There might be some non-PC embedded devices in the future where
> > Startup() is not the handled by the FW but fixing the code for
> > those IMHO should be postponed until there is hardware available
> > to test the fixes although extra Startup in the driver code is
> > essentially a NOP.
> > 
> > Added Shutdown(CLEAR) to the remove paths of TIS and CRB drivers.
> > Changed tpm2_shutdown() to a void function because there isn't
> > much you can do except print an error message if this fails with
> > a system error.
> > 
> > Reported-by: Peter Hüwe <PeterHuewe-Mmb7MZpHnFY@public.gmane.org>
> > Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> > ---
> >  drivers/char/tpm/tpm-interface.c |  6 ++++--
> >  drivers/char/tpm/tpm.h           |  2 +-
> >  drivers/char/tpm/tpm2-cmd.c      | 19 +++++++++++--------
> >  drivers/char/tpm/tpm_crb.c       | 20 +++++---------------
> >  drivers/char/tpm/tpm_tis.c       | 26 +++++++++++++-------------
> >  5 files changed, 34 insertions(+), 39 deletions(-)
> 
> Resume still functions on TPM 1.2 chip, with and without CONFIG_TCG_CRB.
> 
> Tested-by: Scot Doyle <lkml14-enLWO88E2pdl57MIdRCFDg@public.gmane.org>

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Andy Lutomirski @ 2015-02-02 20:12 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Arnd Bergmann, Ted Ts'o, Michael Kerrisk, Linux API,
	One Thousand Gnomes, Austin S Hemmelgarn, Tom Gundersen,
	Greg Kroah-Hartman, linux-kernel, Eric W. Biederman,
	David Herrmann, Djalal Harouni, Johannes Stezenbach,
	Christoph Hellwig
In-Reply-To: <54CF44B9.8000005-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>
> Hi Andy,
>
> On 01/29/2015 01:09 PM, Andy Lutomirski wrote:
> > On Jan 29, 2015 6:42 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>
> >> As we explained before, currently, D-Bus peers do collect the same
> >> information already if they need to have them, but they have to do deal
> >> with the inherit races in such cases. kdbus is closing the gap by
> >> optionally providing the same information along with each message, if
> >> requested.
> >
> > In all these discussions, no one ever gave a decent example use case.
> > If a process drops some privilege, it must close all fds it has that
> > captured its old privilege.  This has nothing to do with kdbus.
>
> kdbus does not implement any new concept here but sticks to what
> SCM_CREDENTIALS does on SOL_SEQPACKET. An application can get a
> file-descriptor from socket() or socketpair() and freely pass it around
> between different tasks or threads, but messages will always have the
> credentials attached that are valid at *send* time. SO_PEERCREDS,
> however, still reports the connect-time credentials, and kdbus provides
> exactly the same semantics and both ways of retrieving information.
>
> > I agree that the design seems to have improved to a state of being at
> > least decent,
>
> One reason for that is your feedback. Thanks for that again!
>
> > It's an optional feature that will get used, non-optionally, thousands
> > of times on each boot, apparently.  Keep in mind that it's also a
> > scalability problem because it takes locks.  If it ever gets used
> > thousands of times per CPU on a big thousand-core machine, it's going
> > to suck, and you'll have backed yourself into a corner.
>
> That's right, but again - if an application wants to gather this kind of
> information about tasks it interacts with, it can do so today by looking
> at /proc or similar sources. Desktop machines do exactly that already,
> and the kernel code executed in such cases very much resembles that in
> metadata.c, and is certainly not cheaper. kdbus just makes such
> information more accessible when requested. Which information is
> collected is defined by bit-masks on both the sender and the receiver
> connection, and most applications will effectively only use a very
> limited set by default if they go through one of the more high-level
> libraries.

I should rephrase a bit.  Kdbus doesn't require use of send-time
metadata.  It does, however, strongly encourage it, and it sounds like
systemd and other major users will use send-time metadata.  Once that
happens, it's ABI (even if it's purely in userspace), and changing it
is asking for security holes to pop up.  So you'll be mostly stuck
with it.

>
> Also, when metadata is collected, the code mostly takes temporary
> references on objects like PIDs, namespaces etc. Which operation would
> you consider particularly expensive?

The refcounting, copies of some of the data, and counting bytes and
allocating space.  The refcounting is the part that will scale
particularly badly on many CPUs.

Do you have some simple benchmark code you can share?  I'd like to
play with it a bit.

--Andy

>
>
> Thanks again,
> Daniel
>

^ permalink raw reply

* Re: [RFC][PATCH v2] procfs: Always expose /proc/<pid>/map_files/ and make it readable
From: Andy Lutomirski @ 2015-02-02 20:16 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Kees Cook, Andrew Morton, Cyrill Gorcunov, Kirill A. Shutemov,
	Alexey Dobriyan, Oleg Nesterov, Eric W. Biederman, Al Viro,
	Kirill A. Shutemov, Peter Feiner, Grant Likely,
	Siddhesh Poyarekar, LKML, kernel-team, Pavel Emelyanov, Linux API
In-Reply-To: <20150131015842.GA431662@mail.thefacebook.com>

On Fri, Jan 30, 2015 at 5:58 PM, Calvin Owens <calvinowens@fb.com> wrote:
> On Thursday 01/29 at 17:30 -0800, Kees Cook wrote:
>> On Tue, Jan 27, 2015 at 8:38 PM, Calvin Owens <calvinowens@fb.com> wrote:
>> > On Monday 01/26 at 15:43 -0800, Andrew Morton wrote:
>> >> On Tue, 27 Jan 2015 00:00:54 +0300 Cyrill Gorcunov <gorcunov@gmail.com> wrote:
>> >>
>> >> > On Mon, Jan 26, 2015 at 02:47:31PM +0200, Kirill A. Shutemov wrote:
>> >> > > On Fri, Jan 23, 2015 at 07:15:44PM -0800, Calvin Owens wrote:
>> >> > > > Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and
>> >> > > > is only exposed if CONFIG_CHECKPOINT_RESTORE is set. This interface
>> >> > > > is very useful for enumerating the files mapped into a process when
>> >> > > > the more verbose information in /proc/<pid>/maps is not needed.
>> >>
>> >> This is the main (actually only) justification for the patch, and it it
>> >> far too thin.  What does "not needed" mean.  Why can't people just use
>> >> /proc/pid/maps?
>> >
>> > The biggest difference is that if you do something like this:
>> >
>> >         fd = open("/stuff", O_BLAH);
>> >         map = mmap(NULL, 4096, PROT_BLAH, MAP_SHARED, fd, 0);
>> >         close(fd);
>> >         unlink("/stuff");
>> >
>> > ...then map_files/ gives you a way to get a file descriptor for
>> > "/stuff", which you couldn't do with /proc/pid/maps.
>> >
>> > It's also something of a win if you just want to see what is mapped at a
>> > specific address, since you can just readlink() the symlink for the
>> > address range you care about and it will go grab the appropriate VMA and
>> > give you the answer. /proc/pid/maps requires walking the VMA tree, which
>> > is quite expensive for processes with many thousands of threads, even
>> > without the O(N^2) issue.
>> >
>> > (You have to know what address range you want though, since readdir() on
>> > map_files/ obviously has to walk the VMA tree just like /proc/N/maps.)
>> >
>> >> > > > This patch moves the folder out from behind CHECKPOINT_RESTORE, and
>> >> > > > removes the CAP_SYS_ADMIN restrictions. Following the links requires
>> >> > > > the ability to ptrace the process in question, so this doesn't allow
>> >> > > > an attacker to do anything they couldn't already do before.
>> >> > > >
>> >> > > > Signed-off-by: Calvin Owens <calvinowens@fb.com>
>> >> > >
>> >> > > Cc +linux-api@
>> >> >
>> >> > Looks good to me, thanks! Though I would really appreciate if someone
>> >> > from security camp take a look as well.
>> >>
>> >> hm, who's that.  Kees comes to mind.
>> >>
>> >> And reviewers' task would be a heck of a lot easier if they knew what
>> >> /proc/pid/map_files actually does.  This:
>> >>
>> >> akpm3:/usr/src/25> grep -r map_files Documentation
>> >> akpm3:/usr/src/25>
>> >>
>> >> does not help.
>> >>
>> >> The 640708a2cff7f81 changelog says:
>> >>
>> >> :     This one behaves similarly to the /proc/<pid>/fd/ one - it contains
>> >> :     symlinks one for each mapping with file, the name of a symlink is
>> >> :     "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
>> >> :     results in a file that point exactly to the same inode as them vma's one.
>> >> :
>> >> :     For example the ls -l of some arbitrary /proc/<pid>/map_files/
>> >> :
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>> >> :      | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
>> >>
>> >> afacit this info is also available in /proc/pid/maps, so things
>> >> shouldn't get worse if the /proc/pid/map_files permissions are at least
>> >> as restrictive as the /proc/pid/maps permissions.  Is that the case?
>> >> (Please add to changelog).
>> >
>> > Yes, the only difference is that you can follow the link as per above.
>> > I'll resend with a new message explaining that and the deletion thing.
>> >
>> >> There's one other problem here: we're assuming that the map_files
>> >> implementation doesn't have bugs.  If it does have bugs then relaxing
>> >> permissions like this will create new vulnerabilities.  And the
>> >> map_files implementation is surprisingly complex.  Is it bug-free?
>> >
>> > While I was messing with it I used it a good bit and didn't see any
>> > issues, although I didn't actively try to fuzz it or anything. I'd be
>> > happy to write something to test hammering it in weird ways if you like.
>> > I'm also happy to write testcases for namespaces.
>> >
>> > So far as security issues, as others have pointed out you can't follow
>> > the links unless you can ptrace the process in question, which seems
>> > like a pretty solid guarantee. As Cyrill pointed out in the discussion
>> > about the documentation, that's the same protection as /proc/N/fd/*, and
>> > those links function in the same way.
>>
>> My concern here is that fd/* are connected as streams, and while that
>> has a certain level of badness as an external-to-the-process attacker,
>> PTRACE_MODE_READ is much weaker than PTRACE_MODE_ATTACH (which is
>> required for access to /proc/N/mem). Since these fds are the things
>> mapped into memory on a process, writing to them is a subset of access
>> to /proc/N/mem, and I don't feel that PTRACE_MODE_READ is sufficient.
>
> If you haven't done close() on a mmapped file, doesn't fd/* allow the
> same access to the corresponding regions of memory? Or am I missing
> something?
>

But if you have called close(), then you can't currently do things
like ftruncate or ioctl on the mapped file.  These things don't
persist across execve(), but the do persist across calls to setresuid,
etc that drop privileges.  The latter part makes me a tiny bit
nervous.

It also might be worth checking for drivers or arch code that creates
vmas that are backed by a different struct file than the struct file
that was mmapped in the first place.

--Andy

^ permalink raw reply

* Re: [PATCH 2/2] user_namespaces.7: Update the documention to reflect the fixes for negative groups
From: Alban Crequy @ 2015-02-02 21:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-man, Kees Cook, Linux API, Linux Containers, Josh Triplett,
	stable, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kenton Varda, LSM, Michael Kerrisk-manpages, Richard Weinberger,
	Casey Schaufler, Andrew Morton, Andy Lutomirski
In-Reply-To: <87ppbo1ql4.fsf_-_-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

Hello,

Thanks for updating the man page.

On 12 December 2014 at 22:54, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
(...)
> Furthermore to preserve in some form the useful applications that have
> been setting gid_map without privilege the file /proc/[pid]/setgroups
> was added to allow disabling setgroups.  With the setgroups system
> call permanently disabled in a user namespace it again becomes safe to
> allow writes to gid_map without privilege.
>
> Here is my meager attempt to update user_namespaces.7 to reflect these
> issues.

The program userns_child_exec.c in user_namespaces.7 should be updated
to write in /proc/.../setgroups, near the line:
/* Update the UID and GID maps in the child */

Otherwise, the example given in the manpage does not work:
$ ./userns_child_exec -p -m -U -M '0 1000 1' -G '0 1000 1' bash

Cheers,
Alban

^ permalink raw reply

* [PATCH v2] selftests/exec: Check if the syscall exists and bail if not
From: Michael Ellerman @ 2015-02-03  3:53 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, geert-Td1EMuHUCqxL1ZNQvxDV9g,
	drysdale-hpIqsD4AKlfQT0dZR+AlfA, shuahkh-JPH+aEBZ4P+UEJcrhfAQsw,
	davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL

On systems which don't implement sys_execveat(), this test produces a
lot of output.

Add a check at the beginning to see if the syscall is present, and if
not just note one error and return.

When we run on a system that doesn't implement the syscall we will get
ENOSYS back from the kernel, so change the logic that handles
__NR_execveat not being defined to also use ENOSYS rather than -ENOSYS.

Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>
---

v2: Switch to positive ENOSYS. Confirmed this works as expected in the
case where the syscall is defined, but then is not present at runtime.


 tools/testing/selftests/exec/execveat.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
index e238c9559caf..8d5d1d2ee7c1 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -30,7 +30,7 @@ static int execveat_(int fd, const char *path, char **argv, char **envp,
 #ifdef __NR_execveat
 	return syscall(__NR_execveat, fd, path, argv, envp, flags);
 #else
-	errno = -ENOSYS;
+	errno = ENOSYS;
 	return -1;
 #endif
 }
@@ -234,6 +234,14 @@ static int run_tests(void)
 	int fd_cloexec = open_or_die("execveat", O_RDONLY|O_CLOEXEC);
 	int fd_script_cloexec = open_or_die("script", O_RDONLY|O_CLOEXEC);
 
+	/* Check if we have execveat at all, and bail early if not */
+	errno = 0;
+	execveat_(-1, NULL, NULL, NULL, 0);
+	if (errno == ENOSYS) {
+		printf("[FAIL] ENOSYS calling execveat - no kernel support?\n");
+		return 1;
+	}
+
 	/* Change file position to confirm it doesn't affect anything */
 	lseek(fd, 10, SEEK_SET);
 
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH v2] selftests/exec: Check if the syscall exists and bail if not
From: David Drysdale @ 2015-02-03  7:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Geert Uytterhoeven, Shuah Khan,
	davej-rdkfGonbjUTCLXcRTR1eJlpr/1R2p/CL
In-Reply-To: <1422935588-9973-1-git-send-email-mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>

On Tue, Feb 3, 2015 at 3:53 AM, Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org> wrote:
> On systems which don't implement sys_execveat(), this test produces a
> lot of output.
>
> Add a check at the beginning to see if the syscall is present, and if
> not just note one error and return.
>
> When we run on a system that doesn't implement the syscall we will get
> ENOSYS back from the kernel, so change the logic that handles
> __NR_execveat not being defined to also use ENOSYS rather than -ENOSYS.
>
> Signed-off-by: Michael Ellerman <mpe-Gsx/Oe8HsFggBc27wqDAHg@public.gmane.org>

Acked-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

> ---
>
> v2: Switch to positive ENOSYS. Confirmed this works as expected in the
> case where the syscall is defined, but then is not present at runtime.

Thanks!

>
>  tools/testing/selftests/exec/execveat.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
> index e238c9559caf..8d5d1d2ee7c1 100644
> --- a/tools/testing/selftests/exec/execveat.c
> +++ b/tools/testing/selftests/exec/execveat.c
> @@ -30,7 +30,7 @@ static int execveat_(int fd, const char *path, char **argv, char **envp,
>  #ifdef __NR_execveat
>         return syscall(__NR_execveat, fd, path, argv, envp, flags);
>  #else
> -       errno = -ENOSYS;
> +       errno = ENOSYS;
>         return -1;
>  #endif
>  }
> @@ -234,6 +234,14 @@ static int run_tests(void)
>         int fd_cloexec = open_or_die("execveat", O_RDONLY|O_CLOEXEC);
>         int fd_script_cloexec = open_or_die("script", O_RDONLY|O_CLOEXEC);
>
> +       /* Check if we have execveat at all, and bail early if not */
> +       errno = 0;
> +       execveat_(-1, NULL, NULL, NULL, 0);
> +       if (errno == ENOSYS) {
> +               printf("[FAIL] ENOSYS calling execveat - no kernel support?\n");
> +               return 1;
> +       }
> +
>         /* Change file position to confirm it doesn't affect anything */
>         lseek(fd, 10, SEEK_SET);
>
> --
> 2.1.0
>

^ permalink raw reply

* MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Vlastimil Babka @ 2015-02-03  8:19 UTC (permalink / raw)
  To: Dave Hansen, Mel Gorman, linux-mm
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-api, mtk.manpages,
	linux-man
In-Reply-To: <54CFF8AC.6010102@intel.com>

[CC linux-api, man pages]

On 02/02/2015 11:22 PM, Dave Hansen wrote:
> On 02/02/2015 08:55 AM, Mel Gorman wrote:
>> This patch identifies when a thread is frequently calling MADV_DONTNEED
>> on the same region of memory and starts ignoring the hint. On an 8-core
>> single-socket machine this was the impact on ebizzy using glibc 2.19.
> 
> The manpage, at least, claims that we zero-fill after MADV_DONTNEED is
> called:
> 
>>      MADV_DONTNEED
>>               Do  not  expect  access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources
>>               associated with it.)  Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents  from  the
>>               underlying mapped file (see mmap(2)) or zero-fill-on-demand pages for mappings without an underlying file.
> 
> So if we have anything depending on the behavior that it's _always_
> zero-filled after an MADV_DONTNEED, this will break it.

OK, so that's a third person (including me) who understood it as a zero-fill
guarantee. I think the man page should be clarified (if it's indeed not
guaranteed), or we have a bug.

The implementation actually skips MADV_DONTNEED for
VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma's.

I'm not sure about VM_PFNMAP, these are probably special enough. For mlock, one
could expect that mlocking and MADV_DONTNEED would be in some opposition, but
it's not documented in the manpage AFAIK. Neither is the hugetlb case, which
could be really unexpected by the user.

Next, what the man page says about guarantees:

"The kernel is free to ignore the advice."

- that would suggest that nothing is guaranteed

"This call does not influence the semantics of the application (except in the
case of MADV_DONTNEED)"

- that depends if the reader understands it as "does influence by MADV_DONTNEED"
or "may influence by MADV_DONTNEED"

- btw, isn't MADV_DONTFORK another exception that does influence the semantics?
And since it's mentioned as a workaround for some hardware, is it OK to ignore
this advice?

And the part you already cited:

"Subsequent accesses of pages in this range will succeed, but will result either
in reloading of the memory contents from the underlying mapped file (see
mmap(2)) or zero-fill on-demand pages for mappings without an underlying file."

- The word "will result" did sound as a guarantee at least to me. So here it
could be changed to "may result (unless the advice is ignored)"?

And if we agree that there is indeed no guarantee, what's the actual semantic
difference from MADV_FREE? I guess none? So there's only a possible perfomance
difference?

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v5] perf: Use monotonic clock as a source for timestamps
From: Pawel Moll @ 2015-02-03  9:20 UTC (permalink / raw)
  To: ajh mls
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	adrian.hunter-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org
In-Reply-To: <CAN+dfcT_6zZZ4oeyngUE5N0Wtx2B9CvXsfU71m+cuyXpq2KBdw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, 2015-02-03 at 08:30 +0000, ajh mls wrote:
> There is still
>
> http://marc.info/?l=linux-kernel&m=142141223902303

Uh. I have no idea why, but I haven't got this mail at all :-(

Thanks for pointing it out!

Pawel

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Daniel Mack @ 2015-02-03 10:09 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnd Bergmann, Ted Ts'o, Michael Kerrisk, Linux API,
	One Thousand Gnomes, Austin S Hemmelgarn, Tom Gundersen,
	Greg Kroah-Hartman, linux-kernel, Eric W. Biederman,
	David Herrmann, Djalal Harouni, Johannes Stezenbach,
	Christoph Hellwig
In-Reply-To: <CALCETrUh1Mse4CBQ4bfkJf+ew=kdpn46hMLS2QafLhfRTzQoBQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi Andy,

On 02/02/2015 09:12 PM, Andy Lutomirski wrote:
> On Feb 2, 2015 1:34 AM, "Daniel Mack" <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:

>> That's right, but again - if an application wants to gather this kind of
>> information about tasks it interacts with, it can do so today by looking
>> at /proc or similar sources. Desktop machines do exactly that already,
>> and the kernel code executed in such cases very much resembles that in
>> metadata.c, and is certainly not cheaper. kdbus just makes such
>> information more accessible when requested. Which information is
>> collected is defined by bit-masks on both the sender and the receiver
>> connection, and most applications will effectively only use a very
>> limited set by default if they go through one of the more high-level
>> libraries.
> 
> I should rephrase a bit.  Kdbus doesn't require use of send-time
> metadata.  It does, however, strongly encourage it, and it sounds like

On the kernel level, kdbus just *offers* that, just like sockets offer
SO_PASSCRED. On the userland level, kdbus helps applications get that
information race-free, easier and faster than they would otherwise.

> systemd and other major users will use send-time metadata.  Once that
> happens, it's ABI (even if it's purely in userspace), and changing it
> is asking for security holes to pop up.  So you'll be mostly stuck
> with it.

We know we can't break the ABI. At most, we could deprecate item types
and introduce new ones, but we want to avoid that by all means of
course. However, I fail to see how that is related to send time
metadata, or even to kdbus in general, as all ABIs have to be kept stable.

> Do you have some simple benchmark code you can share?  I'd like to
> play with it a bit.

Sure, it's part of the self-test suite. Call it with "-t benchmark" to
run the benchmark as isolated test with verbose output. The code for
that lives in test-benchmark.c.


Thanks,
Daniel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox