[PATCH v4] ext4: reduce lock contention in __ext4_new

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4] ext4: reduce lock contention in __ext4_new_inode
@ 2017-08-08  5:05 Wang Shilong
  2017-08-16 16:42 ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Wang Shilong @ 2017-08-08  5:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, wshilong, adilger, sihara, lixi

From: Wang Shilong <wshilong@ddn.com>

While running number of creating file threads concurrently,
we found heavy lock contention on group spinlock:

FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
ext4_create                    1707443399           1440000      1185.72
_raw_spin_lock                 1317641501           180899929    7.28
jbd2__journal_start            287821030            1453950      197.96
jbd2_journal_get_write_access  33441470             73077185     0.46
ext4_add_nondir                29435963             1440000      20.44
ext4_add_entry                 26015166             1440049      18.07
ext4_dx_add_entry              25729337             1432814      17.96
ext4_mark_inode_dirty          12302433             5774407      2.13

most of cpu time blames to _raw_spin_lock, here is some testing
numbers with/without patch.

Test environment:
Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
         DDR4 Memory, 8GbFC)
Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
          Read Intensive SSD)

format command:
        mkfs.ext4 -J size=4096

test command:
        mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
                -r -i 1 -v -p 10 -u #first run to load inode

        mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
                -r -i 5 -v -p 10 -u

Kernel version: 4.13.0-rc3

Test  1,440,000 files with 48 directories by 48 processes:

Without patch:

File Creation   File removal
79,033          289,569 ops/per second
81,463          285,359
79,875          288,475
79,917          284,624
79,420          290,91

with patch:
File Creation   File removal
691,528		296,574 ops/per second
691,946		297,106
692,030		296,238
691,005		299,249
692,871		300,664

Creation performance is improved more than 8X with large
journal size. The main problem here is we test bitmap
and do some check and journal operations which could be
slept, then we test and set with lock hold, this could
be racy, and make 'inode' steal by other process.

However, after first try, we could confirm handle has
been started and inode bitmap journaled too, then
we could find and set bit with lock hold directly, this
will mostly gurateee success with second try.

This patch dosen't change logic if it comes to
no journal mode, luckily this is not normal
use cases i believe.

Tested-by: Shuichi Ihara <sihara@ddn.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
---
v3->v4: codes cleanup and avoid sleep.
---
 fs/ext4/ialloc.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 507bfb3..23380f39 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -761,6 +761,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	ext4_group_t flex_group;
 	struct ext4_group_info *grp;
 	int encrypt = 0;
+	bool hold_lock;
 
 	/* Cannot create files in a deleted directory */
 	if (!dir || !dir->i_nlink)
@@ -917,17 +918,40 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			continue;
 		}
 
+		hold_lock = false;
 repeat_in_this_group:
+		/* if @hold_lock is ture, that means, journal
+		 * is properly setup and inode bitmap buffer has
+		 * been journaled already, we can directly hold
+		 * lock and set bit if found, this will mostly
+		 * gurantee forward progress for each thread.
+		 */
+		if (hold_lock)
+			ext4_lock_group(sb, group);
+
 		ino = ext4_find_next_zero_bit((unsigned long *)
 					      inode_bitmap_bh->b_data,
 					      EXT4_INODES_PER_GROUP(sb), ino);
-		if (ino >= EXT4_INODES_PER_GROUP(sb))
+		if (ino >= EXT4_INODES_PER_GROUP(sb)) {
+			if (hold_lock)
+				ext4_unlock_group(sb, group);
 			goto next_group;
+		}
 		if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
+			if (hold_lock)
+				ext4_unlock_group(sb, group);
 			ext4_error(sb, "reserved inode found cleared - "
 				   "inode=%lu", ino + 1);
 			continue;
 		}
+
+		if (hold_lock) {
+			ext4_set_bit(ino, inode_bitmap_bh->b_data);
+			ext4_unlock_group(sb, group);
+			ino++;
+			goto got;
+		}
+
 		if ((EXT4_SB(sb)->s_journal == NULL) &&
 		    recently_deleted(sb, group, ino)) {
 			ino++;
@@ -950,6 +974,10 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			ext4_std_error(sb, err);
 			goto out;
 		}
+
+		if (EXT4_SB(sb)->s_journal)
+			hold_lock = true;
+
 		ext4_lock_group(sb, group);
 		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
 		ext4_unlock_group(sb, group);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode
  2017-08-08  5:05 [PATCH v4] ext4: reduce lock contention in __ext4_new_inode Wang Shilong
@ 2017-08-16 16:42 ` Jan Kara
  2017-08-17  6:23   ` Wang Shilong
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2017-08-16 16:42 UTC (permalink / raw)
  To: Wang Shilong; +Cc: linux-ext4, tytso, wshilong, adilger, sihara, lixi

On Tue 08-08-17 13:05:17, Wang Shilong wrote:
> From: Wang Shilong <wshilong@ddn.com>
> 
> While running number of creating file threads concurrently,
> we found heavy lock contention on group spinlock:
> 
> FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
> ext4_create                    1707443399           1440000      1185.72
> _raw_spin_lock                 1317641501           180899929    7.28
> jbd2__journal_start            287821030            1453950      197.96
> jbd2_journal_get_write_access  33441470             73077185     0.46
> ext4_add_nondir                29435963             1440000      20.44
> ext4_add_entry                 26015166             1440049      18.07
> ext4_dx_add_entry              25729337             1432814      17.96
> ext4_mark_inode_dirty          12302433             5774407      2.13
> 
> most of cpu time blames to _raw_spin_lock, here is some testing
> numbers with/without patch.
> 
> Test environment:
> Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
>          DDR4 Memory, 8GbFC)
> Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
>           Read Intensive SSD)
> 
> format command:
>         mkfs.ext4 -J size=4096
> 
> test command:
>         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
>                 -r -i 1 -v -p 10 -u #first run to load inode
> 
>         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
>                 -r -i 5 -v -p 10 -u
> 
> Kernel version: 4.13.0-rc3
> 
> Test  1,440,000 files with 48 directories by 48 processes:
> 
> Without patch:
> 
> File Creation   File removal
> 79,033          289,569 ops/per second
> 81,463          285,359
> 79,875          288,475
> 79,917          284,624
> 79,420          290,91
> 
> with patch:
> File Creation   File removal
> 691,528		296,574 ops/per second
> 691,946		297,106
> 692,030		296,238
> 691,005		299,249
> 692,871		300,664
> 
> Creation performance is improved more than 8X with large
> journal size. The main problem here is we test bitmap
> and do some check and journal operations which could be
> slept, then we test and set with lock hold, this could
> be racy, and make 'inode' steal by other process.
> 
> However, after first try, we could confirm handle has
> been started and inode bitmap journaled too, then
> we could find and set bit with lock hold directly, this
> will mostly gurateee success with second try.
> 
> This patch dosen't change logic if it comes to
> no journal mode, luckily this is not normal
> use cases i believe.
> 
> Tested-by: Shuichi Ihara <sihara@ddn.com>
> Signed-off-by: Wang Shilong <wshilong@ddn.com>

The results look great and the code looks correct however I dislike the
somewhat complex codeflow with your hold_lock variable. So how about
cleaning up the code as follows:

Create function like

unsigned long find_inode_bit(struct super_block *sb, ext4_group_t group,
		struct buffer_head *bitmap, unsigned long start_ino)
{
	unsigned long ino;

next:
	ino = ext4_find_next_zero_bit(...);
	if (ino >= EXT4_INODES_PER_GROUP(sb))
		return 0;
	if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
		...
		return 0;
	}
	if ((EXT4_SB(sb)->s_journal == NULL) &&
                    recently_deleted(sb, group, ino)) {
		start_ino = ino + 1;
		if (start_ino < EXT4_INODES_PER_GROUP(sb))
			goto next;
	}
	return ino;
}

Then you can use this function from __ext4_new_inode() when looking for
free ino and also in case test_and_set_bit() fails you could just do:

ext4_lock_group(sb, group);
ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
if (ret2) {
	/* Someone already took the bit. Repeat the search with lock held.*/
	ino = find_inode_bit(sb, group, inode_bitmap_bh, ino);
	if (ino) {
		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
		WARN_ON_ONCE(!ret2);
	}
}
ext4_unlock_group(sb, group);

And that's it, no strange bool variables and conditional locking. And as a
bonus it also works for nojournal mode in the same way.

								Honza

> ---
> v3->v4: codes cleanup and avoid sleep.
> ---
>  fs/ext4/ialloc.c | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 507bfb3..23380f39 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -761,6 +761,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  	ext4_group_t flex_group;
>  	struct ext4_group_info *grp;
>  	int encrypt = 0;
> +	bool hold_lock;
>  
>  	/* Cannot create files in a deleted directory */
>  	if (!dir || !dir->i_nlink)
> @@ -917,17 +918,40 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  			continue;
>  		}
>  
> +		hold_lock = false;
>  repeat_in_this_group:
> +		/* if @hold_lock is ture, that means, journal
> +		 * is properly setup and inode bitmap buffer has
> +		 * been journaled already, we can directly hold
> +		 * lock and set bit if found, this will mostly
> +		 * gurantee forward progress for each thread.
> +		 */
> +		if (hold_lock)
> +			ext4_lock_group(sb, group);
> +
>  		ino = ext4_find_next_zero_bit((unsigned long *)
>  					      inode_bitmap_bh->b_data,
>  					      EXT4_INODES_PER_GROUP(sb), ino);
> -		if (ino >= EXT4_INODES_PER_GROUP(sb))
> +		if (ino >= EXT4_INODES_PER_GROUP(sb)) {
> +			if (hold_lock)
> +				ext4_unlock_group(sb, group);
>  			goto next_group;
> +		}
>  		if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
> +			if (hold_lock)
> +				ext4_unlock_group(sb, group);
>  			ext4_error(sb, "reserved inode found cleared - "
>  				   "inode=%lu", ino + 1);
>  			continue;
>  		}
> +
> +		if (hold_lock) {
> +			ext4_set_bit(ino, inode_bitmap_bh->b_data);
> +			ext4_unlock_group(sb, group);
> +			ino++;
> +			goto got;
> +		}
> +
>  		if ((EXT4_SB(sb)->s_journal == NULL) &&
>  		    recently_deleted(sb, group, ino)) {
>  			ino++;
> @@ -950,6 +974,10 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  			ext4_std_error(sb, err);
>  			goto out;
>  		}
> +
> +		if (EXT4_SB(sb)->s_journal)
> +			hold_lock = true;
> +
>  		ext4_lock_group(sb, group);
>  		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>  		ext4_unlock_group(sb, group);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode
  2017-08-16 16:42 ` Jan Kara
@ 2017-08-17  6:23   ` Wang Shilong
  2017-08-17  9:19     ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Wang Shilong @ 2017-08-17  6:23 UTC (permalink / raw)
  To: Jan Kara, Wang Shilong
  Cc: linux-ext4@vger.kernel.org, tytso@mit.edu, adilger@dilger.ca,
	Shuichi Ihara, Li Xi

[-- Attachment #1: Type: text/plain, Size: 8057 bytes --]

Hi Jan,

     thanks for good suggestion, just one question we could not hold lock
with nojounal mode, how about something attached one?

please let me know if you have better taste for it, much appreciated!


Thanks,
Shilong


________________________________________
From: Jan Kara [jack@suse.cz]
Sent: Thursday, August 17, 2017 0:42
To: Wang Shilong
Cc: linux-ext4@vger.kernel.org; tytso@mit.edu; Wang Shilong; adilger@dilger.ca; Shuichi Ihara; Li Xi
Subject: Re: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode

On Tue 08-08-17 13:05:17, Wang Shilong wrote:
> From: Wang Shilong <wshilong@ddn.com>
>
> While running number of creating file threads concurrently,
> we found heavy lock contention on group spinlock:
>
> FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
> ext4_create                    1707443399           1440000      1185.72
> _raw_spin_lock                 1317641501           180899929    7.28
> jbd2__journal_start            287821030            1453950      197.96
> jbd2_journal_get_write_access  33441470             73077185     0.46
> ext4_add_nondir                29435963             1440000      20.44
> ext4_add_entry                 26015166             1440049      18.07
> ext4_dx_add_entry              25729337             1432814      17.96
> ext4_mark_inode_dirty          12302433             5774407      2.13
>
> most of cpu time blames to _raw_spin_lock, here is some testing
> numbers with/without patch.
>
> Test environment:
> Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
>          DDR4 Memory, 8GbFC)
> Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
>           Read Intensive SSD)
>
> format command:
>         mkfs.ext4 -J size=4096
>
> test command:
>         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
>                 -r -i 1 -v -p 10 -u #first run to load inode
>
>         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
>                 -r -i 5 -v -p 10 -u
>
> Kernel version: 4.13.0-rc3
>
> Test  1,440,000 files with 48 directories by 48 processes:
>
> Without patch:
>
> File Creation   File removal
> 79,033          289,569 ops/per second
> 81,463          285,359
> 79,875          288,475
> 79,917          284,624
> 79,420          290,91
>
> with patch:
> File Creation   File removal
> 691,528               296,574 ops/per second
> 691,946               297,106
> 692,030               296,238
> 691,005               299,249
> 692,871               300,664
>
> Creation performance is improved more than 8X with large
> journal size. The main problem here is we test bitmap
> and do some check and journal operations which could be
> slept, then we test and set with lock hold, this could
> be racy, and make 'inode' steal by other process.
>
> However, after first try, we could confirm handle has
> been started and inode bitmap journaled too, then
> we could find and set bit with lock hold directly, this
> will mostly gurateee success with second try.
>
> This patch dosen't change logic if it comes to
> no journal mode, luckily this is not normal
> use cases i believe.
>
> Tested-by: Shuichi Ihara <sihara@ddn.com>
> Signed-off-by: Wang Shilong <wshilong@ddn.com>

The results look great and the code looks correct however I dislike the
somewhat complex codeflow with your hold_lock variable. So how about
cleaning up the code as follows:

Create function like

unsigned long find_inode_bit(struct super_block *sb, ext4_group_t group,
                struct buffer_head *bitmap, unsigned long start_ino)
{
        unsigned long ino;

next:
        ino = ext4_find_next_zero_bit(...);
        if (ino >= EXT4_INODES_PER_GROUP(sb))
                return 0;
        if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
                ...
                return 0;
        }
        if ((EXT4_SB(sb)->s_journal == NULL) &&
                    recently_deleted(sb, group, ino)) {
                start_ino = ino + 1;
                if (start_ino < EXT4_INODES_PER_GROUP(sb))
                        goto next;
        }
        return ino;
}

Then you can use this function from __ext4_new_inode() when looking for
free ino and also in case test_and_set_bit() fails you could just do:

ext4_lock_group(sb, group);
ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
if (ret2) {
        /* Someone already took the bit. Repeat the search with lock held.*/
        ino = find_inode_bit(sb, group, inode_bitmap_bh, ino);
        if (ino) {
                ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
                WARN_ON_ONCE(!ret2);
        }
}
ext4_unlock_group(sb, group);

And that's it, no strange bool variables and conditional locking. And as a
bonus it also works for nojournal mode in the same way.

                                                                Honza

> ---
> v3->v4: codes cleanup and avoid sleep.
> ---
>  fs/ext4/ialloc.c | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 507bfb3..23380f39 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -761,6 +761,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>       ext4_group_t flex_group;
>       struct ext4_group_info *grp;
>       int encrypt = 0;
> +     bool hold_lock;
>
>       /* Cannot create files in a deleted directory */
>       if (!dir || !dir->i_nlink)
> @@ -917,17 +918,40 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>                       continue;
>               }
>
> +             hold_lock = false;
>  repeat_in_this_group:
> +             /* if @hold_lock is ture, that means, journal
> +              * is properly setup and inode bitmap buffer has
> +              * been journaled already, we can directly hold
> +              * lock and set bit if found, this will mostly
> +              * gurantee forward progress for each thread.
> +              */
> +             if (hold_lock)
> +                     ext4_lock_group(sb, group);
> +
>               ino = ext4_find_next_zero_bit((unsigned long *)
>                                             inode_bitmap_bh->b_data,
>                                             EXT4_INODES_PER_GROUP(sb), ino);
> -             if (ino >= EXT4_INODES_PER_GROUP(sb))
> +             if (ino >= EXT4_INODES_PER_GROUP(sb)) {
> +                     if (hold_lock)
> +                             ext4_unlock_group(sb, group);
>                       goto next_group;
> +             }
>               if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
> +                     if (hold_lock)
> +                             ext4_unlock_group(sb, group);
>                       ext4_error(sb, "reserved inode found cleared - "
>                                  "inode=%lu", ino + 1);
>                       continue;
>               }
> +
> +             if (hold_lock) {
> +                     ext4_set_bit(ino, inode_bitmap_bh->b_data);
> +                     ext4_unlock_group(sb, group);
> +                     ino++;
> +                     goto got;
> +             }
> +
>               if ((EXT4_SB(sb)->s_journal == NULL) &&
>                   recently_deleted(sb, group, ino)) {
>                       ino++;
> @@ -950,6 +974,10 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>                       ext4_std_error(sb, err);
>                       goto out;
>               }
> +
> +             if (EXT4_SB(sb)->s_journal)
> +                     hold_lock = true;
> +
>               ext4_lock_group(sb, group);
>               ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>               ext4_unlock_group(sb, group);
> --
> 2.9.3
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: modfied.patch --]
[-- Type: application/octet-stream, Size: 3156 bytes --]

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 0a9b48f..a3912cf 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -730,6 +730,39 @@ static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
 	return ret;
 }
 
+static unsigned long find_inode_bit(struct super_block *sb,
+				    ext4_group_t group,
+				    struct buffer_head *bitmap,
+				    unsigned long ino, bool grp_locked)
+{
+next:
+	ino = ext4_find_next_zero_bit((unsigned long *)
+				      bitmap->b_data,
+				      EXT4_INODES_PER_GROUP(sb), ino);
+	if (ino >= EXT4_INODES_PER_GROUP(sb))
+		return 0;
+
+	if (group == 0 && (ino + 1) < EXT4_FIRST_INO(sb)) {
+		if (grp_locked)
+			ext4_unlock_group(sb, group);
+		ext4_error(sb, "reserved inode found cleared - "
+			       "inode=%lu", ino + 1);
+		if (grp_locked)
+			ext4_lock_group(sb, group);
+		return 0;
+	}
+
+	if ((EXT4_SB(sb)->s_journal == NULL) &&
+	    recently_deleted(sb, group, ino)) {
+		ino++;
+		if (ino < EXT4_INODES_PER_GROUP(sb))
+			goto next;
+		return 0;
+	}
+
+	return ino;
+}
+
 /*
  * There are two policies for allocating an inode.  If the new inode is
  * a directory, then a forward search is made for a block group with both
@@ -910,21 +943,11 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		}
 
 repeat_in_this_group:
-		ino = ext4_find_next_zero_bit((unsigned long *)
-					      inode_bitmap_bh->b_data,
-					      EXT4_INODES_PER_GROUP(sb), ino);
-		if (ino >= EXT4_INODES_PER_GROUP(sb))
-			goto next_group;
-		if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
-			ext4_error(sb, "reserved inode found cleared - "
-				   "inode=%lu", ino + 1);
+		ino = find_inode_bit(sb, group, inode_bitmap_bh,
+				     ino, false);
+		if (!ino)
 			goto next_group;
-		}
-		if ((EXT4_SB(sb)->s_journal == NULL) &&
-		    recently_deleted(sb, group, ino)) {
-			ino++;
-			goto next_inode;
-		}
+
 		if (!handle) {
 			BUG_ON(nblocks <= 0);
 			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
@@ -936,19 +959,37 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 				goto out;
 			}
 		}
+
 		BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
 		err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
 		if (err) {
 			ext4_std_error(sb, err);
 			goto out;
 		}
+
 		ext4_lock_group(sb, group);
 		ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
+		if (ret2 && EXT4_SB(sb)->s_journal == NULL) {
+			/* Someone already took the bit. Repeat the search
+			 * with lock held, function might sleep in two cases:
+			 * 1) no journal mode.
+			 * 2) journal mode but hit logic error.
+			 * only hold lock in journal mode, but still need
+			 * take care of error case.
+			 */
+			ino = find_inode_bit(sb, group, inode_bitmap_bh,
+					     ino, true);
+			if (ino) {
+				ret2 = ext4_test_and_set_bit(ino,
+						inode_bitmap_bh->b_data);
+				WARN_ON_ONCE(!ret2);
+			}
+		}
 		ext4_unlock_group(sb, group);
 		ino++;		/* the inode bitmap is zero-based */
 		if (!ret2)
 			goto got; /* we grabbed the inode! */
-next_inode:
+
 		if (ino < EXT4_INODES_PER_GROUP(sb))
 			goto repeat_in_this_group;
 next_group:

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode
  2017-08-17  6:23   ` Wang Shilong
@ 2017-08-17  9:19     ` Jan Kara
  2017-08-17  9:21       ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2017-08-17  9:19 UTC (permalink / raw)
  To: Wang Shilong
  Cc: Jan Kara, Wang Shilong, linux-ext4@vger.kernel.org, tytso@mit.edu,
	adilger@dilger.ca, Shuichi Ihara, Li Xi

Hi Shilong!

On Thu 17-08-17 06:23:26, Wang Shilong wrote:
>      thanks for good suggestion, just one question we could not hold lock
> with nojounal mode, how about something attached one?
> 
> please let me know if you have better taste for it, much appreciated!

Thanks for quickly updating the patch! Is the only reason why you cannot
hold the lock in the nojournal mode that sb_getblk() might sleep? The
attached patch should fix that so that you don't have to special-case the
nojournal mode anymore.

Also looking at your patch I'd just move the check for EXT4_FIRST_INO() out
of find_ino_bit() - that way you can avoid special-casing the error as well
and the check makes sense only when using find_next_zero_bit() for the
first time anyway (after that we are guaranteed that we start searching at
inode number that is big enough).

								Honza

> ________________________________________
> From: Jan Kara [jack@suse.cz]
> Sent: Thursday, August 17, 2017 0:42
> To: Wang Shilong
> Cc: linux-ext4@vger.kernel.org; tytso@mit.edu; Wang Shilong; adilger@dilger.ca; Shuichi Ihara; Li Xi
> Subject: Re: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode
> 
> On Tue 08-08-17 13:05:17, Wang Shilong wrote:
> > From: Wang Shilong <wshilong@ddn.com>
> >
> > While running number of creating file threads concurrently,
> > we found heavy lock contention on group spinlock:
> >
> > FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
> > ext4_create                    1707443399           1440000      1185.72
> > _raw_spin_lock                 1317641501           180899929    7.28
> > jbd2__journal_start            287821030            1453950      197.96
> > jbd2_journal_get_write_access  33441470             73077185     0.46
> > ext4_add_nondir                29435963             1440000      20.44
> > ext4_add_entry                 26015166             1440049      18.07
> > ext4_dx_add_entry              25729337             1432814      17.96
> > ext4_mark_inode_dirty          12302433             5774407      2.13
> >
> > most of cpu time blames to _raw_spin_lock, here is some testing
> > numbers with/without patch.
> >
> > Test environment:
> > Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
> >          DDR4 Memory, 8GbFC)
> > Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
> >           Read Intensive SSD)
> >
> > format command:
> >         mkfs.ext4 -J size=4096
> >
> > test command:
> >         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
> >                 -r -i 1 -v -p 10 -u #first run to load inode
> >
> >         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
> >                 -r -i 5 -v -p 10 -u
> >
> > Kernel version: 4.13.0-rc3
> >
> > Test  1,440,000 files with 48 directories by 48 processes:
> >
> > Without patch:
> >
> > File Creation   File removal
> > 79,033          289,569 ops/per second
> > 81,463          285,359
> > 79,875          288,475
> > 79,917          284,624
> > 79,420          290,91
> >
> > with patch:
> > File Creation   File removal
> > 691,528               296,574 ops/per second
> > 691,946               297,106
> > 692,030               296,238
> > 691,005               299,249
> > 692,871               300,664
> >
> > Creation performance is improved more than 8X with large
> > journal size. The main problem here is we test bitmap
> > and do some check and journal operations which could be
> > slept, then we test and set with lock hold, this could
> > be racy, and make 'inode' steal by other process.
> >
> > However, after first try, we could confirm handle has
> > been started and inode bitmap journaled too, then
> > we could find and set bit with lock hold directly, this
> > will mostly gurateee success with second try.
> >
> > This patch dosen't change logic if it comes to
> > no journal mode, luckily this is not normal
> > use cases i believe.
> >
> > Tested-by: Shuichi Ihara <sihara@ddn.com>
> > Signed-off-by: Wang Shilong <wshilong@ddn.com>
> 
> The results look great and the code looks correct however I dislike the
> somewhat complex codeflow with your hold_lock variable. So how about
> cleaning up the code as follows:
> 
> Create function like
> 
> unsigned long find_inode_bit(struct super_block *sb, ext4_group_t group,
>                 struct buffer_head *bitmap, unsigned long start_ino)
> {
>         unsigned long ino;
> 
> next:
>         ino = ext4_find_next_zero_bit(...);
>         if (ino >= EXT4_INODES_PER_GROUP(sb))
>                 return 0;
>         if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
>                 ...
>                 return 0;
>         }
>         if ((EXT4_SB(sb)->s_journal == NULL) &&
>                     recently_deleted(sb, group, ino)) {
>                 start_ino = ino + 1;
>                 if (start_ino < EXT4_INODES_PER_GROUP(sb))
>                         goto next;
>         }
>         return ino;
> }
> 
> Then you can use this function from __ext4_new_inode() when looking for
> free ino and also in case test_and_set_bit() fails you could just do:
> 
> ext4_lock_group(sb, group);
> ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> if (ret2) {
>         /* Someone already took the bit. Repeat the search with lock held.*/
>         ino = find_inode_bit(sb, group, inode_bitmap_bh, ino);
>         if (ino) {
>                 ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>                 WARN_ON_ONCE(!ret2);
>         }
> }
> ext4_unlock_group(sb, group);
> 
> And that's it, no strange bool variables and conditional locking. And as a
> bonus it also works for nojournal mode in the same way.
> 
>                                                                 Honza
> 
> > ---
> > v3->v4: codes cleanup and avoid sleep.
> > ---
> >  fs/ext4/ialloc.c | 30 +++++++++++++++++++++++++++++-
> >  1 file changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> > index 507bfb3..23380f39 100644
> > --- a/fs/ext4/ialloc.c
> > +++ b/fs/ext4/ialloc.c
> > @@ -761,6 +761,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> >       ext4_group_t flex_group;
> >       struct ext4_group_info *grp;
> >       int encrypt = 0;
> > +     bool hold_lock;
> >
> >       /* Cannot create files in a deleted directory */
> >       if (!dir || !dir->i_nlink)
> > @@ -917,17 +918,40 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> >                       continue;
> >               }
> >
> > +             hold_lock = false;
> >  repeat_in_this_group:
> > +             /* if @hold_lock is ture, that means, journal
> > +              * is properly setup and inode bitmap buffer has
> > +              * been journaled already, we can directly hold
> > +              * lock and set bit if found, this will mostly
> > +              * gurantee forward progress for each thread.
> > +              */
> > +             if (hold_lock)
> > +                     ext4_lock_group(sb, group);
> > +
> >               ino = ext4_find_next_zero_bit((unsigned long *)
> >                                             inode_bitmap_bh->b_data,
> >                                             EXT4_INODES_PER_GROUP(sb), ino);
> > -             if (ino >= EXT4_INODES_PER_GROUP(sb))
> > +             if (ino >= EXT4_INODES_PER_GROUP(sb)) {
> > +                     if (hold_lock)
> > +                             ext4_unlock_group(sb, group);
> >                       goto next_group;
> > +             }
> >               if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
> > +                     if (hold_lock)
> > +                             ext4_unlock_group(sb, group);
> >                       ext4_error(sb, "reserved inode found cleared - "
> >                                  "inode=%lu", ino + 1);
> >                       continue;
> >               }
> > +
> > +             if (hold_lock) {
> > +                     ext4_set_bit(ino, inode_bitmap_bh->b_data);
> > +                     ext4_unlock_group(sb, group);
> > +                     ino++;
> > +                     goto got;
> > +             }
> > +
> >               if ((EXT4_SB(sb)->s_journal == NULL) &&
> >                   recently_deleted(sb, group, ino)) {
> >                       ino++;
> > @@ -950,6 +974,10 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> >                       ext4_std_error(sb, err);
> >                       goto out;
> >               }
> > +
> > +             if (EXT4_SB(sb)->s_journal)
> > +                     hold_lock = true;
> > +
> >               ext4_lock_group(sb, group);
> >               ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> >               ext4_unlock_group(sb, group);
> > --
> > 2.9.3
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode
  2017-08-17  9:19     ` Jan Kara
@ 2017-08-17  9:21       ` Jan Kara
  2017-08-17 21:51         ` Y2038 bug in ext4 recently_deleted() function Andreas Dilger
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2017-08-17  9:21 UTC (permalink / raw)
  To: Wang Shilong
  Cc: Jan Kara, Wang Shilong, linux-ext4@vger.kernel.org, tytso@mit.edu,
	adilger@dilger.ca, Shuichi Ihara, Li Xi

[-- Attachment #1: Type: text/plain, Size: 9228 bytes --]

On Thu 17-08-17 11:19:59, Jan Kara wrote:
> Hi Shilong!
> 
> On Thu 17-08-17 06:23:26, Wang Shilong wrote:
> >      thanks for good suggestion, just one question we could not hold lock
> > with nojounal mode, how about something attached one?
> > 
> > please let me know if you have better taste for it, much appreciated!
> 
> Thanks for quickly updating the patch! Is the only reason why you cannot
> hold the lock in the nojournal mode that sb_getblk() might sleep? The
> attached patch should fix that so that you don't have to special-case the
> nojournal mode anymore.

Forgot to attach the patch - here it is. Feel free to include it in your
series as a preparatory patch.

								Honza

> > ________________________________________
> > From: Jan Kara [jack@suse.cz]
> > Sent: Thursday, August 17, 2017 0:42
> > To: Wang Shilong
> > Cc: linux-ext4@vger.kernel.org; tytso@mit.edu; Wang Shilong; adilger@dilger.ca; Shuichi Ihara; Li Xi
> > Subject: Re: [PATCH v4] ext4: reduce lock contention in __ext4_new_inode
> > 
> > On Tue 08-08-17 13:05:17, Wang Shilong wrote:
> > > From: Wang Shilong <wshilong@ddn.com>
> > >
> > > While running number of creating file threads concurrently,
> > > we found heavy lock contention on group spinlock:
> > >
> > > FUNC                           TOTAL_TIME(us)       COUNT        AVG(us)
> > > ext4_create                    1707443399           1440000      1185.72
> > > _raw_spin_lock                 1317641501           180899929    7.28
> > > jbd2__journal_start            287821030            1453950      197.96
> > > jbd2_journal_get_write_access  33441470             73077185     0.46
> > > ext4_add_nondir                29435963             1440000      20.44
> > > ext4_add_entry                 26015166             1440049      18.07
> > > ext4_dx_add_entry              25729337             1432814      17.96
> > > ext4_mark_inode_dirty          12302433             5774407      2.13
> > >
> > > most of cpu time blames to _raw_spin_lock, here is some testing
> > > numbers with/without patch.
> > >
> > > Test environment:
> > > Server : SuperMicro Sever (2 x E5-2690 v3@2.60GHz, 128GB 2133MHz
> > >          DDR4 Memory, 8GbFC)
> > > Storage : 2 x RAID1 (DDN SFA7700X, 4 x Toshiba PX02SMU020 200GB
> > >           Read Intensive SSD)
> > >
> > > format command:
> > >         mkfs.ext4 -J size=4096
> > >
> > > test command:
> > >         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
> > >                 -r -i 1 -v -p 10 -u #first run to load inode
> > >
> > >         mpirun -np 48 mdtest -n 30000 -d /ext4/mdtest.out -F -C \
> > >                 -r -i 5 -v -p 10 -u
> > >
> > > Kernel version: 4.13.0-rc3
> > >
> > > Test  1,440,000 files with 48 directories by 48 processes:
> > >
> > > Without patch:
> > >
> > > File Creation   File removal
> > > 79,033          289,569 ops/per second
> > > 81,463          285,359
> > > 79,875          288,475
> > > 79,917          284,624
> > > 79,420          290,91
> > >
> > > with patch:
> > > File Creation   File removal
> > > 691,528               296,574 ops/per second
> > > 691,946               297,106
> > > 692,030               296,238
> > > 691,005               299,249
> > > 692,871               300,664
> > >
> > > Creation performance is improved more than 8X with large
> > > journal size. The main problem here is we test bitmap
> > > and do some check and journal operations which could be
> > > slept, then we test and set with lock hold, this could
> > > be racy, and make 'inode' steal by other process.
> > >
> > > However, after first try, we could confirm handle has
> > > been started and inode bitmap journaled too, then
> > > we could find and set bit with lock hold directly, this
> > > will mostly gurateee success with second try.
> > >
> > > This patch dosen't change logic if it comes to
> > > no journal mode, luckily this is not normal
> > > use cases i believe.
> > >
> > > Tested-by: Shuichi Ihara <sihara@ddn.com>
> > > Signed-off-by: Wang Shilong <wshilong@ddn.com>
> > 
> > The results look great and the code looks correct however I dislike the
> > somewhat complex codeflow with your hold_lock variable. So how about
> > cleaning up the code as follows:
> > 
> > Create function like
> > 
> > unsigned long find_inode_bit(struct super_block *sb, ext4_group_t group,
> >                 struct buffer_head *bitmap, unsigned long start_ino)
> > {
> >         unsigned long ino;
> > 
> > next:
> >         ino = ext4_find_next_zero_bit(...);
> >         if (ino >= EXT4_INODES_PER_GROUP(sb))
> >                 return 0;
> >         if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
> >                 ...
> >                 return 0;
> >         }
> >         if ((EXT4_SB(sb)->s_journal == NULL) &&
> >                     recently_deleted(sb, group, ino)) {
> >                 start_ino = ino + 1;
> >                 if (start_ino < EXT4_INODES_PER_GROUP(sb))
> >                         goto next;
> >         }
> >         return ino;
> > }
> > 
> > Then you can use this function from __ext4_new_inode() when looking for
> > free ino and also in case test_and_set_bit() fails you could just do:
> > 
> > ext4_lock_group(sb, group);
> > ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> > if (ret2) {
> >         /* Someone already took the bit. Repeat the search with lock held.*/
> >         ino = find_inode_bit(sb, group, inode_bitmap_bh, ino);
> >         if (ino) {
> >                 ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> >                 WARN_ON_ONCE(!ret2);
> >         }
> > }
> > ext4_unlock_group(sb, group);
> > 
> > And that's it, no strange bool variables and conditional locking. And as a
> > bonus it also works for nojournal mode in the same way.
> > 
> >                                                                 Honza
> > 
> > > ---
> > > v3->v4: codes cleanup and avoid sleep.
> > > ---
> > >  fs/ext4/ialloc.c | 30 +++++++++++++++++++++++++++++-
> > >  1 file changed, 29 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> > > index 507bfb3..23380f39 100644
> > > --- a/fs/ext4/ialloc.c
> > > +++ b/fs/ext4/ialloc.c
> > > @@ -761,6 +761,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> > >       ext4_group_t flex_group;
> > >       struct ext4_group_info *grp;
> > >       int encrypt = 0;
> > > +     bool hold_lock;
> > >
> > >       /* Cannot create files in a deleted directory */
> > >       if (!dir || !dir->i_nlink)
> > > @@ -917,17 +918,40 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> > >                       continue;
> > >               }
> > >
> > > +             hold_lock = false;
> > >  repeat_in_this_group:
> > > +             /* if @hold_lock is ture, that means, journal
> > > +              * is properly setup and inode bitmap buffer has
> > > +              * been journaled already, we can directly hold
> > > +              * lock and set bit if found, this will mostly
> > > +              * gurantee forward progress for each thread.
> > > +              */
> > > +             if (hold_lock)
> > > +                     ext4_lock_group(sb, group);
> > > +
> > >               ino = ext4_find_next_zero_bit((unsigned long *)
> > >                                             inode_bitmap_bh->b_data,
> > >                                             EXT4_INODES_PER_GROUP(sb), ino);
> > > -             if (ino >= EXT4_INODES_PER_GROUP(sb))
> > > +             if (ino >= EXT4_INODES_PER_GROUP(sb)) {
> > > +                     if (hold_lock)
> > > +                             ext4_unlock_group(sb, group);
> > >                       goto next_group;
> > > +             }
> > >               if (group == 0 && (ino+1) < EXT4_FIRST_INO(sb)) {
> > > +                     if (hold_lock)
> > > +                             ext4_unlock_group(sb, group);
> > >                       ext4_error(sb, "reserved inode found cleared - "
> > >                                  "inode=%lu", ino + 1);
> > >                       continue;
> > >               }
> > > +
> > > +             if (hold_lock) {
> > > +                     ext4_set_bit(ino, inode_bitmap_bh->b_data);
> > > +                     ext4_unlock_group(sb, group);
> > > +                     ino++;
> > > +                     goto got;
> > > +             }
> > > +
> > >               if ((EXT4_SB(sb)->s_journal == NULL) &&
> > >                   recently_deleted(sb, group, ino)) {
> > >                       ino++;
> > > @@ -950,6 +974,10 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> > >                       ext4_std_error(sb, err);
> > >                       goto out;
> > >               }
> > > +
> > > +             if (EXT4_SB(sb)->s_journal)
> > > +                     hold_lock = true;
> > > +
> > >               ext4_lock_group(sb, group);
> > >               ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
> > >               ext4_unlock_group(sb, group);
> > > --
> > > 2.9.3
> > >
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
> 
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-ext4-Do-not-unnecessarily-allocate-buffer-in-recentl.patch --]
[-- Type: text/x-patch, Size: 1162 bytes --]

>From c9e9550fe6e2a7e498c1a8b709b570f4c5ed8e2b Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 17 Aug 2017 11:07:10 +0200
Subject: [PATCH] ext4: Do not unnecessarily allocate buffer in
 recently_deleted()

In recently_deleted() function we want to check whether inode is still
cached in buffer cache. Use sb_find_get_block() for that instead of
sb_getblk() to avoid unnecessary allocation of bdev page and buffer
heads.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ialloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 507bfb3344d4..0d03e73dccaf 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -707,9 +707,9 @@ static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
 	if (unlikely(!gdp))
 		return 0;
 
-	bh = sb_getblk(sb, ext4_inode_table(sb, gdp) +
+	bh = sb_find_get_block(sb, ext4_inode_table(sb, gdp) +
 		       (ino / inodes_per_block));
-	if (unlikely(!bh) || !buffer_uptodate(bh))
+	if (!bh || !buffer_uptodate(bh))
 		/*
 		 * If the block is not in the buffer cache, then it
 		 * must have been written out.
-- 
2.12.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Y2038 bug in ext4 recently_deleted() function
  2017-08-17  9:21       ` Jan Kara
@ 2017-08-17 21:51         ` Andreas Dilger
  2017-08-18  1:23           ` Deepa Dinamani
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2017-08-17 21:51 UTC (permalink / raw)
  To: Theodore Ts'o, Deepa Dinamani, Arnd Bergmann
  Cc: Wang Shilong, Wang Shilong, linux-ext4@vger.kernel.org,
	Shuichi Ihara, Li Xi, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 1994 bytes --]

On Aug 17, 2017, at 3:21 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Thu 17-08-17 11:19:59, Jan Kara wrote:
>> Hi Shilong!
>> 
>> On Thu 17-08-17 06:23:26, Wang Shilong wrote:
>>>     thanks for good suggestion, just one question we could not hold lock
>>> with nojounal mode, how about something attached one?
>>> 
>>> please let me know if you have better taste for it, much appreciated!
>> 
>> Thanks for quickly updating the patch! Is the only reason why you cannot
>> hold the lock in the nojournal mode that sb_getblk() might sleep? The
>> attached patch should fix that so that you don't have to special-case the
>> nojournal mode anymore.
> 
> Forgot to attach the patch - here it is. Feel free to include it in your
> series as a preparatory patch.

Strange, I never even knew recently_deleted() existed, even though it was
added to the tree 4 years ago yesterday.  It looks like this is only used
with the no-journal code, which I don't really interact with.

One thing I did notice when looking at it is that there is a Y2038 bug in
recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit
get_seconds().  To fix this, it would be possible to either use a wrapped
32-bit comparison, like time_after() for jiffies, something like:

	u32 now, dtime;

	/* assume dtime is within the past 30 years, see time_after() */
        now = get_seconds();
	if (dtime && (dtime - now < 0) && (dtime + recentcy - now < 0))
		ret = 1;

or use i_ctime_extra to implicitly extend i_dtime beyond 2038, something like:

	/* assume dtime epoch same as ctime, see EXT4_INODE_GET_XTIME() */
	dtime = le32_to_cpu(raw_inode->i_dtime);
	if (EXT4_INODE_SIZE(sb) > EXT4_GOOD_OLD_INODE_SIZE &&
	    offsetof(typeof(*raw_inode), i_ctime_extra) + 4 <=
	    EXT4_GOOD_OLD_INODE_SIZE + le32_to_cpu(raw_inode->i_extra_isize))
                dtime += (long)(le32_to_cpu(raw_inode->i_ctime_extra) &
				EXT4_EPOCH_MASK) << 32;

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Y2038 bug in ext4 recently_deleted() function
  2017-08-17 21:51         ` Y2038 bug in ext4 recently_deleted() function Andreas Dilger
@ 2017-08-18  1:23           ` Deepa Dinamani
  2017-08-18  9:31             ` Arnd Bergmann
  2017-08-18 13:41             ` Theodore Ts'o
  0 siblings, 2 replies; 14+ messages in thread
From: Deepa Dinamani @ 2017-08-18  1:23 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Arnd Bergmann, Wang Shilong, Wang Shilong,
	linux-ext4@vger.kernel.org, Shuichi Ihara, Li Xi, Jan Kara

> Strange, I never even knew recently_deleted() existed, even though it was
> added to the tree 4 years ago yesterday.  It looks like this is only used
> with the no-journal code, which I don't really interact with.
>
> One thing I did notice when looking at it is that there is a Y2038 bug in
> recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit
> get_seconds().

I don't think dtime has widened on the disk layout for ext4 according
to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am
not sure how fixing the internal implementation would be useful until
we do that. Is there a plan for that?

As far as get_seconds() is concerned, get_seconds() returns unsigned
long which is 64 bits on a 64 bit arch and 32 bit on a 32 bit arch.
Since dtime variable is declared as unsigned long in this function,
same holds for the size of this variable.

There is no y2038 problem on a 64 bit machine.

So moving to the case of a 32 bit machine:

get_seconds() can return values until year 2106. And, recentcy at max
can only be 35. Analyzing the current line:

if (dtime && (dtime < now) && (now < dtime + recentcy))

The above equation should work fine at least until 35 seconds before
y2038 deadline.

-Deepa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Y2038 bug in ext4 recently_deleted() function
  2017-08-18  1:23           ` Deepa Dinamani
@ 2017-08-18  9:31             ` Arnd Bergmann
  2017-08-18 15:38               ` Deepa Dinamani
  2017-08-18 13:41             ` Theodore Ts'o
  1 sibling, 1 reply; 14+ messages in thread
From: Arnd Bergmann @ 2017-08-18  9:31 UTC (permalink / raw)
  To: Deepa Dinamani
  Cc: Andreas Dilger, Theodore Ts'o, Wang Shilong, Wang Shilong,
	linux-ext4@vger.kernel.org, Shuichi Ihara, Li Xi, Jan Kara

On Fri, Aug 18, 2017 at 3:23 AM, Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>> Strange, I never even knew recently_deleted() existed, even though it was
>> added to the tree 4 years ago yesterday.  It looks like this is only used
>> with the no-journal code, which I don't really interact with.
>>
>> One thing I did notice when looking at it is that there is a Y2038 bug in
>> recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit
>> get_seconds().
>
> I don't think dtime has widened on the disk layout for ext4 according
> to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am
> not sure how fixing the internal implementation would be useful until
> we do that. Is there a plan for that?
>
> As far as get_seconds() is concerned, get_seconds() returns unsigned
> long which is 64 bits on a 64 bit arch and 32 bit on a 32 bit arch.
> Since dtime variable is declared as unsigned long in this function,
> same holds for the size of this variable.
>
> There is no y2038 problem on a 64 bit machine.

I think what Andreas was saying is that it's actually the opposite:
on a 32-bit machine, the code will work correctly for 32-bit unsigned
long values as long as 'dtime' and 'now' are in the same epoch,
e.g. both are before 2106 or both are after.

On 64-bit systems it's always wrong after 2106.

> So moving to the case of a 32 bit machine:
>
> get_seconds() can return values until year 2106. And, recentcy at max
> can only be 35. Analyzing the current line:
>
> if (dtime && (dtime < now) && (now < dtime + recentcy))
>
> The above equation should work fine at least until 35 seconds before
> y2038 deadline.

Since it's all unsigned arithmetic, it should be fine until 2106.
However, we should get rid of get_seconds() long before then
and use ktime_get_real_seconds() instead, as most other users
of get_seconds() are (more) broken.

Looking at the two suggested approaches:

>>        u32 now, dtime;
>>
>>        /* assume dtime is within the past 30 years, see time_after() */
>>        now = get_seconds();
>>        if (dtime && (dtime - now < 0) && (dtime + recentcy - now < 0))
>>                ret = 1;

* As 'dtime' and 'now' are both unsigned, subtracting them will also result
  in an unsigned value that is never less than zero, so it won't work.
  Adding a cast to 's32' would fix that the same way that time_after() does.

* please use ktime_get_real_seconds() instead of get_seconds(), so we
   don't have to replace it later.

* The comment should say '68 years', not 30.

> or use i_ctime_extra to implicitly extend i_dtime beyond 2038, something like:
>
>        /* assume dtime epoch same as ctime, see EXT4_INODE_GET_XTIME() */
>        dtime = le32_to_cpu(raw_inode->i_dtime);
>        if (EXT4_INODE_SIZE(sb) > EXT4_GOOD_OLD_INODE_SIZE &&
>            offsetof(typeof(*raw_inode), i_ctime_extra) + 4 <=
>            EXT4_GOOD_OLD_INODE_SIZE + le32_to_cpu(raw_inode->i_extra_isize))
>                dtime += (long)(le32_to_cpu(raw_inode->i_ctime_extra) &
>                                EXT4_EPOCH_MASK) << 32;

* This is slightly incorrect when we are close to the epoch boundary, as i_ctime
  and i_dtime might end up being in different epochs. I would not go there.

* If we were to pick this approach, a cast to 'long' is obviously wrong on
  32-bit systems, better use 'u64' or 'time64_t'.

     Arnd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Y2038 bug in ext4 recently_deleted() function
  2017-08-18  1:23           ` Deepa Dinamani
  2017-08-18  9:31             ` Arnd Bergmann
@ 2017-08-18 13:41             ` Theodore Ts'o
  1 sibling, 0 replies; 14+ messages in thread
From: Theodore Ts'o @ 2017-08-18 13:41 UTC (permalink / raw)
  To: Deepa Dinamani
  Cc: Andreas Dilger, Arnd Bergmann, Wang Shilong, Wang Shilong,
	linux-ext4@vger.kernel.org, Shuichi Ihara, Li Xi, Jan Kara

On Thu, Aug 17, 2017 at 06:23:26PM -0700, Deepa Dinamani wrote:
> 
> I don't think dtime has widened on the disk layout for ext4 according
> to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am
> not sure how fixing the internal implementation would be useful until
> we do that. Is there a plan for that?

The dtime field is not visible to user; it's mostly for debugging
purposes.  For debugfs we just are just using i_ctime_extra to compose
the time.  (Perhaps we should be using i_mtime_extra, or the max of
the ctime, mtime, and atime extra fields; but it's not really that
important.)

The issue which Andreas pointed out is the only place where we
actually use the dtime field, and that's so we can avoid re-using a
freshly deleted inode until at least N seconds have gone by in
no-journal node.  That's because if we don't, there are some
unfortunate effects that can take place if we crash and not all of the
metadata gets updated.  Even after running e2fsck -fy, we can end up
having a directory or an immutable file show up where ntp or timed
expects to find a time adjustment file, or some such, that can cause
various system daemons to crash and burn because they aren't expecting
find a file at a particular pathname they own which they can't delete.

There are a number ways we could solve it; one is to just use a new
in-memory variable which can be 64-bits wide.  This burns an extra 8
bytes for each inode in the inode cache, which is why we didn't do
that.

It doesn't really have to be super exact; if we actually have an inode
that avoids getting reused for 136 years (2**32 seconds), it will have
disappeared from the in-memory inode cache.  We just need something
which is valid for N seconds after the deletion time.  (I think we may
have upped N to a larger value on our data center kernels --- 300
seconds if I recall correctly --- because there were some edge cases
where 35 seconds wasn't enough.)

						- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Y2038 bug in ext4 recently_deleted() function
  2017-08-18  9:31             ` Arnd Bergmann
@ 2017-08-18 15:38               ` Deepa Dinamani
  2017-08-18 16:09                 ` Andreas Dilger
  0 siblings, 1 reply; 14+ messages in thread
From: Deepa Dinamani @ 2017-08-18 15:38 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andreas Dilger, Theodore Ts'o, Wang Shilong, Wang Shilong,
	linux-ext4@vger.kernel.org, Shuichi Ihara, Li Xi, Jan Kara

On Fri, Aug 18, 2017 at 2:31 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Fri, Aug 18, 2017 at 3:23 AM, Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>>> Strange, I never even knew recently_deleted() existed, even though it was
>>> added to the tree 4 years ago yesterday.  It looks like this is only used
>>> with the no-journal code, which I don't really interact with.
>>>
>>> One thing I did notice when looking at it is that there is a Y2038 bug in
>>> recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit
>>> get_seconds().
>>
>> I don't think dtime has widened on the disk layout for ext4 according
>> to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am
>> not sure how fixing the internal implementation would be useful until
>> we do that. Is there a plan for that?
>>
>> As far as get_seconds() is concerned, get_seconds() returns unsigned
>> long which is 64 bits on a 64 bit arch and 32 bit on a 32 bit arch.
>> Since dtime variable is declared as unsigned long in this function,
>> same holds for the size of this variable.
>>
>> There is no y2038 problem on a 64 bit machine.
>
> I think what Andreas was saying is that it's actually the opposite:
> on a 32-bit machine, the code will work correctly for 32-bit unsigned
> long values as long as 'dtime' and 'now' are in the same epoch,
> e.g. both are before 2106 or both are after.
> On 64-bit systems it's always wrong after 2106.

There is some confusion here.
I was only referring to the current implementation:

static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
{
.
.
.
   unsigned long dtime, now;
   int offset, ret = 0, recentcy = RECENTCY_MIN;
.
.
.
    offset = (ino % inodes_per_block) * EXT4_INODE_SIZE(sb);
    raw_inode = (struct ext4_inode *) (bh->b_data + offset);
    dtime = le32_to_cpu(raw_inode->i_dtime);
    now = get_seconds();
    if (buffer_dirty(bh))
    recentcy += RECENTCY_DIRTY;

    if (dtime && (dtime < now) && (now < dtime + recentcy))
         ret = 1;
.
.
.
}

In the above implementation, I do not see any problem on a 64 bit machine.
The only problem is that dtime on disk representation is signed 32 bits only.
If that were not a problem then this would be fine from time prespective.

On 32 bit machine, dtime on disk representation again prevents it from
being able to represent times beyond 2038 unless one of the approaches
Ted mentioned is used to extend/ interpret it.

>> So moving to the case of a 32 bit machine:
>>
>> get_seconds() can return values until year 2106. And, recentcy at max
>> can only be 35. Analyzing the current line:
>>
>> if (dtime && (dtime < now) && (now < dtime + recentcy))
>>
>> The above equation should work fine at least until 35 seconds before
>> y2038 deadline.
>
> Since it's all unsigned arithmetic, it should be fine until 2106.
> However, we should get rid of get_seconds() long before then
> and use ktime_get_real_seconds() instead, as most other users
> of get_seconds() are (more) broken.

Dtime on disk representation again breaks this for certain values in
2038 even though everything is unsigned.

I was just saying that whatever we do here depends on how dtime on
disk is interpreted.

Agree that ktime_get_real_seconds() should be used here. But, the way
we handle new values would rely on this new interpretation of dtime.
Also, using time64_t variables on stack only matters after this. Once
the types are corrected, maybe the comparison expression need not
change at all(after new dtime interpretation is in place).

Let me know if I am missing something here.

-Deepa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Y2038 bug in ext4 recently_deleted() function
  2017-08-18 15:38               ` Deepa Dinamani
@ 2017-08-18 16:09                 ` Andreas Dilger
  2017-08-22 15:18                   ` Arnd Bergmann
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2017-08-18 16:09 UTC (permalink / raw)
  To: Deepa Dinamani
  Cc: Arnd Bergmann, Theodore Ts'o, Wang Shilong, Wang Shilong,
	linux-ext4@vger.kernel.org, Shuichi Ihara, Li Xi, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 3900 bytes --]


> On Aug 18, 2017, at 9:38 AM, Deepa Dinamani <deepa.kernel@gmail.com> wrote:
> 
> On Fri, Aug 18, 2017 at 2:31 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Fri, Aug 18, 2017 at 3:23 AM, Deepa Dinamani <deepa.kernel@gmail.com> wrote:
>>> 
>>>> One thing I did notice when looking at it is that there is a Y2038 bug in
>>>> recently_deleted(), as it is comparing 32-bit i_dtime directly with 64-bit
>>>> get_seconds().
>>> 
>>> I don't think dtime has widened on the disk layout for ext4 according
>>> to https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout. So I am
>>> not sure how fixing the internal implementation would be useful until
>>> we do that. Is there a plan for that?
>>> 
>>> As far as get_seconds() is concerned, get_seconds() returns unsigned
>>> long which is 64 bits on a 64 bit arch and 32 bit on a 32 bit arch.
>>> Since dtime variable is declared as unsigned long in this function,
>>> same holds for the size of this variable.
>>> 
>>> There is no y2038 problem on a 64 bit machine.
>> 
>> I think what Andreas was saying is that it's actually the opposite:
>> on a 32-bit machine, the code will work correctly for 32-bit unsigned
>> long values as long as 'dtime' and 'now' are in the same epoch,
>> e.g. both are before 2106 or both are after.
>> On 64-bit systems it's always wrong after 2106.
> 
> There is some confusion here.
> I was only referring to the current implementation:
> 
> static int recently_deleted(struct super_block *sb, ext4_group_t group, int ino)
> {
> .
> .
> .
>   unsigned long dtime, now;
>   int offset, ret = 0, recentcy = RECENTCY_MIN;
> .
> .
> .
>    offset = (ino % inodes_per_block) * EXT4_INODE_SIZE(sb);
>    raw_inode = (struct ext4_inode *) (bh->b_data + offset);
>    dtime = le32_to_cpu(raw_inode->i_dtime);
>    now = get_seconds();
>    if (buffer_dirty(bh))
>    recentcy += RECENTCY_DIRTY;
> 
>    if (dtime && (dtime < now) && (now < dtime + recentcy))
>         ret = 1;
> .
> .
> .
> }
> 
> In the above implementation, I do not see any problem on a 64 bit machine.
> The only problem is that dtime on disk representation is signed 32 bits only.
> If that were not a problem then this would be fine from time prespective.

The 32-bit dtime is the root of the problem.  There is no plan to extend
the dtime field on disk, because it is used so little (mostly as a boolean
value, and for forensics).

>>> So moving to the case of a 32 bit machine:
>>> 
>>> get_seconds() can return values until year 2106. And, recentcy at max
>>> can only be 35. Analyzing the current line:
>>> 
>>> if (dtime && (dtime < now) && (now < dtime + recentcy))
>>> 
>>> The above equation should work fine at least until 35 seconds before
>>> y2038 deadline.
>> 
>> Since it's all unsigned arithmetic, it should be fine until 2106.
>> However, we should get rid of get_seconds() long before then
>> and use ktime_get_real_seconds() instead, as most other users
>> of get_seconds() are (more) broken.
> 
> Dtime on disk representation again breaks this for certain values in
> 2038 even though everything is unsigned.
> 
> I was just saying that whatever we do here depends on how dtime on
> disk is interpreted.
> 
> Agree that ktime_get_real_seconds() should be used here. But, the way
> we handle new values would rely on this new interpretation of dtime.
> Also, using time64_t variables on stack only matters after this. Once
> the types are corrected, maybe the comparison expression need not
> change at all (after new dtime interpretation is in place).

There will not be a new dtime format on disk, but since the calculation
here only depends on relative times (within a few minutes), then it would
be fine to use only 32-bit timestamps, and truncate off the high bits
from get_seconds()/ktime_get_real_seconds().

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Y2038 bug in ext4 recently_deleted() function
  2017-08-18 16:09                 ` Andreas Dilger
@ 2017-08-22 15:18                   ` Arnd Bergmann
  2017-08-22 16:20                     ` Andreas Dilger
  0 siblings, 1 reply; 14+ messages in thread
From: Arnd Bergmann @ 2017-08-22 15:18 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Deepa Dinamani, Theodore Ts'o, Wang Shilong, Wang Shilong,
	linux-ext4@vger.kernel.org, Shuichi Ihara, Li Xi, Jan Kara

On Fri, Aug 18, 2017 at 6:09 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>
>>>> So moving to the case of a 32 bit machine:
>>>>
>>>> get_seconds() can return values until year 2106. And, recentcy at max
>>>> can only be 35. Analyzing the current line:
>>>>
>>>> if (dtime && (dtime < now) && (now < dtime + recentcy))
>>>>
>>>> The above equation should work fine at least until 35 seconds before
>>>> y2038 deadline.
>>>
>>> Since it's all unsigned arithmetic, it should be fine until 2106.
>>> However, we should get rid of get_seconds() long before then
>>> and use ktime_get_real_seconds() instead, as most other users
>>> of get_seconds() are (more) broken.
>>
>> Dtime on disk representation again breaks this for certain values in
>> 2038 even though everything is unsigned.
>>
>> I was just saying that whatever we do here depends on how dtime on
>> disk is interpreted.
>>
>> Agree that ktime_get_real_seconds() should be used here. But, the way
>> we handle new values would rely on this new interpretation of dtime.
>> Also, using time64_t variables on stack only matters after this. Once
>> the types are corrected, maybe the comparison expression need not
>> change at all (after new dtime interpretation is in place).
>
> There will not be a new dtime format on disk, but since the calculation
> here only depends on relative times (within a few minutes), then it would
> be fine to use only 32-bit timestamps, and truncate off the high bits
> from get_seconds()/ktime_get_real_seconds().

Agreed.

Are you planning to apply your fix for it then? I think your first
suggestion is all we need, aside from the three  minor comments
I had.

       Arnd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Y2038 bug in ext4 recently_deleted() function
  2017-08-22 15:18                   ` Arnd Bergmann
@ 2017-08-22 16:20                     ` Andreas Dilger
  2017-08-22 19:35                       ` Arnd Bergmann
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2017-08-22 16:20 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Deepa Dinamani, Theodore Ts'o, Wang Shilong, Wang Shilong,
	linux-ext4@vger.kernel.org, Shuichi Ihara, Li Xi, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 2046 bytes --]

On Aug 22, 2017, at 9:18 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> 
> On Fri, Aug 18, 2017 at 6:09 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>> 
>>>>> So moving to the case of a 32 bit machine:
>>>>> 
>>>>> get_seconds() can return values until year 2106. And, recentcy at max
>>>>> can only be 35. Analyzing the current line:
>>>>> 
>>>>> if (dtime && (dtime < now) && (now < dtime + recentcy))
>>>>> 
>>>>> The above equation should work fine at least until 35 seconds before
>>>>> y2038 deadline.
>>>> 
>>>> Since it's all unsigned arithmetic, it should be fine until 2106.
>>>> However, we should get rid of get_seconds() long before then
>>>> and use ktime_get_real_seconds() instead, as most other users
>>>> of get_seconds() are (more) broken.
>>> 
>>> Dtime on disk representation again breaks this for certain values in
>>> 2038 even though everything is unsigned.
>>> 
>>> I was just saying that whatever we do here depends on how dtime on
>>> disk is interpreted.
>>> 
>>> Agree that ktime_get_real_seconds() should be used here. But, the way
>>> we handle new values would rely on this new interpretation of dtime.
>>> Also, using time64_t variables on stack only matters after this. Once
>>> the types are corrected, maybe the comparison expression need not
>>> change at all (after new dtime interpretation is in place).
>> 
>> There will not be a new dtime format on disk, but since the calculation
>> here only depends on relative times (within a few minutes), then it would
>> be fine to use only 32-bit timestamps, and truncate off the high bits
>> from get_seconds()/ktime_get_real_seconds().
> 
> Agreed.
> 
> Are you planning to apply your fix for it then? I think your first
> suggestion is all we need, aside from the three  minor comments
> I had.

Do you think it is worthwhile to introduce a "time_after32()" helper for this?
I suspect that this will also be useful for other parts of the kernel that
deal with relative 32-bit timestamps.


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Y2038 bug in ext4 recently_deleted() function
  2017-08-22 16:20                     ` Andreas Dilger
@ 2017-08-22 19:35                       ` Arnd Bergmann
  0 siblings, 0 replies; 14+ messages in thread
From: Arnd Bergmann @ 2017-08-22 19:35 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Deepa Dinamani, Theodore Ts'o, Wang Shilong, Wang Shilong,
	linux-ext4@vger.kernel.org, Shuichi Ihara, Li Xi, Jan Kara

On Tue, Aug 22, 2017 at 6:20 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Aug 22, 2017, at 9:18 AM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> On Fri, Aug 18, 2017 at 6:09 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>>>
>>>>>> So moving to the case of a 32 bit machine:
>>>>>>
>>>>>> get_seconds() can return values until year 2106. And, recentcy at max
>>>>>> can only be 35. Analyzing the current line:
>>>>>>
>>>>>> if (dtime && (dtime < now) && (now < dtime + recentcy))
>>>>>>
>>>>>> The above equation should work fine at least until 35 seconds before
>>>>>> y2038 deadline.
>>>>>
>>>>> Since it's all unsigned arithmetic, it should be fine until 2106.
>>>>> However, we should get rid of get_seconds() long before then
>>>>> and use ktime_get_real_seconds() instead, as most other users
>>>>> of get_seconds() are (more) broken.
>>>>
>>>> Dtime on disk representation again breaks this for certain values in
>>>> 2038 even though everything is unsigned.
>>>>
>>>> I was just saying that whatever we do here depends on how dtime on
>>>> disk is interpreted.
>>>>
>>>> Agree that ktime_get_real_seconds() should be used here. But, the way
>>>> we handle new values would rely on this new interpretation of dtime.
>>>> Also, using time64_t variables on stack only matters after this. Once
>>>> the types are corrected, maybe the comparison expression need not
>>>> change at all (after new dtime interpretation is in place).
>>>
>>> There will not be a new dtime format on disk, but since the calculation
>>> here only depends on relative times (within a few minutes), then it would
>>> be fine to use only 32-bit timestamps, and truncate off the high bits
>>> from get_seconds()/ktime_get_real_seconds().
>>
>> Agreed.
>>
>> Are you planning to apply your fix for it then? I think your first
>> suggestion is all we need, aside from the three  minor comments
>> I had.
>
> Do you think it is worthwhile to introduce a "time_after32()" helper for this?
> I suspect that this will also be useful for other parts of the kernel that
> deal with relative 32-bit timestamps.

I can't think of any other one at the moment. The RTC code may need a
similar check somewhere but it's more likely that they want something
slightly different.

No objections to introducing a time_after32() from my side if only
for documentation purposes, but we probably won't use it elsewhere.

       Arnd

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-08-22 19:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-08  5:05 [PATCH v4] ext4: reduce lock contention in __ext4_new_inode Wang Shilong
2017-08-16 16:42 ` Jan Kara
2017-08-17  6:23   ` Wang Shilong
2017-08-17  9:19     ` Jan Kara
2017-08-17  9:21       ` Jan Kara
2017-08-17 21:51         ` Y2038 bug in ext4 recently_deleted() function Andreas Dilger
2017-08-18  1:23           ` Deepa Dinamani
2017-08-18  9:31             ` Arnd Bergmann
2017-08-18 15:38               ` Deepa Dinamani
2017-08-18 16:09                 ` Andreas Dilger
2017-08-22 15:18                   ` Arnd Bergmann
2017-08-22 16:20                     ` Andreas Dilger
2017-08-22 19:35                       ` Arnd Bergmann
2017-08-18 13:41             ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox