Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [md PATCH] md: handle read-only member devices better.
From: Shaohua Li @ 2017-04-13  5:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Linux-RAID, Nanda Kishore Chinnaram
In-Reply-To: <87a87lutj7.fsf@notabene.neil.brown.name>

On Thu, Apr 13, 2017 at 08:53:48AM +1000, Neil Brown wrote:
> 
> 1/ If an array has any read-only devices when it is started,
>    the array itself must be read-only
> 2/ A read-only device cannot be added to an array after it is
>    started.
> 3/ Setting an array to read-write should not succeed
>    if any member devices are read-only

Didn't get these. We call md_import_device() first to open under layer disk. We
always use FMOD_READ|FMOD_WRITE to open the disk. So if the disk is ro,
md_import_device should fail, we don't add the disk to the array. Why would we
have such issues?

Thanks,
Shaohua
 
> Reported-and-Tested-by: Nanda Kishore Chinnaram <Nanda_Kishore_Chinna@dell.com>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  drivers/md/md.c | 41 ++++++++++++++++++++++++++---------------
>  1 file changed, 26 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 22894303d335..9fe930109012 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2093,6 +2093,10 @@ static int bind_rdev_to_array(struct md_rdev *rdev, struct mddev *mddev)
>  	if (find_rdev(mddev, rdev->bdev->bd_dev))
>  		return -EEXIST;
>  
> +	if ((bdev_read_only(rdev->bdev) || bdev_read_only(rdev->meta_bdev)) &&
> +	    mddev->pers)
> +		return -EROFS;
> +
>  	/* make sure rdev->sectors exceeds mddev->dev_sectors */
>  	if (!test_bit(Journal, &rdev->flags) &&
>  	    rdev->sectors &&
> @@ -5345,6 +5349,13 @@ int md_run(struct mddev *mddev)
>  			continue;
>  		sync_blockdev(rdev->bdev);
>  		invalidate_bdev(rdev->bdev);
> +		if (mddev->ro != 1 &&
> +		    (bdev_read_only(rdev->bdev) ||
> +		     bdev_read_only(rdev->meta_bdev))) {
> +			mddev->ro = 1;
> +			if (mddev->gendisk)
> +				set_disk_ro(mddev->gendisk, 1);
> +		}
>  
>  		/* perform some consistency tests on the device.
>  		 * We don't want the data to overlap the metadata,
> @@ -5569,6 +5580,9 @@ static int do_md_run(struct mddev *mddev)
>  static int restart_array(struct mddev *mddev)
>  {
>  	struct gendisk *disk = mddev->gendisk;
> +	struct md_rdev *rdev;
> +	bool has_journal = false;
> +	bool has_readonly = false;
>  
>  	/* Complain if it has no devices */
>  	if (list_empty(&mddev->disks))
> @@ -5577,24 +5591,21 @@ static int restart_array(struct mddev *mddev)
>  		return -EINVAL;
>  	if (!mddev->ro)
>  		return -EBUSY;
> -	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> -		struct md_rdev *rdev;
> -		bool has_journal = false;
> -
> -		rcu_read_lock();
> -		rdev_for_each_rcu(rdev, mddev) {
> -			if (test_bit(Journal, &rdev->flags) &&
> -			    !test_bit(Faulty, &rdev->flags)) {
> -				has_journal = true;
> -				break;
> -			}
> -		}
> -		rcu_read_unlock();
>  
> +	rcu_read_lock();
> +	rdev_for_each_rcu(rdev, mddev) {
> +		if (test_bit(Journal, &rdev->flags) &&
> +		    !test_bit(Faulty, &rdev->flags))
> +			has_journal = true;
> +		if (bdev_read_only(rdev->bdev))
> +			has_readonly = true;
> +	}
> +	rcu_read_unlock();
> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) && !has_journal)
>  		/* Don't restart rw with journal missing/faulty */
> -		if (!has_journal)
>  			return -EINVAL;
> -	}
> +	if (has_readonly)
> +		return -EROFS;
>  
>  	mddev->safemode = 0;
>  	mddev->ro = 0;
> -- 
> 2.12.2
> 



^ permalink raw reply

* disk initialize priority
From: d tbsky @ 2017-04-13  8:13 UTC (permalink / raw)
  To: linux-raid

Hi:
     I just tried to install rhel 7.3 on a 4 disk raid6. it's a
minimal install but took 3 hours to finish. I found the disk
initialize speed is around 150MB/sec. in past the disk initialize
speed will down to slow speed when there are other disk IO, but it
seems not the case in rhel 7.3.

    I have check the below parameter:
    cat /proc/sys/dev/raid/speed_limit_min  -> 1000
    cat /proc/sys/dev/raid/speed_limit_max -> 200000

    when the system is idle, the init speed is also about 150M/sec, so
the system is use all the power to init when there are other disk IO.

     is it normal? where should I check?

     thanks a lot for help!!


Regards,
tbskyd

^ permalink raw reply

* Re: [RFC PATCH v5] IV Generation algorithms for dm-crypt
From: Binoy Jayan @ 2017-04-13  8:46 UTC (permalink / raw)
  To: Milan Broz
  Cc: Oded, Ofir, Herbert Xu, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra, Gilad
In-Reply-To: <774a9713-05ef-0162-0203-461254a04f6e@gmail.com>

Hi Milan,

On 10 April 2017 at 19:30, Milan Broz <gmazyland@gmail.com> wrote:

Thank you for the reply.

> Well, it is good that there is no performance degradation but it
> would be nice to have some user of it that proves it is really
> working for your hw.

I have been able to get access to a hardware with IV generation support
a few days back. The hardware I was having before did not have IV
generation support. Will be able to come up with numbers after making
it work with the new one.

> FYI - with patch that increases dmcrypt sector size to 4k
> I can see improvement in speed usually in 5-15% with sync AES-NI
> (depends on access pattern), with dmcrypt mapped to memory
> it is even close to 20% speed up (but such a configuration is
> completely artificial).
>
> I wonder why increased dmcrypt sector size does not work for your hw,
> it should help as well (and can be combiuned later with this IV approach).
> (For native 4k drives this should be used in future anyway...)

I think it should work well too with backward incompatibility.

Thanks,
Binoy

^ permalink raw reply

* Re: disk initialize priority
From: Reindl Harald @ 2017-04-13  9:09 UTC (permalink / raw)
  To: d tbsky, linux-raid
In-Reply-To: <CAC6SzHJAM-AbCiK7rqhgk4CJcyZ4C-yW0kt_rr3hgk+F-WMCiQ@mail.gmail.com>



Am 13.04.2017 um 10:13 schrieb d tbsky:
> Hi:
>       I just tried to install rhel 7.3 on a 4 disk raid6. it's a
> minimal install but took 3 hours to finish. I found the disk
> initialize speed is around 150MB/sec. in past the disk initialize
> speed will down to slow speed when there are other disk IO, but it
> seems not the case in rhel 7.3.
> 
>      I have check the below parameter:
>      cat /proc/sys/dev/raid/speed_limit_min  -> 1000
>      cat /proc/sys/dev/raid/speed_limit_max -> 200000
> 
>      when the system is idle, the init speed is also about 150M/sec, so
> the system is use all the power to init when there are other disk IO.
> 
>       is it normal? where should I check?

sysctl.conf:
dev.raid.speed_limit_min = 10000
dev.raid.speed_limit_max = 50000

sysctl -p

and it will adjust itself within seconds - the same works if that values 
are too low and raid-check would take forever

^ permalink raw reply

* [PATCH] md/raid10: introduce handle_badblocks
From: Guoqing Jiang @ 2017-04-13  9:28 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, Guoqing Jiang

If the state of r10bio is either R10BIO_IsSync or
R10BIO_IsRecover, then handle_write_completed calls
the same logic to clear or set badblock for bio and
repl_bio, so refactor these codes to a new function
named handle_badblocks.

Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid10.c | 42 +++++++++++++++++-------------------------
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 4167091..a707772 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2620,6 +2620,20 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
 	raid10_read_request(mddev, r10_bio->master_bio, r10_bio);
 }
 
+static void handle_badblocks(struct r10bio *r10_bio, struct md_rdev *rdev,
+			     struct r10conf *conf, int slot, bool repl)
+{
+	struct bio *bio = repl ? r10_bio->devs[slot].repl_bio :
+				 r10_bio->devs[slot].bio;
+	sector_t addr = r10_bio->devs[slot].addr;
+
+	if (!bio->bi_error)
+		rdev_clear_badblocks(rdev, addr, r10_bio->sectors, 0);
+	else
+		if (!rdev_set_badblocks(rdev, addr, r10_bio->sectors, 0))
+			md_error(conf->mddev, rdev);
+}
+
 static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 {
 	/* Some sort of write request has finished and it
@@ -2638,34 +2652,12 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 			rdev = conf->mirrors[dev].rdev;
 			if (r10_bio->devs[m].bio == NULL)
 				continue;
-			if (!r10_bio->devs[m].bio->bi_error) {
-				rdev_clear_badblocks(
-					rdev,
-					r10_bio->devs[m].addr,
-					r10_bio->sectors, 0);
-			} else {
-				if (!rdev_set_badblocks(
-					    rdev,
-					    r10_bio->devs[m].addr,
-					    r10_bio->sectors, 0))
-					md_error(conf->mddev, rdev);
-			}
+			handle_badblocks(r10_bio, rdev, conf, m, false);
+
 			rdev = conf->mirrors[dev].replacement;
 			if (r10_bio->devs[m].repl_bio == NULL)
 				continue;
-
-			if (!r10_bio->devs[m].repl_bio->bi_error) {
-				rdev_clear_badblocks(
-					rdev,
-					r10_bio->devs[m].addr,
-					r10_bio->sectors, 0);
-			} else {
-				if (!rdev_set_badblocks(
-					    rdev,
-					    r10_bio->devs[m].addr,
-					    r10_bio->sectors, 0))
-					md_error(conf->mddev, rdev);
-			}
+			handle_badblocks(r10_bio, rdev, conf, m, true);
 		}
 		put_buf(r10_bio);
 	} else {
-- 
2.6.6


^ permalink raw reply related

* Re: disk initialize priority
From: Roman Mamedov @ 2017-04-13  9:37 UTC (permalink / raw)
  To: d tbsky; +Cc: linux-raid
In-Reply-To: <CAC6SzHJAM-AbCiK7rqhgk4CJcyZ4C-yW0kt_rr3hgk+F-WMCiQ@mail.gmail.com>

On Thu, 13 Apr 2017 16:13:41 +0800
d tbsky <tbskyd@gmail.com> wrote:

> in past the disk initialize
> speed will down to slow speed when there are other disk IO, but it
> seems not the case in rhel 7.3.

It was changed some time ago with:

  "md: allow resync to go faster when there is competing IO"
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ac8fa4196d205ac8fff3f8932bddbad4f16e4110

and a problem like you describe has been reported:

  https://www.spinics.net/lists/raid/msg51363.html

Some patches have been discussed, but apparently there is no fix for it up to
this day, as the current code in question looks the same as in the above patch.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/md/md.c#n8138

-- 
With respect,
Roman

^ permalink raw reply

* Re: disk initialize priority
From: d tbsky @ 2017-04-13  9:44 UTC (permalink / raw)
  To: Reindl Harald; +Cc: linux-raid
In-Reply-To: <43e44c4b-37d6-9734-b323-5b432ef27d52@thelounge.net>

2017-04-13 17:09 GMT+08:00 Reindl Harald <h.reindl@thelounge.net>:
>
>
> Am 13.04.2017 um 10:13 schrieb d tbsky:
>>
>> Hi:
>>       I just tried to install rhel 7.3 on a 4 disk raid6. it's a
>> minimal install but took 3 hours to finish. I found the disk
>> initialize speed is around 150MB/sec. in past the disk initialize
>> speed will down to slow speed when there are other disk IO, but it
>> seems not the case in rhel 7.3.
>>
>>      I have check the below parameter:
>>      cat /proc/sys/dev/raid/speed_limit_min  -> 1000
>>      cat /proc/sys/dev/raid/speed_limit_max -> 200000
>>
>>      when the system is idle, the init speed is also about 150M/sec, so
>> the system is use all the power to init when there are other disk IO.
>>
>>       is it normal? where should I check?
>
>
> sysctl.conf:
> dev.raid.speed_limit_min = 10000
> dev.raid.speed_limit_max = 50000
>
> sysctl -p
>
> and it will adjust itself within seconds - the same works if that values are
> too low and raid-check would take forever

hi:
    my problem is the speed is too high, not too low...

Regards,
tbskyd

^ permalink raw reply

* Re: disk initialize priority
From: Reindl Harald @ 2017-04-13  9:47 UTC (permalink / raw)
  To: d tbsky; +Cc: linux-raid
In-Reply-To: <CAC6SzHKWeLMSn1Z3W0KyBaUw+5MLmnMxLbh130gxM071t8VeFw@mail.gmail.com>



Am 13.04.2017 um 11:44 schrieb d tbsky:
> 2017-04-13 17:09 GMT+08:00 Reindl Harald <h.reindl@thelounge.net>:
>>
>>
>> Am 13.04.2017 um 10:13 schrieb d tbsky:
>>>
>>> Hi:
>>>        I just tried to install rhel 7.3 on a 4 disk raid6. it's a
>>> minimal install but took 3 hours to finish. I found the disk
>>> initialize speed is around 150MB/sec. in past the disk initialize
>>> speed will down to slow speed when there are other disk IO, but it
>>> seems not the case in rhel 7.3.
>>>
>>>       I have check the below parameter:
>>>       cat /proc/sys/dev/raid/speed_limit_min  -> 1000
>>>       cat /proc/sys/dev/raid/speed_limit_max -> 200000
>>>
>>>       when the system is idle, the init speed is also about 150M/sec, so
>>> the system is use all the power to init when there are other disk IO.
>>>
>>>        is it normal? where should I check?
>>
>>
>> sysctl.conf:
>> dev.raid.speed_limit_min = 10000
>> dev.raid.speed_limit_max = 50000
>>
>> sysctl -p
>>
>> and it will adjust itself within seconds - the same works if that values are
>> too low and raid-check would take forever
> 
> hi: my problem is the speed is too high, not too low...

and?

you said it is at 200000
i showed you how to reduce it to 50000 or whatever you want

^ permalink raw reply

* Re: disk initialize priority
From: d tbsky @ 2017-04-13  9:56 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid
In-Reply-To: <20170413143736.5ac148d0@natsu>

2017-04-13 17:37 GMT+08:00 Roman Mamedov <rm@romanrm.net>:
> On Thu, 13 Apr 2017 16:13:41 +0800
> d tbsky <tbskyd@gmail.com> wrote:
>
>> in past the disk initialize
>> speed will down to slow speed when there are other disk IO, but it
>> seems not the case in rhel 7.3.
>
> It was changed some time ago with:
>
>   "md: allow resync to go faster when there is competing IO"
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ac8fa4196d205ac8fff3f8932bddbad4f16e4110
>
> and a problem like you describe has been reported:
>
>   https://www.spinics.net/lists/raid/msg51363.html
>
> Some patches have been discussed, but apparently there is no fix for it up to
> this day, as the current code in question looks the same as in the above patch.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/md/md.c#n8138
>
> --
> With respect,
> Roman

hi:
    thanks for the hint. I will try to report RHEL to see what
happened. I didn't notice these in previous versions.

Regards,
tbskyd

^ permalink raw reply

* Re: disk initialize priority
From: d tbsky @ 2017-04-13  9:58 UTC (permalink / raw)
  To: Reindl Harald; +Cc: linux-raid
In-Reply-To: <fc31c172-f988-32a4-c9b2-93bc7f3bfc90@thelounge.net>

2017-04-13 17:47 GMT+08:00 Reindl Harald <h.reindl@thelounge.net>:
>
>
> Am 13.04.2017 um 11:44 schrieb d tbsky:
>>
>> 2017-04-13 17:09 GMT+08:00 Reindl Harald <h.reindl@thelounge.net>:
>>>
>>>
>>>
>>> Am 13.04.2017 um 10:13 schrieb d tbsky:
>>>>
>>>>
>>>> Hi:
>>>>        I just tried to install rhel 7.3 on a 4 disk raid6. it's a
>>>> minimal install but took 3 hours to finish. I found the disk
>>>> initialize speed is around 150MB/sec. in past the disk initialize
>>>> speed will down to slow speed when there are other disk IO, but it
>>>> seems not the case in rhel 7.3.
>>>>
>>>>       I have check the below parameter:
>>>>       cat /proc/sys/dev/raid/speed_limit_min  -> 1000
>>>>       cat /proc/sys/dev/raid/speed_limit_max -> 200000
>>>>
>>>>       when the system is idle, the init speed is also about 150M/sec, so
>>>> the system is use all the power to init when there are other disk IO.
>>>>
>>>>        is it normal? where should I check?
>>>
>>>
>>>
>>> sysctl.conf:
>>> dev.raid.speed_limit_min = 10000
>>> dev.raid.speed_limit_max = 50000
>>>
>>> sysctl -p
>>>
>>> and it will adjust itself within seconds - the same works if that values
>>> are
>>> too low and raid-check would take forever
>>
>>
>> hi: my problem is the speed is too high, not too low...
>
>
> and?
>
> you said it is at 200000
> i showed you how to reduce it to 50000 or whatever you want

hi:
   sorry I didn't understand you correctly. thanks for the hint. I
will try to resolve the root cause of the problem.

Regards,
tbskyd

^ permalink raw reply

* Re: disk initialize priority
From: Reindl Harald @ 2017-04-13 10:05 UTC (permalink / raw)
  To: d tbsky; +Cc: linux-raid
In-Reply-To: <CAC6SzHKhn5j6f8DOcebFT2H5zRgnBbpJsnsncQocWRVS7DLWXg@mail.gmail.com>



Am 13.04.2017 um 11:58 schrieb d tbsky:
> 2017-04-13 17:47 GMT+08:00 Reindl Harald <h.reindl@thelounge.net>:
>>
>>
>> Am 13.04.2017 um 11:44 schrieb d tbsky:
>>>
>>> 2017-04-13 17:09 GMT+08:00 Reindl Harald <h.reindl@thelounge.net>:
>>>>
>>>>
>>>>
>>>> Am 13.04.2017 um 10:13 schrieb d tbsky:
>>>>>
>>>>>
>>>>> Hi:
>>>>>         I just tried to install rhel 7.3 on a 4 disk raid6. it's a
>>>>> minimal install but took 3 hours to finish. I found the disk
>>>>> initialize speed is around 150MB/sec. in past the disk initialize
>>>>> speed will down to slow speed when there are other disk IO, but it
>>>>> seems not the case in rhel 7.3.
>>>>>
>>>>>        I have check the below parameter:
>>>>>        cat /proc/sys/dev/raid/speed_limit_min  -> 1000
>>>>>        cat /proc/sys/dev/raid/speed_limit_max -> 200000
>>>>>
>>>>>        when the system is idle, the init speed is also about 150M/sec, so
>>>>> the system is use all the power to init when there are other disk IO.
>>>>>
>>>>>         is it normal? where should I check?
>>>>
>>>>
>>>>
>>>> sysctl.conf:
>>>> dev.raid.speed_limit_min = 10000
>>>> dev.raid.speed_limit_max = 50000
>>>>
>>>> sysctl -p
>>>>
>>>> and it will adjust itself within seconds - the same works if that values
>>>> are
>>>> too low and raid-check would take forever
>>>
>>>
>>> hi: my problem is the speed is too high, not too low...
>>
>> and?
>>
>> you said it is at 200000
>> i showed you how to reduce it to 50000 or whatever you want
> 
> hi:
> sorry I didn't understand you correctly. thanks for the hint. I
> will try to resolve the root cause of the problem

the root cause is a default which is too large for your disks wqhile on 
the other hand if someone has a RAID10 with SSD disks it's even too low 
and you should just configure that for your needs and workload

it's not only about how fast are your disks, it's also about your 
workload, there are times where i set it to 500000 to get the weekly 
raid-scrub as fast as possible when i don't heavily use the machine and 
on other days i slow it down dramatically because i need performance and 
don't care when the background task is finished

it's also a difference between RAID1 or RAID10 when concurrency reads 
can be served by the other disk and the impact is not so big than on a RAID1

^ permalink raw reply

* Re: disk initialize priority
From: Roman Mamedov @ 2017-04-13 10:05 UTC (permalink / raw)
  To: Reindl Harald; +Cc: d tbsky, linux-raid
In-Reply-To: <fc31c172-f988-32a4-c9b2-93bc7f3bfc90@thelounge.net>

On Thu, 13 Apr 2017 11:47:29 +0200
Reindl Harald <h.reindl@thelounge.net> wrote:

> and?
> 
> you said it is at 200000
> i showed you how to reduce it to 50000 or whatever you want

The point is that this should automatically adjust between 1000 and 200000 (on
default settings) according to other IO in the system. But since April 2015
this automatic tuning got broken and has worsened significantly for some
scenarios. The solution should be not that "well you're SOL and now always
have to hand-tune it", but to keep reporting your experience and try out the
proposed fixes so the prior automated adjustment is reimplemented, or baring
that, to revert the behavior change on your systems (already tried that just
now on my kernel build, it still reverts cleanly).

-- 
With respect,
Roman

^ permalink raw reply

* Re: disk initialize priority
From: d tbsky @ 2017-04-13 10:25 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Reindl Harald, linux-raid
In-Reply-To: <20170413150514.6b92d81f@natsu>

2017-04-13 18:05 GMT+08:00 Roman Mamedov <rm@romanrm.net>:
> On Thu, 13 Apr 2017 11:47:29 +0200
> Reindl Harald <h.reindl@thelounge.net> wrote:
>
>> and?
>>
>> you said it is at 200000
>> i showed you how to reduce it to 50000 or whatever you want
>
> The point is that this should automatically adjust between 1000 and 200000 (on
> default settings) according to other IO in the system. But since April 2015
> this automatic tuning got broken and has worsened significantly for some
> scenarios. The solution should be not that "well you're SOL and now always
> have to hand-tune it", but to keep reporting your experience and try out the
> proposed fixes so the prior automated adjustment is reimplemented, or baring
> that, to revert the behavior change on your systems (already tried that just
> now on my kernel build, it still reverts cleanly).
>
> --
> With respect,
> Roman

hi:
     RHEL 7 use kernel 3.10 so it should not be affected. at least
RHEL 7.2 I didn't notice this kind of problem. maybe RHEL 7.3 pull
some feature/patch from upstream kernel. I will try to report to
redhat bugzilla to see if they can fix these.

     thanks again for all the hints and help!!

Regards,
tbskyd

^ permalink raw reply

* Re: disk initialize priority
From: Roman Mamedov @ 2017-04-13 10:34 UTC (permalink / raw)
  To: d tbsky; +Cc: linux-raid
In-Reply-To: <CAC6SzHLj4FL_=7FwSwvaziYAOs95hwhRgoG6qaZPvgMBsXJBsg@mail.gmail.com>

On Thu, 13 Apr 2017 18:25:44 +0800
d tbsky <tbskyd@gmail.com> wrote:

>      RHEL 7 use kernel 3.10 so it should not be affected. at least
> RHEL 7.2 I didn't notice this kind of problem. maybe RHEL 7.3 pull
> some feature/patch from upstream kernel. I will try to report to
> redhat bugzilla to see if they can fix these.

Out of curiousity I checked the RHEL 7.3 kernel 3.10 source as available at:
http://vault.centos.org/7.3.1611/updates/Source/SPackages/kernel-3.10.0-514.10.2.el7.src.rpm
and yes indeed they have included that patch from the upstream 4.x kernel.

-- 
With respect,
Roman

^ permalink raw reply

* Re: disk initialize priority
From: d tbsky @ 2017-04-13 10:38 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid
In-Reply-To: <20170413153419.73377864@natsu>

2017-04-13 18:34 GMT+08:00 Roman Mamedov <rm@romanrm.net>:
> On Thu, 13 Apr 2017 18:25:44 +0800
> d tbsky <tbskyd@gmail.com> wrote:
>
>>      RHEL 7 use kernel 3.10 so it should not be affected. at least
>> RHEL 7.2 I didn't notice this kind of problem. maybe RHEL 7.3 pull
>> some feature/patch from upstream kernel. I will try to report to
>> redhat bugzilla to see if they can fix these.
>
> Out of curiousity I checked the RHEL 7.3 kernel 3.10 source as available at:
> http://vault.centos.org/7.3.1611/updates/Source/SPackages/kernel-3.10.0-514.10.2.el7.src.rpm
> and yes indeed they have included that patch from the upstream 4.x kernel.
>
> --
> With respect,
> Roman

hi:
     you are really life saver. now I can easily report the bugs to redhat.
     thanks again for your patient and help!!

Regards,
tbskyd

^ permalink raw reply

* (unknown), 
From: Scott Ellentuch @ 2017-04-13 15:58 UTC (permalink / raw)
  To: linux-raid

for disk in a b c d g h i j k l m n
do

  disklist="${disklist} /dev/sd${disk}1"

done

mdadm --create --verbose /dev/md2 --level=5 --raid=devices=12  ${disklist}

But its telling me :

mdadm: invalid number of raid devices: devices=12


I can't find any definition of a limit anywhere.

Thank you, Tuc

^ permalink raw reply

* Re:
From: Scott Ellentuch @ 2017-04-13 16:38 UTC (permalink / raw)
  To: Mark Knecht; +Cc: Linux-RAID
In-Reply-To: <CAK2H+efb3iKA5P3yd7uRqJomci6ENvrB1JRBBmtQEpEvyPMe7w@mail.gmail.com>

DOH! Stared at it for a while... Thanks.

Tuc

On Thu, Apr 13, 2017 at 12:22 PM, Mark Knecht <markknecht@gmail.com> wrote:
>
>
> On Thu, Apr 13, 2017 at 8:58 AM, Scott Ellentuch <tuctboh@gmail.com> wrote:
>>
>> for disk in a b c d g h i j k l m n
>> do
>>
>>   disklist="${disklist} /dev/sd${disk}1"
>>
>> done
>>
>> mdadm --create --verbose /dev/md2 --level=5 --raid=devices=12  ${disklist}
>>
>> But its telling me :
>>
>> mdadm: invalid number of raid devices: devices=12
>>
>>
>> I can't find any definition of a limit anywhere.
>>
>> Thank you, Tuc
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> Try
>
> --raid-devices=12
>
> not
>
> --raid=devices=12

^ permalink raw reply

* Re: disk initialize priority
From: Stephane Thiell @ 2017-04-13 17:44 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: d tbsky, linux-raid@vger.kernel.org
In-Reply-To: <20170413153419.73377864@natsu>

> On Apr 13, 2017, at 3:34 AM, Roman Mamedov <rm@romanrm.net> wrote:
> 
> On Thu, 13 Apr 2017 18:25:44 +0800
> d tbsky <tbskyd@gmail.com> wrote:
> 
>>     RHEL 7 use kernel 3.10 so it should not be affected. at least
>> RHEL 7.2 I didn't notice this kind of problem. maybe RHEL 7.3 pull
>> some feature/patch from upstream kernel. I will try to report to
>> redhat bugzilla to see if they can fix these.
> 
> Out of curiousity I checked the RHEL 7.3 kernel 3.10 source as available at:
> http://vault.centos.org/7.3.1611/updates/Source/SPackages/kernel-3.10.0-514.10.2.el7.src.rpm
> and yes indeed they have included that patch from the upstream 4.x kernel.
> 

Interesting.

Just to say that I have noticed a similar issue on CentOS 7.3 (using 3.10.0-514.10.2.el7_lustre.x86_64). I am using multiple raid6 md (8+2) and I had to deploy a cron script that checks if each raid is degraded or not, and set speed_limit_max accordingly (eg. higher “priority” for rebuild if array degraded) but always below the max throughput… or other user I/Os become too slow. Not ideal but works fine this way.

Best regards,

Stéphane

^ permalink raw reply

* GET_ARRAY_INFO assumptions?
From: Jes Sorensen @ 2017-04-13 17:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

Looking at trying to phase out the ioctl usage, I am trying to introduce 
a helper for the 'is the array valid' situation.

Now looking at places like Incremental.c (around like 557 in my current 
tree):
	/* 7b/ if yes, */
	/* - if number of OK devices match expected, or -R and there */
	/*             are enough, */
	/*   + add any bitmap file  */
	/*   + start the array (auto-readonly). */

	if (md_get_array_info(mdfd, &ainf) == 0) {
		if (c->export) {
			printf("MD_STARTED=already\n");
		} else if (c->verbose >= 0)
			pr_err("%s attached to %s which is already active.\n",
			       devname, chosen_name);
		rv = 0;
		goto out_unlock;
	}

I am wondering if there are any side effects/assumptions about 
GET_ARRAY_INFO that I am not considering? Basically I am making the 
assumption that if /sys/block/md<X>/md exists, the array is valid.

The code in Incremental.c already deals with sysfs higher up in the 
code, so I guess the question is if the above test is even relevant anymore?

Alternative, do we need export a new state in sysfs 'running'?

Thoughts?

Jes

diff --git a/util.c b/util.c
index a695c45..99ed015 100644
--- a/util.c
+++ b/util.c
@@ -200,6 +200,22 @@ out:
         return ret;
  }

+int md_valid_array(int fd)
+{
+       struct mdinfo *sra;
+       struct mdu_array_info_s array;
+       int ret;
+
+       sra = xcalloc(1, sizeof(*sra));
+       ret = sysfs_init(sra, fd, NULL);
+       free(sra);
+
+       if (ret)
+               ret = ioctl(fd, GET_ARRAY_INFO, &array);
+
+       return !ret;
+}
+
  /*
   * Get array info from the kernel. Longer term we want to deprecate the
   * ioctl and get it from sysfs.

^ permalink raw reply related

* Re: GET_ARRAY_INFO assumptions?
From: Shaohua Li @ 2017-04-13 20:37 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: NeilBrown, linux-raid
In-Reply-To: <ab07d303-85fb-e912-e4fa-9ee66a1277cc@gmail.com>

On Thu, Apr 13, 2017 at 01:50:06PM -0400, Jes Sorensen wrote:
> Hi Neil,
> 
> Looking at trying to phase out the ioctl usage, I am trying to introduce a
> helper for the 'is the array valid' situation.
> 
> Now looking at places like Incremental.c (around like 557 in my current
> tree):
> 	/* 7b/ if yes, */
> 	/* - if number of OK devices match expected, or -R and there */
> 	/*             are enough, */
> 	/*   + add any bitmap file  */
> 	/*   + start the array (auto-readonly). */
> 
> 	if (md_get_array_info(mdfd, &ainf) == 0) {
> 		if (c->export) {
> 			printf("MD_STARTED=already\n");
> 		} else if (c->verbose >= 0)
> 			pr_err("%s attached to %s which is already active.\n",
> 			       devname, chosen_name);
> 		rv = 0;
> 		goto out_unlock;
> 	}
> 
> I am wondering if there are any side effects/assumptions about
> GET_ARRAY_INFO that I am not considering? Basically I am making the
> assumption that if /sys/block/md<X>/md exists, the array is valid.

what does 'valid' really mean? md<x>/md exists after a md device is allocated,
the md device might not have any under layer disks bound yet.

> The code in Incremental.c already deals with sysfs higher up in the code, so
> I guess the question is if the above test is even relevant anymore?
> 
> Alternative, do we need export a new state in sysfs 'running'?

I'd assume 'running' means the md device has a personality attached. See
array_state_show(), !running == 'clear' or 'inactive'.


Thanks,
Shaohua
> Thoughts?
> 
> Jes
> 
> diff --git a/util.c b/util.c
> index a695c45..99ed015 100644
> --- a/util.c
> +++ b/util.c
> @@ -200,6 +200,22 @@ out:
>         return ret;
>  }
> 
> +int md_valid_array(int fd)
> +{
> +       struct mdinfo *sra;
> +       struct mdu_array_info_s array;
> +       int ret;
> +
> +       sra = xcalloc(1, sizeof(*sra));
> +       ret = sysfs_init(sra, fd, NULL);
> +       free(sra);
> +
> +       if (ret)
> +               ret = ioctl(fd, GET_ARRAY_INFO, &array);
> +
> +       return !ret;
> +}
> +
>  /*
>   * Get array info from the kernel. Longer term we want to deprecate the
>   * ioctl and get it from sysfs.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] md/raid10: introduce handle_badblocks
From: Shaohua Li @ 2017-04-13 20:50 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: linux-raid, neilb
In-Reply-To: <1492075708-4412-1-git-send-email-gqjiang@suse.com>

On Thu, Apr 13, 2017 at 05:28:28PM +0800, Guoqing Jiang wrote:
> If the state of r10bio is either R10BIO_IsSync or
> R10BIO_IsRecover, then handle_write_completed calls
> the same logic to clear or set badblock for bio and
> repl_bio, so refactor these codes to a new function
> named handle_badblocks.
> 
> Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/raid10.c | 42 +++++++++++++++++-------------------------
>  1 file changed, 17 insertions(+), 25 deletions(-)

I'm reluctant to accept this one. Don't get me wrong, this does reduce the code
lines a little bit, but not much. The problem, I think, is the name.
handle_badblocks describes a big scope, it actually makes coding reading
harder. So if you can find a better name, I'll merge this.

Thanks,
Shaohua

> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 4167091..a707772 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2620,6 +2620,20 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>  	raid10_read_request(mddev, r10_bio->master_bio, r10_bio);
>  }
>  
> +static void handle_badblocks(struct r10bio *r10_bio, struct md_rdev *rdev,
> +			     struct r10conf *conf, int slot, bool repl)
> +{
> +	struct bio *bio = repl ? r10_bio->devs[slot].repl_bio :
> +				 r10_bio->devs[slot].bio;
> +	sector_t addr = r10_bio->devs[slot].addr;
> +
> +	if (!bio->bi_error)
> +		rdev_clear_badblocks(rdev, addr, r10_bio->sectors, 0);
> +	else
> +		if (!rdev_set_badblocks(rdev, addr, r10_bio->sectors, 0))
> +			md_error(conf->mddev, rdev);
> +}
> +
>  static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
>  {
>  	/* Some sort of write request has finished and it
> @@ -2638,34 +2652,12 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
>  			rdev = conf->mirrors[dev].rdev;
>  			if (r10_bio->devs[m].bio == NULL)
>  				continue;
> -			if (!r10_bio->devs[m].bio->bi_error) {
> -				rdev_clear_badblocks(
> -					rdev,
> -					r10_bio->devs[m].addr,
> -					r10_bio->sectors, 0);
> -			} else {
> -				if (!rdev_set_badblocks(
> -					    rdev,
> -					    r10_bio->devs[m].addr,
> -					    r10_bio->sectors, 0))
> -					md_error(conf->mddev, rdev);
> -			}
> +			handle_badblocks(r10_bio, rdev, conf, m, false);
> +
>  			rdev = conf->mirrors[dev].replacement;
>  			if (r10_bio->devs[m].repl_bio == NULL)
>  				continue;
> -
> -			if (!r10_bio->devs[m].repl_bio->bi_error) {
> -				rdev_clear_badblocks(
> -					rdev,
> -					r10_bio->devs[m].addr,
> -					r10_bio->sectors, 0);
> -			} else {
> -				if (!rdev_set_badblocks(
> -					    rdev,
> -					    r10_bio->devs[m].addr,
> -					    r10_bio->sectors, 0))
> -					md_error(conf->mddev, rdev);
> -			}
> +			handle_badblocks(r10_bio, rdev, conf, m, true);
>  		}
>  		put_buf(r10_bio);
>  	} else {
> -- 
> 2.6.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: GET_ARRAY_INFO assumptions?
From: Jes Sorensen @ 2017-04-13 21:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid
In-Reply-To: <20170413203742.wg6mrnzedw7ew5ky@kernel.org>

On 04/13/2017 04:37 PM, Shaohua Li wrote:
> On Thu, Apr 13, 2017 at 01:50:06PM -0400, Jes Sorensen wrote:
>> Hi Neil,
>>
>> Looking at trying to phase out the ioctl usage, I am trying to introduce a
>> helper for the 'is the array valid' situation.
>>
>> Now looking at places like Incremental.c (around like 557 in my current
>> tree):
>> 	/* 7b/ if yes, */
>> 	/* - if number of OK devices match expected, or -R and there */
>> 	/*             are enough, */
>> 	/*   + add any bitmap file  */
>> 	/*   + start the array (auto-readonly). */
>>
>> 	if (md_get_array_info(mdfd, &ainf) == 0) {
>> 		if (c->export) {
>> 			printf("MD_STARTED=already\n");
>> 		} else if (c->verbose >= 0)
>> 			pr_err("%s attached to %s which is already active.\n",
>> 			       devname, chosen_name);
>> 		rv = 0;
>> 		goto out_unlock;
>> 	}
>>
>> I am wondering if there are any side effects/assumptions about
>> GET_ARRAY_INFO that I am not considering? Basically I am making the
>> assumption that if /sys/block/md<X>/md exists, the array is valid.
>
> what does 'valid' really mean? md<x>/md exists after a md device is allocated,
> the md device might not have any under layer disks bound yet.
>
>> The code in Incremental.c already deals with sysfs higher up in the code, so
>> I guess the question is if the above test is even relevant anymore?
>>
>> Alternative, do we need export a new state in sysfs 'running'?
>
> I'd assume 'running' means the md device has a personality attached. See
> array_state_show(), !running == 'clear' or 'inactive'.

Good point, I guess what I am trying to figure out is what is assumed 
when ioctl(GET_ARRAY_INFO) returns 0 and how do we map it to sysfs?

Jes


^ permalink raw reply

* [PATCH 00/22] Introduce common scatterlist map function
From: Logan Gunthorpe @ 2017-04-13 22:05 UTC (permalink / raw)
  To: Christoph Hellwig, Martin K. Petersen, Sagi Grimberg, Jens Axboe,
	Tejun Heo, Greg Kroah-Hartman, Dan Williams, Ross Zwisler,
	Matthew Wilcox, Sumit Semwal, Ming Lin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linaro-mm-sig-cunTk1MwBs8s++Sfvej+rw,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	fcoe-devel-s9riP+hp16TNLxjTenLetw,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w,
	sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA, devel
  Cc: Steve Wise

Hi Everyone,

As part of my effort to enable P2P DMA transactions with PCI cards,
we've identified the need to be able to safely put IO memory into
scatterlists (and eventually other spots). This probably involves a
conversion from struct page to pfn_t but that migration is a ways off
and those decisions are yet to be made.

As an initial step in that direction, I've started cleaning up some of the
scatterlist code by trying to carve out a better defined layer between it
and it's users. The longer term goal would be to remove sg_page or replace
it with something that can potentially fail.

This patchset is the first step in that effort. I've introduced
a common function to map scatterlist memory and converted all the common
kmap(sg_page()) cases. This removes about 66 sg_page calls (of ~331).

Seeing this is a fairly large cleanup set that touches a wide swath of
the kernel I have limited the people I've sent this to. I'd suggest we look
toward merging the first patch and then I can send the individual subsystem
patches on to their respective maintainers and get them merged
independantly. (This is to avoid the conflicts I created with my last
cleanup set... Sorry) Though, I'm certainly open to other suggestions to get
it merged.

The patchset is based on v4.11-rc6 and can be found in the sg_map
branch from this git tree:

https://github.com/sbates130272/linux-p2pmem.git

Thanks,

Logan


Logan Gunthorpe (22):
  scatterlist: Introduce sg_map helper functions
  nvmet: Make use of the new sg_map helper function
  libiscsi: Make use of new the sg_map helper function
  target: Make use of the new sg_map function at 16 call sites
  drm/i915: Make use of the new sg_map helper function
  crypto: hifn_795x: Make use of the new sg_map helper function
  crypto: shash, caam: Make use of the new sg_map helper function
  crypto: chcr: Make use of the new sg_map helper function
  dm-crypt: Make use of the new sg_map helper in 4 call sites
  staging: unisys: visorbus: Make use of the new sg_map helper function
  RDS: Make use of the new sg_map helper function
  scsi: ipr, pmcraid, isci: Make use of the new sg_map helper in 4 call
    sites
  scsi: hisi_sas, mvsas, gdth: Make use of the new sg_map helper
    function
  scsi: arcmsr, ips, megaraid: Make use of the new sg_map helper
    function
  scsi: libfc, csiostor: Change to sg_copy_buffer in two drivers
  xen-blkfront: Make use of the new sg_map helper function
  mmc: sdhci: Make use of the new sg_map helper function
  mmc: spi: Make use of the new sg_map helper function
  mmc: tmio: Make use of the new sg_map helper function
  mmc: sdricoh_cs: Make use of the new sg_map helper function
  mmc: tifm_sd: Make use of the new sg_map helper function
  memstick: Make use of the new sg_map helper function

 crypto/shash.c                                  |   9 +-
 drivers/block/xen-blkfront.c                    |  33 +++++--
 drivers/crypto/caam/caamalg.c                   |   8 +-
 drivers/crypto/chelsio/chcr_algo.c              |  28 +++---
 drivers/crypto/hifn_795x.c                      |  32 ++++---
 drivers/dma-buf/dma-buf.c                       |   3 +
 drivers/gpu/drm/i915/i915_gem.c                 |  27 +++---
 drivers/md/dm-crypt.c                           |  38 +++++---
 drivers/memstick/host/jmb38x_ms.c               |  23 ++++-
 drivers/memstick/host/tifm_ms.c                 |  22 ++++-
 drivers/mmc/host/mmc_spi.c                      |  26 +++--
 drivers/mmc/host/sdhci.c                        |  35 ++++++-
 drivers/mmc/host/sdricoh_cs.c                   |  14 ++-
 drivers/mmc/host/tifm_sd.c                      |  88 +++++++++++++----
 drivers/mmc/host/tmio_mmc.h                     |  12 ++-
 drivers/mmc/host/tmio_mmc_dma.c                 |   5 +
 drivers/mmc/host/tmio_mmc_pio.c                 |  24 +++++
 drivers/nvme/target/fabrics-cmd.c               |  16 +++-
 drivers/scsi/arcmsr/arcmsr_hba.c                |  16 +++-
 drivers/scsi/csiostor/csio_scsi.c               |  54 +----------
 drivers/scsi/cxgbi/libcxgbi.c                   |   5 +
 drivers/scsi/gdth.c                             |   9 +-
 drivers/scsi/hisi_sas/hisi_sas_v1_hw.c          |  14 ++-
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c          |  13 ++-
 drivers/scsi/ipr.c                              |  27 +++---
 drivers/scsi/ips.c                              |   8 +-
 drivers/scsi/isci/request.c                     |  42 ++++----
 drivers/scsi/libfc/fc_libfc.c                   |  49 ++--------
 drivers/scsi/libiscsi_tcp.c                     |  32 ++++---
 drivers/scsi/megaraid.c                         |   9 +-
 drivers/scsi/mvsas/mv_sas.c                     |  10 +-
 drivers/scsi/pmcraid.c                          |  19 ++--
 drivers/staging/unisys/visorhba/visorhba_main.c |  12 ++-
 drivers/target/iscsi/iscsi_target.c             |  27 ++++--
 drivers/target/target_core_rd.c                 |   3 +-
 drivers/target/target_core_sbc.c                | 122 +++++++++++++++++-------
 drivers/target/target_core_transport.c          |  18 ++--
 drivers/target/target_core_user.c               |  43 ++++++---
 include/linux/scatterlist.h                     |  97 +++++++++++++++++++
 include/scsi/libiscsi_tcp.h                     |   3 +-
 include/target/target_core_backend.h            |   4 +-
 net/rds/ib_recv.c                               |  17 +++-
 42 files changed, 739 insertions(+), 357 deletions(-)

--
2.1.4

^ permalink raw reply

* [PATCH 01/22] scatterlist: Introduce sg_map helper functions
From: Logan Gunthorpe @ 2017-04-13 22:05 UTC (permalink / raw)
  To: Christoph Hellwig, Martin K. Petersen, Sagi Grimberg, Jens Axboe,
	Tejun Heo, Greg Kroah-Hartman, Dan Williams, Ross Zwisler,
	Matthew Wilcox, Sumit Semwal, Ming Lin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linaro-mm-sig-cunTk1MwBs8s++Sfvej+rw,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	fcoe-devel-s9riP+hp16TNLxjTenLetw,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w,
	sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA, devel
  Cc: Steve Wise
In-Reply-To: <1492121135-4437-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

This patch introduces functions which kmap the pages inside an sgl. Two
variants are provided: one if an offset is required and one if the
offset is zero. These functions replace a common pattern of
kmap(sg_page(sg)) that is used in about 50 places within the kernel.

The motivation for this work is to eventually safely support sgls that
contain io memory. In order for that to work, any access to the contents
of an iomem SGL will need to be done with iomemcpy or hit some warning.
(The exact details of how this will work have yet to be worked out.)
Having all the kmaps in one place is just a first step in that
direction. Additionally, seeing this helps cut down the users of sg_page,
it should make any effort to go to struct-page-less DMAs a little
easier (should that idea ever swing back into favour again).

A flags option is added to select between a regular or atomic mapping so
these functions can replace kmap(sg_page or kmap_atomic(sg_page.
Future work may expand this to have flags for using page_address or
vmap. Much further in the future, there may be a flag to allocate memory
and copy the data from/to iomem.

We also add the semantic that sg_map can fail to create a mapping,
despite the fact that the current code this is replacing is assumed to
never fail and the current version of these functions cannot fail. This
is to support iomem which either have to fail to create the mapping or
allocate memory as a bounce buffer which itself can fail.

Also, in terms of cleanup, a few of the existing kmap(sg_page) users
play things a bit loose in terms of whether they apply sg->offset
so using these helper functions should help avoid such issues.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/dma-buf/dma-buf.c   |  3 ++
 include/linux/scatterlist.h | 97 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0007b79..b95934b 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -37,6 +37,9 @@
 
 #include <uapi/linux/dma-buf.h>
 
+/* Prevent the highmem.h macro from aliasing ops->kunmap_atomic */
+#undef kunmap_atomic
+
 static inline int is_dma_buf_file(struct file *);
 
 struct dma_buf_list {
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index cb3c8fe..acd4d73 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mm.h>
+#include <linux/highmem.h>
 #include <asm/io.h>
 
 struct scatterlist {
@@ -126,6 +127,102 @@ static inline struct page *sg_page(struct scatterlist *sg)
 	return (struct page *)((sg)->page_link & ~0x3);
 }
 
+#define SG_KMAP		(1 << 0)	/* create a mapping with kmap */
+#define SG_KMAP_ATOMIC	(1 << 1)	/* create a mapping with kmap_atomic */
+
+/**
+ * sg_map_offset - kmap a page inside an sgl
+ * @sg:		SG entry
+ * @offset:	Offset into entry
+ * @flags:	Flags for creating the mapping
+ *
+ * Description:
+ *   Use this function to map a page in the scatterlist at the specified
+ *   offset. sg->offset is already added for you. Note: the semantics of
+ *   this function are that it may fail. Thus, its output should be checked
+ *   with IS_ERR and PTR_ERR. Otherwise, a pointer to the specified offset
+ *   in the mapped page is returned.
+ *
+ *   Flags can be any of:
+ *	* SG_KMAP	 - Use kmap to create the mapping
+ *	* SG_KMAP_ATOMIC - Use kmap_atomic to map the page atommically.
+ *			   Thus, the rules of that function apply: the cpu
+ *			   may not sleep until it is unmaped.
+ *
+ *   Also, consider carefully whether this function is appropriate. It is
+ *   largely not recommended for new code and if the sgl came from another
+ *   subsystem and you don't know what kind of memory might be in the list
+ *   then you definitely should not call it. Non-mappable memory may be in
+ *   the sgl and thus this function may fail unexpectedly.
+ **/
+static inline void *sg_map_offset(struct scatterlist *sg, size_t offset,
+				   int flags)
+{
+	struct page *pg;
+	unsigned int pg_off;
+
+	offset += sg->offset;
+	pg = nth_page(sg_page(sg), offset >> PAGE_SHIFT);
+	pg_off = offset_in_page(offset);
+
+	if (flags & SG_KMAP_ATOMIC)
+		return kmap_atomic(pg) + pg_off;
+	else
+		return kmap(pg) + pg_off;
+}
+
+/**
+ * sg_unkmap_offset - unmap a page that was mapped with sg_map_offset
+ * @sg:		SG entry
+ * @addr:	address returned by sg_map_offset
+ * @offset:	Offset into entry (same as specified for sg_map_offset)
+ * @flags:	Flags, which are the same specified for sg_map_offset
+ *
+ * Description:
+ *   Unmap the page that was mapped with sg_map_offset
+ *
+ **/
+static inline void sg_unmap_offset(struct scatterlist *sg, void *addr,
+				    size_t offset, int flags)
+{
+	struct page *pg = nth_page(sg_page(sg), offset >> PAGE_SHIFT);
+	unsigned int pg_off = offset_in_page(offset);
+
+	if (flags & SG_KMAP_ATOMIC)
+		kunmap_atomic(addr - sg->offset - pg_off);
+	else
+		kunmap(pg);
+}
+
+/**
+ * sg_map - map the first page in the scatterlist entry
+ * @sg:		SG entry
+ * @flags:	Flags, see sg_map_offset for a description
+ *
+ * Description:
+ *   Same as sg_map_offset(sg, 0, flags);
+ *
+ **/
+static inline void *sg_map(struct scatterlist *sg, int flags)
+{
+	return sg_map_offset(sg, 0, flags);
+}
+
+/**
+ * sg_unmap - unmap a page mapped with sg_map
+ * @sg:		SG entry
+ * @addr:       address returned by sg_map
+ * @flags:	Flags, see sg_map_offset for a description
+ *
+ * Description:
+ *   Same as sg_map_offset(sg, 0, flags);
+ *
+ **/
+static inline void sg_unmap(struct scatterlist *sg, void *addr, int flags)
+{
+	sg_unmap_offset(sg, addr, 0, flags);
+}
+
 /**
  * sg_set_buf - Set sg entry to point at given data
  * @sg:		 SG entry
-- 
2.1.4

^ permalink raw reply related

* [PATCH 02/22] nvmet: Make use of the new sg_map helper function
From: Logan Gunthorpe @ 2017-04-13 22:05 UTC (permalink / raw)
  To: Christoph Hellwig, Martin K. Petersen, Sagi Grimberg, Jens Axboe,
	Tejun Heo, Greg Kroah-Hartman, Dan Williams, Ross Zwisler,
	Matthew Wilcox, Sumit Semwal, Ming Lin,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linaro-mm-sig-cunTk1MwBs8s++Sfvej+rw,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-raid-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	fcoe-devel-s9riP+hp16TNLxjTenLetw,
	open-iscsi-/JYPxA39Uh5TLH3MbocFFw,
	megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w,
	sparmaintainer-GLv8BlqOqDDQT0dZR+AlfA, devel
  Cc: Steve Wise
In-Reply-To: <1492121135-4437-1-git-send-email-logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

This is a straight forward conversion in two places. Should kmap fail,
the code will return an INVALD_DATA error in the completion.

Signed-off-by: Logan Gunthorpe <logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>
---
 drivers/nvme/target/fabrics-cmd.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 8bd022af..f62a634 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -122,7 +122,11 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	d = sg_map(req->sg, SG_KMAP);
+	if (IS_ERR(d)) {
+		status = NVME_SC_SGL_INVALID_DATA;
+		goto out;
+	}
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -158,7 +162,7 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
+	sg_unmap(req->sg, d, SG_KMAP);
 	nvmet_req_complete(req, status);
 }
 
@@ -170,7 +174,11 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 	u16 qid = le16_to_cpu(c->qid);
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	d = sg_map(req->sg, SG_KMAP);
+	if (IS_ERR(d)) {
+		status = NVME_SC_SGL_INVALID_DATA;
+		goto out;
+	}
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -205,7 +213,7 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
 	pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
+	sg_unmap(req->sg, d, SG_KMAP);
 	nvmet_req_complete(req, status);
 	return;
 
-- 
2.1.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox