Linux RAID subsystem development

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Re: [PATCH v1 08/54] block: comment on bio_alloc_pages()
From: Coly Li @ 2016-12-30 10:40 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Jens Axboe, Kent Overstreet,
	Shaohua Li, Mike Christie, Guoqing Jiang, Hannes Reinecke,
	open list:BCACHE (BLOCK LAYER CACHE),
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT
In-Reply-To: <1482854250-13481-9-git-send-email-tom.leiming@gmail.com>

On 2016/12/27 下午11:55, Ming Lei wrote:
> This patch adds comment on usage of bio_alloc_pages(),
> also comments on one special case of bch_data_verify().
> 
> Signed-off-by: Ming Lei <tom.leiming@gmail.com>
> ---
>  block/bio.c               | 4 +++-
>  drivers/md/bcache/debug.c | 6 ++++++
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 2b375020fc49..d4a1e0b63ea0 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -961,7 +961,9 @@ EXPORT_SYMBOL(bio_advance);
>   * @bio: bio to allocate pages for
>   * @gfp_mask: flags for allocation
>   *
> - * Allocates pages up to @bio->bi_vcnt.
> + * Allocates pages up to @bio->bi_vcnt, and this function should only
> + * be called on a new initialized bio, which means all pages aren't added
> + * to the bio via bio_add_page() yet.
>   *
>   * Returns 0 on success, -ENOMEM on failure. On failure, any allocated pages are
>   * freed.
> diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
> index 06f55056aaae..48d03e8b3385 100644
> --- a/drivers/md/bcache/debug.c
> +++ b/drivers/md/bcache/debug.c
> @@ -110,6 +110,12 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
>  	struct bio_vec bv, cbv;
>  	struct bvec_iter iter, citer = { 0 };
>  
> +	/*
> +	 * Once multipage bvec is supported, the bio_clone()
> +	 * has to make sure page count in this bio can be held
> +	 * in the new cloned bio because each single page need
> +	 * to assign to each bvec of the new bio.
> +	 */
>  	check = bio_clone(bio, GFP_NOIO);
>  	if (!check)
>  		return;
> 
Acked-by: Coly Li <colyli@suse.de>

-- 
Coly Li

^ permalink raw reply

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Herbert Xu @ 2016-12-30 10:27 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Milan Broz, Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <CAHv-k_9SynECq7qDbrW59=LsV_WNj+9Ffa=6tATyNKOt36he6Q@mail.gmail.com>

On Thu, Dec 29, 2016 at 02:53:25PM +0530, Binoy Jayan wrote:
>
> When we keep these in dm-crypt and if more than one key is used
> (it is actually more than one parts of the original key),
> there are more than one cipher instance created - one for each
> unique part of the key. Since the crypto requests are modelled
> to go through the template ciphers in the order:
> 
> "essiv -> cbc -> aes"
> 
> a particular cipher instance of the IV (essiv in this example) is
> responsible to encrypt an entire bigger block. If this bigger block
> is to be later split into 512 bytes blocks and then encrypted using
> the other cipher instance depending on the following formula:
> 
> key_index = sector & (key_count - 1)

This is just a matter of structuring the key for the IV generator.
The IV generator's key in this case should be a combination of the
key to the underlying CBC plus the set of all keys for the IV
generator itself.  It should then allocate the required number of
tfms as is currently done by crypt_alloc_tfms in dm-crypt.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: Intel SSD or other brands
From: Doug Dumitru @ 2016-12-30  1:24 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <6199a648-eebe-687d-1a13-3f2dd954ae54@websitemanagers.com.au>

On Thu, Dec 29, 2016 at 2:51 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> On 30/12/16 05:50, Doug Dumitru wrote:
>>
>> Mr. Goryachev,
>>
>> I find it easier to look at these numbers in terms of IOPS.  You are
>> dealing with 17,500 IOPS vs 100 IOPS.  This pretty much has to be the
>> "commit" from the drive.  The benchmark is basically waiting for the
>> data to actually hit "recorded media" before the IO completes.  The
>> "better" drive is returning this "write ACK" when the data is in RAM,
>> the the "worse" drive is returning this "write ACK" after the data is
>> somewhere much slower (probably flash).  I would note that 17,500 IOPS
>> is a "good" but not "great" number.
>
>
> So what would you consider a great number? I guess in practice the
> environment isn't really that massive, it shouldn't really *need* great
> numbers, but it seems no matter how hard I try to "over-architect", it is
> still not performing to end user expectation.

The Intel drives run the same IOPS regardless of pre-conditioning.
They do this mostly by intentionally slowing down random writes so
that the worst case does not actually look any worse.  You can pretty
much dial in any level of random write IOPS by manipulating over
provisioning.  With 8% OP, a drive might get 5K IOPS, but at 20%, this
goes up to 15K.  So if you want to keep an SSD fast, don't fill it up.

>>
>> Doing commit writes to Flash is expensive.  Not only do you have to
>> wait for the flash, but you have to update the mapping tables to get
>> to the data.  Flash also does not typically allow 4K updates (even
>> given the erase rules), so your 4K sync update probably has to update
>> a 16K "page" is probably causing a lot of flash wear.  Maybe as much
>> as 10:1 write amplification.  Maybe more.
>
> Wear doesn't seem to have been a problem so far.
>
>   9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age Always
> -       914339h+46m+34.180s
> This is obviously wrong, I haven't had the drive for >100 years, but it is
> at least almost 4 years old (early 2013 I suspect).
> 233 Media_Wearout_Indicator 0x0032   095   095   000    Old_age Always
> -       0
> This is the worst drive out of the whole array, the best is 99, but either
> way it suggests these drives could easily last >10 years, which would be
> well and truly longer than their expected/useful life (based on capacity).
>
> 241 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       3329773
> This is the drive with the highest number of writes.... Obviously most
> writes are smaller than 32MB, so I'm not entirely sure what this means, but
> I suspect we are not doing a lot of writes per day compared to the total
> storage capacity...
> 3329773 * 32MB / 3 years / 365 days = 97308MB/day. Total capacity is 480*7 =
> 3360GB or approx 0.03 per drive writes per day.
>
> I've actually asked this question before, but here again we find what
> appears to be an anomaly... some drives have significantly more writes than
> others, and I don't understand why in a RAID5 array this would be the case,
> I would have expected the writes to be split approx equally across all
> drives...
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       1501762
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       1712480
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       1684811
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       1781849
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       2282764
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       2269957
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       2154155
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       2163563
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       3329774
>
> It's hard to calculate whether some drives were replaced or similar due to
> the nonsense power on hours values.... but generally all these drives were
> purchased at the same time, and so should have been used mostly equally.
>
>> In the case of RAID-5 in a "traditional" Linux deployment, and
>> especially with DRBD protecting you on another node, you are probably
>> fine without having every last ACK "perfect".  After all, if you power
>> fail the primary node, the secondary will become the primary, and any
>> "tail writes" that are missing will get re-sync'd by DRBD's hash
>> checks.  And because the amount of data being re-synced is small, it
>> will happen very quickly and you might not even notice it.
>
> Right, and at this stage I'm not even looking at data integrity, I'm only
> examining "performance". In fact, it would be within the "acceptable"
> parameters" to lose some data under a "disaster" scenario (where disaster
> means losing both primary and secondary in an unclean shutdown). Of course,
> I wouldn't design the system to do that, but it isn't a strict requirement,
> as long as "normal" processes mean no data loss/corruption, and any drive
> should (eventually) write all the data it has told you it will.
>>
>> Back to performance, you should also consider what your array is doing
>> to you.  You are running an 8 drive raid-5 array.  This will limit
>> performance even more because every write becomes 2 sync writes, plus
>> 6 reads.  With q=1 latencies, if you run this test on the array with
>> "good" drives, you should probably get about 15K IOPS max, but it
>> might be a bit worse as the read and write latencies add for each OP.
>
> Right, and one thing I've considered was moving to RAID10 to avoid this, but
> even RAID10 means 2 writes. Assuming reads are relatively quick, than that
> should reduce the impact of the RAID5 as well. At this stage, converting to
> RAID10 is still something I'm holding up my sleeve as a last resort (due to
> the additional wasted capacity).
>
> Note that my tests are all on single drives, not the array. I can't afford
> to be doing testing on the full array due to the destructive nature, and
> also it is almost impossible to get a quiet moment where the tests wouldn't
> be affected by workload.
>>
>> I tried your test on our "in house" "server-side FTL" mapping layer on
>> 8 drives raid-5.  This is an E5-1650 v3 w/ an LSI 3008 SAS controller
>> and 8 Samsung 256GB 850 Pro SSDs.  The array is "new" so it will slow
>> down somewhat as it fills.  439K IOPS is actually quite a bit under
>> the array's bandwidth, but at q=1, you end up benchmarking the
>> benchmark program.  (at q=10, the array saturates the drives linear
>> performance at about 900K IOPS or 3518 MB/sec).
>>
>> root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0
>> --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60
>> --time_based --group_reporting --name=ess-raid5 --numjobs=1
>
> Would it be possible for you to run the test on a single drive directly
> instead?
>>
>>
>> Run status group 0 (all jobs):
>>    WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s,
>> maxb=1710.9MB/s, mint=60001msec, maxt=60001msec
>
> I might be looking at the wrong value, but you are getting 1711MB/s out of
> an 8 drive array, I got a max of 99MB/s on a single drive, even if I
> multiply that by 7 (8 drives - 1 redundancy), it's still less than half. I'd
> be pretty keen to see your single drive results. Also whether those results
> will change when using the 800GB model.

My test is of a "managed" array with a "host side Flash Translation
Layer".  This means that software is linearizing the writes before
RAID-5 sees them.  This is how the major "storage appliance" vendors
get really fast performance.  One vendor, running an earlier version
of the software I am running here, was able to support 5000 ESXI VDI
clients from a single 2U storage server (with a lot of FC cards).  The
boot storm took about 3 minutes to settle.

Single drives are around 500 MB/sec which is 125K IOPS through our
engine.  Eight drives are (8-1)x500=3500 MB/sec or 900K IOPS.  This is
actually faster than FIO can generate a test pattern from a single
job.  It is also faster than stock RAID-5 can linearly write without
patches.

In terms of wear, lots of users are running very light write
environments.  This is good as many configurations are > 50:1 write
amp if you measure "end to end".  By end to end, I mean, how many
flash writes happen when you create a small file inside of a file
system.  This leads to "file system write amp" x "raid write amp" x
"SSD write amp".  Some people don't like this approach as the file
system is often "off limits" and a black box.  Then again, some file
systems are better than others (for 10K sync creates, EXT4 and XFS are
both about 4.4:1 whereas ZFS is a lot worse).  And with EXT4/XFS, you
can mitigate some of this with an SSD or mapping layer that compresses
blocks.

Doug Dumitru



>
> Thank you for your advice, I'll see whether I can find a way to purchase one
> of the samsung drives for testing/evaluation, then seem to be a similar
> price to the Intel S3510 that I was looking at.
>
> Regards,
> Adam
>
>
>> On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>>>
>>> Apologies for my prematurely sent email (if it gets through), this one is
>>> complete...
>>>
>>> Hi all,
>>>
>>> I've spent a number of years trying to build up a nice RAID array for my
>>> SAN, but I seem to be slowly solving one bottle neck only to find another
>>> one. Right now, I've identified the underlying SSD's as being a major
>>> factor
>>> in that performance issue.
>>>
>>> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
>>> performed really well.
>>> I added 1 x 480GB 530s SSD
>>> I added 2 x 480GB 530s SSD
>>>
>>> I now found out that performance of a 520s SSD is around 180 times faster
>>> than a 530s SSD. I had to run many tests, but eventually I found the
>>> right
>>> things to test for (which matched my real life results), and the numbers
>>> were nothing short of crazy.
>>> Running each test 5 times and average the results...
>>> 520s: 70MB/s
>>> 530s: 0.4MB/s
>>>
>>> OK, so before I could remove and test the 520s, I removed/tested one of
>>> the
>>> 530s and saw the horrible performance, so I bought and tested a 540s and
>>> found:
>>> 540s: 6.7MB/s
>>> So, around 20 times better than the 530, so I replaced all the drives
>>> with
>>> the 540, but I still have worse performance than the original 5 x 520s
>>> array.
>>>
>>> Working with Intel, they swapped a 530s drive for a DC3510, and I then
>>> found
>>> the DC3510 was awesome:
>>> DC3510: 99MB/s
>>> Except, a few weeks back when I placed the order, I was told there is no
>>> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
>>> replacement model is the DC3520. So I figure I won't just blindly buy the
>>> DC3520 assuming it's performance will be similar to the previous model,
>>> so I
>>> buy 4 x 480GB DC3520 and start testing.
>>> DC3520: 37MB/s
>>>
>>> So, 1/3rd of a DC3510, but still better than the current live 540s
>>> drives,
>>> but also still half the original 520s drives.
>>>
>>> Summary:
>>> 520s:   70217kB/s
>>> 530s:     391kB/s
>>> 540s:    6712kB/s
>>> 330s:      24kB/s
>>> DC3510: 99313kB/s
>>> DC3520: 37051kB/s
>>> WD2TBCB:  475kB/s
>>>
>>> * For comparison, I had a older Western Digital Black 2TB spare, and ran
>>> the
>>> same test on it. Got a better result than some of the SSD's which was
>>> really
>>> surprising, but it's certainly not an option.
>>> FYI, the test I'm running is this:
>>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k
>>> --iodepth=1
>>> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
>>> --numjobs=1
>>> All drives were tested on the same machine/SATA3 port (basic intel
>>> desktop
>>> motherboard), with nothing on the drive (no fs, no partition, nothing
>>> trying
>>> to access it, etc..).
>>> In reality, I tested iodepth from 1..10, but in my use case, the
>>> iodepth=1
>>> matches is the relevant number. At higher iodepth, we see performance on
>>> all
>>> the drives improve, if interested, I can provide a full set of my
>>> results/analysis.
>>>
>>> So, my actual question... Can you suggest or have you tested any Intel
>>> (or
>>> other brand) SSD which has good performance (similar to the DC3510 or the
>>> 520s)? (I can't buy and test every single variant out there, my budget
>>> doesn't go anywhere close to that).
>>> It needs to be SATA, since I don't have enough PCIe slots to get the
>>> needed
>>> capacity (nor enough budget). I need around 8 x drives with around 6TB
>>> capacity in RAID5.
>>>
>>> FYI, my storage stack is like this:
>>> 8 x SSD's
>>> mdadm - RAID5
>>> LVM
>>> DRBD
>>> iSCSI
>>>
>>>  From my understanding, it is DRBD that makes everything a iodepth=1
>>> issue.
>>> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of
>>> IO
>>> at the same time, but it usually a single VM performance that is too
>>> limited.
>>>
>>> Regards,
>>> Adam
>>>
>>>
>>> --
>>> Adam Goryachev Website Managers www.websitemanagers.com.au
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>
> --
> Adam Goryachev Website Managers www.websitemanagers.com.au
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* Re: Intel SSD or other brands
From: Robert LeBlanc @ 2016-12-29 23:20 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <65bda7a8-f74f-be87-2490-f6c0b1c5c3f1@websitemanagers.com.au>

On Thu, Dec 29, 2016 at 4:04 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> On 30/12/16 03:56, Robert LeBlanc wrote:
>>
>> This is a similar workload as Ceph and you may find more information
>> from their mailing lists. When I was working with Ceph about a year
>> ago, we tested a bunch of SSDs and found that sync=1 really
>> differentiates drives and you really find which drives are better. In
>> our testing, we found that the 35xx, 36xx, and 37xx drives handled the
>> workloads the best. The 3x00 drives were close to EOL, so we focused
>> on the 3x10 drives. I don't have the data anymore, but the 3610 had
>> the best performance, the 3710 had the best data integrity in the case
>> of power failure, and the 3510 had the best price.
>
> So it seems that my "good/best" results were based on the 3510, which was
> the cheapest out of the options you tested. Any chance you could find the
> raw data again? Or do you recall the relative performance difference between
> these three drives?

This was done at another job and the data stayed when I left. The
performance between the three drives were pretty close, I think less
than 10%, but I can't remember exactly.

>> The 3510 had about
>> ~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3
>> DWPD.
>
> We seem to be around 0.03 DWPD, so I don't think any of these drives would
> be a problem for us. Lifetime seems much longer than the useful life, given
> capacity/etc.

We had really good wear on the 35xx drives, I think they are
understated, but I don't have the data to back that up.

>> Due to the fault tolerance of Ceph, we felt comfortable with the
>> 3610s.
>
> Equally, we have fault tolerance (RAID5) as well as DRBD onto the other node
> which also has RAID5. I also monitor the drive lifetime, I'm not sure what
> value I would consider urgent replacement, but probably around 20% remaining
> life....

You may never even get there at 0.03 DWPD.

>> In our testing, we exceeded the performance numbers listed for
>> the drives on their data sheets when running up to 8 jobs even with
>> sync=1 which no other manufacture did. For Ceph, we could put multiple
>> OSDs on a disk and take advantage of this performance gain. You may be
>> able to do something similar by partitioning your RAID 5 and putting
>> multiple DRBDs on it.
>
>
> We do this already... we use a single RAID5 which is split with LVM2 (20
> LV's), and each LV is then a DRBD device (so 20 DRBD's). This was one of the
> optimisations linbit advised us to do way back at the beginning.
>
> The problem I'm having is that a single DRBD will reach saturation because
> the underlying devices are saturated. So I'm trying to improve the
> underlying device performance, and expect to be able to "move" the
> bottleneck to DRBD or hopefully, the ethernet of the iSCSI interface.
>
> Regards,
> Adam

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply

* Re: Intel SSD or other brands
From: Adam Goryachev @ 2016-12-29 23:04 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <CAANLjFo4tP9Xdv8JX5waNOu_34zWDKsQqOru3fpt2wOAJuZYcQ@mail.gmail.com>

On 30/12/16 03:56, Robert LeBlanc wrote:
> This is a similar workload as Ceph and you may find more information
> from their mailing lists. When I was working with Ceph about a year
> ago, we tested a bunch of SSDs and found that sync=1 really
> differentiates drives and you really find which drives are better. In
> our testing, we found that the 35xx, 36xx, and 37xx drives handled the
> workloads the best. The 3x00 drives were close to EOL, so we focused
> on the 3x10 drives. I don't have the data anymore, but the 3610 had
> the best performance, the 3710 had the best data integrity in the case
> of power failure, and the 3510 had the best price.
So it seems that my "good/best" results were based on the 3510, which 
was the cheapest out of the options you tested. Any chance you could 
find the raw data again? Or do you recall the relative performance 
difference between these three drives?

> The 3510 had about
> ~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3
> DWPD.
We seem to be around 0.03 DWPD, so I don't think any of these drives 
would be a problem for us. Lifetime seems much longer than the useful 
life, given capacity/etc.
> Due to the fault tolerance of Ceph, we felt comfortable with the
> 3610s.
Equally, we have fault tolerance (RAID5) as well as DRBD onto the other 
node which also has RAID5. I also monitor the drive lifetime, I'm not 
sure what value I would consider urgent replacement, but probably around 
20% remaining life....
> In our testing, we exceeded the performance numbers listed for
> the drives on their data sheets when running up to 8 jobs even with
> sync=1 which no other manufacture did. For Ceph, we could put multiple
> OSDs on a disk and take advantage of this performance gain. You may be
> able to do something similar by partitioning your RAID 5 and putting
> multiple DRBDs on it.

We do this already... we use a single RAID5 which is split with LVM2 (20 
LV's), and each LV is then a DRBD device (so 20 DRBD's). This was one of 
the optimisations linbit advised us to do way back at the beginning.

The problem I'm having is that a single DRBD will reach saturation 
because the underlying devices are saturated. So I'm trying to improve 
the underlying device performance, and expect to be able to "move" the 
bottleneck to DRBD or hopefully, the ethernet of the iSCSI interface.

Regards,
Adam

> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Dec 28, 2016 at 7:14 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> Apologies for my prematurely sent email (if it gets through), this one is
>> complete...
>>
>> Hi all,
>>
>> I've spent a number of years trying to build up a nice RAID array for my
>> SAN, but I seem to be slowly solving one bottle neck only to find another
>> one. Right now, I've identified the underlying SSD's as being a major factor
>> in that performance issue.
>>
>> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
>> performed really well.
>> I added 1 x 480GB 530s SSD
>> I added 2 x 480GB 530s SSD
>>
>> I now found out that performance of a 520s SSD is around 180 times faster
>> than a 530s SSD. I had to run many tests, but eventually I found the right
>> things to test for (which matched my real life results), and the numbers
>> were nothing short of crazy.
>> Running each test 5 times and average the results...
>> 520s: 70MB/s
>> 530s: 0.4MB/s
>>
>> OK, so before I could remove and test the 520s, I removed/tested one of the
>> 530s and saw the horrible performance, so I bought and tested a 540s and
>> found:
>> 540s: 6.7MB/s
>> So, around 20 times better than the 530, so I replaced all the drives with
>> the 540, but I still have worse performance than the original 5 x 520s
>> array.
>>
>> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
>> the DC3510 was awesome:
>> DC3510: 99MB/s
>> Except, a few weeks back when I placed the order, I was told there is no
>> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
>> replacement model is the DC3520. So I figure I won't just blindly buy the
>> DC3520 assuming it's performance will be similar to the previous model, so I
>> buy 4 x 480GB DC3520 and start testing.
>> DC3520: 37MB/s
>>
>> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
>> but also still half the original 520s drives.
>>
>> Summary:
>> 520s:   70217kB/s
>> 530s:     391kB/s
>> 540s:    6712kB/s
>> 330s:      24kB/s
>> DC3510: 99313kB/s
>> DC3520: 37051kB/s
>> WD2TBCB:  475kB/s
>>
>> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
>> same test on it. Got a better result than some of the SSD's which was really
>> surprising, but it's certainly not an option.
>> FYI, the test I'm running is this:
>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
>> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
>> --numjobs=1
>> All drives were tested on the same machine/SATA3 port (basic intel desktop
>> motherboard), with nothing on the drive (no fs, no partition, nothing trying
>> to access it, etc..).
>> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
>> matches is the relevant number. At higher iodepth, we see performance on all
>> the drives improve, if interested, I can provide a full set of my
>> results/analysis.
>>
>> So, my actual question... Can you suggest or have you tested any Intel (or
>> other brand) SSD which has good performance (similar to the DC3510 or the
>> 520s)? (I can't buy and test every single variant out there, my budget
>> doesn't go anywhere close to that).
>> It needs to be SATA, since I don't have enough PCIe slots to get the needed
>> capacity (nor enough budget). I need around 8 x drives with around 6TB
>> capacity in RAID5.
>>
>> FYI, my storage stack is like this:
>> 8 x SSD's
>> mdadm - RAID5
>> LVM
>> DRBD
>> iSCSI
>>
>>  From my understanding, it is DRBD that makes everything a iodepth=1 issue.
>> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
>> at the same time, but it usually a single VM performance that is too
>> limited.
>>
>> Regards,
>> Adam
>>
>>
>> --
>> Adam Goryachev Website Managers www.websitemanagers.com.au
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* Re: Intel SSD or other brands
From: Adam Goryachev @ 2016-12-29 22:51 UTC (permalink / raw)
  To: doug; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <CAFx4rwT+j3mx3B_bj803Zs1kX=x-1w4Nwwsyb=a=NumFo8-Ufw@mail.gmail.com>

On 30/12/16 05:50, Doug Dumitru wrote:
> Mr. Goryachev,
>
> I find it easier to look at these numbers in terms of IOPS.  You are
> dealing with 17,500 IOPS vs 100 IOPS.  This pretty much has to be the
> "commit" from the drive.  The benchmark is basically waiting for the
> data to actually hit "recorded media" before the IO completes.  The
> "better" drive is returning this "write ACK" when the data is in RAM,
> the the "worse" drive is returning this "write ACK" after the data is
> somewhere much slower (probably flash).  I would note that 17,500 IOPS
> is a "good" but not "great" number.

So what would you consider a great number? I guess in practice the 
environment isn't really that massive, it shouldn't really *need* great 
numbers, but it seems no matter how hard I try to "over-architect", it 
is still not performing to end user expectation.
> Doing commit writes to Flash is expensive.  Not only do you have to
> wait for the flash, but you have to update the mapping tables to get
> to the data.  Flash also does not typically allow 4K updates (even
> given the erase rules), so your 4K sync update probably has to update
> a 16K "page" is probably causing a lot of flash wear.  Maybe as much
> as 10:1 write amplification.  Maybe more.
Wear doesn't seem to have been a problem so far.

   9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age 
Always       -       914339h+46m+34.180s
This is obviously wrong, I haven't had the drive for >100 years, but it 
is at least almost 4 years old (early 2013 I suspect).
233 Media_Wearout_Indicator 0x0032   095   095   000    Old_age 
Always       -       0
This is the worst drive out of the whole array, the best is 99, but 
either way it suggests these drives could easily last >10 years, which 
would be well and truly longer than their expected/useful life (based on 
capacity).

241 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       3329773
This is the drive with the highest number of writes.... Obviously most 
writes are smaller than 32MB, so I'm not entirely sure what this means, 
but I suspect we are not doing a lot of writes per day compared to the 
total storage capacity...
3329773 * 32MB / 3 years / 365 days = 97308MB/day. Total capacity is 
480*7 = 3360GB or approx 0.03 per drive writes per day.

I've actually asked this question before, but here again we find what 
appears to be an anomaly... some drives have significantly more writes 
than others, and I don't understand why in a RAID5 array this would be 
the case, I would have expected the writes to be split approx equally 
across all drives...
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       1501762
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       1712480
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       1684811
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       1781849
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       2282764
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       2269957
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       2154155
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       2163563
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       3329774

It's hard to calculate whether some drives were replaced or similar due 
to the nonsense power on hours values.... but generally all these drives 
were purchased at the same time, and so should have been used mostly 
equally.

> In the case of RAID-5 in a "traditional" Linux deployment, and
> especially with DRBD protecting you on another node, you are probably
> fine without having every last ACK "perfect".  After all, if you power
> fail the primary node, the secondary will become the primary, and any
> "tail writes" that are missing will get re-sync'd by DRBD's hash
> checks.  And because the amount of data being re-synced is small, it
> will happen very quickly and you might not even notice it.
Right, and at this stage I'm not even looking at data integrity, I'm 
only examining "performance". In fact, it would be within the 
"acceptable" parameters" to lose some data under a "disaster" scenario 
(where disaster means losing both primary and secondary in an unclean 
shutdown). Of course, I wouldn't design the system to do that, but it 
isn't a strict requirement, as long as "normal" processes mean no data 
loss/corruption, and any drive should (eventually) write all the data it 
has told you it will.
> Back to performance, you should also consider what your array is doing
> to you.  You are running an 8 drive raid-5 array.  This will limit
> performance even more because every write becomes 2 sync writes, plus
> 6 reads.  With q=1 latencies, if you run this test on the array with
> "good" drives, you should probably get about 15K IOPS max, but it
> might be a bit worse as the read and write latencies add for each OP.
Right, and one thing I've considered was moving to RAID10 to avoid this, 
but even RAID10 means 2 writes. Assuming reads are relatively quick, 
than that should reduce the impact of the RAID5 as well. At this stage, 
converting to RAID10 is still something I'm holding up my sleeve as a 
last resort (due to the additional wasted capacity).

Note that my tests are all on single drives, not the array. I can't 
afford to be doing testing on the full array due to the destructive 
nature, and also it is almost impossible to get a quiet moment where the 
tests wouldn't be affected by workload.
> I tried your test on our "in house" "server-side FTL" mapping layer on
> 8 drives raid-5.  This is an E5-1650 v3 w/ an LSI 3008 SAS controller
> and 8 Samsung 256GB 850 Pro SSDs.  The array is "new" so it will slow
> down somewhat as it fills.  439K IOPS is actually quite a bit under
> the array's bandwidth, but at q=1, you end up benchmarking the
> benchmark program.  (at q=10, the array saturates the drives linear
> performance at about 900K IOPS or 3518 MB/sec).
>
> root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0
> --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60
> --time_based --group_reporting --name=ess-raid5 --numjobs=1
Would it be possible for you to run the test on a single drive directly 
instead?
>
> Run status group 0 (all jobs):
>    WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s,
> maxb=1710.9MB/s, mint=60001msec, maxt=60001msec
I might be looking at the wrong value, but you are getting 1711MB/s out 
of an 8 drive array, I got a max of 99MB/s on a single drive, even if I 
multiply that by 7 (8 drives - 1 redundancy), it's still less than half. 
I'd be pretty keen to see your single drive results. Also whether those 
results will change when using the 800GB model.

Thank you for your advice, I'll see whether I can find a way to purchase 
one of the samsung drives for testing/evaluation, then seem to be a 
similar price to the Intel S3510 that I was looking at.

Regards,
Adam

> On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> Apologies for my prematurely sent email (if it gets through), this one is
>> complete...
>>
>> Hi all,
>>
>> I've spent a number of years trying to build up a nice RAID array for my
>> SAN, but I seem to be slowly solving one bottle neck only to find another
>> one. Right now, I've identified the underlying SSD's as being a major factor
>> in that performance issue.
>>
>> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
>> performed really well.
>> I added 1 x 480GB 530s SSD
>> I added 2 x 480GB 530s SSD
>>
>> I now found out that performance of a 520s SSD is around 180 times faster
>> than a 530s SSD. I had to run many tests, but eventually I found the right
>> things to test for (which matched my real life results), and the numbers
>> were nothing short of crazy.
>> Running each test 5 times and average the results...
>> 520s: 70MB/s
>> 530s: 0.4MB/s
>>
>> OK, so before I could remove and test the 520s, I removed/tested one of the
>> 530s and saw the horrible performance, so I bought and tested a 540s and
>> found:
>> 540s: 6.7MB/s
>> So, around 20 times better than the 530, so I replaced all the drives with
>> the 540, but I still have worse performance than the original 5 x 520s
>> array.
>>
>> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
>> the DC3510 was awesome:
>> DC3510: 99MB/s
>> Except, a few weeks back when I placed the order, I was told there is no
>> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
>> replacement model is the DC3520. So I figure I won't just blindly buy the
>> DC3520 assuming it's performance will be similar to the previous model, so I
>> buy 4 x 480GB DC3520 and start testing.
>> DC3520: 37MB/s
>>
>> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
>> but also still half the original 520s drives.
>>
>> Summary:
>> 520s:   70217kB/s
>> 530s:     391kB/s
>> 540s:    6712kB/s
>> 330s:      24kB/s
>> DC3510: 99313kB/s
>> DC3520: 37051kB/s
>> WD2TBCB:  475kB/s
>>
>> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
>> same test on it. Got a better result than some of the SSD's which was really
>> surprising, but it's certainly not an option.
>> FYI, the test I'm running is this:
>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
>> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
>> --numjobs=1
>> All drives were tested on the same machine/SATA3 port (basic intel desktop
>> motherboard), with nothing on the drive (no fs, no partition, nothing trying
>> to access it, etc..).
>> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
>> matches is the relevant number. At higher iodepth, we see performance on all
>> the drives improve, if interested, I can provide a full set of my
>> results/analysis.
>>
>> So, my actual question... Can you suggest or have you tested any Intel (or
>> other brand) SSD which has good performance (similar to the DC3510 or the
>> 520s)? (I can't buy and test every single variant out there, my budget
>> doesn't go anywhere close to that).
>> It needs to be SATA, since I don't have enough PCIe slots to get the needed
>> capacity (nor enough budget). I need around 8 x drives with around 6TB
>> capacity in RAID5.
>>
>> FYI, my storage stack is like this:
>> 8 x SSD's
>> mdadm - RAID5
>> LVM
>> DRBD
>> iSCSI
>>
>>  From my understanding, it is DRBD that makes everything a iodepth=1 issue.
>> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
>> at the same time, but it usually a single VM performance that is too
>> limited.
>>
>> Regards,
>> Adam
>>
>>
>> --
>> Adam Goryachev Website Managers www.websitemanagers.com.au
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* Re: [PATCH] imsm: enable bad block support for imsm metadata
From: Jes Sorensen @ 2016-12-29 19:29 UTC (permalink / raw)
  To: Tomasz Majchrzak; +Cc: linux-raid
In-Reply-To: <1482914287-17630-1-git-send-email-tomasz.majchrzak@intel.com>

Tomasz Majchrzak <tomasz.majchrzak@intel.com> writes:
> Enable bad block support for imsm metadata as commit e522751d605d
> ("seq_file: reset iterator to first record for zero offset") has been
> accepted in upstream kernel. Prior to that patch mdmon had not been able
> to read bad blocks sysfs file.
>
> Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
> ---
>  super-intel.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)

Applied!

Thanks,
Jes

^ permalink raw reply

* Re: Intel SSD or other brands
From: Doug Dumitru @ 2016-12-29 18:50 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <0329d841-984b-fa25-0bf2-0aba4d55b6de@websitemanagers.com.au>

Mr. Goryachev,

I find it easier to look at these numbers in terms of IOPS.  You are
dealing with 17,500 IOPS vs 100 IOPS.  This pretty much has to be the
"commit" from the drive.  The benchmark is basically waiting for the
data to actually hit "recorded media" before the IO completes.  The
"better" drive is returning this "write ACK" when the data is in RAM,
the the "worse" drive is returning this "write ACK" after the data is
somewhere much slower (probably flash).  I would note that 17,500 IOPS
is a "good" but not "great" number.

Doing commit writes to Flash is expensive.  Not only do you have to
wait for the flash, but you have to update the mapping tables to get
to the data.  Flash also does not typically allow 4K updates (even
given the erase rules), so your 4K sync update probably has to update
a 16K "page" is probably causing a lot of flash wear.  Maybe as much
as 10:1 write amplification.  Maybe more.

There are a bunch of things to consider when looking at "sync"
performance.  The easiest way to look at this is that the drive
"absolutely" has to have the data in stable storage to be "correct".
This is not really true, and the overhead of this can be huge.  File
systems "know" this behavior and instead of looking for a hard sync,
they use "barriers".  The idea of a barrier, is that the drive is
allowed to buffer writes, just not re-order them  so that an IO
crosses a "barrier".

Testing of SSDs for this is looking for "serialization errors".  If
you pull power from an SSD and then go look at the blocks that made it
to the media after the reboot, drives can work in one of three ways.
If absolutely every ACKd block is on the drive, then sync works and
barriers are not relevant.  If the writes stop and no "newer" write
made it to the drive when an "older" one did not, then the drive is
still OK with barriers.  If "newer" writes made it to the media but
older writes did not, then this is a serialization error and you have
spaghetti.  SSDs with power fail serialization errors are "bad".  Then
again, it is important to understand the system-level implications of
how the error will impact your stack.

In the case of RAID-5 in a "traditional" Linux deployment, and
especially with DRBD protecting you on another node, you are probably
fine without having every last ACK "perfect".  After all, if you power
fail the primary node, the secondary will become the primary, and any
"tail writes" that are missing will get re-sync'd by DRBD's hash
checks.  And because the amount of data being re-synced is small, it
will happen very quickly and you might not even notice it.

Back to performance, you should also consider what your array is doing
to you.  You are running an 8 drive raid-5 array.  This will limit
performance even more because every write becomes 2 sync writes, plus
6 reads.  With q=1 latencies, if you run this test on the array with
"good" drives, you should probably get about 15K IOPS max, but it
might be a bit worse as the read and write latencies add for each OP.

I tried your test on our "in house" "server-side FTL" mapping layer on
8 drives raid-5.  This is an E5-1650 v3 w/ an LSI 3008 SAS controller
and 8 Samsung 256GB 850 Pro SSDs.  The array is "new" so it will slow
down somewhat as it fills.  439K IOPS is actually quite a bit under
the array's bandwidth, but at q=1, you end up benchmarking the
benchmark program.  (at q=10, the array saturates the drives linear
performance at about 900K IOPS or 3518 MB/sec).

root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0
--direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60
--time_based --group_reporting --name=ess-raid5 --numjobs=1
ess-raid5: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/1714MB/0KB /s] [0/439K/0
iops] [eta 00m:00s]
ess-raid5: (groupid=0, jobs=1): err= 0: pid=29544: Thu Dec 29 10:36:13 2016
  write: io=102653MB, bw=1710.9MB/s, iops=437980, runt= 60001msec
    clat (usec): min=1, max=155, avg= 2.06, stdev= 0.49
     lat (usec): min=1, max=155, avg= 2.10, stdev= 0.49
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    1], 10.00th=[    2], 20.00th=[    2],
     | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    2],
     | 70.00th=[    2], 80.00th=[    2], 90.00th=[    3], 95.00th=[    3],
     | 99.00th=[    3], 99.50th=[    3], 99.90th=[    7], 99.95th=[    8],
     | 99.99th=[   11]
    bw (MB  /s): min= 1330, max= 1755, per=100.00%, avg=1710.84, stdev=39.12
    lat (usec) : 2=5.17%, 4=94.51%, 10=0.31%, 20=0.02%, 50=0.01%
    lat (usec) : 250=0.01%
  cpu          : usr=14.08%, sys=85.92%, ctx=47, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=26279247/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s,
maxb=1710.9MB/s, mint=60001msec, maxt=60001msec

On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> Apologies for my prematurely sent email (if it gets through), this one is
> complete...
>
> Hi all,
>
> I've spent a number of years trying to build up a nice RAID array for my
> SAN, but I seem to be slowly solving one bottle neck only to find another
> one. Right now, I've identified the underlying SSD's as being a major factor
> in that performance issue.
>
> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
> performed really well.
> I added 1 x 480GB 530s SSD
> I added 2 x 480GB 530s SSD
>
> I now found out that performance of a 520s SSD is around 180 times faster
> than a 530s SSD. I had to run many tests, but eventually I found the right
> things to test for (which matched my real life results), and the numbers
> were nothing short of crazy.
> Running each test 5 times and average the results...
> 520s: 70MB/s
> 530s: 0.4MB/s
>
> OK, so before I could remove and test the 520s, I removed/tested one of the
> 530s and saw the horrible performance, so I bought and tested a 540s and
> found:
> 540s: 6.7MB/s
> So, around 20 times better than the 530, so I replaced all the drives with
> the 540, but I still have worse performance than the original 5 x 520s
> array.
>
> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
> the DC3510 was awesome:
> DC3510: 99MB/s
> Except, a few weeks back when I placed the order, I was told there is no
> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
> replacement model is the DC3520. So I figure I won't just blindly buy the
> DC3520 assuming it's performance will be similar to the previous model, so I
> buy 4 x 480GB DC3520 and start testing.
> DC3520: 37MB/s
>
> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
> but also still half the original 520s drives.
>
> Summary:
> 520s:   70217kB/s
> 530s:     391kB/s
> 540s:    6712kB/s
> 330s:      24kB/s
> DC3510: 99313kB/s
> DC3520: 37051kB/s
> WD2TBCB:  475kB/s
>
> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
> same test on it. Got a better result than some of the SSD's which was really
> surprising, but it's certainly not an option.
> FYI, the test I'm running is this:
> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
> --numjobs=1
> All drives were tested on the same machine/SATA3 port (basic intel desktop
> motherboard), with nothing on the drive (no fs, no partition, nothing trying
> to access it, etc..).
> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
> matches is the relevant number. At higher iodepth, we see performance on all
> the drives improve, if interested, I can provide a full set of my
> results/analysis.
>
> So, my actual question... Can you suggest or have you tested any Intel (or
> other brand) SSD which has good performance (similar to the DC3510 or the
> 520s)? (I can't buy and test every single variant out there, my budget
> doesn't go anywhere close to that).
> It needs to be SATA, since I don't have enough PCIe slots to get the needed
> capacity (nor enough budget). I need around 8 x drives with around 6TB
> capacity in RAID5.
>
> FYI, my storage stack is like this:
> 8 x SSD's
> mdadm - RAID5
> LVM
> DRBD
> iSCSI
>
> From my understanding, it is DRBD that makes everything a iodepth=1 issue.
> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
> at the same time, but it usually a single VM performance that is too
> limited.
>
> Regards,
> Adam
>
>
> --
> Adam Goryachev Website Managers www.websitemanagers.com.au
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply

* Re: Intel SSD or other brands
From: Peter Grandi @ 2016-12-29 18:37 UTC (permalink / raw)
  To: Linux RAID
In-Reply-To: <22629.18653.431222.163351@tree.ty.sabi.co.uk>

>> [ ... ] sync=1 really differentiates drives and you really
>> find which drives are better. [ ... ]

> It is not necessarily "better" in a strict sense: flash SSD
> devices with supercapacitor-backed persistent caches [ ... ]

http://www.storagereview.com/images/Samsung-SSD-SM825-PCB-Bottom.jpg
http://www.storagereview.com/samsung_ssd_sm825_enterprise_ssd_review

http://www.theregister.co.uk/2014/09/24/storage_supercapacitors/
http://us.apacer.com/business/technology/CorePower_Technology
http://www.badcaps.net/forum/showthread.php?t=34417

^ permalink raw reply

* Re: Intel SSD or other brands
From: Peter Grandi @ 2016-12-29 17:46 UTC (permalink / raw)
  To: Linux RAID
In-Reply-To: <8c0e48b3-e224-1e0e-13fe-987a58411e06@websitemanagers.com.au>

>> Well "performance" can be roughly the same, even if "speed" can
>> be very different. http://www.sabi.co.uk/blog/15-two.html#151023

> [ ... ] what you mean to say here, [ ... ]

Some people may say that 'eatmydata $COMMAND' can improve a lot
the "performance" of running '$COMMAND'. More properly it can
improve its speed, but its performance arguably goes to zero or
more precisely becomes insignificant, in most cases. That's a
pretty huge difference.

^ permalink raw reply

* Re: Intel SSD or other brands
From: Peter Grandi @ 2016-12-29 17:33 UTC (permalink / raw)
  To: Linux RAID
In-Reply-To: <CAANLjFo4tP9Xdv8JX5waNOu_34zWDKsQqOru3fpt2wOAJuZYcQ@mail.gmail.com>

> [ ... ] sync=1 really differentiates drives and you really
> find which drives are better. [ ... ]

It is not necessarily "better" in a strict sense: flash SSD
devices with supercapacitor-backed persistent caches can be much
faster on 'fsync' heavy workloads, but also cost a lot more
(probably mostly because of market segmentation). Of course
especially on a RAID5 set with lots of read-modify-write.

It is a different performance envelope, not necessarily a
"better" one. If one does not need small-write speed then
cheaper drivers are more appropriate.

However devices which don't have persistent caches and still
have high 'sync=1'/'direct=1' speed because they don't implement
'fsync' synchronously are definitely worse, in the sense of
having arguably no performance at all.

Some manufacturers think that using an SLC cache helps without a
persistent-ed RAM cache, but the persistent=-ed RAM seems a lot
better to me.

^ permalink raw reply

* Re: Intel SSD or other brands
From: Robert LeBlanc @ 2016-12-29 16:56 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org
In-Reply-To: <0329d841-984b-fa25-0bf2-0aba4d55b6de@websitemanagers.com.au>

This is a similar workload as Ceph and you may find more information
from their mailing lists. When I was working with Ceph about a year
ago, we tested a bunch of SSDs and found that sync=1 really
differentiates drives and you really find which drives are better. In
our testing, we found that the 35xx, 36xx, and 37xx drives handled the
workloads the best. The 3x00 drives were close to EOL, so we focused
on the 3x10 drives. I don't have the data anymore, but the 3610 had
the best performance, the 3710 had the best data integrity in the case
of power failure, and the 3510 had the best price. The 3510 had about
~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3
DWPD. Due to the fault tolerance of Ceph, we felt comfortable with the
3610s. In our testing, we exceeded the performance numbers listed for
the drives on their data sheets when running up to 8 jobs even with
sync=1 which no other manufacture did. For Ceph, we could put multiple
OSDs on a disk and take advantage of this performance gain. You may be
able to do something similar by partitioning your RAID 5 and putting
multiple DRBDs on it.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Dec 28, 2016 at 7:14 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> Apologies for my prematurely sent email (if it gets through), this one is
> complete...
>
> Hi all,
>
> I've spent a number of years trying to build up a nice RAID array for my
> SAN, but I seem to be slowly solving one bottle neck only to find another
> one. Right now, I've identified the underlying SSD's as being a major factor
> in that performance issue.
>
> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
> performed really well.
> I added 1 x 480GB 530s SSD
> I added 2 x 480GB 530s SSD
>
> I now found out that performance of a 520s SSD is around 180 times faster
> than a 530s SSD. I had to run many tests, but eventually I found the right
> things to test for (which matched my real life results), and the numbers
> were nothing short of crazy.
> Running each test 5 times and average the results...
> 520s: 70MB/s
> 530s: 0.4MB/s
>
> OK, so before I could remove and test the 520s, I removed/tested one of the
> 530s and saw the horrible performance, so I bought and tested a 540s and
> found:
> 540s: 6.7MB/s
> So, around 20 times better than the 530, so I replaced all the drives with
> the 540, but I still have worse performance than the original 5 x 520s
> array.
>
> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
> the DC3510 was awesome:
> DC3510: 99MB/s
> Except, a few weeks back when I placed the order, I was told there is no
> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
> replacement model is the DC3520. So I figure I won't just blindly buy the
> DC3520 assuming it's performance will be similar to the previous model, so I
> buy 4 x 480GB DC3520 and start testing.
> DC3520: 37MB/s
>
> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
> but also still half the original 520s drives.
>
> Summary:
> 520s:   70217kB/s
> 530s:     391kB/s
> 540s:    6712kB/s
> 330s:      24kB/s
> DC3510: 99313kB/s
> DC3520: 37051kB/s
> WD2TBCB:  475kB/s
>
> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
> same test on it. Got a better result than some of the SSD's which was really
> surprising, but it's certainly not an option.
> FYI, the test I'm running is this:
> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
> --numjobs=1
> All drives were tested on the same machine/SATA3 port (basic intel desktop
> motherboard), with nothing on the drive (no fs, no partition, nothing trying
> to access it, etc..).
> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
> matches is the relevant number. At higher iodepth, we see performance on all
> the drives improve, if interested, I can provide a full set of my
> results/analysis.
>
> So, my actual question... Can you suggest or have you tested any Intel (or
> other brand) SSD which has good performance (similar to the DC3510 or the
> 520s)? (I can't buy and test every single variant out there, my budget
> doesn't go anywhere close to that).
> It needs to be SATA, since I don't have enough PCIe slots to get the needed
> capacity (nor enough budget). I need around 8 x drives with around 6TB
> capacity in RAID5.
>
> FYI, my storage stack is like this:
> 8 x SSD's
> mdadm - RAID5
> LVM
> DRBD
> iSCSI
>
> From my understanding, it is DRBD that makes everything a iodepth=1 issue.
> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
> at the same time, but it usually a single VM performance that is too
> limited.
>
> Regards,
> Adam
>
>
> --
> Adam Goryachev Website Managers www.websitemanagers.com.au
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Intel SSD or other brands
From: Adam Goryachev @ 2016-12-29 14:35 UTC (permalink / raw)
  To: Peter Grandi, Linux RAID
In-Reply-To: <22628.62979.46926.463323@tree.ty.sabi.co.uk>



On 29/12/16 22:39, Peter Grandi wrote:
>> [ ... ] I now found out that performance of a 520s SSD is
>> around 180 times faster than a 530s SSD. [ ... ]
> Well "performance" can be roughly the same, even if "speed" can
> be very different.
>
> http://www.sabi.co.uk/blog/15-two.html#151023
I'm not entirely sure what you mean to say here, I have a reasonably 
well defined real life workload, (ie, single threaded, small random 
writes)... I am measuring the same statistics across multiple devices 
and comparing those numbers.
In addition, replacing the devices with others that showed an 
improvement (measured during testing) in the real life system showed an 
equivalent improvement (reduction) in end user complaints. So I feel 
reasonably sure that I am "on the right track"....

Am I overlooking something else (very possible)?
>> 520s: 70MB/s
>> 530s: 0.4MB/s
> ....
>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k
>>   --iodepth=1 --runtime=60 --time_based --group_reporting
>>   --name=IntelDC3510_4kj1 --numjobs=1
> Arguably the 520s actually have no performance.
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

I have seen this page, and all I can suggest is that the 480GB 520s 
performs very different to the 60GB model. I see 70MB/s which is 
significantly different to the listed 9MB/s on that page.
This page matches my results (comparatively) that the 520 performs much 
better than the 535, though I don't have easy access to a 535 in order 
to run my destructive tests.... but I did run similar tests on a 535, 
and I ran the more thorough tests on the 530 and 540.

Regards,
Adam


^ permalink raw reply

* Re: Intel SSD or other brands
From: Peter Grandi @ 2016-12-29 11:39 UTC (permalink / raw)
  To: Linux RAID
In-Reply-To: <0329d841-984b-fa25-0bf2-0aba4d55b6de@websitemanagers.com.au>

> [ ... ] I now found out that performance of a 520s SSD is
> around 180 times faster than a 530s SSD. [ ... ]

Well "performance" can be roughly the same, even if "speed" can
be very different.

http://www.sabi.co.uk/blog/15-two.html#151023

> 520s: 70MB/s
> 530s: 0.4MB/s
....
> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k 
>  --iodepth=1 --runtime=60 --time_based --group_reporting 
>  --name=IntelDC3510_4kj1 --numjobs=1

Arguably the 520s actually have no performance.

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

^ permalink raw reply

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Binoy Jayan @ 2016-12-29  9:23 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Milan Broz, Oded, Ofir, David S. Miller, linux-crypto, Mark Brown,
	Arnd Bergmann, Linux kernel mailing list, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <20161223075114.GA3580@gondor.apana.org.au>

Hi Herbert,

Sorry for the delayed response, I was busy with testing dm-crypt
with bonnie++ for regressions. I tried to find some alternative
way to keep the IV algorithms' registration in the dm-crypt.
Also there were some changes done in dm-crypt keys structure too
recently.

c538f6e dm crypt: add ability to use keys from the kernel key retention service

On Thu, Dec 22, 2016 at 04:25:12PM +0530, Binoy Jayan wrote:
>
> > It doesn't have to live outside of dm-crypt.  You can register
> > these IV generators from there if you really want.
>
> Sorry, but I didn't understand this part.

What I mean is that moving the IV generators into the crypto API
does not mean the dm-crypt team giving up control over them.  You
could continue to keep them within the dm-crypt code base and
still register them through the normal crypto API mechanism

When we keep these in dm-crypt and if more than one key is used
(it is actually more than one parts of the original key),
there are more than one cipher instance created - one for each
unique part of the key. Since the crypto requests are modelled
to go through the template ciphers in the order:

"essiv -> cbc -> aes"

a particular cipher instance of the IV (essiv in this example) is
responsible to encrypt an entire bigger block. If this bigger block
is to be later split into 512 bytes blocks and then encrypted using
the other cipher instance depending on the following formula:

key_index = sector & (key_count - 1)

it is not possible as the cipher instances do not have access to
each other's instances. So, number of keys used is crucial while
performing encryption.

If there was only a single key, it should not have been a problem.
But if there are more than one key, then encrypting a bigger block
with a single key would cause backward incompatibility.
I was wondering if this is acceptable.

bigger block: What I mean by bigger block here is the set of 512-byte
blocks that dm-crypt can be optimized to process at once.

Thanks,
Binoy

^ permalink raw reply

* Intel SSD or other brands
From: Adam Goryachev @ 2016-12-29  2:14 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Apologies for my prematurely sent email (if it gets through), this one 
is complete...

Hi all,

I've spent a number of years trying to build up a nice RAID array for my 
SAN, but I seem to be slowly solving one bottle neck only to find 
another one. Right now, I've identified the underlying SSD's as being a 
major factor in that performance issue.

I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this 
performed really well.
I added 1 x 480GB 530s SSD
I added 2 x 480GB 530s SSD

I now found out that performance of a 520s SSD is around 180 times 
faster than a 530s SSD. I had to run many tests, but eventually I found 
the right things to test for (which matched my real life results), and 
the numbers were nothing short of crazy.
Running each test 5 times and average the results...
520s: 70MB/s
530s: 0.4MB/s

OK, so before I could remove and test the 520s, I removed/tested one of 
the 530s and saw the horrible performance, so I bought and tested a 540s 
and found:
540s: 6.7MB/s
So, around 20 times better than the 530, so I replaced all the drives 
with the 540, but I still have worse performance than the original 5 x 
520s array.

Working with Intel, they swapped a 530s drive for a DC3510, and I then 
found the DC3510 was awesome:
DC3510: 99MB/s
Except, a few weeks back when I placed the order, I was told there is no 
longer any stock of this drive, (I wanted 16 x 800GB model), and that 
the replacement model is the DC3520. So I figure I won't just blindly 
buy the DC3520 assuming it's performance will be similar to the previous 
model, so I buy 4 x 480GB DC3520 and start testing.
DC3520: 37MB/s

So, 1/3rd of a DC3510, but still better than the current live 540s 
drives, but also still half the original 520s drives.

Summary:
520s:   70217kB/s
530s:     391kB/s
540s:    6712kB/s
330s:      24kB/s
DC3510: 99313kB/s
DC3520: 37051kB/s
WD2TBCB:  475kB/s

* For comparison, I had a older Western Digital Black 2TB spare, and ran 
the same test on it. Got a better result than some of the SSD's which 
was really surprising, but it's certainly not an option.
FYI, the test I'm running is this:
fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k 
--iodepth=1 --runtime=60 --time_based --group_reporting 
--name=IntelDC3510_4kj1 --numjobs=1
All drives were tested on the same machine/SATA3 port (basic intel 
desktop motherboard), with nothing on the drive (no fs, no partition, 
nothing trying to access it, etc..).
In reality, I tested iodepth from 1..10, but in my use case, the 
iodepth=1 matches is the relevant number. At higher iodepth, we see 
performance on all the drives improve, if interested, I can provide a 
full set of my results/analysis.

So, my actual question... Can you suggest or have you tested any Intel 
(or other brand) SSD which has good performance (similar to the DC3510 
or the 520s)? (I can't buy and test every single variant out there, my 
budget doesn't go anywhere close to that).
It needs to be SATA, since I don't have enough PCIe slots to get the 
needed capacity (nor enough budget). I need around 8 x drives with 
around 6TB capacity in RAID5.

FYI, my storage stack is like this:
8 x SSD's
mdadm - RAID5
LVM
DRBD
iSCSI

 From my understanding, it is DRBD that makes everything a iodepth=1 
issue. It is possible to reach iodepth=2 if I have 2 x VM's both doing a 
lot of IO at the same time, but it usually a single VM performance that 
is too limited.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply

* Intel SSD or other brands
From: Adam Goryachev @ 2016-12-29  1:52 UTC (permalink / raw)
  To: linux-raid

Hi all,

I've spent a number of years trying to build up a nice RAID array for my 
SAN, but I seem to be slowly solving one bottle neck only to find 
another one. Right now, I've identified the underlying SSD's as being a 
major factor in that performance issue.

I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this 
performed really well.
I added 1 x 480GB 530s SSD
I added 2 x 480GB 530s SSD

I now found out that performance of a 520s SSD is around 180 times 
faster than a 530s SSD.

-- 
Adam Goryachev
Website Managers
P: +61 2 8304 0000                    adam@websitemanagers.com.au
F: +61 2 8304 0001                     www.websitemanagers.com.au

^ permalink raw reply

* [PATCH 2/2] md/r5cache: enable chunk_aligned_read with write back cache
From: Song Liu @ 2016-12-29  1:18 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen
In-Reply-To: <20161229011802.692478-1-songliubraving@fb.com>

Chunk aligned read significantly reduces CPU usage of raid456.
However, it is not safe to fully bypass the write back cache.
This patch enables chunk aligned read with write back cache.

For chunk aligned read, we track stripes in write back cache at
a bigger granularity, "big_stripe". Each chunk may contain more
than one stripe (for example, a 256kB chunk contains 64 4kB-page,
so this chunk contain 64 stripes). For chunk_aligned_read, these
stripes are grouped into one big_stripe, so we only need one lookup
for the whole chunk.

For each big_stripe, struct big_stripe_info tracks how many stripes of
this big_stripe are in the write back cache. These data are tracked
in a radix tree (big_stripe_tree). big_stripe_index() is used to
calculate keys for the radix tree.

chunk_aligned_read() calls r5c_big_stripe_cached() to look up
big_stripe of each chunk in the tree. If this big_stripe is in the
tree, chunk_aligned_read() aborts. This look up is protected by
rcu_read_lock().

It is necessary to remember whether a stripe is counted in
big_stripe_tree. Instead of adding new flag, we reuses existing flags:
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
two flags are set, the stripe is counted in big_stripe_tree. This
requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
r5c_try_caching_write().

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 146 ++++++++++++++++++++++++++++++++++++++++++++---
 drivers/md/raid5.c       |  19 +++---
 drivers/md/raid5.h       |   1 +
 3 files changed, 152 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 817b294..268dcd2 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -162,9 +162,60 @@ struct r5l_log {
 
 	/* to submit async io_units, to fulfill ordering of flush */
 	struct work_struct deferred_io_work;
+
+	/* to for chunk_aligned_read in writeback mode, details below */
+	spinlock_t tree_lock;
+	struct radix_tree_root big_stripe_tree;
+	struct kmem_cache *bsi_kc;	/* kmem_cache for big_stripe_info */
+	mempool_t *big_stripe_info_pool;
 };
 
 /*
+ * Enable chunk_aligned_read() with write back cache.
+ *
+ * Each chunk may contain more than one stripe (for example, a 256kB
+ * chunk contains 64 4kB-page, so this chunk contain 64 stripes). For
+ * chunk_aligned_read, these stripes are grouped into one "big_stripe".
+ * For each big_stripe, struct big_stripe_info tracks how many stripes of
+ * this big_stripe are in the write back cache. These data are tracked
+ * in a radix tree (big_stripe_tree). big_stripe_index() is used to
+ * calculate keys for the radix tree.
+ *
+ * chunk_aligned_read() calls r5c_big_stripe_cached() to look up
+ * big_stripe of each chunk in the tree. If this big_stripe is in the
+ * tree, chunk_aligned_read() aborts. This look up is protected by
+ * rcu_read_lock().
+ *
+ * It is necessary to remember whether a stripe is counted in
+ * big_stripe_tree. Instead of adding new flag, we reuses existing flags:
+ * STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
+ * two flags are set, the stripe is counted in big_stripe_tree. This
+ * requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
+ * r5c_try_caching_write().
+ */
+struct big_stripe_info {
+	atomic_t count;
+#ifdef CONFIG_DEBUG_VM
+	void *pad;  /* suppress size check error in kmem_cache_sanity_check */
+#endif
+};
+
+/*
+ * calculate key for big_stripe_tree
+ *
+ * sect: align_bi->bi_iter.bi_sector or sh->sector
+ */
+static inline sector_t big_stripe_index(struct r5conf *conf,
+					sector_t sect)
+{
+	sector_t offset;
+
+	offset = sector_div(sect, conf->chunk_sectors *
+			    (conf->raid_disks - conf->max_degraded));
+	return sect;
+}
+
+/*
  * an IO range starts from a meta data block and end at the next meta data
  * block. The io unit's the meta data block tracks data/parity followed it. io
  * unit is written to log disk with normal write, as we always flush log disk
@@ -2293,6 +2344,8 @@ int r5c_try_caching_write(struct r5conf *conf,
 	int i;
 	struct r5dev *dev;
 	int to_cache = 0;
+	struct big_stripe_info *bsinfo;
+	sector_t bs_index;
 
 	BUG_ON(!r5c_is_writeback(log));
 
@@ -2327,6 +2380,37 @@ int r5c_try_caching_write(struct r5conf *conf,
 		}
 	}
 
+	/* if the stripe is not counted in big_stripe_tree, add it now */
+	if (!test_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state) &&
+	    !test_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+		bs_index = big_stripe_index(conf, sh->sector);
+		spin_lock(&log->tree_lock);
+		bsinfo = radix_tree_lookup(&log->big_stripe_tree, bs_index);
+		if (bsinfo)
+			atomic_inc(&bsinfo->count);
+		else {
+			bsinfo = mempool_alloc(log->big_stripe_info_pool, GFP_ATOMIC);
+			if (bsinfo) {
+				atomic_set(&bsinfo->count, 1);
+				radix_tree_insert(&log->big_stripe_tree, bs_index, bsinfo);
+			}
+		}
+		spin_unlock(&log->tree_lock);
+		if (!bsinfo) {
+			/* in case we cannot allocate memory for the tree, do
+			 * not cache this stripe
+			 */
+			r5c_make_stripe_write_out(sh);
+			return -EAGAIN;
+		}
+
+		/* set STRIPE_R5C_PARTIAL_STRIPE, this shows the stripe is
+		 * counted in the radix tree
+		 */
+		set_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state);
+		atomic_dec(&conf->r5c_cached_partial_stripes);
+	}
+
 	for (i = disks; i--; ) {
 		dev = &sh->dev[i];
 		if (dev->towrite) {
@@ -2401,17 +2485,19 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 				 struct stripe_head *sh,
 				 struct stripe_head_state *s)
 {
+	struct r5l_log *log = conf->log;
 	int i;
 	int do_wakeup = 0;
+	struct big_stripe_info *bsinfo;
+	sector_t bs_index;
 
-	if (!conf->log ||
-	    !test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags))
+	if (!log || !test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags))
 		return;
 
 	WARN_ON(test_bit(STRIPE_R5C_CACHING, &sh->state));
 	clear_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags);
 
-	if (conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
 		return;
 
 	for (i = sh->disks; i--; ) {
@@ -2433,13 +2519,26 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 	if (do_wakeup)
 		wake_up(&conf->wait_for_overlap);
 
-	spin_lock_irq(&conf->log->stripe_in_journal_lock);
+	spin_lock_irq(&log->stripe_in_journal_lock);
 	list_del_init(&sh->r5c);
-	spin_unlock_irq(&conf->log->stripe_in_journal_lock);
+	spin_unlock_irq(&log->stripe_in_journal_lock);
 	sh->log_start = MaxSector;
 
-	atomic_dec(&conf->log->stripe_in_journal_count);
-	r5c_update_log_state(conf->log);
+	atomic_dec(&log->stripe_in_journal_count);
+	r5c_update_log_state(log);
+
+	/* stop counting this stripe in big_stripe_tree */
+	if (test_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state) ||
+	    test_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+		bs_index = big_stripe_index(conf, sh->sector);
+		bsinfo = radix_tree_lookup(&log->big_stripe_tree, bs_index);
+
+		if (atomic_dec_and_lock(&bsinfo->count, &log->tree_lock)) {
+			radix_tree_delete(&log->big_stripe_tree, bs_index);
+			spin_unlock(&log->tree_lock);
+			mempool_free(bsinfo, log->big_stripe_info_pool);
+		}
+	}
 
 	if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
 		BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
@@ -2509,6 +2608,22 @@ r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 	return 0;
 }
 
+/* check whether this big stripe is in write back cache. */
+bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect)
+{
+	struct r5l_log *log = conf->log;
+	sector_t bs_index;
+	struct big_stripe_info *bsinfo;
+
+	if (!log)
+		return false;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	bs_index = big_stripe_index(conf, sect);
+	bsinfo = radix_tree_lookup(&log->big_stripe_tree, bs_index);
+	return bsinfo != NULL;
+}
+
 static int r5l_load_log(struct r5l_log *log)
 {
 	struct md_rdev *rdev = log->rdev;
@@ -2642,6 +2757,17 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->meta_pool)
 		goto out_mempool;
 
+	log->bsi_kc = KMEM_CACHE(big_stripe_info, 0);
+	if (!log->bsi_kc)
+		goto bsi_kc;
+
+	log->big_stripe_info_pool = mempool_create_slab_pool(4, log->bsi_kc);
+	if (!log->big_stripe_info_pool)
+		goto big_stripe_info_pool;
+
+	spin_lock_init(&log->tree_lock);
+	INIT_RADIX_TREE(&log->big_stripe_tree, GFP_ATOMIC);
+
 	log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 						 log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
@@ -2674,6 +2800,10 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	rcu_assign_pointer(conf->log, NULL);
 	md_unregister_thread(&log->reclaim_thread);
 reclaim_thread:
+	mempool_destroy(log->big_stripe_info_pool);
+big_stripe_info_pool:
+	kmem_cache_destroy(log->bsi_kc);
+bsi_kc:
 	mempool_destroy(log->meta_pool);
 out_mempool:
 	bioset_free(log->bs);
@@ -2689,6 +2819,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 void r5l_exit_log(struct r5l_log *log)
 {
 	md_unregister_thread(&log->reclaim_thread);
+	mempool_destroy(log->big_stripe_info_pool);
+	kmem_cache_destroy(log->bsi_kc);
 	mempool_destroy(log->meta_pool);
 	bioset_free(log->bs);
 	mempool_destroy(log->io_pool);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6d5391f..7608acc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -285,13 +285,12 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 						atomic_dec(&conf->r5c_cached_partial_stripes);
 					list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
 					r5c_check_cached_full_stripe(conf);
-				} else {
-					/* partial stripe */
-					if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
-							      &sh->state))
-						atomic_inc(&conf->r5c_cached_partial_stripes);
+				} else
+					/* STRIPE_R5C_PARTIAL_STRIPE is set in
+					 * r5c_try_caching_write(). No need to
+					 * set it again.
+					 */
 					list_add_tail(&sh->lru, &conf->r5c_partial_stripe_list);
-				}
 			}
 		}
 	}
@@ -5020,6 +5019,13 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 		      rdev->recovery_offset >= end_sector)))
 			rdev = NULL;
 	}
+
+	if (r5c_big_stripe_cached(conf, align_bi->bi_iter.bi_sector)) {
+		rcu_read_unlock();
+		bio_put(align_bi);
+		return 0;
+	}
+
 	if (rdev) {
 		sector_t first_bad;
 		int bad_sectors;
@@ -5376,7 +5382,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	 * data on failed drives.
 	 */
 	if (rw == READ && mddev->degraded == 0 &&
-	    !r5c_is_writeback(conf->log) &&
 	    mddev->reshape_position == MaxSector) {
 		bi = chunk_aligned_read(mddev, bi);
 		if (!bi)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 6cc8d4c..efc9922 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -790,4 +790,5 @@ extern void r5c_flush_cache(struct r5conf *conf, int num);
 extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 extern struct md_sysfs_entry r5c_journal_mode;
+extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
 #endif
-- 
2.9.3


^ permalink raw reply related

* [PATCH 1/2] md/r5cache: move clear_bit STRIPE_R5C_PARTIAL/FULL_STRIPE
From: Song Liu @ 2016-12-29  1:18 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen

This patch moves clear_bit of STRIPE_R5C_PARTIAL_STRIPE and
STRIPE_R5C_FULL_STRIPE to r5c_finish_stripe_write_out(). This is
to prepare the next patch for chunk_aligned_read with write_back
cache.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index ac98a82..817b294 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -410,16 +410,6 @@ void r5c_make_stripe_write_out(struct stripe_head *sh)
 
 	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 		atomic_inc(&conf->preread_active_stripes);
-
-	if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
-		BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
-		atomic_dec(&conf->r5c_cached_partial_stripes);
-	}
-
-	if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
-		BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
-		atomic_dec(&conf->r5c_cached_full_stripes);
-	}
 }
 
 static void r5c_handle_data_cached(struct stripe_head *sh)
@@ -2447,8 +2437,19 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 	list_del_init(&sh->r5c);
 	spin_unlock_irq(&conf->log->stripe_in_journal_lock);
 	sh->log_start = MaxSector;
+
 	atomic_dec(&conf->log->stripe_in_journal_count);
 	r5c_update_log_state(conf->log);
+
+	if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
+		BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+		atomic_dec(&conf->r5c_cached_partial_stripes);
+	}
+
+	if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+		BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
+		atomic_dec(&conf->r5c_cached_full_stripes);
+	}
 }
 
 int
-- 
2.9.3


^ permalink raw reply related

* Re: SAS disk from RAID card (no RAID mode) problems
From: John Stoffel @ 2016-12-28 21:28 UTC (permalink / raw)
  To: Joe Landman; +Cc: IW News, linux-raid
In-Reply-To: <97a5cd89-d3b7-e3e4-7b02-84e97efa7dbd@gmail.com>

>>>>> "Joe" == Joe Landman <joe.landman@gmail.com> writes:

Joe> On 12/27/2016 11:57 AM, IW News wrote:
>> On 27/12/16 17:12, Thomas Fjellstrom wrote:
>>> On Tuesday, December 27, 2016 8:53:07 AM MST IW News wrote:
>>>> On 26/12/16 22:25, Thomas Fjellstrom wrote:
>>>>> On Friday, December 23, 2016 9:01:40 AM MST IW News wrote:
>>>>> Hello,

Joe> [...]

>>>>> If thats the mvsas driver, god help you. I had nothing but 
>>>>> troubles with
>>>>> it
>>>>> for a good year or two or possibly more and it was never fixed or 
>>>>> really
>>>>> acknowledged, so I sold that card (a supermicro aoc-saslp-mv8) and 
>>>>> picked
>>>>> up a IBM m1015 (LSI 9211-8i a-like) and haven't looked back.
>>>>> 
>>>>> Basically it would reset a lot, and lock up. AND I believe it 
>>>>> caused a lot
>>>>> of silent corruption as well.
>>>> Hi,
>>>> 
>>>> So there is no support for the mvsas driver?
>>>> 
>>>> The controller is integrated in the motherboard. I could by a PCIe one
>>>> but, what's the deal them?
>>>> 
>>>> Thank you.
>>>> 
>>>> Iñigo.
>>> I don't know that there isn't support, there has been commits to that 
>>> driver
>>> pretty regularly even when i was having problems. I just found there 
>>> was no
>>> support for my particular chipset. I hope you fare better than I did. 
>>> Contact
>>> linux-scsi as was suggested, maybe they can help.
>>> 
>> I already contacted. No answer.
>> 
>> Maybe there is no mvsas developer active there now.

Joe> mvsas support under Linux is terrible.  I know you probably don't
Joe> want to hear this, but get another card.  Someone recommended an
Joe> LSI9211 card.  Past experience with those have been spotty.  They
Joe> are cheap, but I'd recommend a 9207-8i.  Costs a little more, but
Joe> generally works very well.

I bought one and it's been running rock solid in my home system for
months and months.  Got it off ebay for around $120 or so, forget the
exact price.  

Joe> Again, if you are using an mvsas based card, you should expect
Joe> problems and data corruption.  If you use the 9207, it should
Joe> work nicely.

I concur, LSI makes some solid gear.  Get the ones without the RAID
ROM too if you can, or flash it and just use it as a JBOD controller.

Mine is:

02:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

^ permalink raw reply

* [PATCH] imsm: enable bad block support for imsm metadata
From: Tomasz Majchrzak @ 2016-12-28  8:38 UTC (permalink / raw)
  To: linux-raid; +Cc: Jes.Sorensen, Tomasz Majchrzak

Enable bad block support for imsm metadata as commit e522751d605d
("seq_file: reset iterator to first record for zero offset") has been
accepted in upstream kernel. Prior to that patch mdmon had not been able
to read bad blocks sysfs file.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
---
 super-intel.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/super-intel.c b/super-intel.c
index df95957..29b2163 100644
--- a/super-intel.c
+++ b/super-intel.c
@@ -3209,7 +3209,7 @@ static void getinfo_super_imsm_volume(struct supertype *st, struct mdinfo *info,
 							info->array.chunk_size,
 							super->sector_size,
 							info->component_size);
-	info->bb.supported = 0;
+	info->bb.supported = 1;
 
 	memset(info->uuid, 0, sizeof(info->uuid));
 	info->recovery_start = MaxSector;
@@ -3376,7 +3376,7 @@ static void getinfo_super_imsm(struct supertype *st, struct mdinfo *info, char *
 	info->name[0] = 0;
 	info->recovery_start = MaxSector;
 	info->recovery_blocked = imsm_reshape_blocks_arrays_changes(st->sb);
-	info->bb.supported = 0;
+	info->bb.supported = 1;
 
 	/* do we have the all the insync disks that we expect? */
 	mpb = super->anchor;
@@ -7380,7 +7380,7 @@ static struct mdinfo *container_content_imsm(struct supertype *st, char *subarra
 				info_d->component_size = blocks_per_member(map);
 			}
 
-			info_d->bb.supported = 0;
+			info_d->bb.supported = 1;
 			get_volume_badblocks(super->bbm_log, ord_to_idx(ord),
 					     info_d->data_offset,
 					     info_d->component_size,
@@ -8410,7 +8410,7 @@ static struct mdinfo *imsm_activate_spare(struct active_array *a,
 		di->data_offset = pba_of_lba0(map);
 		di->component_size = a->info.component_size;
 		di->container_member = inst;
-		di->bb.supported = 0;
+		di->bb.supported = 1;
 		super->random = random32();
 		di->next = rv;
 		rv = di;
-- 
1.8.3.1


^ permalink raw reply related

* Re: SAS disk from RAID card (no RAID mode) problems
From: Thomas Fjellstrom @ 2016-12-28  1:06 UTC (permalink / raw)
  To: Joe Landman; +Cc: IW News, linux-raid
In-Reply-To: <97a5cd89-d3b7-e3e4-7b02-84e97efa7dbd@gmail.com>

On Tuesday, December 27, 2016 12:09:52 PM MST Joe Landman wrote:
> On 12/27/2016 11:57 AM, IW News wrote:
> > On 27/12/16 17:12, Thomas Fjellstrom wrote:
> >> On Tuesday, December 27, 2016 8:53:07 AM MST IW News wrote:
> >>> On 26/12/16 22:25, Thomas Fjellstrom wrote:
> >>>> On Friday, December 23, 2016 9:01:40 AM MST IW News wrote:
> >>>>> Hello,
> 
> [...]
> 
> >>>>> If thats the mvsas driver, god help you. I had nothing but
> >>>>> troubles with
> >>>> 
> >>>> it
> >>>> for a good year or two or possibly more and it was never fixed or
> >>>> really
> >>>> acknowledged, so I sold that card (a supermicro aoc-saslp-mv8) and
> >>>> picked
> >>>> up a IBM m1015 (LSI 9211-8i a-like) and haven't looked back.
> >>>> 
> >>>> Basically it would reset a lot, and lock up. AND I believe it
> >>>> caused a lot
> >>>> of silent corruption as well.
> >>> 
> >>> Hi,
> >>> 
> >>> So there is no support for the mvsas driver?
> >>> 
> >>> The controller is integrated in the motherboard. I could by a PCIe one
> >>> but, what's the deal them?
> >>> 
> >>> Thank you.
> >>> 
> >>> Iñigo.
> >> 
> >> I don't know that there isn't support, there has been commits to that
> >> driver
> >> pretty regularly even when i was having problems. I just found there
> >> was no
> >> support for my particular chipset. I hope you fare better than I did.
> >> Contact
> >> linux-scsi as was suggested, maybe they can help.
> > 
> > I already contacted. No answer.
> > 
> > Maybe there is no mvsas developer active there now.
> 
> mvsas support under Linux is terrible.   I know you probably don't want
> to hear this, but get another card.   Someone recommended an LSI9211
> card.  Past experience with those have been spotty.  They are cheap, but
> I'd recommend a 9207-8i.  Costs a little more, but generally works very
> well.
> 
> Again, if you are using an mvsas based card, you should expect problems
> and data corruption.  If you use the 9207, it should work nicely.
> 
> The windows world has all sorts of workarounds for wonky cards/chipsets
> in their drivers.  Generally, if the driver is not actively supported in
> linux and up to date, you are likely going to have problems.  If others
> are reporting problems (google "mvsas problems in linux" if you want to
> see how long people have been having problems with the cards), stay far
> away from it.

I had hoped things improved :( when I was having problems with my mvsas based 
card, there was someone making commits to the code from marvell and I got some 
kind of response once or twice, but stopped replying and problems were never 
fixed.

I however haven't had any real issues with my dirt cheap ($50-100) IBM M1015 
cards (I have three). A lot of people have had good results with them. 
Technically its a 9220, which is quite similar to a 9211, but has some slight 
differences.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply

* Re: SAS disk from RAID card (no RAID mode) problems
From: Joe Landman @ 2016-12-27 17:09 UTC (permalink / raw)
  To: IW News, linux-raid
In-Reply-To: <658c4d05-8105-1f02-633d-12bd8d00fb49@imagedworld.com>

On 12/27/2016 11:57 AM, IW News wrote:
> On 27/12/16 17:12, Thomas Fjellstrom wrote:
>> On Tuesday, December 27, 2016 8:53:07 AM MST IW News wrote:
>>> On 26/12/16 22:25, Thomas Fjellstrom wrote:
>>>> On Friday, December 23, 2016 9:01:40 AM MST IW News wrote:
>>>>> Hello,

[...]

>>>>> If thats the mvsas driver, god help you. I had nothing but 
>>>>> troubles with
>>>> it
>>>> for a good year or two or possibly more and it was never fixed or 
>>>> really
>>>> acknowledged, so I sold that card (a supermicro aoc-saslp-mv8) and 
>>>> picked
>>>> up a IBM m1015 (LSI 9211-8i a-like) and haven't looked back.
>>>>
>>>> Basically it would reset a lot, and lock up. AND I believe it 
>>>> caused a lot
>>>> of silent corruption as well.
>>> Hi,
>>>
>>> So there is no support for the mvsas driver?
>>>
>>> The controller is integrated in the motherboard. I could by a PCIe one
>>> but, what's the deal them?
>>>
>>> Thank you.
>>>
>>> Iñigo.
>> I don't know that there isn't support, there has been commits to that 
>> driver
>> pretty regularly even when i was having problems. I just found there 
>> was no
>> support for my particular chipset. I hope you fare better than I did. 
>> Contact
>> linux-scsi as was suggested, maybe they can help.
>>
> I already contacted. No answer.
>
> Maybe there is no mvsas developer active there now.

mvsas support under Linux is terrible.   I know you probably don't want 
to hear this, but get another card.   Someone recommended an LSI9211 
card.  Past experience with those have been spotty.  They are cheap, but 
I'd recommend a 9207-8i.  Costs a little more, but generally works very 
well.

Again, if you are using an mvsas based card, you should expect problems 
and data corruption.  If you use the 9207, it should work nicely.

The windows world has all sorts of workarounds for wonky cards/chipsets 
in their drivers.  Generally, if the driver is not actively supported in 
linux and up to date, you are likely going to have problems.  If others 
are reporting problems (google "mvsas problems in linux" if you want to 
see how long people have been having problems with the cards), stay far 
away from it.

-- 
Joe Landman
e: joe.landman@gmail.com
t: @hpcjoe
c: +1 734 612 4615
w: https://scalability.org

^ permalink raw reply

* Re: SAS disk from RAID card (no RAID mode) problems
From: IW News @ 2016-12-27 16:57 UTC (permalink / raw)
  To: linux-raid
In-Reply-To: <2822438.YHxX6ucKrv@natasha>

On 27/12/16 17:12, Thomas Fjellstrom wrote:
> On Tuesday, December 27, 2016 8:53:07 AM MST IW News wrote:
>> On 26/12/16 22:25, Thomas Fjellstrom wrote:
>>> On Friday, December 23, 2016 9:01:40 AM MST IW News wrote:
>>>> Hello,
>>>>
>>>> First message here.
>>>>
>>>> After looking for a solution without any luck I have found this list. I
>>>> hope someone can help me with this.
>>>>
>>>> I have an ASUS P6T Deluxe with a MARVELL 88SE63xx SAS RAID controller.
>>>> There are to identical 400GB SAS SSD drives attached to it. One of them
>>>> has a Windows 10 installation, the other one Linux.
>>>> Grub is installed on the second disk.
>>>>
>>>> Windows works as expected, but I have problems with the Linux
>>>> installation: the desktop environment freezes for some second once in a
>>>> while. This occurs with Mint Cinnamon, OpenSuSe KDE, Ubuntu and Manjaro
>>>> KDE. All of them are current installations. I'm now working in up to
>>>> date Manjaro KDE (kernel 4.9.0).
>>>> When the temporary freezes occur the mouse pointer moves, some windows
>>>> are updated correctly, other do not and DE stops working.
>>>> When this happens always I have a system log like this:
>>> If thats the mvsas driver, god help you. I had nothing but troubles with
>>> it
>>> for a good year or two or possibly more and it was never fixed or really
>>> acknowledged, so I sold that card (a supermicro aoc-saslp-mv8) and picked
>>> up a IBM m1015 (LSI 9211-8i a-like) and haven't looked back.
>>>
>>> Basically it would reset a lot, and lock up. AND I believe it caused a lot
>>> of silent corruption as well.
>> Hi,
>>
>> So there is no support for the mvsas driver?
>>
>> The controller is integrated in the motherboard. I could by a PCIe one
>> but, what's the deal them?
>>
>> Thank you.
>>
>> Iñigo.
> I don't know that there isn't support, there has been commits to that driver
> pretty regularly even when i was having problems. I just found there was no
> support for my particular chipset. I hope you fare better than I did. Contact
> linux-scsi as was suggested, maybe they can help.
>
I already contacted. No answer.

Maybe there is no mvsas developer active there now.


^ permalink raw reply

* [PATCH v1 38/54] md/raid1.c: convert to bio_for_each_segment_all_sp()
From: Ming Lei @ 2016-12-27 16:04 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Ming Lei, Shaohua Li,
	open list:SOFTWARE RAID Multiple Disks SUPPORT
In-Reply-To: <1482854706-14128-1-git-send-email-tom.leiming@gmail.com>

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 drivers/md/raid1.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a1f3fbed9100..4818f40e7ce9 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -988,12 +988,13 @@ static void alloc_behind_pages(struct bio *bio, struct r1bio *r1_bio)
 {
 	int i;
 	struct bio_vec *bvec;
+	struct bvec_iter_all bia;
 	struct bio_vec *bvecs = kzalloc(bio->bi_vcnt * sizeof(struct bio_vec),
 					GFP_NOIO);
 	if (unlikely(!bvecs))
 		return;
 
-	bio_for_each_segment_all(bvec, bio, i) {
+	bio_for_each_segment_all_sp(bvec, bio, i, bia) {
 		bvecs[i] = *bvec;
 		bvecs[i].bv_page = alloc_page(GFP_NOIO);
 		if (unlikely(!bvecs[i].bv_page))
-- 
2.7.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox