From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753883AbbIWTlT (ORCPT ); Wed, 23 Sep 2015 15:41:19 -0400 Received: from mail-qg0-f45.google.com ([209.85.192.45]:35839 "EHLO mail-qg0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751932AbbIWTlS (ORCPT ); Wed, 23 Sep 2015 15:41:18 -0400 Subject: Re: [PATCH 1/3] Make /dev/urandom scalable To: Andi Kleen , tytso@mit.edu References: <1442963767-14945-1-git-send-email-andi@firstfloor.org> Cc: linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com, herbert@gondor.apana.org.au, Andi Kleen From: Austin S Hemmelgarn Message-ID: <5603004A.20801@gmail.com> Date: Wed, 23 Sep 2015 15:40:58 -0400 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: <1442963767-14945-1-git-send-email-andi@firstfloor.org> Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-512; boundary="------------ms070605040305090201040503" X-Antivirus: avast! (VPS 150923-1, 2015-09-23), Outbound message X-Antivirus-Status: Clean Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is a cryptographically signed message in MIME format. --------------ms070605040305090201040503 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: quoted-printable On 2015-09-22 19:16, Andi Kleen wrote: > From: Andi Kleen > > We had a case where a 4 socket system spent >80% of its total CPU time > contending on the global urandom nonblocking pool spinlock. While the > application could probably have used an own PRNG, it may have valid > reasons to use the best possible key for different session keys. > > The application still ran acceptable under 2S, but just fell over > the locking cliff on 4S. > > Implementation > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > The non blocking pool is used widely these days, from every execve() (t= o > set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequ= ent > /dev/urandom users in user space. Clearly having such a popular resourc= e > under a global lock is bad thing. > > This patch changes the random driver to use distributed per NUMA node > nonblocking pools. The basic structure is not changed: entropy is > first fed into the input pool and later from there distributed > round-robin into the blocking and non blocking pools. This patch extend= s > this to use an dedicated non blocking pool for each node, and distribut= e > evenly from the input pool into these distributed pools, in > addition to the blocking pool. > > Then every urandom/getrandom user fetches data from its node local > pool. At boot time when users may be still waiting for the non > blocking pool initialization we use the node 0 non blocking pool, > to avoid the need for different wake up queues. > > For single node systems (like the vast majority of non server systems) > nothing changes. There is still only a single non blocking pool. > > The different per-node pools also start with different start > states and diverge more and more over time, as they get > feed different input data. So "replay" attacks are > difficult after some time. I really like this idea, as it both makes getting random numbers on busy = servers faster, and makes replay attacks more difficult. > > Without hardware random number seed support the start states > (until enough real entropy is collected) are not very random, but > that's not worse than before > > Since we still have a global input pool there are no problems > with load balancing entropy data between nodes. Any node that never > runs any interrupts would still get the same amount of entropy as > other nodes. > > Entropy is fed preferably to nodes that need it more using > the existing 75% threshold. > > For saving/restoring /dev/urandom, there is currently no mechanism > to access the non local node pool (short of setting task affinity). > This implies that currently the standard init/exit random save/restore > scripts would only save node 0. On restore all pools are updates. > So the entropy of non 0 gets lost over reboot. That seems acceptable > to me for now (fixing this would need a new separate save/restore inter= face) I agree that this is acceptable, it wouldn't be hard for someone who=20 wants to to just modify the script to set it's own task affinity and=20 loop through the nodes (although that might get confusing with=20 hot-plugged/hot-removed nodes). > > Scalability > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > I tested the patch with a simple will-it-scale test banging > on get_random() in parallel on more and more CPUs. Of course > that is not a realistic scenario, as real programs should > do some work between getting random numbers. But it's a worst > case for the random scalability. > > On a 4S Xeon v3 system _without_ the patchkit the benchmark > maxes out when using all the threads on one node. After > that it quickly settles to about half the throughput of > one node with 2-4 nodes. > > (all throughput factors, bigger is better) > Without patchkit: > > 1 node: 1x > 2 nodes: 0.75x > 3 nodes: 0.55x > 4 nodes: 0.42x > > With the patchkit applied: > > 1 node: 1x > 2 nodes: 2x > 3 nodes: 3.4x > 4 nodes: 6x > > So it's not quite linear scalability, but 6x maximum throughput > is already a lot better. > > A node can still have a large number of CPUs: on my test system 36 > logical software threads (18C * 2T). In principle it may make > sense to split it up further. Per logical CPU would be clearly > overkill. But that would also add more pressure on the input > pools. For now per node seems like a acceptable compromise. I'd almost say that making the partitioning level configurable at build=20 time might be useful. I can see possible value to being able to at=20 least partition down to physical cores (so, shared between HyperThreads=20 on Intel processors, and between Compute Module cores on AMD=20 processors), as that could potentially help people running large numbers = of simulations in parallel. Personally, I'm the type who would be willing to take the performance=20 hit to do it per logical CPU just for the fact that it would make replay = attacks more difficult, but I'm probably part of a very small minority=20 in that case. > > /dev/random still uses a single global lock. For now that seems > acceptable as it normally cannot be used for real high volume > accesses anyways. > > The input pool also still uses a global lock. The existing per CPU > fast pool and "give up when busy" mechanism seems to scale well enough > even on larger systems. > --------------ms070605040305090201040503 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC Brgwgga0MIIEnKADAgECAgMRLfgwDQYJKoZIhvcNAQENBQAweTEQMA4GA1UEChMHUm9vdCBD QTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNp Z25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2VydC5vcmcwHhcN MTUwOTIxMTEzNTEzWhcNMTYwMzE5MTEzNTEzWjBjMRgwFgYDVQQDEw9DQWNlcnQgV29UIFVz ZXIxIzAhBgkqhkiG9w0BCQEWFGFoZmVycm9pbjdAZ21haWwuY29tMSIwIAYJKoZIhvcNAQkB FhNhaGVtbWVsZ0BvaGlvZ3QuY29tMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA nQ/81tq0QBQi5w316VsVNfjg6kVVIMx760TuwA1MUaNQgQ3NyUl+UyFtjhpkNwwChjgAqfGd LIMTHAdObcwGfzO5uI2o1a8MHVQna8FRsU3QGouysIOGQlX8jFYXMKPEdnlt0GoQcd+BtESr pivbGWUEkPs1CwM6WOrs+09bAJP3qzKIr0VxervFrzrC5Dg9Rf18r9WXHElBuWHg4GYHNJ2V Ab8iKc10h44FnqxZK8RDN8ts/xX93i9bIBmHnFfyNRfiOUtNVeynJbf6kVtdHP+CRBkXCNRZ qyQT7gbTGD24P92PS2UTmDfplSBcWcTn65o3xWfesbf02jF6PL3BCrVnDRI4RgYxG3zFBJuG qvMoEODLhHKSXPAyQhwZINigZNdw5G1NqjXqUw+lIqdQvoPijK9J3eijiakh9u2bjWOMaleI SMRR6XsdM2O5qun1dqOrCgRkM0XSNtBQ2JjY7CycIx+qifJWsRaYWZz0aQU4ZrtAI7gVhO9h pyNaAGjvm7PdjEBiXq57e4QcgpwzvNlv8pG1c/hnt0msfDWNJtl3b6elhQ2Pz4w/QnWifZ8E BrFEmjeeJa2dqjE3giPVWrsH+lOvQQONsYJOuVb8b0zao4vrWeGmW2q2e3pdv0Axzm/60cJQ haZUv8+JdX9ZzqxOm5w5eUQSclt84u+D+hsCAwEAAaOCAVkwggFVMAwGA1UdEwEB/wQCMAAw VgYJYIZIAYb4QgENBEkWR1RvIGdldCB5b3VyIG93biBjZXJ0aWZpY2F0ZSBmb3IgRlJFRSBo ZWFkIG92ZXIgdG8gaHR0cDovL3d3dy5DQWNlcnQub3JnMA4GA1UdDwEB/wQEAwIDqDBABgNV HSUEOTA3BggrBgEFBQcDBAYIKwYBBQUHAwIGCisGAQQBgjcKAwQGCisGAQQBgjcKAwMGCWCG SAGG+EIEATAyBggrBgEFBQcBAQQmMCQwIgYIKwYBBQUHMAGGFmh0dHA6Ly9vY3NwLmNhY2Vy dC5vcmcwMQYDVR0fBCowKDAmoCSgIoYgaHR0cDovL2NybC5jYWNlcnQub3JnL3Jldm9rZS5j cmwwNAYDVR0RBC0wK4EUYWhmZXJyb2luN0BnbWFpbC5jb22BE2FoZW1tZWxnQG9oaW9ndC5j b20wDQYJKoZIhvcNAQENBQADggIBADMnxtSLiIunh/TQcjnRdf63yf2D8jMtYUm4yDoCF++J jCXbPQBGrpCEHztlNSGIkF3PH7ohKZvlqF4XePWxpY9dkr/pNyCF1PRkwxUURqvuHXbu8Lwn 8D3U2HeOEU3KmrfEo65DcbanJCMTTW7+mU9lZICPP7ZA9/zB+L0Gm1UNFZ6AU50N/86vjQfY WgkCd6dZD4rQ5y8L+d/lRbJW7ZGEQw1bSFVTRpkxxDTOwXH4/GpQfnfqTAtQuJ1CsKT12e+H NSD/RUWGTr289dA3P4nunBlz7qfvKamxPymHeBEUcuICKkL9/OZrnuYnGROFwcdvfjGE5iLB kjp/ttrY4aaVW5EsLASNgiRmA6mbgEAMlw3RwVx0sVelbiIAJg9Twzk4Ct6U9uBKiJ8S0sS2 8RCSyTmCRhJs0vvva5W9QUFGmp5kyFQEoSfBRJlbZfGX2ehI2Hi3U2/PMUm2ONuQG1E+a0AP u7I0NJc/Xil7rqR0gdbfkbWp0a+8dAvaM6J00aIcNo+HkcQkUgtfrw+C2Oyl3q8IjivGXZqT 5UdGUb2KujLjqjG91Dun3/RJ/qgQlotH7WkVBs7YJVTCxfkdN36rToPcnMYOI30FWa0Q06gn F6gUv9/mo6riv3A5bem/BdbgaJoPnWQD9D8wSyci9G4LKC+HQAMdLmGoeZfpJzKHMYIE0TCC BM0CAQEwgYAweTEQMA4GA1UEChMHUm9vdCBDQTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNl cnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNpZ25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcN AQkBFhJzdXBwb3J0QGNhY2VydC5vcmcCAxEt+DANBglghkgBZQMEAgMFAKCCAiEwGAYJKoZI hvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTUwOTIzMTk0MDU4WjBPBgkq hkiG9w0BCQQxQgRAPemz+ACQj6DLPgtxeK5vH+XEV/WrXREdYmgfklsDbuH3XvOf0+mGOeFC Tk4xnvriDgizIIrNllNFikVlUJUqlDBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIGRBgkrBgEEAYI3EAQxgYMwgYAweTEQMA4GA1UE ChMHUm9vdCBDQTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlD QSBDZXJ0IFNpZ25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2Vy dC5vcmcCAxEt+DCBkwYLKoZIhvcNAQkQAgsxgYOggYAweTEQMA4GA1UEChMHUm9vdCBDQTEe MBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNpZ25p bmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2VydC5vcmcCAxEt+DAN BgkqhkiG9w0BAQEFAASCAgAtWyYEvzGWuzz9KSADMqrriJA6mUB0EaoikShHoFM+oY6Qb8Az kJc8XQ9uJnCebsIlEcTJ63DCMH9ihsIZgA4AjcQQMeUzuaOuS7vt8bTZfUaYx8tL+ud2nQJ0 xpo95/BAhR+w2lAdvIDPWLWkNdCqHI97pGwykTxqfP/vxFVTZtvFaZDcPoy3JvLTvohjYDjV oMFar7uKgg4pcwNHTzvZycAUJlCAK30K99Rz4rgJTGD2ZOWwrSxaCu6DTm0f41zOyW9n9uRR HZ2YJUtSY8/JHk/i3xpHrgo56RCV4LDN0kAs1FqjFMNK4JC4B4gTI3wmcGXrHXqVHjY1GUp0 whFUS1chw1x4dmC3v/GqMqagxlFyjUIUNvVFSNbkVaAV/uhiaiwZ9KFoeaaR0DIFA0cI07Tq JMRojZnxhuapQUAY2xu+0O0K5PbOPgyGHtIdaFckr3uNKbN/YNLYYSBQ+Q1ztu8amQrnq4jf nD5lFDmQ8+EZ2krvCMbwzHeHpxqNqDKy5fGbKzsr7P8nBn+49qTvZLVvjIb6w8IW4g1PmSI5 x3sJ+h1u2/V4RYFjBHaROQGYZ84xbG27rAFc6BW6o5amxOZ6oeEoqNmDVfDJGfd9lgnwc8n4 Y6G21JOPvQynuM8sFcY3AwW0+5IY//LR6URd35SkytXhmgctTox/FXmuhQAAAAAAAA== --------------ms070605040305090201040503--