Argon2id Parallelism explanation

I’m having trouble understanding the meaning of the Parallelism setting for Argon2id.

Does increasing parallelism make it harder or easier to brute force my password (assuming all other settings are kept the same)?

my understanding is that increasing iterations and memory makes it harder to brute force my password. Does Parallelism also work like this where increasing it (e.g., from 2 to 4) make the password harder to brute force?

1 Like

Here is a good community posting about the different settings for Argon2, as well as some chimes from the author of the pull request which helped to bring the feature to Bitwarden.

OWASP only speaks of using parallelism of 1. But elsewhere, I see recommendations to set parrallelism based on how many CPU cores/threads you have.

Can anyone confirm that increasing parallelism above 1 is making things more secure? Are there any scenarios where it is not a good idea to increase above 1?

My layman’s understanding is that GPUs have many cores, but limited memory that is shared by all cores. Therefore, if you have set the Argon2 memory cost to be sufficiently large, then the GPU based attack will not be able to take full advantage of the parallelism. On the other hand, a multithread CPU will be able to take advantage of the parallelism (except for in the current WebAssembly implementation), so that increasing the number of lanes allows you to increase the memory requirement while keeping the unlock time fixed.

Basically, increasing parallelism will reduce the impact on unlock/login time of the memory and iterations costs, allowing you to increase these costs for an attacker. Conversely, increasing parallelism while keeping memory and number of iterations fixed will not increase your security.

Say I have parallelism set to 1 and then raise memory and iterations so there is a noticeable delay. If I then raise parallelism to 2 and the delay goes away, doesn’t this mean it is making it easier for an attacker to brute force?

Again, my understanding of this is not 100%, so I may have some of the details incorrect, but this is how I think about it:

Let’s consider an RTX 4090 GPU, which has 16,384 cores and 24 GB of memory shared between all cores. No matter what KDF settings you have, the attacker can divide and conquer the brute-force search space to speed up the search (e.g., letting one core test all passwords that start with A, while the next core tests for all passwords that start with B, etc.). With PBKDF2 as the hashing algorithm, each GPU core can claim up to 1,430 KiB (over 1 MiB) of memory to itself (since16,384×1,430 KiB = 24 GB), which is more than enough to run the PBKDF2 algorithm. Thus, all 16,384 GPU cores can be working on the brute-force attack simultaneously, greatly reducing the time required to crack the password.

However, if you use Argon2id, then the available memory on the GPU becomes a rate-limiting factor. For example, if you set the Argon2id memory cost to M=256 MiB and parallelism to L=1, then the available 24 GB of memory can only be used by 89 cores working simultaneously (because 89×256 MiB = 23.9 GB, but 90×256 MiB > 24 GB).

Let’s suppose that you can get a similar unlock speed on your devices by doubling both the memory and parallelism (making M=512 MiB and L=2). How does this affect the attacker? Now the 16,384 GPU cores can work in pairs to take advantage of the parallelization of the Argon2id algorithm, and the greatest number of such pairs that can work simultaneously is 16,384/2 = 8,192 pairs. However, the available memory can now be segmented into at most 44 independent 512-MiB chunks, so at most, 44 core-pairs (88 cores) can be working simultaneously.

Now, remember that increasing M also increased the unlock time (i.e., the time required to execute the Argon2id algorithm) — e.g., above we assumed that the time for one core-pair to compute one hash when M=512 MiB and L=2 was similar to the time for a single core to compute one hash when M=256 MiB and L=1. Thus, the case with the higher parallelism (M=512 MiB and L=2) should result in a cracking rate that is about half the rate achieved with the lower parallelism (M=256 MiB and L=1). TL;DR: The bottom line is that increasing parallelism (if and only if accompanied by an increase in memory cost) will slow down a GPU-based attack without increasing the user’s unlock/login time.

In the above analysis, I have not taken into account possible rate-limiting effects of the data transfer speeds through the memory bus. My understanding is that in some cases, bus speed could be the dominant factor (as opposed to memory size being the dominant factor, which is what I assumed above), but this is beyond what I am able to analyze with my current level of knowledge.

2 Likes

On Wikipedia you can find the Argon2 algorithm in pseudocode:

To my understanding, Argon2 doesn’t use parallelism to increase speed (i.e. by using more CPU cores to do the same work), but to increase complexity even more. It uses the available CPU cores to do even more work in the same time. You can see this in the calculations done for the “rows” (or “lanes”) which depend on the parallelism parameter. Argon2 puts these rows together in the end, forming the final hash.

This means that it won’t be any faster for you, but as long as your CPU has enough resources, it won’t be slower than a single-threaded hashing function, either.

On the other hand, an attacker either needs the same amount of parallel threads for his calculations or faces a huge increase in time needed.

(Disclaimer: I’m no crypto expert, either. This is my understanding based on the algorithm description.)

That pseudo-code will have to wait until I am able to concentrate better.

In the meantime, I did some experiments on the WASM Argon2 demo page:

https://antelle.net/argon2-browser/

In my tests (on a Chrome browser in Windows 10), the parallelism setting did not appear to affect the time required to compute the hash.

  • On the one hand, this supports your understanding that parallelism only makes the algorithm do more work in the same time.

  • On the other hand, @Quexten has previously reported that the WASM implementation of Argon2id does not support parallelism (and I believe he has implied that this results in longer unlock times). Thus, I had assumed that if parallelism was set to, say L=4, the WASM implementation would require 4 times as much time, because the lanes could not be computed in parallel (and therefore would have to be computed in series instead???). The observation of similar time elapsed when L is increased suggests that the WASM implementation just ignores the parallelism setting and always uses L=1 (in which case we don’t know whether your interpretation is correct, because we don’t know whether increasing L reduces the time elapsed).

Argon2id does not support parallelism (and I believe he has implied that this results in longer unlock times).

To be clear, setting parallelism different to p=1 is supported. It will just not make use of available computational resources + memory bandwidth because it only runs on a single thread. And yes, it is 4x (not quite but somewhere in that region) slower.

This is due to how WebAssembly handles multithreading. The currently compiled version does not support threads and runs the code sequentially, even with parallelism set. Technically, it’s possible, emscripten does support multithreading (pthreads) by creating webworkers in javascript and running individual WASM code on each webworker. I have a branch implementing this for both the demo site and for Bitwarden. Compare to the multi-threaded version here:

https://argon2-browser.quexten.com/

You should see much faster (nearly native) unlock times on this test site, with high parallelism.

The problem is that this has quite some edge cases that need to be handled, and proper fallback is also not too easy. It would require a fair amount of work to smoothly integrate into Bitwarden, thus there is no pull-request to add this. (Also my pull requests with other, simpler, argon2 optimizations are still in the process of being reviewed).

The observation of similar time elapsed when L is increased suggests that the WASM implementation just ignores the parallelism setting and always uses L=1

No. The hashes differ when setting different parallelism. If the WebAssembly implementation were to always assume parallelism=1 then it would be incompatible to the native (mobile) implementations. It simply does not enable the -pthreads flag / enables -DARGON2_NO_THREADS during compilation. The argon2 C library has support for compiling without pthreads and runs the computation in sequence.

Increasing pararallelism decreases time to hash, but the CPU (and memory bandwidth) are more utilized during this time. You can verify this by running the argon2 command line tool:

❯ time echo -n "hashThis" | argon2 saltItWithSalt -l 32 -p 8 -t 30 -m 19
Type:		Argon2i
Iterations:	30
Memory:		524288 KiB
Parallelism:	8
Hash:		704de82b1f7bf11b5cf152c1463ebf0e019e1d920cfac2dadf38f379d2812cb9
Encoded:	$argon2i$v=19$m=524288,t=30,p=8$c2FsdEl0V2l0aFNhbHQ$cE3oKx978Rtc8VLBRj6/DgGeHZIM+sLa3zjzedKBLLk
15.555 seconds
Verification ok
echo -n "hashThis"  0.00s user 0.00s system 32% cpu 0.002 total
argon2 saltItWithSalt -l 32 -p 8 -t 30 -m 19  30.47s user 0.62s system 699% cpu 4.447 total

~ took 4s 
❯ time echo -n "hashThis" | argon2 saltItWithSalt -l 32 -p 1 -t 30 -m 19
Type:		Argon2i
Iterations:	30
Memory:		524288 KiB
Parallelism:	1
Hash:		ff880adfa82bb7b88184160a41754707cd8a60b5292f724acc93c9e5ff44e353
Encoded:	$argon2i$v=19$m=524288,t=30,p=1$c2FsdEl0V2l0aFNhbHQ$/4gK36grt7iBhBYKQXVHB82KYLUpL3JKzJPJ5f9E41M
10.478 seconds
Verification ok
echo -n "hashThis"  0.00s user 0.00s system 50% cpu 0.001 total
argon2 saltItWithSalt -l 32 -p 1 -t 30 -m 19  20.21s user 0.29s system 99% cpu 20.678 total


Also, as to whether increasing parallelism / lanes actually decreases or increases security, as long as lanes are implemented with actual multithreading, and you raise your iterations to be at the same target unlock times, it is a security improvement. With the current implementation it really depends on the attack method.

From a plain brute-force perspective (no specialized attacks):

Consider that the total amont of work to be done for calculating one hash is still the same. This means that, as long as an attacker is not limited on memory, (and on CPU attacks, memory per core is cheap), they are always bound by the CPU.

Given a system with let’s say 128GiB of ram. If you have the maximum amount of memory selected in your KDF config, 1GiB, the attacker can run (roughly) 128 password candidates in parallel at any time. If per password, the amount of work is the same, then it does not matter whether the attacker runs 128 password candidates in sequence, or 128 candidates in parallel on different threads (slighly simplified). The hashrate is the same.

I do not have such high-specced hardware available, but on a Ryzen 3400G, with the default config (t=3,p=4,m=64MiB) and (t=3,p=1,m=64MiB), the hashrate on John was nearly the same (20/s, 19/s).

If we were to assume a GPU/ASIC based attack, then this might be different, since there Memory per CPU core is more expensive.

Either way it does not really matter too much. Even with PBKDF2 at 600K iterations, most master passwords that had any significant complexity were not financially feasible to crack in any realistic timeframe. With argon2 at anything above or equal to default parameters, it’s either credential stuffing from another leak with password re-use or a colossally bad master password.

2 Likes

@Quexten Thank you for the additional information and clarification.

If you have the time and inclination, would you mind reading my example in the comment above and let me know where my mistakes are?

If you have the time and inclination, would you mind reading my example in the comment above and let me know where my mistakes are?

I mostly agree with what you wrote, however:

Now, remember that increasing M also increased the unlock time (i.e., the time required to execute the Argon2id algorithm) — e.g., above we assumed that the time for one core-pair to compute one hash when M=512 MiB and L=2 was similar to the time for a single core to compute one hash when M=256 MiB and L=1.

The assumption was actually correct when parallelism is implemented utilizing threads, p=2 m=512MiB and p=1 256MiB takes the same amount of time:

❯ echo -n "0000" | argon2 saltsalt -l 32 -p 1 -e -t 10 -m 18
$argon2i$v=19$m=262144,t=10,p=1$c2FsdHNhbHQ$wWm2fKPbC8fLn9t2QH1XyUZ2uMypRgqdR7MkthDcZ2g

~ took 2s 
❯ echo -n "0000" | argon2 saltsalt -l 32 -p 2 -e -t 10 -m 19
$argon2i$v=19$m=524288,t=10,p=2$c2FsdHNhbHQ$2IujJ/yxQqkKleR/suIDN6++B+y6EKPyTc1g3Q+qW5Q

~ took 2s 
❯ echo -n "0000" | argon2 saltsalt -l 32 -p 1 -e -t 10 -m 19
$argon2i$v=19$m=524288,t=10,p=1$c2FsdHNhbHQ$4Fe0JC5scpxXxQCHdcCtcI0p2d2SZdKFOc5d7tijvtw

~ took 5s 

(m=19 = 512MiB, m=18 = 256MiB)

Still, In your example, the attacker (using a GPU) has many more cores per unit of memory. In that case, higher parallelism (given the same memory and iterations count) does make cracking on a GPU faster compared to lower parallelism because there are idle compute resources left. That’s where increasing iterations/memory helps.

The bottom line is that increasing parallelism (if and only if accompanied by an increase in memory cost) will slow down a GPU-based attack without increasing the user’s unlock/login time.

By the way, increasing iterations also slows down the attacker in your scenario. If you double the parallelism, but double the iterations, then cracking speed stays the same. So at least it does not make cracking faster in this case. But I agree that increasing memory first is the better option.

Thank you very much for your feedback!

Yes, that is why I stressed that increasing parallelism only helps if simultaneously increasing memory cost to keep elapsed time constant (on a CPU).

With regards to increasing iterations, doesn’t this always have the same effect on both the user and the attacker? It seems to me that increasing memory cost is the best strategy, and that increased iterations are only needed if one has not reached the maximum tolerable unlock time after maximizing both the memory cost and the parallelism.

Yes, I agree. My point with that was that increasing parallelism + iterations does not make the attack easier on a GPU (but also does not improve unlock speed, and actually makes the unlock time worse with the current WebAssembly implementation). I was not trying to say that you should do it when you could increase memory instead. If you have already maxed out your memory configuration though, and still want to make cracking slower (and can bear the unlock time) it is an option though. And while increasing parallelism + iterations does not make a GPU attack slower, it does make a CPU attack slower where the ratio of memory to CPU cores is different, while keeping user unlock time the same (again, assuming a multi-threaded implementation).

Still, again, I agree that increasing memory should be the first change (assuming you don’t use iOS or are aware of the limitations during unlock).

Considering the current WebAssembly implementation isn’t actually multi-threaded though, it does not make much sense to increase the parallelism, as it will only decrease unlock times on your mobile devices, not web/desktop.

2 Likes