Argon2id settings

Hi @grb, advice that reducing p will increase security is simply unsupported in the relevant documents directly or by implication. :slight_smile:
I shall explain the detail on which I base this claim, and am happy to await refutation provided it is based directly on the description and discussion provided in the Argon2 introductory 2015 paper by Biryukov et al and the 2021 RFC by the same authors with S Josefsson.

Your description of a memory-limited cracking unit simply supports that Argon2 is designed to be memory hard not processor-soft. To quote the authors encapsulating p, m and t:

The authors provide a table illustrating the inverse relationship for available threads between reduction in CPU cycles / byte and memory transfer cycles, a key to Argon2 being the fact it is a serial process where pre-computation is essentially negated. As they said above, parallelism does not reduce cracking time proportional to cores available, one reason Argon2 is so strong. It is memory-bound rather than compute-bound.

To clarify the point that p is not a security factor, I refer to the authors’ own presentation as follows.

On Scalability, they advise that:

If reducing p could significantly improve security, double it as your calculations on a 4090 imply, why would they not even hint at this possibility in scalability?

Their comment in the same section on Parallelism is:

Allowing those experiments reflected 2015 technology in terms of exhaustion, why no hint that higher p is bad, massive p might be some sort of security disaster? Argon2 and the RFC are not old.

To summarise in their own words under the heading GPU/FPGA/ASIC-unfriendly:

Moving on to the specific recommendations in RFC 9106, we see not default or rule-of-thumb statements but three explicit recommendations (they repeat the first two recommendations under Security Considerations later):

  • The first, “uniformly safe”, is p=4, m=2GiB, t=1.
  • The second is if you are memory constrained for that, recommending p=4, m=64MB, t=3.
  • The third is a universal model covered in items 3-11, which specify p=4, m=maximum affordable memory, t=maximum affordable running time given m.

All earlier references to p are couched as 2 * cores. If there were a notional doubling of security by setting p=2 or quadrupling with p=1 compared with 4 then I think it might have been mentioned in recommendation 2 or 3, rather than fussing over m and t with zero mention of p.

Security Considerations at the end of the RFC, before repeating recommendations 1 and 2 from above, has zero discussion or recommendations to reduce (or increase) p. If p is a relevant security factor, then why not?

The ineluctable conclusion is, because it isn’t. They said this at the beginning.

Best advice remains to set p = 2 * cores (default 4), maximise m and set t to suit. Other advice is unsupported by discussion and advice from the Argon2 / RFC authors.

I found performance benchmarks for Argon2 on github, linked from the original paper, as follows:

m in MiB

  • For m=1, t=1, changing p from 1 to 4 increased execution time by 17%.
  • For m=4096, t=1, changing p from 2 to 4 increased execution time by 49%

Far from speeding things up higher p apparently slows it down, contrary to my own earlier belief. This appears to be because increasing p adds lanes which adds [slow] memory access work. I alluded to this in my earlier post.

m and t remain far more important anyway.

If my desktop have 6 cores and my phone have 8, should I set the parallelism to 16 ??

Edit: In my own test, increasing the parallelism from 1 to 6, reduces the execution time on the android app by a lot, I didn’t calculate the percentage, but from 16 sec it goes down to 9 sec (with 1GiB and 6 iterations)

@Mulled7768 Than you for taking the time to dive deeper into the sources, and to elevate our discussion of Argon2id. Unfortunately, I currently do not have the time to read the sources you cited and form my own opinion of what conclusions they support (although it would have helped if you had provided direct links).

I believe that the ultimate conclusion may be that we are both right under different scenarios.

That being said, for now, none of the passages that you quoted in your first comment are convincing to me — I can interpret each of them as being consistent with the arguments I have been making earlier in the thread.

On the other hand, in my comment from last year where I first describe my take on analyzing the Argon2id costs, I concede first of all that “my understanding of this is not 100%” (which is still true), and importantly provide the following disclaimer (which also applies to everything I have said so far in the present thread):

 

Based on the reference to “memory cycles” in your first comment, and the GitHub benchmarking data in your follow-up comment, I believe that the scenarios in which your point of view (that increasing parallelism results in an unchanged or even improved cracking resistance) will hold, and my assumption that parallelism reduces cracking resistance will be invalid, are precisely those conditions in which the hash calculation speed is limited by bus transfer speeds.

Hopefully you agree that there exist situations in which bus speeds are not rate-limiting, as demonstrated by the fact that unlock time on a mobile app can be reduced by increasing parallelism. The challenge is to determine under what conditions does the bus bandwidth become a greater constraint than the memory availability.

1 Like

So raising my p value from 1 to 5 will slow down my vault access using the browser extension and desktop app, and increase cracking resistance?

KeePassXC has a default p value of 2.

One greater concern I have is that some day the browser extension will be compromised to steal my master password. MEGA had this happen some years ago: Hacked MEGA Chrome Extension was Used to Steal Cryptocurrency

Only needs to happen one time for the whole vault to get into the wrong hands.

Other supply side surfaces are the desktop app and the web vault.

That said, I like to understand things as best I can and do my part to harden my vault as well as I can.

Forgot to mention KPXC by default uses Argon2d instead of Argon2id. Argon2d is stronger than Argon2id if you aren’t concerned about side channel attacks.

@grb I must say, I understand very little of the technical side of this whole thread, but if it comes to the configuration of a system it seems to me to get into the area, where we had to guess an attackers system / the exact configuration and circumstances of that possible system (I think we can not only look at the “speed” of our own system at that point, right?)…

And for me that sounds - in simple terms - that maybe neither very high nor very low values might be “(definitively) good”, if it can increase or decrease the cracking time, depending on the not foreseeable exact circumstances… So maybe the middleground with p = 4 might not be a bad idea after all?

:thinking:

I doubt that when unlocking your vault on your own device, that you are facing the type of transfer speed limitations that could cause your access time to increase with increasing parallelism.

Whether the net effect on the attacker is a positive, negative, or neutral, is a bit unclear. What is clear is that memory is the most important parameter to optimize.

1 Like

For me personally, since I predominantly use the browser extension (which does not support multithreading), I am leaving parallelism at 1. That is because increasing parallelism in my case will not allow me to increase my memory setting (so it does not benefit me), and it could significantly help an attacker (or at best, as we learned above, perhaps slow them down by an insignificant amount).

@grb What exactly settings you are using ?? You said that you had the parallelism at 1, but what’s the others ??

Does it matter? My settings are optimized for my own hardware (and my own tolerance for unlock/login delay), so that information is not applicable to anybody else’s situation. I use the default number of iterations (3), and my memory setting is higher than the default, but lower than the maximum.

1 Like

Ok !!! I asked you what your settings is just out from curiosity. I dont want to copy your settings or anything else !!

As they said above, parallelism does not reduce cracking time proportional to cores available, one reason Argon2 is so strong. It is memory-bound rather than compute-bound.

Just to be clear, parallelism DOES decrease the time to calculate a single hash. It does not change the total work performed, but spreads it over multiple cores, so given one CPU, the amount of hashes calculated per second (when doing password cracking) is the same, no matter the degree of parallelism. But unlock time does get lowered, given higher parallesim. If an attacker were to crack, not in parallel, but serially (for instance because the memory constraints are that high), then it would help with cracking, but similar affect the unlock time, (just like iterations in pbkdf2/argon2). Memory is better here, because it affects cracking more negatively than it affects unlock time. Thus memory is always recommended to be chosen first.

The interactions are somewhat complex, because at some point memory bandwidth also poses a bottleneck, but the speedup is linear to the amount of cores used, which can be easily verified by running argon2 in the cli.

Thank you for your response @grb. I have been wrestling with the parameters for Argon2 for a while, so it is no surprise we have made mistakes in the analysis. The important step we have both now noted is to understand that a key part of Argon2 is not just the amount of memory it demands but how it uses it. Adding lanes adds work and random memory access in a way that is essentially unpredictable to the processors pre-fetch/pre-compute features. Thus cycle time goes down but memory access time goes up, inversely. I found a table for this in the papers and it is borne out in tests.

Forgetting all of this theoretical discussion, people want evidence so I installed Argon2 on my computer and ran tests. My context is an Intel i9 2.3 GHz with 8 cores, 32GB memory. Running on the command line. I am using Darwin (BSD if you like) so none of the Apple interface fairy floss can interfere.

I tested t = [3 6 12] across variations of p and m. The results were highly consistent so I dropped that from later tests. As noted in the RFC (I think) execution time rises uniformly linearly with increase in t. I make the factor about 80% of the multiple of t, or a little over that.

I tested each combination of p = [1 2 4 8] with each of m = [16 64 512] with t = 3 in each case. Again the results were consistent in a scatterplot. I need more time to read and interpret the fit so this is a tentative summary based on percentage changes.

Like increasing t, doubling m appears to add over 80% to processing time, hence the trade-off between them. From the documents I may have under-estimated the effect of m there.

Increasing p has a modest effect but it is not uniform. For example, 1 to 2 or 2 to 4 added only about 10-12% to time, but hitting 8 added 47%. I will be doing further testing, just from curiosity as it will not alter the main points below.

The key points are: increase m, then increase t to the extent response time remains affordable for you.

Use basic advice for p (more in a following post) because it has little effect compared with m and t. While it appears to make a more significant difference once you get to 8+, that is still less important than m or t and for all of these we are constrained by our least capable device. Remember, your browser is single-threaded anyway. Personally I use the app for anything other than logging in to a site in the browser.

By way of analogy, increasing p is like adding more words to your source list for pass phrases where you would be much better off simply adding a word to your phrase (i.e. more m or t).
The takeaway is that adding to p will increase rather than reduce security, but not critically so.

I need a coffee. :slight_smile:

Now I am puzzled. That speedup is what I previously predicted, now understood to be counteracted by the increase in memory cycles. However, these are times in milliseconds for p = [1 2 4 8 16] on an 8-core machine (16 threads), m=64, t=3.

175
196
255
300
479

How do we understand this?

I promised additional comment about constraints.

Apple is not keen to publish RAM in its iOS devices but you can find it on the internet. Knowing RAM tells you how aggressive you might be, though speed will eventually be an issue. Looking back at the RFC recommendations, the first is for m = 2 GiB with t=1 and the second for 64 MiB with t=3. I personally do not see those as equivalent so I use more m than 64, more t than 3.

Argon2 was designed with Intel architecture in mind, hence the “cores times two” standard recommendation, defaulting to 4 to save looking it up. However, Apple M chips and ARM chips are different in two ways, according to my reading. Firstly, to manage power consumption Intel chips throttle performance (how aggressively is tuneable). However ARM-style chips have performance and efficiency cores, assigning work according to need. Further, those cores are single-threaded and, at least claimed for Apple M chips, will not spread a given job between p-cores and e-cores. Therefore, on iOS and most likely on Android devices, or M Macs, count the performance cores and use that number.

Or use 4. Or 1. It’s m and t that matter more.

What do you mean you “installed Argon2”. What implementation did you install?

I am surprised that for hardware like yours, you would not see a speed-up with increased parallelism — unless the Argon2 implementation you are using doesn’t support multithreading.

Reference C implementation on github

I’m surprised too, though if supported by test replication then we may have the explanation at hand: additional lanes → work + memory references. I am interested to see.

Edit to add: The following table is from the 2015 Argon2 paper. The authors note this is an imperfect analysis but apparently a clear enough indication. CPU performance rises, countered by increased and not wholly efficient memory use. I guess that is why they call it memory hard; use as well as size.


Their processor is 4-core, m=1GiB. Note 2d and 2i are separated here.

Editing again to save another post:
The figures from the 2015 paper and my results are consistent with those published under Benchmarks on the reference page on github. Theirs has an apparent anomaly in that p=2 was faster than p=1, but otherwise the same pattern.

Is m 64 MiB? (argon2 usually deals in KiB) In that case I would guess other overhead. I just checked and it is not perfectly linear, but i get (for m=1^6KiB, t=1):
p=1: 1300ms
p=2: 690ms (~half)
p=4: 500ms
on the webassembly multithreaded version (don’t have access to cli atm). At p=4 it’s most likely both memory overhead and overhead of the non parallelizable work, so these bits then don’t linearly scale. My point was that the total amount of work stays the same, it just gets distributed to more cores, so for serial cracking it helps, for parallel cracking (when more memory is available) it does not.

I agree that this complicates things on other architectures. To be honest since adding argon2 support to Bitwarden I have changed my personal opinion about exposing the settings to the user, since this is too confusing to users and does not necessarily guarantee optimal protection. Hiding these settings for the account (choosing sane, but high defaults), and having local unlock not go by account KDF but settings calculated automatically for the device (memory, parallelism [if the device supports it] and iterations configured so it unlocks in 1 second would be a better compromise.

CPU performance rises, countered by increased and not wholly efficient memory use. I guess that is why they call it memory hard; use as well as size.

I do wonder how this gets affected by new CPUs that have an L3 cache fitting all of the KDF memory, instead of having to rely on slow ram. Even modern gaming CPUs can nearly fit 64MB into just L3 cache.

1 Like

Yes. I am using m = 16 which is 2^16 KiB according to the help text.

Trying to reconcile results, I tried t=1 (all my earlier tests were t=3 or more) and also separately tested 2i, 2d and 2id in case that mattered. Results unfortunately :slight_smile: were consistent with before. The three forms of Argon2 were close enough that I will give the averages. For p = [1 2 4 8] results were, in seconds:
0.081
0.094
0.106
0.167
A sequence four identical tests produced results within 6% of the mean from run to run.

While on the one hand I would like to know what is happening here, on the other I am more than satisfied that my fairly standard settings provide full security. I have pretty consistent opening times across devices.