Again, my understanding of this is not 100%, so I may have some of the details incorrect, but this is how I think about it:

Let’s consider an RTX 4090 GPU, which has 16,384 cores and 24 GB of memory shared between all cores. No matter what KDF settings you have, the attacker can divide and conquer the brute-force search space to speed up the search (e.g., letting one core test all passwords that start with `A`

, while the next core tests for all passwords that start with `B`

, etc.). With PBKDF2 as the hashing algorithm, each GPU core can claim up to 1,430 KiB (over 1 MiB) of memory to itself (since16,384×1,430 KiB = 24 GB), which is more than enough to run the PBKDF2 algorithm. Thus, all 16,384 GPU cores can be working on the brute-force attack simultaneously, greatly reducing the time required to crack the password.

However, if you use Argon2id, then the available memory on the GPU becomes a rate-limiting factor. For example, if you set the Argon2id memory cost to *M*=256 MiB and parallelism to *L*=1, then the available 24 GB of memory can only be used by **89** cores working simultaneously (because 89×256 MiB = 23.9 GB, but 90×256 MiB > 24 GB).

Let’s suppose that you can get a similar unlock speed on your devices by doubling *both* the memory and parallelism (making *M*=512 MiB and *L*=2). How does this affect the attacker? Now the 16,384 GPU cores can work in pairs to take advantage of the parallelization of the Argon2id algorithm, and the greatest number of such pairs that can work simultaneously is 16,384/2 = 8,192 pairs. However, the available memory can now be segmented into at most 44 independent 512-MiB chunks, so at most, 44 core-pairs (88 cores) can be working simultaneously.

Now, remember that increasing *M* also increased the unlock time (i.e., the time required to execute the Argon2id algorithm) — e.g., above we assumed that the time for one core-pair to compute one hash when *M*=512 MiB and *L*=2 was similar to the time for a single core to compute one hash when *M*=256 MiB and *L*=1. Thus, the case with the higher parallelism (*M*=512 MiB and *L*=2) should result in a cracking rate that is about half the rate achieved with the lower parallelism (*M*=256 MiB and *L*=1). **TL;DR:** The bottom line is that increasing parallelism (if and only if accompanied by an increase in memory cost) will slow down a GPU-based attack without increasing the user’s unlock/login time.

In the above analysis, I have not taken into account possible rate-limiting effects of the data transfer speeds through the memory bus. My understanding is that in *some* cases, bus speed could be the dominant factor (as opposed to memory size being the dominant factor, which is what I assumed above), but this is beyond what I am able to analyze with my current level of knowledge.