Argon2id does not support parallelism (and I believe he has implied that this results in longer unlock times).
To be clear, setting parallelism different to p=1 is supported. It will just not make use of available computational resources + memory bandwidth because it only runs on a single thread. And yes, it is 4x (not quite but somewhere in that region) slower.
You should see much faster (nearly native) unlock times on this test site, with high parallelism.
The problem is that this has quite some edge cases that need to be handled, and proper fallback is also not too easy. It would require a fair amount of work to smoothly integrate into Bitwarden, thus there is no pull-request to add this. (Also my pull requests with other, simpler, argon2 optimizations are still in the process of being reviewed).
The observation of similar time elapsed when L is increased suggests that the WASM implementation just ignores the parallelism setting and always uses L=1
No. The hashes differ when setting different parallelism. If the WebAssembly implementation were to always assume parallelism=1 then it would be incompatible to the native (mobile) implementations. It simply does not enable the -pthreads flag / enables -DARGON2_NO_THREADS during compilation. The argon2 C library has support for compiling without pthreads and runs the computation in sequence.
Increasing pararallelism decreases time to hash, but the CPU (and memory bandwidth) are more utilized during this time. You can verify this by running the argon2 command line tool:
❯ time echo -n "hashThis" | argon2 saltItWithSalt -l 32 -p 8 -t 30 -m 19
Memory: 524288 KiB
echo -n "hashThis" 0.00s user 0.00s system 32% cpu 0.002 total
argon2 saltItWithSalt -l 32 -p 8 -t 30 -m 19 30.47s user 0.62s system 699% cpu 4.447 total
~ took 4s
❯ time echo -n "hashThis" | argon2 saltItWithSalt -l 32 -p 1 -t 30 -m 19
Memory: 524288 KiB
echo -n "hashThis" 0.00s user 0.00s system 50% cpu 0.001 total
argon2 saltItWithSalt -l 32 -p 1 -t 30 -m 19 20.21s user 0.29s system 99% cpu 20.678 total
Also, as to whether increasing parallelism / lanes actually decreases or increases security, as long as lanes are implemented with actual multithreading, and you raise your iterations to be at the same target unlock times, it is a security improvement. With the current implementation it really depends on the attack method.
From a plain brute-force perspective (no specialized attacks):
Consider that the total amont of work to be done for calculating one hash is still the same. This means that, as long as an attacker is not limited on memory, (and on CPU attacks, memory per core is cheap), they are always bound by the CPU.
Given a system with let’s say 128GiB of ram. If you have the maximum amount of memory selected in your KDF config, 1GiB, the attacker can run (roughly) 128 password candidates in parallel at any time. If per password, the amount of work is the same, then it does not matter whether the attacker runs 128 password candidates in sequence, or 128 candidates in parallel on different threads (slighly simplified). The hashrate is the same.
I do not have such high-specced hardware available, but on a Ryzen 3400G, with the default config (t=3,p=4,m=64MiB) and (t=3,p=1,m=64MiB), the hashrate on John was nearly the same (20/s, 19/s).
If we were to assume a GPU/ASIC based attack, then this might be different, since there Memory per CPU core is more expensive.
Either way it does not really matter too much. Even with PBKDF2 at 600K iterations, most master passwords that had any significant complexity were not financially feasible to crack in any realistic timeframe. With argon2 at anything above or equal to default parameters, it’s either credential stuffing from another leak with password re-use or a colossally bad master password.