I found performance benchmarks for Argon2 on github, linked from the original paper, as follows:
m in MiB
- For m=1, t=1, changing p from 1 to 4 increased execution time by 17%.
- For m=4096, t=1, changing p from 2 to 4 increased execution time by 49%
Far from speeding things up higher p apparently slows it down, contrary to my own earlier belief. This appears to be because increasing p adds lanes which adds [slow] memory access work. I alluded to this in my earlier post.
m and t remain far more important anyway.