Kyle, your parameters look good. The poor scaling must be due to the combination of small local problem size and sub-optimal interconnect. A 100Gbit interconnect sounds a lot, but throughput is less relevant than latency. As such, solving the problem in parallel requires the exchange of many small messages with minimal latency. For this, infini-band is much better than Gbit Ethernet. As such, you might only scale to larger local problem sizes (e.g. 100K - 200K) compared to what is reported in literature (20K - 50K).
Please report the local problem sizes you have. You can see them using the "-output_level 5" option. You should see a small table with the ranks and their local number of elements / indices. The distribution should be pretty even, so you only need to post one set of values. Also, index multiplicities are worth reporting.
Best, Aurel