Notes on my project;

More units per dense layer seems to correspond to better data ‘memorization’?

one-hot encoding was not the best idea.

Embedding layers help manage vast swath of possible moves

higher units at input/ output seems to help accuracy

reducing batch size has helped with error/loss/accuracy very suddenly showing NAN

Your data process may be flawed.
I forgot to add 1 to my indices in a crucial dictionary, and had 0s creeping up in my data sporadically, but these sporadic 0s helped cause gradient explosions, ie. loss and accuracy blowing up to nan.

Inexplicably, reducing the the units of a few dense layers from 256 to 128 actually halved my performance?!

Maybe it is useful to keep most layers the same dimensions? maybe 256 is a special gpu number?