1. update Spread, ReduceSumAll, Multiply, Unsqueeze, Softmax, CrossEntropy with float16 datatype 2. modify the implementation of fnnlm to support float16 computation, but it remains some bugs and the loss is nan