Better explanation of "I failed to train the fastspeech" #16

mellogrand · 2020-02-21T10:59:48Z

Could you elaborate a little more and maybe propose a solution to the problem you raised?

(2020/02/10)
I was able to finish this implementation by completing the Stop token prediction and remove the concatenation of inputs and outputs of multihead attention.
However, the alignments of this implementation are less diagonal, so it can not generate proper alignments for fastspeech
As a result, I failed to train the fastspeech with this implementation :(

LEEYOONHYUNG · 2020-02-25T15:13:34Z

According to the writers of the fastspeech, it is important to use proper alignments in the training.

When I implemented transformer-tts at first, I failed to implement it perfectly, and so by concatenating the input and output in multi-head-self-attention, I finished it.

I assume that thanks to this concatenation, the encoder-decoder alignments were more diagonal, and I can use around 6,000 data instances among 13,100 data instances.

However, when I correct my implementation and implement the original transformer-tts nearly perfectly, I can only use about 1,000 data instances for the fastspeech training, so audio quality is much worse than before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better explanation of "I failed to train the fastspeech" #16

Better explanation of "I failed to train the fastspeech" #16

mellogrand commented Feb 21, 2020

LEEYOONHYUNG commented Feb 25, 2020

Better explanation of "I failed to train the fastspeech" #16

Better explanation of "I failed to train the fastspeech" #16

Comments

mellogrand commented Feb 21, 2020

LEEYOONHYUNG commented Feb 25, 2020