You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you elaborate a little more and maybe propose a solution to the problem you raised?
(2020/02/10)
I was able to finish this implementation by completing the Stop token prediction and remove the concatenation of inputs and outputs of multihead attention.
However, the alignments of this implementation are less diagonal, so it can not generate proper alignments for fastspeech
As a result, I failed to train the fastspeech with this implementation :(
The text was updated successfully, but these errors were encountered:
According to the writers of the fastspeech, it is important to use proper alignments in the training.
When I implemented transformer-tts at first, I failed to implement it perfectly, and so by concatenating the input and output in multi-head-self-attention, I finished it.
I assume that thanks to this concatenation, the encoder-decoder alignments were more diagonal, and I can use around 6,000 data instances among 13,100 data instances.
However, when I correct my implementation and implement the original transformer-tts nearly perfectly, I can only use about 1,000 data instances for the fastspeech training, so audio quality is much worse than before.
Could you elaborate a little more and maybe propose a solution to the problem you raised?
(2020/02/10)
I was able to finish this implementation by completing the Stop token prediction and remove the concatenation of inputs and outputs of multihead attention.
However, the alignments of this implementation are less diagonal, so it can not generate proper alignments for fastspeech
As a result, I failed to train the fastspeech with this implementation :(
The text was updated successfully, but these errors were encountered: