Abstract
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express finegrained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consistency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both objective and subjective metrics. Our code will be abailable.
Model Architecture
Emotional Voice Conversion
Neutral to Sad
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Neutral to Surprise
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
baseline |
Stage I (w/o EC) |
Stage I |
AINN (full) |
Happy to Surprise
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Happy to Angry
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Angry to Surprise
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Sad to Happy
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Sad to Neutral
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Sad to Surprise
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Surprise to Sad
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Surprise to Neutral
Source speech | Target speech | Converted speech | |
---|---|---|---|
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
||
baseline |
Stage I (w/o EC) |
||
Stage I |
AINN (full) |
Emotional Strength Control
Neutral to Angry
Source speech | Reference speech | Converted speech |
---|---|---|
Reference speech A (weak) |
Converted speech A |
|
Reference speech B (strong) |
Converted speech B |
|
Reference speech A (weak) |
Converted speech A |
|
Reference speech B (strong) |
Converted speech B |
Neutral to Surprise
Source speech | Reference speech | Converted speech |
---|---|---|
Reference speech A (weak) |
Converted speech A |
|
Reference speech B (strong) |
Converted speech B |
|
Neutral to Sad
Source speech | Reference speech | Converted speech |
---|---|---|
Reference speech A (weak) |
Converted speech A |
|
Reference speech B (strong) |
Converted speech B |
|
Reference speech A (weak) |
Converted speech A |
|
Reference speech B (strong) |
Converted speech B |
Neutral to Happy
Source speech | Reference speech | Converted speech |
---|---|---|
Reference speech A (weak) |
Converted speech A |
|
Reference speech B (strong) |
Converted speech B |
|
Reference speech A (weak) |
Converted speech A |
|
Reference speech B (strong) |
Converted speech B |