Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion Track Name

Abstract

Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express finegrained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consistency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both objective and subjective metrics. Our code will be abailable.

Model Architecture

Emotional Voice Conversion

Neutral to Sad

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Neutral to Surprise

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)


		Stage I	AINN (full)
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Happy to Surprise

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Happy to Angry

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Angry to Surprise

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Sad to Happy

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Sad to Neutral

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Sad to Surprise

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Surprise to Sad

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Surprise to Neutral

Source speech	Target speech	Converted speech
		baseline	Stage I (w/o EC)
		Stage I	AINN (full)


		baseline	Stage I (w/o EC)
		Stage I	AINN (full)

Emotional Strength Control

Neutral to Angry

Source speech	Reference speech	Converted speech
	Reference speech A (weak)	Converted speech A
	Reference speech B (strong)	Converted speech B

	Reference speech A (weak)	Converted speech A
	Reference speech B (strong)	Converted speech B

Neutral to Surprise

Source speech	Reference speech	Converted speech
	Reference speech A (weak)	Converted speech A
	Reference speech B (strong)	Converted speech B

Neutral to Sad

Source speech	Reference speech	Converted speech
	Reference speech A (weak)	Converted speech A
	Reference speech B (strong)	Converted speech B

	Reference speech A (weak)	Converted speech A
	Reference speech B (strong)	Converted speech B

Neutral to Happy

Source speech	Reference speech	Converted speech
	Reference speech A (weak)	Converted speech A
	Reference speech B (strong)	Converted speech B

	Reference speech A (weak)	Converted speech A
	Reference speech B (strong)	Converted speech B