WaveMAE: A Masked Autoencoder for Audio Representation Learning

Overview

WaveMAE is a research project focused on developing a state-of-the-art, general-purpose audio representation model. It uses a masked autoencoder (MAE) architecture to learn a semantically rich latent space from raw audio waveforms. This latent space is designed to be highly descriptive for downstream tasks, particularly generative modeling.

The core of the project involves training an autoencoder to reconstruct masked portions of an audio signal's STFT representation. To enrich the learned latent space, the model uses auxiliary decoders aligned with pre-trained models, including:

Wav2Vec2-BERT: For general audio features.
RMVPE: For pitch information.

The project is structured into several phases, from initial scaffolding and data pipeline implementation to systematic experiments and model analysis.

Project Structure

AIDocs/: Contains the project plan and technical specifications.
conf/: Hydra configuration files for the model, training, and experiments.
data/: Data loading and preprocessing scripts.
models/: PyTorch model definitions for the WaveMAE architecture.
scripts/: Training, evaluation, and utility scripts.
pretrained_models/: Contains the pre-trained RMVPE model and its source code.

Setup

Create and activate the virtual environment:
```
bash setup.sh
source .venv/bin/activate
```
Run the training script:
```
python scripts/train.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.hydra		.hydra
.specstory		.specstory
.vscode		.vscode
AIDocs		AIDocs
ConvNeXt-V2 @ 2553895		ConvNeXt-V2 @ 2553895
RMVPE @ a6db1cd		RMVPE @ a6db1cd
conf		conf
papers		papers
pretrained_models		pretrained_models
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
cryptography-45.0.4-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl		cryptography-45.0.4-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
inspect_rmvpe_models.py		inspect_rmvpe_models.py
load_jingle_model.py		load_jingle_model.py
load_safe_model.py		load_safe_model.py
pitch_visualization.png		pitch_visualization.png
problematic_salience.pt		problematic_salience.pt
requirements.txt		requirements.txt
setup.py		setup.py
setup.sh		setup.sh
timm-1.0.15-py3-none-any.whl		timm-1.0.15-py3-none-any.whl
visualize_pitch_curve.py		visualize_pitch_curve.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WaveMAE: A Masked Autoencoder for Audio Representation Learning

Overview

Project Structure

Setup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WaveMAE: A Masked Autoencoder for Audio Representation Learning

Overview

Project Structure

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages