Uplift AI
Provides raw audio data for pretraining foundation models in low-resource languages.

About Uplift AI
Uplift AI: Raw audio data for pretraining foundation models
Uplift AI, founded by former engineers from Apple Siri, Amazon Alexa, and AWS Bedrock, is creating a large-scale audio dataset for pretraining foundation models. The company's goal is to unlock new model capabilities across hundreds of low-resource languages. Their mission is to accelerate the transition to voice interfaces, especially for the 13% of the global population who cannot read, thereby providing access to digital knowledge and services. Uplift AI has made low-resource languages its only priority and is backed by investors including YCombinator, IVC, and RTP.
Dataset Characteristics
The dataset is designed with the following principles:
Massive scale, without skew: The dataset is built for massive scale but avoids extreme domain imbalances to help base models become better generalists.
Real conversations, real settings: It features audio of real people having conversations in real settings, moving away from 'produced' content found in movies, podcasts, and news.
Diverse acoustic environments: The data includes realistic and varied acoustic environments, such as a person driving a tractor, cooking, in public transit, operating a forklift, hiking, or in a plane.
Globally diverse, locally diverse: The dataset aims for balanced representation across geography, occupations, age, income, accent, and language.
Use Cases
The primary use case for the data is to train SOTA (state-of-the-art) voice models and pretrain foundation models for low-resource languages. Uplift AI also partners with research teams to design pretraining data distributions for any use case.
Getting Started
Users can acquire data through a four-step process:
Step 1: Request Sample: Set up a call to discuss your use case and receive relevant data samples.
Step 2: Purchase Access: Enter into a data license agreement for the dataset and use cases your team needs.
Step 3: Receive data: For existing data, access is granted to your team within two to four days.
Step 4: Experiment with us: Partner with the Uplift AI team to design pretraining data distributions for your specific use case.
To begin, visit the website: https://upliftai.org