New: the Voice AI Investors list release! Check it out

    Research Data Engineer

    Smallest

    Engineering
    Full-time
    On-site
    Bengaluru

    Posted on 4/8/2026

    Job Description

    Research Data Engineer (India) — Smallest.ai

    About the Role

    This is not a typical data engineering role. You won’t be building dashboards. You won’t be maintaining pipelines no one touches.

    You will take messy, noisy, real-world data — and turn it into something models can learn from. Think of this as running a gold mine - you take dust and convert it to gold.

    We work on speech, language, and real-time systems across 50+ languages.
    The difference between a good model and a great one is almost always data quality + data systems. That’s where you come in.

    What You’ll Work On

    • Data Pipelines (Real-time + Batch)

      • Build high-throughput pipelines for audio, text, and multimodal data

      • Streaming + offline processing at scale

    • Data Quality & Curation

      • Cleaning, filtering, deduplication, normalization (numbers, emails, code-mix, etc.)

      • Designing heuristics + ML-based data filtering systems

    • Multilingual Data Systems

      • Handling 50+ languages, accents, and code-mixed inputs

      • Language-aware normalization and segmentation

    • Training Data Engine

      • Build pipelines that continuously generate better training data from production

      • Active learning loops, data selection, sampling strategies

    • Evaluation & Benchmarking Pipelines

      • Create scalable eval datasets across languages and domains

      • Automate quality tracking for ASR, TTS, and conversational systems

    • Data Infra for Research

      • Work closely with research team to unblock experiments fast

      • Build systems that reduce iteration time from weeks → hours

    What This Role Is NOT

    • Not a dashboard/reporting role

    • Not a “move data from A to B” role

    • Not a maintenance-heavy legacy pipeline role

    What We’re Looking For

    • Strong fundamentals in data structures, systems, and pipelines

    • Experience with large-scale data processing (audio/text preferred)

    • Comfortable with messy, unstructured, real-world data

    • Strong coding skills (Python required; systems experience is a plus)

    • Understanding of ML/data pipelines (training, eval, data curation)

    Bonus (Not Mandatory)

    • Experience with speech/audio data (ASR/TTS)

    • Familiarity with multilingual datasets

    • Experience with streaming systems (Kafka, etc.)

    • Exposure to data-centric AI / data quality frameworks

    How We Work

    • Speed over perfection

    • Production over papers

    • Systems that scale, not scripts that barely work

    • Tight loop between data → model → eval → improvement

    Who This Is For

    • You enjoy working with raw, chaotic data

    • You care about data quality more than tooling hype

    • You like building systems that directly impact model performance

    • You get excited by turning unusable data into competitive advantage

    Why Join Us

    We’re building real-time, multilingual voice AI systems.

    At this level, models are only as good as the data behind them.

    If you want to work on the layer that actually moves the needle - this is it.