Karya Logo

Karya

Data Curation Intern

Posted 15 Days Ago
Be an Early Applicant
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
Internship
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
Internship
The Data Curation Intern will curate datasets for AI/ML model training, focusing on cleaning text data and preparing speech data for machine learning applications.
The summary above was generated by AI

About Karya:

Why was Karya on the cover of the Time Magazine , highlighted by Satya Nadella , and invited to present its work to Sundar Pichai one on one? 
In part, because Karya is on a mission to provide AI enabled earning and learning opportunities to communities with high talent, but low access to opportunities. Karya achieves this while also delivering high quality, timely, and price competitive data to its clients.
Karya builds high quality datasets for large companies like Google and Microsoft, while providing ethical work opportunities and fair wages to its workforce.
Karya’s workers make nearly 20 times the Indian minimum wage and through our one-of-a-kind digital work platform, we have delivered over 40 million digital tasks and have positively impacted over 100 thousand workers. In the coming years, our goal is to rapidly scale our impact by bringing economic opportunities to millions of underserved users in India. With a rapidly growing global presence, we are also looking to expand our client base in the Indian market by partnering with leading Indian enterprises.

About the Role

We are looking for a detail-oriented and curious Data Curation Intern to help build high-quality datasets for training AI/ML models with a specific focus on Indian language and multilingual data. You will work with large open-source datasets (e.g., Sangraha by AI4Bharat) that require significant cleaning, structuring, and enrichment before they can be used effectively in model training pipelines.

This is a hands-on, high-impact role at the intersection of data engineering, linguistics, and AI. You will start with text data pipelines and progressively move toward preparing data for read-speech and voice model training.

What You'll Do

Phase 1: Text Data Curation
Audit and profile open-source datasets (Sangraha, Common Crawl, IndicCorp, etc.) to assess quality, coverage, and noise levels
Design and implement data cleaning pipelines: deduplication, script normalisation, encoding fixes, noise removal, sentence boundary detection
Create and apply metadata tagging schemas labelling text by domain (news, legal, literature, health, etc.), subdomain, language, register, and quality tier
Build validation checklists and quality scorecards to benchmark dataset readiness for model training
Document data provenance, licensing, and processing steps for reproducibility

Phase 2: Speech & Voice Data Preparation
Curate high-quality, phonetically diverse text passages suitable for read-speech recording
Ensure text selection covers domain, prosodic, and phonemic variety required for TTS/ASR model training
Assist in defining metadata standards for audio datasets (speaker demographics, recording conditions, transcription format)
Support the pipeline transition from text corpus to aligned speech dataset

What We're Looking For

Must Have
Strong attention to detail — you notice inconsistencies others miss
Comfort with Python for data processing (pandas, regex, basic NLP libraries like spaCy or NLTK)
Familiarity with text data formats: CSV, JSONL, Parquet, plain text corpora
Curiosity about AI/ML, language technology, or computational linguistics
Ability to work independently, document work clearly, and communicate blockers early
Good to Have
Prior exposure to NLP datasets or open-source language resources (IndicNLP, AI4Bharat, Hugging Face datasets)
Knowledge of one or more Indian languages beyond English
Experience with data versioning tools (DVC, Git-LFS) or dataset platforms (Hugging Face Hub)
Basic understanding of how language models or speech models are trained

Why This Role

Work directly on real data pipelines that feed AI model training — not toy projects
Gain hands-on experience with large-scale multilingual and Indic language datasets
Build skills that are in high demand across AI labs, speech companies, and NLP startups
Clear progression path: text → read speech → voice data, with increasing responsibility
Mentorship from people who have built data and AI systems at scale

Karya celebrates diversity and is an equal opportunity employer. All applicants will be considered without regard to race, religion, gender identity, sexual orientation, disability, or any other protected status.

Karya Bengaluru, Karnataka, IND Office

Bengaluru, India

Similar Jobs

45 Minutes Ago
Easy Apply
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
Easy Apply
Mid level
Mid level
Fintech • Payments • Financial Services
The role involves designing, implementing, managing, and maintaining data storage systems and solutions, including performance monitoring and capacity planning.
Top Skills: AnsibleAws S3CephDellEbsEmcExt4HitachiIbmIscsiKvmLinuxLvmNagiosNetappNfsPrometheusPuppetPurestoragePythonRaidShell ScriptingVMwareXfsZabbixZfs
45 Minutes Ago
Easy Apply
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
Easy Apply
Mid level
Mid level
Fintech • Payments • Financial Services
The System Administrator II (Ceph Engineer) will design, implement, and maintain data storage systems, ensure performance and security, and manage backups and disaster recovery processes.
Top Skills: AnsibleAws S3CephDellEbsEmcExt4HitachiIbmIscsiKvmLinuxLvmMultipathingNagiosNetappNfsPrometheusPuppetPurestoragePythonRaidVMwareXfsZabbixZfs
46 Minutes Ago
Remote or Hybrid
India
Mid level
Mid level
Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
As an AVP - Finance Data Quality, you will support data services for Finance regarding compliance and risk management, collaborate with IT and business stakeholders, and document data processes.
Top Skills: AlteryxConfluenceExcelMicrosoft PowerpointMicrosoft VisioPythonQlik SenseRational Team ConcertSASSQLTableauVBA

What you need to know about the Bengaluru Tech Scene

Dubbed the "Silicon Valley of India," Bengaluru has emerged as the nation's leading hub for information technology and a go-to destination for startups. Home to tech giants like ISRO, Infosys, Wipro and HAL, the city attracts and cultivates a rich pool of tech talent, supported by numerous educational and research institutions including the Indian Institute of Science, Bangalore Institute of Technology, and the International Institute of Information Technology.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account