ISSAI Research Internship: NLP & Cultural AI

As a Research Intern at the Institute of Smart Systems and Artificial Intelligence (ISSAI) at Nazarbayev University, I worked on several projects at the intersection of Natural Language Processing (NLP) and Cultural AI, focusing on low-resource languages like Kazakh.
KazCulture Dataset & Cultural LLM Benchmark
My primary contribution was to the development of a cultural LLM evaluation benchmark for Kazakh culture.
- Dataset Creation: Contributed to the creation of 16,137 Kazakh cultural knowledge triplets (subject-relation-object), covering various aspects of Kazakh history, literature, and traditions.
- Quality Assurance: Performed rigorous annotation QA and inter-annotator agreement checks (Cohen's ) to ensure high-quality data.
- Publication: Co-authored the manuscript "Introducing Cultural Knowledge in Language Models: KazCulture Dataset for Kazakh Culture" (submitted to IEEE Access; Manuscript ID: Access-2025-53012).
Multilingual Conversational AI Agent
I worked on the development and deployment of a multilingual conversational AI agent capable of interacting in Kazakh, Russian, and English.
- Interface Prototyping: Prototyped a Telegram bot interface to make the model accessible to users.
- HPC Deployment: Supported internal deployment and testing of KazLLM on the university's High-Performance Computing (HPC) cluster. This involved managing SLURM job submissions, monitoring resources, and log-based debugging to ensure stable operation.
GPT-2 Reproduction & Training
To deepen my understanding of LLM architectures, I worked on training decoder-only Transformers.
- Pipeline Reproduction: Reproduced a GPT-2-style training pipeline based on Andrej Karpathy's tutorial.
- Custom Corpus: Adapted the pipeline to train on a custom Kazakh corpus.
- Data Engineering: Implemented comprehensive data cleaning pipelines, including deduplication and length/quality filters.
- Debugging: Diagnosed and fixed a critical preprocessing inconsistency that was affecting training convergence.
Kazakh Speech Corpus Collection
I spearheaded the collection of a speech corpus to support Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) research for the Kazakh language.
- Protocol Design: Designed the data collection protocol to ensure diverse and high-quality audio samples.
- Coordination: Recruited participants and coordinated the collection process, performing basic QA on the recordings.
- Usability Study: Assisted in a usability study with 38 participants and summarized recurring ASR error patterns to guide future improvements.
Technical Skills Applied
- NLP & ML: PyTorch, Transformers, LLM Evaluation
- Data Science: Dataset Curation, Annotation QA, Cohen's Kappa
- Infrastructure: Linux, HPC, SLURM, Docker
- Development: Python, Telegram API