ISSAI Research Internship: NLP & Cultural AI

IEEE Access Submission

As a Research Intern at the Institute of Smart Systems and Artificial Intelligence (ISSAI) at Nazarbayev University, I worked on several projects at the intersection of Natural Language Processing (NLP) and Cultural AI, focusing on low-resource languages like Kazakh.

KazCulture Dataset & Cultural LLM Benchmark

My primary contribution was to the development of a cultural LLM evaluation benchmark for Kazakh culture.

Dataset Creation: Contributed to the creation of 16,137 Kazakh cultural knowledge triplets (subject-relation-object), covering various aspects of Kazakh history, literature, and traditions.
Quality Assurance: Performed rigorous annotation QA and inter-annotator agreement checks (Cohen's $\kappa$ ) to ensure high-quality data.
Publication: Co-authored the manuscript "Introducing Cultural Knowledge in Language Models: KazCulture Dataset for Kazakh Culture" (submitted to IEEE Access; Manuscript ID: Access-2025-53012).

Multilingual Conversational AI Agent

I worked on the development and deployment of a multilingual conversational AI agent capable of interacting in Kazakh, Russian, and English.

Interface Prototyping: Prototyped a Telegram bot interface to make the model accessible to users.
HPC Deployment: Supported internal deployment and testing of KazLLM on the university's High-Performance Computing (HPC) cluster. This involved managing SLURM job submissions, monitoring resources, and log-based debugging to ensure stable operation.

GPT-2 Reproduction & Training

To deepen my understanding of LLM architectures, I worked on training decoder-only Transformers.

Pipeline Reproduction: Reproduced a GPT-2-style training pipeline based on Andrej Karpathy's tutorial.
Custom Corpus: Adapted the pipeline to train on a custom Kazakh corpus.
Data Engineering: Implemented comprehensive data cleaning pipelines, including deduplication and length/quality filters.
Debugging: Diagnosed and fixed a critical preprocessing inconsistency that was affecting training convergence.

Kazakh Speech Corpus Collection

I spearheaded the collection of a speech corpus to support Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) research for the Kazakh language.

Protocol Design: Designed the data collection protocol to ensure diverse and high-quality audio samples.
Coordination: Recruited participants and coordinated the collection process, performing basic QA on the recordings.
Usability Study: Assisted in a usability study with 38 participants and summarized recurring ASR error patterns to guide future improvements.

Technical Skills Applied

NLP & ML: PyTorch, Transformers, LLM Evaluation
Data Science: Dataset Curation, Annotation QA, Cohen's Kappa
Infrastructure: Linux, HPC, SLURM, Docker
Development: Python, Telegram API