ISSAI Research Internship: NLP & Cultural AI

May 1, 2024

ISSAI Research Internship: NLP & Cultural AI

IEEE Access Submission

As a Research Intern at the Institute of Smart Systems and Artificial Intelligence (ISSAI) at Nazarbayev University – Kazakhstan's leading AI research lab with a 10% intern acceptance rate – I worked on several projects at the intersection of Natural Language Processing (NLP) and Cultural AI, focusing on low-resource languages like Kazakh.

KazCulture Dataset & Cultural LLM Benchmark

My primary contribution was to the development of a cultural LLM evaluation benchmark for Kazakh culture – the first replicable cultural AI benchmark of its kind.

  • Dataset Creation: Contributed to the creation of 16,137 Kazakh cultural knowledge triplets (subject-relation-object), covering various aspects of Kazakh history, literature, and traditions.
  • Comparative Analysis: Led comparative analysis of Saudi & Korean cultural AI benchmarks to inform our methodology.
  • Data Curation: Curated Q&A pairs from 11 Kazakh texts covering traditions, cuisine, games, and folklore.
  • Quality Assurance: Performed rigorous annotation QA and inter-annotator agreement checks (Cohen's κ\kappa) to ensure high-quality data.
  • Publication: Co-authored the manuscript "Introducing Cultural Knowledge in Language Models: KazCulture Dataset for Kazakh Culture" (submitted to IEEE Access; Manuscript ID: Access-2025-53012).

Multilingual Conversational AI Agent

I worked on the development and deployment of a multilingual conversational AI agent capable of interacting in Kazakh, Russian, and English.

  • Interface Prototyping: Prototyped a Telegram bot interface to make the model accessible to users.
  • HPC Deployment: Supported internal deployment and testing of KazLLM on the university's High-Performance Computing (HPC) cluster. This involved managing SLURM job submissions, monitoring resources, and log-based debugging to ensure stable operation.

GPT-2 Reproduction & Training

To deepen my understanding of LLM architectures, I worked on training decoder-only Transformers on high-performance computing infrastructure.

  • HPC Training: Trained GPT-2-style models on the 72-GPU NU HPC supercomputer (DGX A100 environment), managing SLURM job submissions, monitoring, and log-based debugging.
  • Pipeline Reproduction: Reproduced a GPT-2-style training pipeline based on Andrej Karpathy's tutorial.
  • Custom Corpus: Adapted the pipeline to train on a custom Kazakh corpus.
  • Data Engineering: Implemented comprehensive data cleaning pipelines, including deduplication and length/quality filters.
  • Critical Bug Discovery: Diagnosed and fixed a preprocessing inconsistency that senior researchers had missed, which was affecting training convergence – demonstrating attention to detail in large-scale ML pipelines.

Kazakh Speech Corpus Collection

I spearheaded the collection of a speech corpus to support Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) research for the Kazakh language.

  • Protocol Design: Designed the data collection protocol to ensure diverse and high-quality audio samples.
  • Coordination: Recruited participants and coordinated the collection process, performing basic QA on the recordings.
  • Trilingual Speech Study: Led a 38-person trilingual speech study (Kazakh/Russian/English), coordinating participant recruitment and data collection.
  • Error Analysis: Summarized recurring ASR error patterns to guide future improvements in speech recognition systems.

Technical Skills Applied

  • NLP & ML: PyTorch, Transformers, LLM Evaluation
  • Data Science: Dataset Curation, Annotation QA, Cohen's Kappa
  • Infrastructure: Linux, HPC, SLURM, Docker
  • Development: Python, Telegram API
Taizhanov Nurbek