ISSAI Research Internship: NLP & Cultural AI

May 1, 2024

ISSAI Research Internship: NLP & Cultural AI

IEEE Access Submission

As a Research Intern at the Institute of Smart Systems and Artificial Intelligence (ISSAI) at Nazarbayev University, I worked on several projects at the intersection of Natural Language Processing (NLP) and Cultural AI, focusing on low-resource languages like Kazakh.

KazCulture Dataset & Cultural LLM Benchmark

My primary contribution was to the development of a cultural LLM evaluation benchmark for Kazakh culture.

  • Dataset Creation: Contributed to the creation of 16,137 Kazakh cultural knowledge triplets (subject-relation-object), covering various aspects of Kazakh history, literature, and traditions.
  • Quality Assurance: Performed rigorous annotation QA and inter-annotator agreement checks (Cohen's κ\kappa) to ensure high-quality data.
  • Publication: Co-authored the manuscript "Introducing Cultural Knowledge in Language Models: KazCulture Dataset for Kazakh Culture" (submitted to IEEE Access; Manuscript ID: Access-2025-53012).

Multilingual Conversational AI Agent

I worked on the development and deployment of a multilingual conversational AI agent capable of interacting in Kazakh, Russian, and English.

  • Interface Prototyping: Prototyped a Telegram bot interface to make the model accessible to users.
  • HPC Deployment: Supported internal deployment and testing of KazLLM on the university's High-Performance Computing (HPC) cluster. This involved managing SLURM job submissions, monitoring resources, and log-based debugging to ensure stable operation.

GPT-2 Reproduction & Training

To deepen my understanding of LLM architectures, I worked on training decoder-only Transformers.

  • Pipeline Reproduction: Reproduced a GPT-2-style training pipeline based on Andrej Karpathy's tutorial.
  • Custom Corpus: Adapted the pipeline to train on a custom Kazakh corpus.
  • Data Engineering: Implemented comprehensive data cleaning pipelines, including deduplication and length/quality filters.
  • Debugging: Diagnosed and fixed a critical preprocessing inconsistency that was affecting training convergence.

Kazakh Speech Corpus Collection

I spearheaded the collection of a speech corpus to support Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) research for the Kazakh language.

  • Protocol Design: Designed the data collection protocol to ensure diverse and high-quality audio samples.
  • Coordination: Recruited participants and coordinated the collection process, performing basic QA on the recordings.
  • Usability Study: Assisted in a usability study with 38 participants and summarized recurring ASR error patterns to guide future improvements.

Technical Skills Applied

  • NLP & ML: PyTorch, Transformers, LLM Evaluation
  • Data Science: Dataset Curation, Annotation QA, Cohen's Kappa
  • Infrastructure: Linux, HPC, SLURM, Docker
  • Development: Python, Telegram API
Taizhanov Nurbek