Saheed Azeez, a University of Lagos student, created Naijaweb, a 230-million-token dataset sourced from Nairaland to support large language model training, showcasing his technical skills and the potential of Nigeria’s growing AI community despite infrastructural challenges.


Saheed Azeez, a final-year Mechanical Engineering student at the University of Lagos, has created Naijaweb, a dataset of 230 million GPT-2 tokens sourced from Nairaland, Nigeria’s largest online forum. While Azeez describes the process as “easy” with the right knowledge of web scraping and data cleaning, his achievement represents a significant technical feat.
Naijaweb is not a chatbot or AI like ChatGPT; instead, it is a dataset designed to train large language models (LLMs)—the underlying technology of systems like ChatGPT. Compiling such a dataset required Azeez to extract and clean massive amounts of data from Nairaland, a process that involved advanced technical skills and careful planning.
Azeez began his journey in data science in 2019, taking a Python class primarily out of curiosity about machine learning. At the time, he believed machine learning involved teaching robots how to learn—a misconception he quickly corrected. Despite initial setbacks in competitions on platforms like Zindi, he gained essential skills and, by 2022, made his first attempt at web scraping Nairaland.
That first attempt was unsuccessful due to technical challenges. However, by leveraging open-source tools from platforms like Hugging Face, Azeez succeeded in creating Naijaweb this year. “I heard people talking a lot about the value Nairaland holds, so I decided to give web scraping it a shot,” he explained.
Creating Naijaweb involved more than just extracting forum posts. Azeez had to process the text and convert it into tokens that LLMs can understand. Tokens are the basic building blocks of LLMs, representing smaller chunks of text, such as syllables or words, which are converted into numerical values. For example, the word “CALCULATED” might be broken down into tokens like “CAL,” “CU,” “LA,” and “TED,” with each assigned a specific number.
This dataset creation process demanded immense computing resources and electricity—challenges Azeez faced with persistence. “I had to keep my laptop running for days,” he revealed.
Despite his achievements, Azeez faces significant obstacles in advancing his work. Training an LLM using Naijaweb would require powerful GPUs and constant electricity—resources that are expensive and often unavailable in Nigeria. He notes that building an LLM is not a solo effort but requires a team of skilled engineers, many of whom are already contributing to AI development globally.
“There are Nigerians with these skills—some of them have gone abroad for their PhDs,” Azeez said. He also praised local AI communities, such as Data Science Nigeria, which support students and professionals in the field, despite infrastructural challenges like unreliable power and limited access to high-performance computing tools.
Azeez’s interest in technology extends beyond this project. In 2022, he developed Tweet Shot, a bot that allows users to capture screenshots on Twitter. The bot gained significant traction, amassing 170,000 followers before it was sold to an undisclosed buyer.
Currently, Azeez works as a Machine Learning Engineer at HelpMum, a non-profit organization leveraging AI to improve maternal and infant healthcare in Nigeria. Balancing his job with his studies, he continues to explore the potential of AI, driven by a passion for innovation and problem-solving.
While Naijaweb may be a stepping stone, Azeez’s vision reflects the broader aspirations of Nigeria’s burgeoning AI community—one determined to overcome systemic barriers and make its mark in the global AI ecosystem.
READ ALSO: UNILAG Student Becomes Youngest African Scrabble Champion
SOURCES: ALLSCHOOL, TECH POINT
Stories You Shouldn't Have Missed:
- Despite Failing in High School, Young Lady Defies Odds to Earn 3.75/4.00 First-Class Degree
- Student loan: NELFUND to Publish Institutions With Complete Data June 24
- Man Shares Beautiful Video as He Marries Lady He Used to Read With In School, Netizens React
- 20-Year-Old Lady Graduates From College, Named HBCU Entrepreneur of the Year
- “TeenEagle 2024/2025 Gifts Were Minister’s Personal Contribution” – FG Clarifies
- FG Restricts Under-18s from Taking NECO and WAEC Exams
- Ohanekwu Ebuka: OAU Overall Best Graduating Student With 4.88 CGPA, bags 7 awards
Join Our 500,000+ Community:
Thank you so much for reading. We will appreciate it if you share this with your loved ones.