An Extensive Benchmark of LLMs and NLP Models for Iberia and Ibero-American Languages

The evaluation of Large Language Models (LLMs) presents significant challenges, particularly for languages other than English, where high-quality evaluation data is often scarce. Existing benchmarks and leaderboards are predominantly English-centric, and those that do address other languages often overlook the diversity of language varieties, prioritise fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static in nature.

To address these limitations, a group of professional researchers, including myself, have published the paper IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks in languages spoken across the Iberian Peninsula and Ibero-America.

IberBench is a large multi-language and multi-variety benchmark covering Spanish, Portuguese, Catalan, Basque, Galician, and English, as well as Mexican, Uruguayan, Peruvian, Costa Rican, and Cuban Spanish varieties. It includes 101 datasets collected from evaluation campaigns and new benchmarks across 22 various task types such as sentiment and emotion analysis, toxicity detection, machine-generated text detection, and commonsense reasoning. Data sets are based on workshop shared tasks like IberLEF, IberEval, TASS, and PAN, and more recent general-purpose LLM benchmarks. Notably, IberBench standardizes many workshop-sourced data sets, and thus they are easier to find.

The benchmark differentiates between fundamental tasks, which evaluate core language proficiency and knowledge, and industry-relevant tasks, which have economic significance like content moderation or customer insights. Industry-relevant tasks primarily originate from workshops, while fundamental tasks often come from established LLM benchmarks, highlighting a gap in current evaluation practices.

The benchmark’s architecture consists of four key components: the Leaderboard UI, an organization of NLP specialists, the datasets, and the LLM evaluation framework. The Leaderboard UI, hosted on HuggingFace Spaces, serves as the main interface for users to view rankings, plots, and reports, and to request the evaluation of new models or propose new datasets. The organization reviews these requests based on specific criteria, such as the focus on Iberian languages for datasets and training data composition for models.

An evaluation of 23 LLMs, ranging from 100 million to 14 billion parameters, revealed several key insights:

  • LLMs generally perform worse on industry-relevant tasks than in fundamental ones.
  • Galician and Basque present greater challenges than other Iberian languages for the evaluated models.
  • Some tasks, such as lexical borrowing detection, intent classification, and machine-generated text detection, remain largely unsolved, with top-performing LLMs scoring barely above a random guesser.
  • In other tasks, such as sentiment analysis, humor detection, and fake news detection, LLMs perform better than the random baseline but still worse than dedicated systems developed for shared tasks.
  • Models in the 3.1-10 billion parameter range tend to dominate the leaderboard, and model scale is significant for instruction-tuned models, particularly those exceeding 2 billion parameters.
  • European models focused on Iberian languages are competitive primarily when they are instruction-tuned.
  • Spanish varieties show varying performance, with some (like Peruvian, Costa Rican, Uruguayan) showing lower performance and more outliers compared to others (like Cuban, Mexican, Spanish from Spain). Multilingual models, including Basque-tuned ones, sometimes outperform Spanish-specific models across Spanish varieties.

This presents a challenge for companies looking to fine-tune LLMs for these languages and highlights the need to explore specialized NLP solutions tailored for specific tasks, ensuring greater efficiency and scalability.

To delve deeper into the specific dataset composition, task details, model architectures evaluated, detailed performance results across tasks and languages, and the intricacies of the evaluation methodology and its limitations, we encourage you to read the full paper IberBench.

Alvaro Romo Herrero
Data Scientist o Expert Data Scientist at  | Website |  + posts

I transform data and processes into actionable insights using data science and natural language processing to cut through complexity and drive smarter decisions. My consulting and research background helps me bridge strategy with technology, turning analysis into real world impact. I thrive on continuous improvement and cross-functional collaboration because data only matters when it drives decisions and change.

0 Comments

Leave a Reply

You May Also Like

Google Gemini 3: A New Paradigm in Frontier AI

Google Gemini 3: A New Paradigm in Frontier AI

The artificial intelligence landscape shifted decisively with the release of Google DeepMind’s Gemini 3. This white paper evaluates the technical architecture, performance metrics, and strategic positioning of what is currently the world’s most capable AI model. Our...

read more
Automation Will Never Be the Same After AI Agents

Automation Will Never Be the Same After AI Agents

We are living through a change of era. Business automation — which for years relied on predefined workflows and robots following rigid rules — is being replaced by a new generation of AI Agents: autonomous systems capable of planning, deciding, and executing complex...

read more

Discover more from Keepler | The AI Enabler Partner

Subscribe now to keep reading and get access to the full archive.

Continue reading

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.