An Extensive Benchmark of LLMs and NLP Models for Iberia and Ibero-American Languages

keepler-extensive-benchmark-of-llms-and-nlp-models.iberia-ibero-american-languages (1)

The evaluation of Large Language Models (LLMs) presents significant challenges, particularly for languages other than English, where high-quality evaluation data is often scarce. Existing benchmarks and leaderboards are predominantly English-centric, and those that do address other languages often overlook the diversity of language varieties, prioritise fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static in nature.

To address these limitations, a group of professional researchers, including myself, have published the paper IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks in languages spoken across the Iberian Peninsula and Ibero-America.

IberBench is a large multi-language and multi-variety benchmark covering Spanish, Portuguese, Catalan, Basque, Galician, and English, as well as Mexican, Uruguayan, Peruvian, Costa Rican, and Cuban Spanish varieties. It includes 101 datasets collected from evaluation campaigns and new benchmarks across 22 various task types such as sentiment and emotion analysis, toxicity detection, machine-generated text detection, and commonsense reasoning. Data sets are based on workshop shared tasks like IberLEF, IberEval, TASS, and PAN, and more recent general-purpose LLM benchmarks. Notably, IberBench standardizes many workshop-sourced data sets, and thus they are easier to find.

The benchmark differentiates between fundamental tasks, which evaluate core language proficiency and knowledge, and industry-relevant tasks, which have economic significance like content moderation or customer insights. Industry-relevant tasks primarily originate from workshops, while fundamental tasks often come from established LLM benchmarks, highlighting a gap in current evaluation practices.

The benchmark’s architecture consists of four key components: the Leaderboard UI, an organization of NLP specialists, the datasets, and the LLM evaluation framework. The Leaderboard UI, hosted on HuggingFace Spaces, serves as the main interface for users to view rankings, plots, and reports, and to request the evaluation of new models or propose new datasets. The organization reviews these requests based on specific criteria, such as the focus on Iberian languages for datasets and training data composition for models.

An evaluation of 23 LLMs, ranging from 100 million to 14 billion parameters, revealed several key insights:

LLMs generally perform worse on industry-relevant tasks than in fundamental ones.
Galician and Basque present greater challenges than other Iberian languages for the evaluated models.
Some tasks, such as lexical borrowing detection, intent classification, and machine-generated text detection, remain largely unsolved, with top-performing LLMs scoring barely above a random guesser.
In other tasks, such as sentiment analysis, humor detection, and fake news detection, LLMs perform better than the random baseline but still worse than dedicated systems developed for shared tasks.
Models in the 3.1-10 billion parameter range tend to dominate the leaderboard, and model scale is significant for instruction-tuned models, particularly those exceeding 2 billion parameters.
European models focused on Iberian languages are competitive primarily when they are instruction-tuned.
Spanish varieties show varying performance, with some (like Peruvian, Costa Rican, Uruguayan) showing lower performance and more outliers compared to others (like Cuban, Mexican, Spanish from Spain). Multilingual models, including Basque-tuned ones, sometimes outperform Spanish-specific models across Spanish varieties.

This presents a challenge for companies looking to fine-tune LLMs for these languages and highlights the need to explore specialized NLP solutions tailored for specific tasks, ensuring greater efficiency and scalability.

To delve deeper into the specific dataset composition, task details, model architectures evaluated, detailed performance results across tasks and languages, and the intricacies of the evaluation methodology and its limitations, we encourage you to read the full paper IberBench.

Alvaro Romo Herrero

Data Scientist o Expert Data Scientist at Keepler Data Tech | Website | + posts

I transform data and processes into actionable insights using data science and natural language processing to cut through complexity and drive smarter decisions. My consulting and research background helps me bridge strategy with technology, turning analysis into real world impact. I thrive on continuous improvement and cross-functional collaboration because data only matters when it drives decisions and change.

0 Comments

Data Management Becomes Semantic and Cognitive

Feb 3, 2026

2025 ended with a paradox that organisations can no longer afford to overlook: never have we seen so many AI initiatives deployed, yet never has the gap between adoption and real business value been so clear. According to McKinsey’s latest report, The State of AI in...

Not more AI — orchestration. The era of the Agentic Mesh.

Jan 29, 2026

In 2026, “having agents” is no longer an advantage. The advantage is Agentic Mesh. During 2024 and 2025, many companies did what was expected: they tested AI through “pilots”. A copilot for the sales team. A bot for support. An assistant for finance. Promising...

From Wonder to Reason: The New Era of Enterprise Intelligence

Jan 29, 2026

Over the past two years, the tech market has lived in a state of constant fascination. The ability of machines to generate content—text, images, or code—defined a phase of “wonder.” However, as we project our vision toward 2026, we see a fundamental paradigm shift:...

An Extensive Benchmark of LLMs and NLP Models for Iberia and Ibero-American Languages

Alvaro Romo Herrero

0 Comments

Leave a ReplyCancel reply

Alvaro Romo Herrero

May 21, 2025

AI

Categories

Archive

You May Also Like

Data Management Becomes Semantic and Cognitive

Not more AI — orchestration. The era of the Agentic Mesh.

From Wonder to Reason: The New Era of Enterprise Intelligence

An Extensive Benchmark of LLMs and NLP Models for Iberia and Ibero-American Languages

Alvaro Romo Herrero

0 Comments

Leave a ReplyCancel reply

Alvaro Romo Herrero

May 21, 2025

AI

Categories

Archive

You May Also Like

Data Management Becomes Semantic and Cognitive

Not more AI — orchestration. The era of the Agentic Mesh.

From Wonder to Reason: The New Era of Enterprise Intelligence

Discover more from Keepler | The AI Enabler Partner