AMÁLIA and the future of European Portuguese LLMs

TL;DR

Portugal announced a €5.5 million investment in AMÁLIA, a large language model dedicated to European Portuguese. While promising, the project faces questions about data, openness, and benchmarks. Next steps include releasing model weights and expanding Portuguese-specific data.

Portugal’s government announced a €5.5 million investment in AMÁLIA, a large language model specifically designed for European Portuguese, marking a significant step in regional NLP development. The project involves leading Portuguese universities and research labs, aiming to create an open-source model that prioritizes Portuguese language data.

AMÁLIA is a collaborative effort among top Portuguese research institutions, including NOVA, IST, IT, and FCT. It builds upon the EuroLLM model, with modifications to enhance Portuguese language focus, especially through increased training on Portuguese data sources such as Arquivo.pt. The project aims to produce a fully open-source model, but as of now, model weights, training logs, and datasets remain unreleased, limiting external validation.

The team trained AMÁLIA using a total of 107 billion tokens, with approximately 5.8 billion tokens from Arquivo.pt, representing about 5.5% of the total. Supervised fine-tuning involved synthetic Portuguese data, with an estimated 17-18% of the training data being Portuguese. Despite promising benchmark results, the model still trails behind larger models like Qwen 3-8B on certain Portuguese-specific tests, such as the ALBA benchmark.

Why It Matters

This development is significant because it marks Portugal’s strategic effort to establish a native, regionally focused NLP model, crucial for local applications, government services, and cultural preservation. The emphasis on open-source resources aligns with global trends toward transparency and community-driven AI development, but the current lack of publicly available model weights and datasets limits immediate community engagement and benchmarking.

Moreover, the project highlights ongoing challenges in data scarcity for small-language models, raising questions about how much Portuguese data is sufficient for meaningful performance improvements. The effort also underscores the importance of developing benchmarks that measure intrinsic knowledge about Portugal, beyond linguistic and bias assessments.

Amazon

Portuguese language AI model

As an affiliate, we earn on qualifying purchases.

Background

In 2024, several European countries, including Italy with Minerva, have invested in regional language models, emphasizing the importance of local languages in AI. Portugal’s investment follows a global pattern of regional AI development, but with unique challenges due to the limited size of Portuguese data sources and the language’s regional scope. Previous efforts like EuroLLM laid groundwork, but AMÁLIA aims to be a dedicated, Portuguese-focused successor. The project’s emphasis on open resources echoes broader debates about transparency in AI research, especially in smaller language contexts.

“AMÁLIA aims to treat European Portuguese as a first-class citizen in NLP, leveraging Portuguese data sources extensively.”

— Research team member

“Despite promising benchmarks, the lack of open weights and datasets raises questions about the model’s accessibility and real-world utility.”

— Hacker News analysis

Amazon

European Portuguese NLP tools

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear when model weights and datasets will be publicly released, as well as how much Portuguese data is sufficient for optimal performance. The effectiveness of AMÁLIA compared to larger models on real-world Portuguese tasks has yet to be fully demonstrated, and the impact of the current data limitations is still uncertain.

Amazon

open-source language model Portuguese

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include the release of model weights and datasets, further benchmarking with Portuguese-specific tasks, and exploring ways to incorporate more Portuguese data into training. Continued collaboration between institutions and transparency about progress will be critical for the model’s adoption and impact.

Amazon

Portuguese text analysis software

As an affiliate, we earn on qualifying purchases.

Key Questions

Will the AMÁLIA model weights be publicly available?

As of now, the weights have not been released, but the team has indicated that they may share them in the future. The current focus is on completing benchmarks and expanding data sources.

How much Portuguese data was used in training AMÁLIA?

Approximately 5.8 billion tokens from Arquivo.pt, representing about 5.5% of the total 107 billion tokens used in training. The amount of Portuguese data is limited relative to the total dataset.

How does AMÁLIA compare to other models like Qwen 3-8B?

AMÁLIA outperforms Qwen 3-8B on most Portuguese benchmarks but still trails on some, like ALBA. This suggests room for improvement, especially with more Portuguese-specific training data.

What are the main challenges facing AMÁLIA’s development?

Key challenges include limited Portuguese data sources, the need for more transparent open resources, and creating benchmarks that better measure Portuguese cultural and factual knowledge.

AMÁLIA and the future of European Portuguese LLMs

Up next

Digg is back again, this time to aggregate AI news

Author

Artificial Intelligence

Share article