A Comprehensive Assessment of Lara's Translation Capabilities
To evaluate Lara's performance, we translated 300 sentences from English into the most commonly required languages in localization using Lara and various MT systems. We then asked professional translators to assess the accuracy of each translation. Additionally, we requested that professional translators flag errors in Lara's translations as well as in translations performed by their colleagues — specifically those at the median level and the top 1% of our network of professional translators.
Assessing Lara's accuracy against other MT systems
* Percentage of time that at least 2 out of 3 professional translators agreed that a translation was accurate in 2,700 translations from English to Italian, French, Spanish, German, Portuguese, Japanese, Chinese, Russian, and Korean.
We designed this evaluation to compare the performance of various machine translation engines using real-world, enterprise-level content. Our test set consisted of 2,700 sentences, consisting of 300 English source sentences translated by machine translation systems into nine of the most frequently requested localization languages: Italian, French, Spanish, German, Portuguese, Japanese, Chinese, Russian, and Korean. The accuracy of these machine-generated translations was meticulously assessed by professional translators carefully selected for the review process. To ensure objectivity and eliminate bias, we employed a double-blind method: the reviewers were unaware of which machine translation engine produced each translation, and they were not informed of other reviewers’ evaluations. This approach allowed for an unbiased and fair assessment of each system’s performance.
Evaluation Setup
We selected 300 real-world sentences from active translation projects across three industries: tourism, finance, and technology. The evaluation focused on measuring the accuracy of the following machine translation models:
- Lara
- Google Translate
- DeepL
- OpenAI’s GPT-4o (using a 5-shot learning approach, which involves providing five example translations within the prompt to guide and enhance the model’s translation performance)
Evaluation Process
Selection of professional translators
To assess translation quality, we selected top-performing professional translators from a network of 500,000 using T-Rank - an AI-driven ranking system developed by Translated. T-Rank helps select top-performing, domain-qualified professional translators by evaluating their past performance and expertise across more than 30 criteria. This ensured that the translators selected for evaluation were highly qualified native speakers of the target languages.
Human evaluation
Three professional native translators were independently assigned to review each translated sentence for each target language. The translators did not know which model produced the translations, ensuring an unbiased evaluation.
Majority agreement
If at least two of three translators agreed that a translation was suitable for professional use, the model received one point for that sentence. This method reduced subjectivity and emphasized consensus.
Scoring methodology
The final score for each engine represents the percentage of cases where most evaluators approved the translation. This approach reflects the consistency and reliability of each MT model in translating professional content.
Results
The charts below visualize the performance of the four MT engines in the three domains. Lara demonstrated higher accuracy with a score of 65%, while other models, including Google Translate, DeepL, and GPT-4, had scores ranging from 54% to 58%. These results demonstrate Lara's consistently superior performance across domains.
* Percentage of time that at least 2 out of 3 professional translators agreed that a translation was accurate in 2,700 translations from English to Italian, French, Spanish, German, Portuguese, Japanese, Chinese, Russian, and Korean.
Evaluating Lara's accuracy in comparison with professional translators
* Percentage of time that at least 2 out of 3 professional translators agreed that a translation was accurate in 2,700 translations from English to Italian, French, Spanish, German, Portuguese, Japanese, Chinese, Russian, and Korean.
We track Lara's progress through regular human scoring. One of the primary metrics we use is errors per thousand words (EPT or EPTW). This metric helps us assess translation accuracy by calculating the number of errors per thousand words of translated content. Using EPT, we can objectively measure Lara's performance and identify areas for improvement.
Evaluation Setup
In this evaluation, we focused on user-generated content, including chats, reviews, and product descriptions. We translated the content using Lara and also enlisted professional translators selected from the median and top 1 percentile of our network to translate the same content without using any machine translation. All translations were subsequently reviewed by professional translators specifically chosen for the review process in order to highlight the translation errors.
Evaluation Process
Content Selection
We selected a diverse range of user-generated materials, including chat transcripts, customer reviews, and detailed product descriptions, to comprehensively assess translation performance across different content types.
Translation
The selected content was first translated using Lara. In parallel, we engaged professional translators from our network to translate the same set of content without the assistance of any machine translation tools. These translators were carefully chosen from median performers and the top 1 percentile to ensure a broad representation of human translation quality.
Error detection
Regardless of the method used, all translations underwent a rigorous review process conducted by a separate team of professional translators. These reviewers were specifically selected for their expertise and were tasked with highlighting translation errors without knowing the source of the translations. These errors included issues such as grammatical mistakes, mistranslations, and omissions. This step was applied consistently across Lara and professional translations.
EPT Calculation
The EPT score was averaged by combining the results across multiple translations. This score represents the frequency of errors and allows us to monitor improvements in Lara's performance.
Evaluating the Next Version of Lara
We applied the same EPT evaluation process to the alpha model of Lara's next planned model, expected in 2025. This helped us measure the early improvements in the new version and compare its performance with the current iteration. Tracking this progress gives us valuable insight into how Lara advances toward higher translation accuracy.
Results
The EPT results show Lara's steady improvement in reducing translation errors across multiple domains. The results clearly reflect Lara's progress towards language singularity.
Language has been the most important factor in human evolution. Through language, we can understand each other and work together to build a better future. Complex language has enabled us to advance faster than any other species.
By enabling everyone to understand and be understood in their native language, we are unlocking the next stage of human evolution. We believe in humans.