Did you know Sprout.ai can process claims documents in over 100 languages, from Japanese to Greek?

Insurance companies often need to process claims that involve a variety of documents in multiple languages. Each language presents unique challenges due to different scripts, grammatical structures, and cultural nuances.

At Sprout.ai, our use of Large Language Models (LLMs) allows us to process claims documents written in many different languages, from Japanese to Greek to Arabic. This not only enhances the efficiency of claims processing but also ensures that the information extracted is accurate, reducing the likelihood of errors that could come from manual handling.

This blog will explain how Sprout.ai handles different document types in different languages.

What are Large Language Models (LLMs)?

Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand and generate human-like text based on the training they receive from vast amounts of textual data.

These models are a type of machine learning model known as transformers, which use deep learning techniques to process words in relation to all the other words in a sentence, rather than one at a time.

This allows LLMs to generate coherent, contextually relevant text based on the input they receive. They are highly effective in tasks such as language translation, content generation, summarisation, and, as in the case of Sprout.ai, processing unstructured data from documents in multiple languages.

Why are some languages more of a challenge?

Some languages pose greater challenges for both human claims handlers and large language models (LLMs) than others due to a variety of linguistic, technical, and data-related factors.

Sprout.ai’s LLMs are trained to process a wide range of languages, providing global coverage.

1. Complex writing systems

Languages with complex scripts or non-Latin alphabets often present challenges for optical character recognition (OCR) technologies, which are integral to LLMs’ ability to read and process texts.

Example: Chinese and Japanese

These languages use logographic writing systems (Chinese characters in both, plus Kana in Japanese), which include thousands of characters and extensive use of homophones. This complexity requires LLMs to not only recognise a vast array of characters but also understand the context to disambiguate meanings effectively.

2. Morphological complexity

Languages with rich morphological systems, where words are formed by stringing together many prefixes, suffixes, and infixes, can be challenging for LLMs to parse and generate accurately.

Example: Arabic and Turkish

Arabic uses a root-based system where words are formed by adding various affixes to a set of root consonants that convey the base meaning. Turkish, as an agglutinative language, forms words by combining a series of suffixes with a base word, creating long word forms that express meanings that might require a full sentence in English.

3. Grammatical gender and case systems

Languages with complex case systems and grammatical gender rules require LLMs to learn and apply these rules correctly to avoid grammatical errors in translation and text generation.

Example: Russian and German

Russian involves extensive use of cases with nouns, adjectives, and pronouns changing form depending on their role in a sentence. German uses three genders (masculine, feminine, neuter) and four cases, affecting article and adjective forms. Both require LLMs to understand and apply these rules contextually.

4. Limited availability of training data

LLMs depend heavily on the availability of large datasets for training. Languages that are less represented on the internet and in digitised texts might not provide sufficient data to train models effectively. Equally, they will be more difficult to translate manually.

Example: African and Indigenous Languages

Many African languages, such as Swahili or Zulu, and indigenous languages, like Quechua or Maori, historically have less digitised content available. This scarcity limits the ability of LLMs to learn and understand these languages deeply.

5. Dialectal variation

High variability within a language, due to regional dialects or colloquial usage, can complicate understanding and generation tasks for LLMs, which may be trained primarily on ‘standard’ forms of a language.

Example: Spanish and English

Spanish has significant variation between European Spanish and various Latin American dialects in terms of vocabulary, idiomatic expressions, and syntactic structures. English, too, varies widely from British and American to Indian and Singaporean English, each with its own idiomatic and lexical differences.

Types of documents

Processing documents for insurance claims can range significantly in complexity based on how they are produced.

For instance, a handwritten report is often more challenging to handle than one filled out digitally. The handwritten text may vary in legibility, style, and ink consistency, which complicates optical character recognition (OCR) and requires more sophisticated handling by natural language processing (NLP) technologies. In contrast, digital documents typically provide clean, uniform text that is easier for systems to interpret.

Here’s an overview of the most common types of documents typically processed in insurance claims and the complexities involved. Sprout.ai can process all these documents in seconds.

– Claim forms

These are standard and custom forms filled out by policyholders, containing structured fields and unstructured narrative sections. Digital forms are generally straightforward for OCR to process, but handwritten entries can vary in legibility, requiring advanced interpretation capabilities.

– Medical reports

These documents might include handwritten doctor’s notes, typed summaries, and detailed medical charts, all rich with complex medical terminology. The handwritten notes, in particular, pose a significant challenge due to variations in individual handwriting styles and medical abbreviations.

– Police reports

Typically structured, these reports can also include narrative sections that describe an incident in detail, which are crucial for verifying claims related to theft or accidents. Handwritten sections may require contextual understanding to accurately interpret the information.

– Repair estimates

These documents usually include itemised lists and costs associated with vehicle or property repairs, described using industry-specific jargon. When filled out by hand, the technical terminology combined with potentially unclear handwriting increases the complexity of data extraction.

– Legal documents

Including court letters, lawyer correspondence, and legal notices, these are often formatted in dense legal language. The precision required to understand these texts accurately makes both digital and handwritten forms complex, with the latter adding the additional challenge of deciphering legal handwriting.

– Emails

These are exchanges between claimants, insurers, or third parties that may contain important claim information in an informal layout. While usually digital, the informal and varied nature of email text requires sophisticated processing to extract relevant data accurately.

How Sprout.ai processes documents compared to a human claims handler

A claim involving a car accident in Japan

Sprout.ai: Uses OCR to digitise documents related to the accident in Japan, including handwritten notes. NLP is then employed to interpret the text and extract crucial details such as the date of the accident, the nature of damages, and the estimated repair costs.

Human claims handler: May need translation assistance to understand documents in Japanese and manually extract relevant information, which can be time-consuming and prone to errors if nuances are missed.

A health insurance claim in Spain

Sprout.ai: Converts both handwritten and typed medical reports and prescriptions in Spain into machine-readable text. NLP identifies and extracts medical conditions, prescribed treatments, and billing amounts.

Human claims handler: Must fluently understand medical Spanish to accurately process and interpret the medical terminology and handwriting, potentially requiring consultation with medical experts.

A property insurance claim from the Middle East

Sprout.ai: Accurately captures and contextualises text from legal documents and contractor estimates submitted in Arabic. It recognises specific terms related to the insurance policy coverages.

Human claims handler: Challenges include deciphering right-to-left text and complex legal jargon, often necessitating language experts to ensure accuracy.

A travel insurance claim from a holiday in Greece

Sprout.ai: Efficiently processes documents such as medical emergency reports from Greece, automatically translating and extracting key data like treatment details and associated costs.

Human claims handler: May face significant delays if not proficient in Greek, relying on translators and risking inaccuracies in emergency medical contexts.

Conclusion

Our use of advanced LLM technologies like OCR and NLP provides a significant advantage compared to human claims handlers, especially when dealing with documents in multiple languages and scripts.

This not only speeds up the processing time but also enhances accuracy by reducing human error potential. In contrast, human handlers often require language expertise and additional resources to manage the same tasks, which can introduce delays and potential inaccuracies in international settings.