Most multilingual websites parse the end user's query in a non-English language, normalize the text and then map the term to an English-driven taxonomy, leaving room for errors in direct translations. Blueprint's Localization team examines better ways to maximize search results for non-English markets.
We are all used to going to a website and typing a few words to retrieve information, images or simply to find an item we want to buy. The process may seem straightforward for an end user, but there are hidden complexities that allow this type of website search to work well across languages. A significant number of adaptations and unique localization approaches are needed for non-English language searches and their complexities related to morphology, written scripts and conceptual and cultural differences.
The search space, one of the most interesting and lesser-known areas of localization that fall in the crossroads of search, taxonomies and translation, requires a hybrid expertise in library science and translation skills to produce results relevant to the end user.
These complexities can be attributed to the fact that most search systems are conceived for English, which requires customization for other languages. Most multilingual sites parse the end user’s query in a non-English language, then modify the query in certain ways (for instance, normalizing it to remove accent marks) to facilitate processing. After that, they map the normalized query term to an English-driven taxonomy in which keywords are localized into multiple languages to reach the corresponding concept tied to a unique identifier of the relevant asset. That is then retrieved and presented to the end user.
In this common setup, the route going from the end user’s query, through the localized keywords and ending in the asset does not always work perfectly. This is because there are many conceptual, linguistic and cultural differences among languages. Here are a few examples of problem areas that often produce inaccurate results in an English-centric setup, even though they contain accurately translated keywords and good taxonomies.
Schultüte is a uniquely German/Czech concept that has no equivalent in English or in many other cultures and languages.
Conceptual issues: In Europe, an image search for “family” typically shows results with a couple of children; in parts of Africa and the Middle East it could show more children. But in China during the one-child policy period, you would expect, for the most part, to get search results with families with only one child. In other words, even a concept seemingly as simple as family can require adaptations for different countries or markets.
Missing concepts: Some cultures have unique concepts, like “Schultüte” in Germany. That is a large, colorful cone full of school supplies, sweets and little presents given to kids when they are about to start their very first day of school. This word has no equivalent in English or in many other cultures and languages, so an English-centric database will not have an entry for this concept, making it impossible to add localized keywords for it. This often clouds the search with wrong or irrelevant results.
Absence of synonyms or missing tags: English concepts often have multiple equivalents in a target language. For instance, Spanish users looking for a “puzzle” are going to interchangeably use the Spanish translations “puzle” and “rompecabezas.” Mapping both terms during a search can be solved by ensuring that the taxonomy includes both synonyms. But to fully leverage the translated synonyms in the taxonomy, every single asset must be tagged with the English keyword “puzzle.” Yet, taxonomies are very large databases and grow continuously, so their localization is almost never complete. To compensate for any taxonomy shortcomings, it is common to pull more results by matching searches against the text in descriptions, captions, footnotes and other unstructured or free text associated to the assets.
Ambiguity issues: The last example of problematic searches is when a user includes homographs, or words with multiple meanings, like “bridge” in English. If the system lacks an effective disambiguation mechanism, these searches can pose a challenge for any language and return inaccurate results.
Localization and logic operators
From the perspective of the search engine, there is a set of best practices that can mitigate or solve many of those issues. First, ensure that all assets are tagged with English keywords. Second, make sure the keywords in the taxonomy are fully localized. Third, create language-specific concepts where needed. Fourth, have disambiguation prompts for the user to clarify their search. Last, and probably most effective, leverage the conversion of complex inputs from international users into Boolean searches — this is one of the most flexible and clever localization features I have used and experienced. It simply means to transform the original end user’s query at runtime into a compound search sequence that includes logical operators (AND, OR, NOT) as defined by the English mathematician George Boole in his book The Laws of Thought.
Boolean conversions can be leveraged to improve both recall and precision, which are the cornerstones of search metrics. Precision means fewer, but more accurate results, reducing the noise of irrelevant output. Recall means presenting a wider set of results, which is useful when you don’t have enough assets to show and want to drive users toward a related yet relevant set of results.
This is how Boolean conversions work:
When users in mainland China search for “family,” they expect the results to be images of a small family with 1 or 2 children, but instead they receive images of large families, perhaps even some with the Octomom family. For a search engine to prevent that type of result, a Chinese localizer can convert the query for “family” into (family AND (“1_child” OR “2_children”)). Doing so ensures the results will show the kind of small family users in China expect to see.
Now consider users in Germany looking for a “Schultüte” image. Since the concept does not exist in English, this search may generate no results or very few ones solely based on free text matches — if they are activated. In that case, the database developers have to create a German-specific concept in the taxonomy. But a simpler and cleverer alternative is to transform this query on the backend into a German sequence equivalent to ((school AND cone) NOT “traffic_cone”). Note the use of “NOT” to exclude more common cones irrelevant to this search.
A Spanish user searching for “puzzles” using one of its translations, “rompecabezas” or “puzle,” would only find assets if both translations were in the localized taxonomy and the search matched assets with either translation in free text. If all the assets are not tagged, if only one synonym was entered in the taxonomy and if the search does not expand to match free text, the search results could miss many assets. In this situation, the localizer can convert the queries for both “puzle” and “rompecabezas” into the Boolean compound (puzle OR rompecabezas). Then the results for both queries would include any assets containing either word, both in free text and in the taxonomy, casting the widest net to capture all relevant results.
Similarly, when a search uses homographs, like “game bridge,” you most likely want images of the card game. But free text can give you images of a deer or another type of game animal next to a road bridge. In that case, Boolean conversions can be used in any language to improve the precision of the search results. The localizer may convert end users’ queries like “game of bridge,” “bridge game” and “bridge card game” into (bridge (NOT (“dental_bridge” OR “road_bridge”)) AND “card_game”). The results would be more accurate than a blunt search in free text, retrieved from descriptions, captions, footnotes, etc.
Boolean conversion is a very smart and extremely flexible mechanism for search localization. It helps queries in all languages overcome a wide range of issues and drastically improve both precision and recall, the top two measures in the search world. Knowing how Boolean conversions work to achieve better search results is also useful in other contexts. When our localization team at Blueprint was recently asked to localize tags for software products to enable filtering by keyword and browsing by facet, it was exactly this kind of hybrid expertise that informed our approach to provide high-quality localization.
The Blueprint Localization team is a hidden gem in the U.S. West Coast technology landscape. Its collective knowledge, especially the combination of linguistic, cultural and research expertise, is a game-changing differentiator for the localization industry. Are you interested in leveraging your technical and language skills as part of a talented and quality-driven team? Check out our Careers page to learn more.