色虎视频

Resources
Blog

Spelling Standardization: How Many Ways Can You Spell a Word?

Published on
May 3, 2021
Author
Authors
Share

Spelling every word the same in your training data is harder than you might think, but it鈥檚 essential to producing reliable data for your machine learning models.

Here鈥檚 a topic that鈥檚 near and dear to our hearts here at 色虎视频: spelling standardization.If you鈥檙e providing training data to a computer system to produce machine translation, speech recognition, or a computer voice, it鈥檚 important to spell each word the same way every time it comes up (otherwise, you鈥檙e watering down your training data and the language model gets confused).Even if you鈥檙e not going that high-tech and you just want to have reliable search through your database of client questions or your fieldwork notes, spelling words consistently matters, standardized spelling, matters!This is especially true for the kind of annotation we do at 色虎视频, so we鈥檙e a little biased. Human-annotated data is, by definition, entered by a human being, and every person has different dialects, habits, and styles. Spelling a word correctly is vital to consistent, reliable data. How hard can that be? Every word has a 鈥渃orrect鈥 way it should be spelled, right? Just look it up in the dictionary if you鈥檙e not sure.Oh boy. Follow us down the rabbit hole.

Dialects

Here鈥檚 a problem straight away: is it 鈥渟tandardisation鈥 or 鈥渟tandardization鈥? This one鈥檚 region-based, so it鈥檚 not too hard to come up with the relevant spelling for your database. But some cases may be more complex than this 鈥 in Norwegian there are two entirely separate spelling systems (Bokm氓l and Nynorsk) intended to reflect different sets of dialects.Usually this area isn鈥檛 too hard 鈥 you decide in advance which spelling convention to follow for your chosen language and dialect. Stray spellings from other systems can be identified through automated checks and post-editing.

Register

Is it 鈥済onna鈥, 鈥済oin鈥 a鈥, 鈥済on鈥 to鈥, or 鈥済oing to鈥? This one鈥檚 more difficult: the latter spelling is the formally correct option, but in some cases it can be a long way removed from the sounds coming from a person speaking. What if you need to search later for one of the more diverse pronunciations of a phrase? How do you separate the pronunciations in your lexicon, if you鈥檙e producing a speech database?In some cases, the difference may be minimal enough that you can standardize to the dictionary form. In others it may be more sensible to adopt an informal representation.No matter how you choose to approach the subject, the conclusion is the same: standardization is vital.

Low-resource languages

It鈥檚 all well and good to refer to a dictionary, but some languages don鈥檛 have such handy arbiters of spelling. 色虎视频 has worked with Australian and Papua New Guinean languages with no written tradition at all, with languages such as KiSwahili where many alternate spellings may be equally acceptable, and with languages where spelling reform is recent or incomplete. It can be difficult building a team to work in regions with fewer speakers, or less ready access to the Internet.The key here is often working with university researchers and linguists. At the same time, it鈥檚 important to achieve consensus on acceptable spellings through consultation with speakers of the language living in their communities. You may find your database contributes to giving speakers of the language new access to writing resources!

Codepoints

Even when the spelling of the word is totally clear, we can run into trouble. Take a look at these two words:caf茅 褋邪fe虂How many letters do these have in common? To a human, the whole thing. To a computer? Only the 鈥渇鈥! The 鈥渃鈥 and the 鈥渁鈥 on the right come from the Cyrillic alphabet, and the 鈥溍┾ on the right is made out of two characters instead of one.Codepoint errors, as these are known, will look just fine when you鈥檙e reading them, but if you search your database, the text editor isn鈥檛 going to find all instances of the word you searched for. It鈥檚 even more trouble when your database is for automatic speech recognition or a speech synthesis program 鈥 the alternate spelling might not show up in your lexicon and the whole segment of audio could be discarded!Okay, so it鈥檚 fairly unlikely that someone鈥檚 going to be entering Cyrillic characters in your Latin-alphabet database, but for some languages there really are ambiguous cases, identical to a human eye but distinct to a computer. That鈥檚 the case for 鈥溍┾ shown above, and it鈥檚 widespread in many other writing systems too. In Arabic, for example, every character in the Unicode range also has separate equivalent 鈥減resentation form鈥 characters, so 鈥榖eh鈥 may appear as 倩 or as 锃, and there will be the same invisible variation for every character in the Arabic alphabet.So if these kinds of errors are so persistent, even to the human eye, what can be done to mitigate them? Standardization only works if everybody is working from the same sources. In cases like this, it鈥檚 a simple matter of some automatic computer scripts that add a little bit of artificially-intelligent flair to a human-centric process.

Quite a bit to take in, right?

These are just a few of the challenges to face when you鈥檙e working on transcripts and text databases. We hope you discovered some new things about the trials and tribulations of maintaining all this text. At 色虎视频, we鈥檝e helped clients all over the world tackle these issues. If you鈥檇 like to discuss how we can help you or your organization, we鈥檇 be happy to hear from you! to get started.

Related posts

No items found.