Convert accented Unicode characters to non-accented in Python
A guide on how to convert accented Unicode characters in the Vietnamese alphabet to non-accented letters using Python. This Python code efficiently handles Vietnamese text processing.
import unicodedata
def remove_accents(text):
# Normalize the string to Unicode NFKD form
normalized_text = unicodedata.normalize('NFKD', text)
# Remove combining characters (accents) and keep the base letters
return ''.join(c for c in normalized_text if not unicodedata.combining(c))
# Example usage
text = "Đây là ví dụ về chữ cái có dấu: ắ, à, ê, ơ, đ."
no_accent_text = remove_accents(text)
print(no_accent_text)
Detailed explanation:
-
Import
unicodedata
library:- This library allows us to work with Unicode characters and normalize strings.
-
remove_accents
function:- The text is normalized to Unicode NFKD form using
unicodedata.normalize
, which breaks accented characters into their base letters and diacritic marks. - The function then iterates over the characters and removes the combining characters (accents), leaving only the base letters.
- The text is normalized to Unicode NFKD form using
-
Function usage:
- When passing a string with accented characters, the function returns the equivalent string without accents.
-
Example:
- The string
"Đây là ví dụ về chữ cái có dấu: ắ, à, ê, ơ, đ."
is converted to"Day la vi du ve chu cai co dau: a, a, e, o, d."
.
- The string
Python Version:
This code works with Python versions 3.0 and above, as all the libraries and methods used are supported in these versions.