How to convert Unicode characters with accents to non-accented in Java
A guide on how to convert accented Unicode characters to non-accented characters in Java using `Normalizer` and regular expressions.
In this article, we will explore how to use the Normalizer
class in Java to remove accents from Unicode characters, particularly Vietnamese letters. This method is useful for string processing when comparisons or searches are needed.
Java code:
import java.text.Normalizer;
import java.util.regex.Pattern;
public class RemoveDiacritics {
public static void main(String[] args) {
String textWithDiacritics = "Chào mừng bạn đến với Java!";
String textWithoutDiacritics = removeDiacritics(textWithDiacritics);
System.out.println(textWithoutDiacritics);
}
public static String removeDiacritics(String text) {
// Normalize the text to NFD form
String normalized = Normalizer.normalize(text, Normalizer.Form.NFD);
// Regular expression to remove non-letter characters
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
// Replace diacritical marks
return pattern.matcher(normalized).replaceAll("").replaceAll("[^\\p{ASCII}]", "");
}
}
Detailed explanation:
-
import java.text.Normalizer;
: Imports theNormalizer
class for Unicode string processing. -
String textWithDiacritics = "Chào mừng bạn đến với Java!";
: Declares a string with accents. -
String normalized = Normalizer.normalize(text, Normalizer.Form.NFD);
: Normalizes the string to NFD form to separate base characters from diacritics. -
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
: Creates a regular expression to find combining diacritical marks. -
return pattern.matcher(normalized).replaceAll("");
: Removes all combining marks from the string. -
replaceAll("[^\\p{ASCII}]", "");
: Removes all non-ASCII characters.
System Requirements:
- Java version 8 or higher
How to install Java:
Download Java from the official Oracle website and follow the installation instructions.
Tips:
- This method can be used for various languages, not just Vietnamese.
- Be sure to thoroughly test input strings to ensure accurate results.