On September 13, 2022, the official version of Unicode 15.0 was released. 4,489 characters were added in Unicode 15.0, bringing the total number of characters to 149,186. These additions include 2 new scripts for a total of 161 scripts, and 20 new emoji characters. At the same time, several important Unicode specifications have also been updated with version 15.0, including the Unicode Security Mechanism (UTS#39) specification, which is intended to reduce Homoglyph attacks caused by visual deception of Unicode characters. Attack).
Homoglyph Attack is a very old form of visual deception attack. In the era of mechanical typewriters, many typewriters did not have separate 1s and 0s on the keyboard in order to simplify design and reduce manufacturing and maintenance costs. A typist would use a lowercase L and an uppercase i for the number 1, and an uppercase O for the number 0. When these same typists turned computer keyboard operators in the 70s and early 80s, their old keyboard habits carried over into their new professions and became a source of great confusion. This should be a period of concentrated outbreaks of visual confusion and homomorphic attacks.
After that, the typewriter was replaced by the word processor, the information age gradually arrived, and the character encoding began to gradually expand from the ASCII character set to the Unicode character set. We started using browsers or other application clients to render text, those in some contexts where homographs were inappropriate for URLs, formulas, source code, IDs, etc., whose similar appearance continued to make it visually impossible for users to distinguish.
Unicode visual spoofing depends on visually confusing strings: two Unicode strings are very similar in appearance, and at common screen resolutions, they appear in a small, normal font that can easily be mistaken for the other. There are no obvious rules for visual obfuscation: when the size is small enough, many characters look like other characters. “Small at screen resolution” means that most scripts use 9-12 pixel fonts. Confusion also depends on the style of the font: with traditional Hebrew fonts, many characters can only be distinguished by subtle differences that can be lost at small sizes. In some cases, character sequences can also be used to deceive: for example, “rn” (“r” followed by “n”) is visually confused with “m” in many sans-serif fonts.
In recent years, many malicious attacks have occurred due to Unicode encoding deception. Humans, compilers, or AI may make wrong judgments and analysis due to Unicode deception. For example, in 2021, some researchers will use these special characters of Unicode to conduct adversarial attacks on NLP models in commercial systems such as Google. They inject through some imperceptible encoding – such as an invisible character, a homograph, a reordering or a delete operation character, which can significantly degrade the performance of some models, and most models may be functionally out of order (https://arxiv .org/abs/2106.09898).
In 2022, Trezor, a well-known hardware wallet, has a large number of phishing websites, phishing links https://suite.trẹzor.com. This phishing link is very similar to the real Trezor official website trezor.io. (This case comes from the “Blockchain Dark Forest Self-Saving Handbook”: https://github.com/slowmist/Blockchain-dark-forest-selfguard-handbook/blob/main/README_CN.md)
Although client-based, especially browser-based defenses against visual spoofing have been improving, there is no way to completely block this attack. First, visual deception is an attack that is difficult to be completely eliminated, because the human physiological visual system is indistinguishable even at a small enough size; second, the Unicode character set is very large and constantly increasing; Third, standards organizations, browser developers, domain name registrars and other parties need to work together to complete.
This paper mainly studies the visual deception problems caused by glyph rendering, mixed script, PunyCode, bidirectional text, and combined characters in Unicode. Combined with the different types of Unicode visual deception vulnerabilities discovered by the author, it shares the process of vulnerability mining and gives some defense ideas. .
2. Security risks brought by glyph rendering
When fonts or rendering engines have insufficient support for characters or character sequences, a new problem arises for characters or character sequences that should be visually distinguishable – visual confusion. Especially when these characters are used as critical information, such as in the browser’s important security indicator address bar, the harm of this visual deception will be obvious. Browser manufacturers have been actively defending against the problem of IDN spoofing on the address bar, usually maintaining a list of important domain names in the browser. If a domain name is objectionable to the domain name in the list, it will be converted into Punycode for display. .
The lack of rendering support directly leads to the fact that the glyph meaning of each character is often beyond our expectations, and this uncertainty makes visual deception possible at any time. So what exactly is a glyph, and what does it have to do with fonts, characters, and fonts, including the increasingly popular Emoji expression in Unicode, is it a character? Some of these basic noun concepts are often confused by non-professionals, so let’s briefly understand them.
Typeface refers to the design of a set of characters, usually including letters, a set of numbers, and a set of punctuation marks. Also often includes ideographic characters and cartographic symbols. Each font is a collection of glyphs, such as Arial, Helvetica, etc. Some font technologies are very powerful, such as TrueType/OpenType, they can choose to display the best shape based on resolution, system platform, language, etc. However, it can also be used for security attacks, as it is powerful enough to change the appearance of “$100.00” on the screen to “$200.00” when printed. Additionally, Cascading Style Sheets (CSS) can be changed to different fonts for print and screen display, which allows for more confusing fonts. These issues are not specific to Unicode. To reduce the risk of such exploits, programmers and users should only allow trusted fonts.
Font (English: font; traditional British English: fount) refers to a set of fonts in the printing industry that have the same style and size, such as a set of Song Dynasty 5 for text, a set of 10 for headings A font is called a set of fonts. In the early days of computers, dot-matrix characters still had the concept of fonts. The same set of styles was like Zhongyi Song style. After the emergence of vector fonts, there is no need to make different pixel fonts for the same set of style fonts. You only need to make one set and you can scale them at will, and the boundaries between “font” and “font” begin to blur. Ordinary English users can’t tell the difference between “Font” and “Typeface”.
Glyph (English: glyph) , also known as character map or book shape, refers to the shape of a character. The national standard of the People’s Republic of China GB/T 16964 “Information Technology – Font Information Exchange” defines a glyph as “a recognizable abstract graphic symbol that does not depend on any specific design”. In linguistics, a character is the most basic unit of semantic meaning, that is, a morpheme; a glyph is a specific expression for expressing this meaning. The same character can have different glyphs without affecting its meaning. For example, the first letter of the Latin alphabet can be written as a or ɑ, and in Chinese characters, “strong/strong” and “hu/hu/戸”.
In complex scripts, such as Arabic and Austro-Asian scripts, characters may change shape based on surrounding characters.
(1) The font can be changed with the surrounding environment
3 arabic letter heh (U+0647) grouped together
(2) Multiple characters can generate a glyph
f = latin small letter f (U+0066)
i = latin small letter i (U+0069)
Let’s look at one related to vision, when multiple different characters may be combined to have the same visual appearance. For example the characters U+0BB6 SHA and U+0BB8 SA are usually very different. But the two combined character sequences are visually identical. The underlying binary is different.
In fact, the multi-character combination is not only a visual deception, but sometimes this multi-character combination can also cause the system memory to crash. CVE-2018-4124 This vulnerability can cause heap corruption in macOS High Sierra 10.13.3 when processing maliciously crafted strings. The original sequence of జ్ఞా is U+0C1C U+0C4D U+0C1E U+200C U+0C3E which is a sequence of Telugu characters: consonants ja (జ), virama (్), consonants nya (ఞ), zero- width unconnector and vowel aa (ా). When macOS processes the character sequence జ్ఞా, it can crash the system.
Emoji (English: Emoji) are pictographs (graphical symbols), usually presented in the form of colorful cartoons and used inline in text. They represent faces, weather, vehicles and buildings, food and drink, animals and plants, or icons representing emotions, feelings or activities. Emoji has become ubiquitous, and it is becoming a common language for all people across cultures. However, different people’s understanding of the use of Emoji and the fragmented support of various operating systems have also brought some problems. For example: you won’t see the rifle emoji, U+1F946, any time soon. Unicode groups, including Apple and Microsoft, have opposed the inclusion of the rifle symbol. Emojis for pistols and other weapons have landed people in legal trouble. A French court has ruled that the pistol emoji could constitute a death threat, handing a man three months in prison for sending the gun emoji to his ex-girlfriend.
It is also quite dangerous to directly render the U+1F512 encoding in the browser address bar. Because it looks very similar to the HTTPS security lock. An attacker can forge the security lock on it. U+1F512 encoding has had such security issues in Chrome/Firefox browsers.
3. Security risks brought by mixed scripts
There are many legitimate uses for hybrid scripts, such as Ωmega. But visually confusing characters are usually not put together in a script. Visually confusing characters provide many opportunities for deception, such as the two domain names below, the Greek lowercase letter Omicron and the Latin o are very similar in appearance.
A long time ago, domain names were only allowed to contain Latin letters AZ, numbers and some other characters (ASCII character set). Later, it was discovered that domain names that only support ASCII codes may be problematic, because there are many non-Latin-speaking countries and regions in the world, and they are also eager to use their own language symbols in domain names. After several discussions on proposals, the specification for Internationalized Domain Names [rfc3490] was released in 2003, which allows most Unicode to be used in domain names, and is generally referred to as IDNA2003. A revised version of IDNA2003 [rfc5895] was subsequently approved for publication in 2010, calling this revision IDNA2008. But IDNA2003 and IDNA2008 did not effectively solve some problems in internationalized domain names. Subsequently, the Unicode Consortium released [UTS-46] to address some compatibility issues.
The PunyCode algorithm is used in IDNA to convert non-ASCII domain names to ASCII domain names. The PunyCode algorithm can uniquely map any non-ASCII Unicode string to a string that uses only English letters, numbers and hyphens. The encoded domain name is preceded by xn-- to indicate that this is a PunyCode encoding.
An Internationalized Domain Name (IDN) is a second- or third-level domain name or web address registered in any character set or script defined in Unicode. Until the end of 2009, the top-level domain name was limited to the Latin letter az. Later, with the development of Web globalization, IDN TLDs began to be gradually promoted and popularized, which accelerated the progress of globalization, but also brought some security risks. We’ll talk about that later.
In the previous article, we have learned that the total number of characters in Unicode 15.0 is 149,186, and the number of scripts is 161. Most of these scripts can be used for domain registration, and the number is growing. Taking the top-level domain (TLD) COM as an example, Verisign developed a policy for IDN registrations, specifying allowed and prohibited code points. And it has formulated five validation rules: IETF standards, restrictions on specific languages, restrictions on script obfuscation, ICANN restricted Unicode codes, and special characters. IDNs that follow these five rules are considered valid registrations.
Of these five IDN registration rules, we’ll focus on the “Limits to Script Obfuscation” rule, because it’s helpful for spotting visual spoofing issues on IDNs.
[ Restrictions on script obfuscation ]
Verisign does not allow registration with mixed Unicode scripts. Registration will be denied if the IDN contains two or more Unicode script codes. For example characters in Latin script cannot be used in the same IDN as any Cyrillic character. All codes in the IDN must come from the same Unicode script. This is done to avoid obfuscated code appearing in the same IDN.
The following table lists the allowed Unicode scripts.
To name a vulnerability I found earlier, [CVE-2018-4277] Spoof All Domains Containing ‘d’ in Apple Products. I found in my research that encoding latin small letter dum (U+A771) in Apple products renders glyphs very similar to latin small letter d (U+0064). It can be found from the glyph standard (U+A771) in Unicode (http://www.unicode.org/charts/PDF/UA720.pdf), there should be a small apostrophe after d, but in Apple products this is completely ignored.
Next, I went to register a real domain name to make this IDN Spoof work. We know that in Verisign’s rules for IDN registration, mixed Unicode scripts are not allowed for registration. Registration will be denied if the IDN contains two or more Unicode script codes. And (U+A771) also belongs to Latin, which should be in line with the rules of domain name registrars. So the domain name was successfully registered.
I registered another SSL certificate to make this IDN Spoof look more real and perfect. The effect is as follows, Safari did not convert this domain name to punycode display, so we succeeded.
At this point, we have determined that the entire deception process is completely feasible, then the attacker can forge all domain names with d in the domain name. Among the Top 10k domain names in Ggoogle statistics, more than 25% of the website domain names have the character d. The domain names of these websites can be forged.
[ Apple fixes the patch ]
watchOS 4.3.2 https://support.apple.com/zh-cn/HT208935
iOS 11.4.1 https://support.apple.com/zh-cn/HT208938
tvOS 11.4.1 https://support.apple.com/zh-cn/HT208936
macOS High Sierra 10.13.5 https://support.apple.com/zh-cn/HT208937
4. Security risks brought by bidirectional text
Certain characters, such as those used in Arabic and Hebrew scripts, have an inherent right-to-left writing direction. When these characters are mixed with characters from other scripts or symbol sets displayed from left to right, the resulting text is called bidirectional (abbreviated bidi). The relationship between the in-memory representation of a document (logical order) and the display appearance of bidirectional text (visual order) is governed by UAX#9: The Unicode Bidirectional Algorithm [UAX9].
Because some characters have weak or neutral directionality, rather than strong left-to-right or right-to-left, the Unicode bidirectional algorithm uses a precise set of rules to determine the final visual rendering. However, the presentation of arbitrary text sequences may result in text sequences that cannot be read clearly, or may be visually confusing.
In a URL, it is often encountered that multiple directional (weak, neutral, strong) characters exist at the same time. Although the Unicode Bidirectional Algorithm uses a precise set of rules to determine the final visual presentation, multiple directional text sequences may render text sequences that cannot be read clearly, or may be visually confusing.
Chrome has had such a vulnerability before, access it in Chrome:
Which actually renders in the address bar as:
CVE-2018-4205 is a vulnerability I discovered earlier that exploits RTL and whitespace to cause URL address bar spoofing. The following is an example of this vulnerability.
(1) Constructing POC-1
Visit http://www.apple.com.xn--ggbla3j.xn--ngbc5azd/. can be seen
The address bars of the three browsers Chrome/Firefox/Safari are all the same. Edge displays punycode.
(2) Constructing POC-2
Accessing POC-2 in four browsers, we have found the problem at this point. RTL and whitespace appeared in Safari.
(3) Construction of POC-3
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/deep-analysis-of-unicode-visual-deception-attacks/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.