Unicode Security Notes Page

Feel free to include my content in your page via my
RSS feed Follow @irongeek_adc

Help Irongeek.com pay for
bandwidth and research equipment:
Subscribestar or Patreon

Search Irongeek.com:

Affiliates:

Help Irongeek.com pay for bandwidth and research equipment:

Unicode Security Notes Page

Unicode Security Notes Page

Download Talk Slides

Corrupt Unicode Example

Scrap Unicode Notes

Every Unicode Character Blob Page or TXT file
Every Unicode Character 80 Column Page or TXT file
Every Unicode Character With Hex Page or TXT file

Text below is to help with search indexing and copy and pasting, but it is missing some items from the Power Point slides.

Character Assassination:
Fun and games with Unicode

Adrian Crenshaw

About Adrian

I run Irongeek.com

I have an interest in InfoSec education

I don’t know everything - I’m just a geek with time on my hands

Sr. Information Security Engineer at a Fortune 1000

Co-Founder of Derbycon
http://www.derbycon.com/

To be clear concerning what this talk is about

Why this subject?

Lot’s of research has been done, but not many people talk about it

Complexity is the damnable enemy of security, but human language is complex so what can you do?

Act as a setup for future research

To encourage others who are better at exploit development than me to look into it

Because I wanted to make an animation with cartoon letters stabbing each other

Why Unicode

There are more than English Speakers out there

ASCII: American Standard Code for Information Interchange

What about other languages? Cyrillic, Chinese, Hebrew, Arabic, Klingon… ( ok, sort of http://wazu.jp/gallery/Test_Klingon.html )

Unicode lets computer systems support more languages, allowing for world wide use

Unicode History

ASCII is 7 bit and just 96 printable characters, but an 8th bit was added to make other standards:

Extended ASCII

ISO/IEC 8859

ISO/IEC 8859 uses last bit to add another 96+ control characters

You have to specify a part/character set/language to specify those 96

This still was not enough, and did not allow for a lot of mixed languages

The need was to represent all of the characters as unique code points, and not get confused amongst languages

Unicode History

Joe Becker (Xerox), Lee Collins & Mark Davis (Apple) started working on Unicode in 1987 to do this, version 1.0.0 released in Oct 1991

Unicode started as a 16bit character model (0x0-0xFFFF), with the first 256 code points the same as ISO-8859-1

Each character has a code point associated with it:
A = U+0041 $=U+0024 U+265E=♞

This has since been expanded, so Unicode has points from 0x0 to 0x10FFFF (1,114,112 points dec), though support varies

Most used points will be in Basic Multilingual Plane (BMP) represented as U+0000 to U+FFFF

Encodings

UTF-8 (UCS Transformation Format 8-bit), meant to be backward compatible with ASCII

UTF-16 (Unicode Transformation Format 16-bit) which superseded UCS-2

UTF-32 (Unicode Transformation Format 32-bit )

BOM (Byte Order Marks)

UTF-8 prepends EFBBBF to data

UTF-16 FEFF Unicode Big Endian, FFFE Little Endian

UTF-32 generally does not use one

Encoding Examples

Omega U+03A9

AΩB

UTF-8
41 CE A9 42

UTF-16
00 41 03 A9 00 42

UTF-32
00 00 00 41 00 00 03 A9 00 00 00 42

I hate Smart Quotes!

“Smart” "Not so smart" �Smart when dumb� Why?

Microsoft extended ISO 8859-1, making some control characters in 80 to 9F printable for Windows-1252

“ ” ‚ ‘ ’ —
93 94 82 91 92 97

If Windows-1252 is confused for ISO 8859-1, you get � for these characters

Makes copying and pasting command in tutorials a pain!

Related:
Some Email J
Some Email J

UTF-8 Encoding

Lower ASCII is the same in UTF-8, Higher uses continuation bytes (table bogarded from Wikipedia)

UTF-16 Encoding

In UTF-16 U+10000 to U+10FFFF use surrogate pairs in range 0xD800 to 0xD8FF

Steps
based on: http://en.wikipedia.org/wiki/UTF-16

0x10000 is subtracted from the code point, leaving a 20 bit number in the range 0..0xFFFFF.
The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give the first code unit or lead surrogate, which will be in the range 0xD800..0xDBFF.
The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give the second code unit or trail surrogate, which will be in the range 0xDC00..0xDFFF (previous versions of the Unicode Standard referred to these as low surrogates).

Mojibake!

Mojibake = "character" "transform“

AΩB✌C

Code Points:

U+0041 U+03a9 U+0042 U+270C U+0043

UTF-8 bye string:

EF BB BF 41 CE A9 42 E2 9C 8C 43

Mangled by reading as just ISO 8859-1 bytes:

ï»¿AÎ©BâœŒC

Find Your Character

Wikipedia List
https://en.wikipedia.org/wiki/List_of_Unicode_characters

Unicode Table
http://unicode-table.com/

File Format
http://www.fileformat.info/info/unicode/

Unicode Code Converter v7.05
http://rishida.net/tools/conversion/

Typing Unicode

Windows:

Alt, + key on keypad, type hex number

May have to edit HKEY_Current_User/Control Panel/Input Method and set EnableHexNumpad to "1“.
Help from http://www.fileformat.info/tip/microsoft/enter_unicode.htm

OS X

Option+Command+t will let you select some

System Preferences ->Language & Text->Input Sources

Enable “Unicode Hex Input”

Select U+ from the menu bar

Hold Option Key, type in Hex code

Obligatory XKCD Slide

http://xkcd.com/1209/

Homoglyph/Visual Attacks

Confusables and Look-a-likes

Classic Phishing Obfuscations

Would you follow a link in email to AdriansHouseOfPwnage.com?

Text says one thing, link says another:
<a href=”http://irongeek.com”>http://www.microsoft.com</a>

Confuse user with credentials section of a URL:
http://www.microsoft.com@irongeek.com

Firefox pops up a warning

IE just refuses to connect

Other ideas?

Homographs

Homographs = words that looks the same

Homoglyphs = characters that look the same

Examples:

rnicrosoft.com vs. microsoft.com

paypa1.com vs. paypal.com

IR0NGEEK.COM vs. IRONGEEK.COM

Now, what about Unicode?

Problem: DNS is ASCII

DNS labels (the parts separated by dots) follow the LDH rule:

Letters

Digits

Hyphen

This would not allow for international characters in DNS labels

Enter Punycode and IDNA

IDNA

Internationalized Domain Names in Applications (IDNA) allows non-ASCII characters in the host section of a URL to map to DNS host names

café.com = xn--caf-dma.com

北京大学.中國 = xn--1lq90ic7fzpc.xn--fiqz9s

What about Homoglyphs in Unicode?

There are homoglyphs in Unicode that look the same as normal Latin characters, and these could be used for spoofing names, examples:

googlе.com = xn--googl-3we.com
(е is a Cyrillic small letter ie U+0435)

іucu.org = xn--ucU+ihd.org
(і is a Cyrillic small letter Byelorussian-Ukrainian і U+0456)

pаypal.com = xn--pypal-4ve.com
(2^nd а is Cyrillic small letter a U+0430)

Likely Sources for Homoglyphs

Cyrillic script: a, c, e, o, p, x and y

Latin alphabet appears twice, U+0021-007E (Basic Latin) & U+FF01-FF5E (Full width Latin):
!"$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Even some slashes
/(U+002f), ̸ (U+0338), ⁄ (U+2044), ∕(U+2215),
╱ (U+2571), ／ (U+ff0f), ﾉ (U+ff89)

Slashes?

Can other domains be used?

www.microsoft.com⁄index.html.irongeek.com
Slash is U+2044

Mouse over it

Homoglyph Attack Generator
Demo

http://www.irongeek.com/homoglyph-attack-generator.php

Combination of JavaScript and PHP libraries created by phlyLabs as part of phlyMail

Protections Implemented by Browsers

Firefox shows Punycode if

Not in TLD White List (about:config→network.IDN.whitelist)
.ac, .ar, .asia, .at, .biz, .br, .cat, .ch, .cl, .cn, .de, .dk, .ee, .es, .fi, .gr, .hu, .il, .info, .io, .ir, .is, .jp, .kr, .li, .lt, .lu, .lv, .museum, .no, .nu, .nz, .org, .pl, .pr, .se, .sh, .si, .tel, .th, .tm, .tw, .ua, .vn, .xn--0zwm56d, .xn--11b5bs3a9aj6g, .xn--80akhbyknj4f, .xn--90a3ac, .xn--9t4b11yi5a, .xn--deba0ad, .xn--fiqs8s, .xn--fiqz9s, .xn--fzc2c9e2c, .xn--g6w251d, .xn--hgbk6aj7f53bba, .xn--hlcj6aya9esc7a, .xn--j6w193g, .xn--jxalpdlp, .xn--kgbechtv, .xn--kprw13d, .xn--kpry57d, .xn--mgba3a4f16a, .xn--mgba3a4fra, .xn--mgbaam7a8h, .xn--mgbayh7gpa, .xn--mgberp4a5d4a87g, .xn--mgberp4a5d4ar, .xn--mgbqly7c0a67fbc, .xn--mgbqly7cvafr, .xn--o3cw4h, .xn--ogbpf8fl, .xn--p1ai, .xn--wgbh1c, .xn--wgbl6a, .xn--xkc2al3hye2a, .xn—zckzah

network.IDN_show_punycode set to true (default false)

Any of these blacklisted characters appear:
¼½¾ǃː̷̸։׃״؉؊٪۔܁܂܃܄ᅟᅠ᜵ ․‧   ‹›⁁⁄⁒ ⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟∕∶⎮╱⧶⧸⫻⫽⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻　。〔〕〳ㅤ㈝㈞㎮㎯㏆㏟꞉︔︕︿﹝﹞．／｡ﾠ￹￺￻�

Updated at
http://kb.mozillazine.org/Network.IDN.blacklist_chars

Protections Implemented by Browsers

IE 9, and I assume 10 shows Punycode if

If there is a mismatch between the characters used in the URL and the language expectation

If character is not used in any language

Mixed set of scripts that do not belong together

Info may be out of date, most material references IE 7
http://msdn.microsoft.com/en-us/library/bb250505%28v=vs.85%29.aspx

Protections Implemented by Browsers

Chrome shows Punycode if

Configured language of the browser (configured in the “Fonts and Languages” options) does not match

Incompatible set of scripts that do not belong

But there is a whitelist, so hard to confuse scripts like Latin with Chinese can be used

Characters in a black list

Defenses by Registrar

Registrars may not allow the character

For example, one registrar gave the following error when an attempt was made to register іucu.org (Cyrillic small letter Byelorussian-Ukrainian i U+0456):
“Error: You used an invalid international character! Please note that for some reason .org and .info only support Danish, German, Hungarian, Icelandic, Korean, Latvian, Lithuanian, Polish, Spanish, and Swedish international characters.”

May be gotten around by / homoglyphs, ノ Katakana Letter No (U+30ce) seems to work best and a domain you already own

Approach

How different browsers show the Punycode in the URL bar.
How different mail systems show the URL when email is displayed.
How social networks render the URL.

Used domain we control, and Local Hosts file to map the DNS entries

IE 10.0.8

FireFox 23.0.1

Chrome 28.0.1500.95 mg

Some Results

Other odd balls

іucu.org [xn--ucu-ihd.org](і U+0456 ) could not be registered

These seemed to pass Registrar’s tests
Íucu.org [xn--ucU-2ia.org](Latin capital letter i with acute Í U+0456)
íucu.org [xn--ucU-qma.org](Latin small letter i with acute í U+00ED)
įucu.org [xn--ucU-9ta.org](Latin small letter i with ogonek į U+00ED)

ノ Katakana Letter No (U+30ce) seems to work in Firefox for subdomain trick, but not in Chrome or IE

Display of IDNA in Web Apps

What does the webapp display?

How does it parse links?

Test Strings

Ω U+03A9
http://Ω.com
ɡ U+0261
http://ɡoogle.com
http://ɡoogle.org
і U+0456
іucu.org
http://іucu.org
⁄ U+2044
http://www.microsoft.com⁄index.html.irongeek.com
http://www.microsoft.com⁄index.html.irongeek.org

Outlook 2010

Sent from Gmail to campus mail

Pink phishing warning that must be clicked past to use links

4^th, 7^th and 8^th link had parse errors

Gmail

Sent from Outlook mail to Gmail

2^nd and 3^rd links used to have problem with ɡ (Latin small letter script G U+0261) but now work

4^th link had problems with Cyrillic і (U+0456) if no http:// in front

7^th and 8^th link had parse errors because of ⁄ (fraction slash U+2044) and were split in two

Facebook

Seemed to render all but the fourth link as it was inputted Punycode versions show

іucu.org without the preceding http:// gave issues. Cyrillic і (U+0456) seemed to confuse the parser

The ⁄ (fraction slash U+2044) in the last two links seems to also cause no oddities

Twitter

Twitter had the effect of rendering all of the URLs as a truncated, URL shortened (using t.co), Punycode version

Except іucu.org without the preceding http://. Again, the soft-dotted Cyrillic і (U+0456) seemed to confuse the parser.

Twitter makes it pretty obvious that there is something funny about the URLs

Fonts Matter

Calibri:
@dave_rel1k
@dave_reI1k
AΑᎪＡaаａɑα
BΒВᏴᛒＢｂbЬßʙβ
CϹСᏟⅭＣ𐒨сcϲⅽｃ

Courier New:
@dave_rel1k
@dave_reI1k
AΑᎪＡaаａɑα
BΒВᏴᛒＢｂbЬßʙβ
CϹСᏟⅭＣ𐒨сcϲⅽｃ

Ok, besides Homoglyphs?

Steganography

“Covered Writing”

Hide Text in text

Easy to detect by looking at the bytes, but may fool the human eye

Some examples looks better than others, Unicode support varying.

Can be used in Botnets:
http://www.irongeek.com/i.php?page=security/steganographic-command-and-control

Play with it here:
http://www.irongeek.com/i.php?page=security/unicode-steganography-homoglyph-encoder

Stego Examples

Alternate between Latin and Full-width Latin, easy, just add/subtract 65248 decimal. Use U+205F as space
Ｔhiｓ iｓ ｍｙｃoveｒ text ｔｏｕｓｅ． Dｏ ｙｏｕ ｔhiｎk ｉt wｉｌｌ woｒk？ Ｉ hｏpe ｔｈａｔ ｉt will.

Use very close homoglyphs to encode single bits, skip if there are no close homoglyphs, use 8 types of space like characters (U+0020, U+2004, U+2005, U+2006, U+2008, U+2009, U+202F, U+205F) to encode 3 bits each (000,001,010,011,100,101,110,111)
Τhiѕ іѕ my cover tехt tο usе. Dο yοu thіnk іt wіll wοrk? I һοре that it will.

Use non printable Tags in U+E0000 to U+E007F, also easy just add/subtract 0xE0000
This 󠁉is 󠁴my 󠀠cover 󠁷text 󠁯to 󠁲use. 󠁫Do 󠁥you 󠁤think 󠀿it will work? I hope that it will.

Examples:
“It worked?”

Name Spoofing

IP Boards let me spoof Daren from Hak5’s screen name:
Darren Κitchen (U+039A Greek Capital Letter Kappa)
vs
Darren Kitchen
(Post count and admin status will give it away)

Twitter returned the error
“Invalid username! Alphanumerics only.”

Gmail/Google returned the error
“Please use only letters (a-z), numbers, and periods.” when non-ASCII characters were attempted.

More research needs to be done in these areas.

Right to left?

Josh Kelley mentioned this one to me

What about left to right mixed with right to left scripts?

Takes U+202E (Right-to-Left Override), U+202C stops it

http://irongeek.com

http://irongeek.com/moc.tfosorcim//:ptth

More details at:
http://digitalpbk.blogspot.com/2006/11/fun-with-unicode-and-mirroring.html
&
http://dl.packetstormsecurity.net/papers/general/righttoleften-override.pdf

What about file names?

Just how they are displayed

Non Visual

http://www.unicode.org/reports/tr36/

UTF-8 Exploits

Text Comparison

Buffer Overflows

Property and Character Stability

Deletion of Code Points

Secure Encoding Conversion

Enabling Lossless Conversion

Canonicalization Errors?

Remember when the full width Latin forms were turned to normal Latin in the URL bar?

< or > filtered?

What if it also tries to canonicalize similar characters like < (U+003c), >(U+003e), ‹ (U+2039), ﹤ (U+FE64), ﹥ (U+FE65) › (U+203a), ＜(U+ff1c), ＞(U+ff1e) afterwards?

Other Transforms

Case changes

ß (U+00DF) upper case becomes SS

İ (U+0130) to lower case becomes i (U+0069)

ſ (U+017F) to upper becomes S (U+0053)

ẞ (U+1E9E) to lower becomes ß (U+00DF)

ı (U+0131) to upper becomes | (U+0049)

Apparently, locale matters too, French upper case may drop diacritics, Turkish handles “iIıİ” differently

http://www.w3.org/International/wiki/Case_folding

UTF-8 Exploits

Overly long encoding, will it bypass filters?

< = 3C = 00111100
11000000 10 111100 = C0 BC

> = 3E = 00111110
11000000 10111110 = C0 BE

a1 13 a1 03 a1 12 a1 09 a1 10 a1 14

MS00-057 Was this Problem, but with ../

Text Comparison
(Normalization)

Various characters have both their own code point, and can be made with “Combining” characters

Diacritical marks also A (U+0041) next to U+0300 = À but À is also U+00C0

We want text searches to be equivalent,

NFKC - Normalization Form Compatibility Composition

"Ⓓⓔⓛⓔⓣⓔ" into "delete".

International Phonetic Alphabet has examples in U+0300 to U+036F. Even more in U+1DC0 to U+1DFF

Real-life Example: Spotify

The canonical_username function was not “idempotent” (only first time matters), Function like “toLower” would be.

Users signs up with username IronGeek, normalized to irongeek

Another user signs up as ᴵᴿᴼᴺᴳᴱᴱᴷ (U+1D35 U+1D3F U+1D3C U+1D3A U+1D33 U+1D31 U+1D31 U+1D37 in Phonetic Extensions block)
Which also gets normalized to IRONGEEK the first time, but irongeek the next time.

ᴵᴿᴼᴺᴳᴱᴱᴷ requests a password reset email, but with it can reset IronGeek’s account

Full story here:
http://labs.spotify.com/2013/06/18/creative-usernames/

Thwart Searches/Obscenity Filters

What if you want to be public, by hard to search for?

What if you wan to search for filtered words?

Classic example, no Unicode needed: pr0n

Porn != Pοrn != Pоrn

o=U+006f, ο=U+03bf, о=U+043e

Latin Small o, Greek Small Omicron, Cyrillic Small Letter o

Searches for the above turcn up different results in Google

Some items with mixed scripts just get flagged as spam

Just plain fun too

Buffer Overflows

Some expand out

Complexities With Buffer Overflows

Try to overwrite EIP with 0x41414141, you get 0x00410041

Chris Anley came up with “Venetian Shellcode”

Links:
http://www.ngssoftware.com/papers/unicodebo.pdf
https://www.corelan.be/index.php/2009/11/06/exploit-writing-tutorial-part-7-unicode-from-0x00410041-to-calc/

FX of Phenoelit also did some work on this

Fuzzing

Suggestions::

Combining Diacritics

Invisible Characters

Malformed UTF-8

Bad Surrogate Pairs

Multiple levels or RTL, LTR reversing

Chris Weber’s Blog:
http://web.lookout.net/2011/06/special-unicode-characters-for-error.html

In recent news, Apple's CoreText API Bug:
سمَـَّوُوُحخ ̷̴̐خ ̷̴̐خ ̷̴̐خ امارتيخ ̷̴̐خ
http://arstechnica.com/apple/2013/08/rendering-bug-crashes-os-x-and-ios-apps-with-string-of-arabic-characters/
&
MS13-060 Vulnerability in Unicode Scripts Processor Could Allow Remote Code Execution (2850869)

Big Thanks

J. Abolins
@jabolins

Chris Weber
@w3be http://www.casaba.com

Michal Zalewski
@lcamtuf http://nostarch.com/tangledweb

William Coppola
@SubINacls

Useful Sites

Unicode Security Considerations
http://unicode.org/reports/tr36/

Unicode Security Mechanisms
http://www.unicode.org/reports/tr39/

Unicode Converter
http://www.rishida.net/tools/conversion/

Unicode Character Info and List
http://www.fileformat.info/

Homoglyph Attack Generator
http://www.irongeek.com/homoglyph-attack-generator.php

Unicode-HAX
https://github.com/cweb/unicode-hax

OWASP XSS Filter Evasion Cheat Sheet
https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet

Fun

Unicode “Fonts”
http://www.panix.com/~eli/unicode/convert.cgi

Other Fun
http://txtn.us

Art

Hand are based on
http://www.newthinktank.com/2010/10/cartoon-hands/

References

A. Costello, March 2003. [Online]. Available: http://www.ietf.org/rfc/rfc3492.txt

J. Abolins, December 2010. [Online]. Available: http://www.irongeek.com/i.php?page=videos/dojocon-2010-videos#Internationalized%20Domain%20Names%20&%20Investigations%20in%20the%20Networked%20World

M. Zalewski, The Tangled Web: A Guide to Securing Modern Web Applications, 1st ed., No Starch Press, 2011.

E. &. G. A. Gabrilovich, "The Homograph Attack," Communications of the ACM , vol. 45, no. 2, 2002.

V. Krammer, "Phishing defense against IDN address spoofing attacks," in Proceedings of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services , New York, NY, USA, 2006

E. Johanson, "The state of homograph attacks," 2005. [Online]. Available: http://www.shmoo.com/idn/. [Accessed 24 4 2012].

D. Kennedy. [Online]. Available: http://www.secmaniac.com/download/

A. Crenshaw, 2012. [Online]. Available: http://www.irongeek.com/homoglyph-attack-generator.php

phlyLabs, 2012. [Online]. Available: http://phlymail.com

Microsoft, September 2006. [Online]. Available: http://msdn.microsoft.com/en-us/library/bb250505%28VS.85%29.aspx

Chromium Project, [Online]. Available: http://www.chromium.org/developers/design-documents/idn-in-google-chrome

C. Weber, July 2009. [Online]. Available: http://www.blackhat.com/presentations/bh-usa-09/WEBER/BHUSA09-Weber-UnicodeSecurityPreview-SLIDES.pdf.

C. Weber, seems to be longer version of presentation above http://www.casaba.com/files/Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf

C. Weber, July 2009. [Online]. Available: http://www.blackhat.com/presentations/bh-usa-09/WEBER/BHUSA09-Weber-UnicodeSecurityPreview-PAPER.pdf

A. Crenshaw, "Steganographic Command and Control: Building a communication channel that withstands hostile scrutiny," 2010. [Online]. Available: http://www.irongeek.com/i.php?page=security/steganographic-command-and-control [Accessed 23rd April 2012]

Events

Derbycon
Sept 25^th-29^th 2013
http://www.derbycon.com

Others

Questions?

Twitter: @Irongeek_ADC

Printable version of this article

15 most recent posts on Irongeek.com:

If you would like to republish one of the articles from this site on your webpage or print journal please contact IronGeek.