A Logo

Feel free to include my content in your page via my
RSS feed

Help Irongeek.com pay for
bandwidth and research equipment:

Subscribestar or Patreon

Search Irongeek.com:

Irongeek Button
Social-engineer-training Button

Help Irongeek.com pay for bandwidth and research equipment:


Unicode Security Notes Page

Unicode Security Notes Page

Download Talk Slides

Corrupt Unicode Example

Scrap Unicode Notes

Every Unicode Character Blob Page or TXT file
Every Unicode Character 80 Column Page or TXT file
Every Unicode Character With Hex Page or TXT file

Text below is to help with search indexing and copy and pasting, but it is missing some items from the Power Point slides.

Character Assassination:
Fun and games with Unicode

Adrian Crenshaw

About Adrian

I run Irongeek.com

I have an interest in InfoSec education

I don’t know everything - I’m just a geek with time on my hands

Sr. Information Security Engineer at a Fortune 1000

Co-Founder of Derbycon

To be clear concerning what this talk is about

Why this subject?

Lot’s of research has been done, but not many people talk about it

Complexity is the damnable enemy of security, but human language is complex so what can you do?

Act as a setup for future research

To encourage others who are better at exploit development than me to look into it

Because I wanted to make an animation with cartoon letters stabbing each other

Why Unicode

There are more than English Speakers out there

ASCII: American Standard Code for Information Interchange

What about other languages? Cyrillic, Chinese, Hebrew, Arabic, Klingon… ( ok, sort of http://wazu.jp/gallery/Test_Klingon.html )

Unicode lets computer systems support more languages, allowing for world wide use

Unicode History

ASCII is 7 bit and just 96 printable characters, but an 8th bit was added to make other standards:

Extended ASCII

ISO/IEC 8859

ISO/IEC 8859 uses last bit to add another 96+ control characters

You have to specify a part/character set/language to specify those 96

This still was not enough, and did not allow for a lot of mixed languages

The need was to represent all of the characters as unique code points, and not get confused amongst languages

Unicode History

Joe Becker (Xerox), Lee Collins & Mark Davis (Apple) started working on Unicode in 1987 to do this, version 1.0.0 released in Oct 1991

Unicode started as a 16bit character model (0x0-0xFFFF), with the first 256 code points the same as ISO-8859-1

Each character has a code point associated with it:
A = U+0041          $=U+0024             U+265E=

This has since been expanded, so Unicode has points from 0x0 to 0x10FFFF (1,114,112 points dec), though support varies

Most used points will be in Basic Multilingual Plane (BMP) represented as U+0000 to U+FFFF


UTF-8 (UCS Transformation Format 8-bit), meant to be backward compatible with ASCII

UTF-16 (Unicode Transformation Format 16-bit) which superseded UCS-2

UTF-32 (Unicode Transformation Format 32-bit )

BOM (Byte Order Marks)

UTF-8 prepends EFBBBF to data

UTF-16 FEFF Unicode Big Endian, FFFE Little Endian

UTF-32 generally does not use one

Encoding Examples

Omega U+03A9


41 CE A9 42

00 41 03 A9 00 42

00 00 00 41 00 00 03 A9 00 00 00 42

I hate Smart Quotes!

“Smart” "Not so smart" Smart when dumb Why?

Microsoft extended ISO 8859-1, making some control characters in 80 to 9F printable for Windows-1252

“   ”    ‚     ‘    ’    —
93 94 82 91 92 97

If Windows-1252 is confused for ISO 8859-1, you get for these characters

Makes copying and pasting command in tutorials a pain!

Some Email J
Some Email J

UTF-8 Encoding

Lower ASCII is the same in UTF-8, Higher uses continuation bytes (table bogarded from Wikipedia)

UTF-16 Encoding

In UTF-16 U+10000 to U+10FFFF use surrogate pairs in range 0xD800 to 0xD8FF

based on: http://en.wikipedia.org/wiki/UTF-16

    1. 0x10000 is subtracted from the code point, leaving a 20 bit number in the range 0..0xFFFFF.
    2. The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give the first code unit or lead surrogate, which will be in the range 0xD800..0xDBFF.
    3. The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give the second code unit or trail surrogate, which will be in the range 0xDC00..0xDFFF (previous versions of the Unicode Standard referred to these as low surrogates).


Mojibake = "character" "transform“


Code Points:

U+0041 U+03a9 U+0042 U+270C U+0043

UTF-8 bye string:

EF BB BF 41 CE A9 42 E2 9C 8C 43

Mangled by reading as just ISO 8859-1 bytes:


Find Your Character

Wikipedia List

Unicode Table

File Format

Unicode Code Converter v7.05

Typing Unicode


Alt, + key on keypad, type hex number

May have to edit HKEY_Current_User/Control Panel/Input Method and set EnableHexNumpad to "1“.
Help from http://www.fileformat.info/tip/microsoft/enter_unicode.htm


Option+Command+t will let you select some

System Preferences ->Language & Text->Input Sources

Enable “Unicode Hex Input”

Select U+ from the menu bar

Hold Option Key, type in Hex code

Obligatory XKCD Slide


Homoglyph/Visual Attacks

Confusables and Look-a-likes

Classic Phishing Obfuscations

Would you follow a link in email to AdriansHouseOfPwnage.com?

Text says one thing, link says another:
<a href=”http://irongeek.com”>http://www.microsoft.com</a>

Confuse user with credentials section of a URL:

Firefox pops up a warning

IE just refuses to connect

Other ideas?


Homographs = words that looks the same

Homoglyphs = characters that look the same


rnicrosoft.com vs. microsoft.com

paypa1.com vs. paypal.com


Now, what about Unicode?

Problem: DNS is ASCII

DNS labels (the parts separated by dots) follow the LDH rule:




This would not allow for international characters in DNS labels

Enter Punycode and IDNA


Internationalized Domain Names in Applications (IDNA) allows non-ASCII characters in the host section of a URL to map to DNS host names

café.com = xn--caf-dma.com

北京大学.中國 = xn--1lq90ic7fzpc.xn--fiqz9s

What about Homoglyphs in Unicode?

There are homoglyphs in Unicode that look the same as normal Latin characters, and these could be used for spoofing names, examples:

googlе.com = xn--googl-3we.com
(е is a Cyrillic small letter ie U+0435)

іucu.org  = xn--ucU+ihd.org
(і is a Cyrillic small letter Byelorussian-Ukrainian і U+0456)

pаypal.com  = xn--pypal-4ve.com
(2nd а is Cyrillic small letter a U+0430)

Likely Sources for Homoglyphs

Cyrillic script: a, c, e, o, p, x and y

Latin alphabet appears twice, U+0021-007E (Basic Latin) & U+FF01-FF5E (Full width Latin):

Even some slashes
/(U+002f),  ̸ (U+0338), ⁄ (U+2044), ∕(U+2215),
(U+2571), (U+ff0f), (U+ff89)


Can other domains be used?

Slash is U+2044

Mouse over it

Homoglyph Attack Generator


Combination of JavaScript and PHP libraries created by phlyLabs as part of phlyMail

Protections Implemented by Browsers

Firefox shows Punycode if

Not in TLD White List (about:config→network.IDN.whitelist)
.ac, .ar, .asia, .at, .biz, .br, .cat, .ch, .cl, .cn, .de, .dk, .ee, .es, .fi, .gr, .hu, .il, .info, .io, .ir, .is, .jp, .kr, .li, .lt, .lu, .lv, .museum, .no, .nu, .nz, .org, .pl, .pr, .se, .sh, .si, .tel, .th, .tm, .tw, .ua, .vn, .xn--0zwm56d, .xn--11b5bs3a9aj6g, .xn--80akhbyknj4f, .xn--90a3ac, .xn--9t4b11yi5a, .xn--deba0ad, .xn--fiqs8s, .xn--fiqz9s, .xn--fzc2c9e2c, .xn--g6w251d, .xn--hgbk6aj7f53bba, .xn--hlcj6aya9esc7a, .xn--j6w193g, .xn--jxalpdlp, .xn--kgbechtv, .xn--kprw13d, .xn--kpry57d, .xn--mgba3a4f16a, .xn--mgba3a4fra, .xn--mgbaam7a8h, .xn--mgbayh7gpa, .xn--mgberp4a5d4a87g, .xn--mgberp4a5d4ar, .xn--mgbqly7c0a67fbc, .xn--mgbqly7cvafr, .xn--o3cw4h, .xn--ogbpf8fl, .xn--p1ai, .xn--wgbh1c, .xn--wgbl6a, .xn--xkc2al3hye2a, .xn—zckzah

network.IDN_show_punycode set to true (default false)

Any of these blacklisted characters appear:
 ¼½¾ǃː̷̸։׃״؉؊٪۔܁܂܃܄ᅟᅠ᜵           ​․‧

 ‹›⁄⁒ ⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟∕∶⎮⧶⧸⫻⫽⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻ 。〔〕〳㈝㈞㎮㎯㏆︔︕︿﹝﹞./。ᅠ

Updated at

Protections Implemented by Browsers

IE 9, and I assume 10 shows Punycode if

If there is a mismatch between the characters used in the URL and the language expectation

If character is not used in any language

Mixed set of scripts that do not belong together

Info may be out of date, most material references IE 7

Protections Implemented by Browsers

Chrome shows Punycode if

 Configured language of the browser (configured in the “Fonts and Languages” options) does not match

 Incompatible set of scripts that do not belong

But there is a whitelist, so hard to confuse scripts like Latin with Chinese can be used

 Characters in a black list

Defenses by Registrar

Registrars may not allow the character

For example, one registrar gave the following error when an attempt was made to register іucu.org (Cyrillic small letter Byelorussian-Ukrainian i U+0456): 
“Error: You used an invalid international character! Please note that for some reason .org and .info only support Danish, German, Hungarian, Icelandic, Korean, Latvian, Lithuanian, Polish, Spanish, and Swedish international characters.”

May be gotten around by / homoglyphs, Katakana Letter No (U+30ce) seems to work best and a domain you already own


  1. How different browsers show the Punycode in the URL bar.
  2. How different mail systems show the URL when email is displayed.
  3. How social networks render the URL.

Used domain we control, and Local Hosts file to map the DNS entries

IE 10.0.8

FireFox 23.0.1

Chrome 28.0.1500.95 mg

Some Results

Other odd balls

іucu.org [xn--ucu-ihd.org](і U+0456 ) could not be registered

These seemed to pass Registrar’s tests
Íucu.org [xn--ucU-2ia.org](Latin capital letter i with acute Í U+0456)
íucu.org [xn--ucU-qma.org](Latin small letter i with acute í U+00ED)
įucu.org [xn--ucU-9ta.org](Latin small letter i with ogonek į U+00ED)

Katakana Letter No (U+30ce) seems to work in Firefox for subdomain trick, but not in Chrome or IE

Display of IDNA in Web Apps

What does the webapp display?

How does it parse links?

Test Strings

Ω U+03A9
ɡ U+0261
і U+0456
⁄ U+2044

Outlook 2010

Sent from Gmail to campus mail

Pink phishing warning that must be clicked past to use links

4th, 7th and 8th link had parse errors


Sent from Outlook mail to Gmail

2nd and 3rd links used to have problem with ɡ (Latin small letter script G U+0261) but now work

4th link had problems with Cyrillic і (U+0456) if no http:// in front

7th and 8th link had parse errors because of ⁄ (fraction slash U+2044)  and were split in two


Seemed to render all but the fourth link as it was inputted Punycode versions show

іucu.org without the preceding http:// gave issues. Cyrillic і (U+0456) seemed to confuse the parser

The ⁄ (fraction slash U+2044) in the last two links seems to also cause no oddities


Twitter had the effect of rendering all of the URLs as a truncated, URL shortened (using t.co), Punycode version

 Except іucu.org without the preceding http://. Again, the soft-dotted Cyrillic і (U+0456) seemed to confuse the parser.

Twitter makes it pretty obvious that there is something funny about the URLs

Fonts Matter


Courier New:

Ok, besides Homoglyphs?


“Covered Writing”

Hide Text in text

Easy to detect by looking at the bytes, but may fool the human eye

Some examples looks better than others, Unicode support varying.

Can be used in Botnets:

Play with it here:

Stego Examples

Alternate between Latin and Full-width Latin, easy, just add/subtract 65248 decimal. Use U+205F as space
hi imy ove text to use. Dyouhik t will wok hpe thatt will.

Use very close homoglyphs to encode single bits, skip if there are no close homoglyphs, use 8 types of space like characters (U+0020, U+2004, U+2005, U+2006, U+2008, U+2009, U+202F, U+205F) to encode 3 bits each (000,001,010,011,100,101,110,111)
Τhiѕ іѕ my cover tехt tο usе. Dο yοu thіnk іt wіll wοrk? I һοре that it will.

Use non printable Tags in U+E0000 to U+E007F, also easy just add/subtract 0xE0000
This 󠁉is 󠁴my 󠀠cover 󠁷text 󠁯to 󠁲use. 󠁫Do 󠁥you 󠁤think 󠀿it will work? I hope that it will.

“It worked?”

Name Spoofing

IP Boards let me spoof Daren from Hak5’s screen name:
Darren Κitchen (U+039A Greek Capital Letter Kappa)
Darren Kitchen
(Post count and admin status will give it away)

Twitter returned the error
“Invalid username! Alphanumerics only.”

Gmail/Google returned the error
“Please use only letters (a-z), numbers, and periods.” when non-ASCII characters were attempted.

More research needs to be done in these areas.

Right to left?

Josh Kelley mentioned this one to me

What about left to right mixed with right to left scripts?

Takes U+202E (Right-to-Left Override), U+202C stops it



More details at:

What about file names?

Just how they are displayed

Non Visual


UTF-8 Exploits

Text Comparison

Buffer Overflows

Property and Character Stability

Deletion of Code Points

Secure Encoding Conversion

Enabling Lossless Conversion

Canonicalization Errors?

Remember when the full width Latin forms were turned to normal Latin in the URL bar? 

< or > filtered?

What if it also tries to canonicalize similar characters like < (U+003c), >(U+003e), ‹ (U+2039), (U+FE64), (U+FE65)  › (U+203a), (U+ff1c), (U+ff1e) afterwards?

Other Transforms

Case changes

ß (U+00DF) upper case becomes SS

İ (U+0130) to lower case becomes i (U+0069)

ſ (U+017F) to upper becomes S (U+0053)

ẞ (U+1E9E) to lower becomes ß (U+00DF)

ı (U+0131) to upper becomes | (U+0049)

Apparently, locale matters too, French upper case may drop diacritics, Turkish handles “iIıİ” differently


UTF-8 Exploits

Overly long encoding, will it bypass filters?

< = 3C = 00111100
11000000 10 111100 = C0 BC

> = 3E = 00111110
11000000 10111110 = C0 BE

a1 13 a1 03 a1 12 a1 09 a1 10 a1 14

MS00-057 Was this Problem, but with ../

Text Comparison

Various characters have both their own code point, and can be made with “Combining” characters

 Diacritical marks also A (U+0041) next to U+0300 = À but  À is also U+00C0

We want text searches to be equivalent,

NFKC - Normalization Form Compatibility Composition

"Ⓓⓔⓛⓔⓣⓔ" into "delete".

International Phonetic Alphabet has examples in U+0300 to U+036F. Even more in U+1DC0 to U+1DFF

Real-life Example: Spotify

The canonical_username function was not “idempotent” (only first time matters), Function like “toLower” would be.

Users signs up with username IronGeek, normalized to irongeek

Another user signs up as ᴵᴿᴼᴺᴳᴱᴱᴷ (U+1D35 U+1D3F U+1D3C U+1D3A U+1D33 U+1D31 U+1D31 U+1D37 in Phonetic Extensions block)
Which also gets normalized to IRONGEEK the first time, but irongeek the next time.

ᴵᴿᴼᴺᴳᴱᴱᴷ requests a password reset email, but with it can reset IronGeek’s account

Full story here:

Thwart Searches/Obscenity Filters

What if you want to be public, by hard to search for?

What if you wan to search for filtered words?

Classic example, no Unicode needed: pr0n

Porn != Pοrn != Pоrn

o=U+006f, ο=U+03bf, о=U+043e

Latin Small o, Greek Small Omicron, Cyrillic Small Letter o

Searches for the above turcn up different results in Google

Some items with mixed scripts just get flagged as spam

Just plain fun too

Buffer Overflows

Some expand out

Complexities With Buffer Overflows

Try to overwrite EIP with 0x41414141, you get 0x00410041

Chris Anley came up with “Venetian Shellcode”


FX of Phenoelit also did some work on this



Combining Diacritics

Invisible Characters

Malformed UTF-8

Bad Surrogate Pairs

Multiple levels or RTL, LTR reversing

Chris Weber’s Blog:

In recent news, Apple's CoreText API Bug:
سمَـَّوُوُحخ ̷̴̐خ ̷̴̐خ ̷̴̐خ امارتيخ ̷̴̐خ
MS13-060 Vulnerability in Unicode Scripts Processor Could Allow Remote Code Execution (2850869)

Big Thanks

J. Abolins

Chris Weber
@w3be http://www.casaba.com

Michal Zalewski
@lcamtuf http://nostarch.com/tangledweb

William Coppola

Useful Sites

Unicode Security Considerations

Unicode Security Mechanisms

Unicode Converter

Unicode Character Info and List

Homoglyph Attack Generator


OWASP XSS Filter Evasion Cheat Sheet


Unicode “Fonts”

Other Fun


Hand are based on


A. Costello, March 2003. [Online]. Available: http://www.ietf.org/rfc/rfc3492.txt

J. Abolins, December 2010. [Online]. Available: http://www.irongeek.com/i.php?page=videos/dojocon-2010-videos#Internationalized%20Domain%20Names%20&%20Investigations%20in%20the%20Networked%20World

M. Zalewski, The Tangled Web: A Guide to Securing Modern Web Applications, 1st ed., No Starch Press, 2011.

E. &. G. A. Gabrilovich, "The Homograph Attack," Communications of the ACM , vol. 45, no. 2, 2002.

V. Krammer, "Phishing defense against IDN address spoofing attacks," in Proceedings of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services , New York, NY, USA, 2006

E. Johanson, "The state of homograph attacks," 2005. [Online]. Available: http://www.shmoo.com/idn/. [Accessed 24 4 2012].

D. Kennedy. [Online]. Available: http://www.secmaniac.com/download/

A. Crenshaw, 2012. [Online]. Available: http://www.irongeek.com/homoglyph-attack-generator.php

phlyLabs, 2012. [Online]. Available: http://phlymail.com

Microsoft, September 2006. [Online]. Available: http://msdn.microsoft.com/en-us/library/bb250505%28VS.85%29.aspx

Chromium Project, [Online]. Available: http://www.chromium.org/developers/design-documents/idn-in-google-chrome

C. Weber, July 2009. [Online]. Available: http://www.blackhat.com/presentations/bh-usa-09/WEBER/BHUSA09-Weber-UnicodeSecurityPreview-SLIDES.pdf.

C. Weber, seems to be longer version of presentation above http://www.casaba.com/files/Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf

C. Weber, July 2009. [Online]. Available: http://www.blackhat.com/presentations/bh-usa-09/WEBER/BHUSA09-Weber-UnicodeSecurityPreview-PAPER.pdf

A. Crenshaw, "Steganographic Command and Control: Building a communication channel that withstands hostile scrutiny," 2010. [Online]. Available: http://www.irongeek.com/i.php?page=security/steganographic-command-and-control [Accessed 23rd April 2012]


Sept 25th-29th 2013




Twitter: @Irongeek_ADC


Printable version of this article

15 most recent posts on Irongeek.com:

If you would like to republish one of the articles from this site on your webpage or print journal please contact IronGeek.

Copyright 2020, IronGeek
Louisville / Kentuckiana Information Security Enthusiast