Monday, March 21, 2022

When is true not equal to true?

When is true not equal to true?

An investigation of homoglyphs, their impact on code, and how to detect them.

 

Statement of the problem

Here is something I did for fun but stuff like this would be very difficult to debug if it made it into my code. It looks like the values of $true and $false can be changed at will below.

 

 

 

 

PS C:\> $true -eq $true
True

PS C:\> $true -eq $truе
False

PS C:\> $false -eq $false
True

PS C:\> $fаlѕе -eq $false
False 

 

 

 

 

Summary

This article will walk through homoglyphs and a proposed type of attack that I have not yet seen in the wild, but I presume has occurred. Every programming language I’m aware of is impacted but I don’t know every programming language, so I’ll stick to PowerShell for the proofs of concepts below. I’ll also show code that I wrote to detect this vulnerability in PowerShell code which can be built upon to create scanners for other languages. The problems I present here can be detected if proper unit testing is in place. I don’t like writing unit tests either, but this is me Pestering you to consider adding unit testing to your pipeline.

Homoglyphs are two or more characters that have similar or identical shapes but are stored by the computer as different characters. Homoglyphs are nothing new, people have been familiar with them and used them for good and evil since before computers like on some old typewriters that didn’t have a “1” key since there was already a lowercase “L”.

 

Disclaimer: The sample scripts are not supported under any Microsoft standard support program or service. The sample scripts are provided AS IS without warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages. 

 

Disclaimer: Microsoft does not endorse any third-party products or services. The third-party products or services are not supported under any Microsoft standard support program or service. The third-party products or services are available AS IS without warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the third-party products or services and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the third-party products or services be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages. 

 

If you’re not familiar with Trojan Source it is an interesting fun read available at https://trojansource.codes/ or view the original whitepaper at https://trojansource.codes/trojan-source.pdf  The Trojan Source whitepaper makes a mention of homoglyphs however I investigate a little more here.

Leet is an example of homoglyph use: 1337 5p34k 15 4150 4 7yp3 0f h0m0g1yph

Microsoft quickly responded to the Trojan Source publication and quickly put in place notifications to let users know if a GitHub repository or VS Code being modified contained bidirectional (bidi) Unicode characters. Homoglyphs were also mentioned in the Trojan Source paper, but I was unable to find a tool to scan my repositories for homoglyphs so I wrote one. The module can be installed by running Install-Module HomoglyphDetection or by going to the GitHub repository https://github.com/PaulHCode/HomoglyphDetection.

Open-source projects with modern tools like GitHub make it easy for there to be hundreds of developers all contribute to the same project which is great for development speed and agility but presents problems with knowing the entire code base is secure. No single developer owns every variable and function and by just tweaking one character one place major problems can be introduced and it can be very difficult to detect.

There are many characters that look identical but are stored as different characters electronically, some examples shown below.

 

 

 

 

PS C:\> [int][char]'i'
105

PS C:\> [int][char]'і'
1110

PS C:\> [int][char]'a'
97

PS C:\> [int][char]'а'
1072

PS C:\>  

 

 

 

 

Here you can see how someone could make two identical looking variables with different values. I have enough trouble debugging code when I can tell the difference between my variables but if someone introduced identical looking variables in a large project, I might never find the problem.

 

 

 

 

PS C:\> $i = $true

PS C:\> $і = $false

PS C:\> $i
True

PS C:\> $і
False

PS C:\> $i -eq $і
False

 

 

 

 

Function names could also have identical looking names. Projects have many functions loaded from many files and functions and variables used in one place could be loaded from many other places. Below are four different functions that have indistinguishable names. The top function is the original function that developers intended to be in the project. A different function that looks identical could be contributed to the project then different versions of the function can be called in different parts of the project which would yield anomalous behavior.

 

 

 

 

Function Foo{"Hello world"}
Function Fοο{"Goodbye world"}
Function Ϝoo{"Weird world"}
Function Foo‚{"Invisible world"}
Foo;Fοο;Ϝoo;Foo‚
Hello world
Goodbye world
Weird world
Invisible world 

 

 

 

 

In a larger script or program, there could be something like the snippet below.

 

 

 

 

If(!$Prevented){
    #grant access and do things not everyone should be allowed to do here
} 

 

 

 

 

If one of the characters in “Prevented” was changed to a homoglyph, then $Prevented would be an undefined variable which evaluates to $Null, and in an if statement $Null evaluates to $False and Not $Null evaluates to $True.

 

Problem detection

It is easy to generate homoglyphs and can easily be done with http://homoglyphs.net/. It is harder to detect if one word is a homoglyph of another though since each character could be any number of different characters. Even if each character only had 3 other characters that looked like it, the number of combinations possible for an N-character word is 4N-1. A 5-character variable therefore could be represented by 1023 different combinations of characters, and that doesn’t even include options like invisible characters.

 

 

 

 

PS C:\> $truе -eq $true
False 

 

 

 

 

In the example above, we don’t see the reflexive property of equality breaking down, but it is impossible to tell the difference between the two values that both look like $true. The one on the right is the expected automatic variable, however the one on the left uses character 1077 for ‘e’ instead of character 101. An undefined variable is not going to be equal to $true or $false, so by modifying a single character in a PowerShell script the code can dramatically change in function while appearing identical in the text editor and PowerShell host.

 

 

 

 

PS C:\> ("true").ToCharArray() | %{"$_ - $([int][char]$_)"}
t - 116
r - 114
u - 117
e - 101

PS C:\> ("truе").ToCharArray() | %{"$_ - $([int][char]$_)"}
t - 116
r - 114
u - 117
е - 1077

PS C:\>  

 

 

 

 

My first thought was that use of characters with a high numerical value were probably not used in production code; after all, I only use the characters I can see on my keyboard which top out at around 126. If I discovered high value characters in code, then maybe it is suspicious.

I checked which characters are used in common PowerShell by simply scanning all the PowerShell scripts on my computer. My computer isn’t a great reflection of what is used by everyone, but at least it was a good start. I did the quick check using the following:

 

 

 

 

$bigArray = @()
$count = 0
$files = (Get-ChildItem '<my scripts location>' -include '*.ps1' -recurse)
$max = $files.count
ForEach ($file in $files) {
    Write-Progress -Status "checking file $count of $max" -Activity "$($file.fullname)" -PercentComplete (100 * ($count / $max))
    $bigArray += (Get-Content $file).tochararray() | ForEach-Object { [int]$_ }
    $count++
}
$analysis = $bigArray | Group-Object
$analysis | Select-Object count, @{N = 'Name2'; E = { [int]$_.Name } }, @{N = 'char'; E = { [char][int]($_.name) } } | Sort-Object Name2

 

 

 

 

To my surprise, high numbered characters were being used; not a ton, but I couldn’t just dismiss them. I’m from the United States and only work with the English language, but there are probably developers using many other Unicode characters that are higher numbered in other languages, so I needed a better way to detect unusual characters rather than just alerting on high numbered characters.

 

Automating detection

I devised a plan to detect homoglyphs in my code:

  1. Pull every variable and function name out of the code to be examined.
  2. Convert each character of the names to the more common character I know that looks like it.
  3. Compare all the converted values to each other to see if any of the converted values match while the actual values are different.

If any of the items meet those criteria, then they should be examined by a human for validation.

 

Get every variable and function name

Getting every variable and function can be achieved easily in PowerShell using abstract syntax trees. Since there are other languages out there, my module allows you to specify custom parsing using regular expressions and is easily extensible for you to add in your own parser for other file types. The parsing for PowerShell I used was simple but effective, as seen below.

 

 

 

 

$AST = [System.Management.Automation.Language.Parser]::ParseFile(
        $file,
        [ref]$null,
        [ref]$Null
    ) 
$AST.FindAll({$args[0] -is [System.Management.Automation.Language.FunctionDefinitionAst]}, $true) 
$AST.FindAll({$args[0] -is [System.Management.Automation.Language.VariableExpressionAst]}, $true) 

 

 

 

 

Convert each character to a Latin character that looks the same

I decided to compare variables and function names so I would need to make a mapping of every character and which Latin character it looks like. Unfortunately, there are 65535 characters in Unicode and I’m lazy, so I don’t want to look at all of them and analyze each one and which others it looks like. I found a function someone else wrote and slightly modified it to create a picture of each character, then wrote a simple function to have OCR give me what character the picture looks like; this automated the process of building a lookup table.

 

 

 

 

function Convert-TextToImage
{
# code from:  https://community.idera.com/database-tools/powershell/powertips/b/tips/posts/converting-text-to-image
  param
  (
    [String]
    [Parameter(Mandatory)]
    $Text,
    
    [String]
    $Font = 'Microsoft Sans Serif', #'Times New Roman', #'Consolas',
    
    [ValidateRange(5,400)]
    [Int]
    $FontSize = 24,

    [string]
    $filename="$env:temp\$(Get-Random).png"
  )
  #I slightly modified this function from: https://community.idera.com/database-tools/powershell/powertips/b/tips/posts/converting-text-to-image

  [System.Windows.Media.Brush]$Foreground = [System.Windows.Media.Brushes]::Black
  [System.Windows.Media.Brush]$Background = [System.Windows.Media.Brushes]::White

  
  If($Text.Length -gt 1){
    $psm = 13
    $font = 'Consolas' #'Mitra'#'Microsoft Sans Serif'
    $fontSize = 50
}Else{
    $psm = 10
    $font = 'Times New Roman'
    $fontSize = 20
}

  # take a simple XAML template with some text  
  $xaml = @"
<TextBlock
   xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
   xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml">$Text</TextBlock>
"@

  Add-Type -AssemblyName PresentationFramework
  
  # turn it into a UIElement
  $reader = [XML.XMLReader]::Create([IO.StringReader]$XAML)
  $result = [Windows.Markup.XAMLReader]::Load($reader)
  
  # refine its properties
  $result.FontFamily = $Font
  $result.FontSize = $FontSize
  $result.Foreground = $Foreground
  $result.Background = $Background
  
  # render it in memory to the desired size
  $result.Measure([System.Windows.Size]::new([Double]::PositiveInfinity, [Double]::PositiveInfinity))
  $result.Arrange([System.Windows.Rect]::new($result.DesiredSize))
  $result.UpdateLayout()
  
  # write it to a bitmap and save it as PNG
  $render = [System.Windows.Media.Imaging.RenderTargetBitmap]::new($result.ActualWidth, $result.ActualHeight, 96, 96, [System.Windows.Media.PixelFormats]::Default)
  $render.Render($result)
  Start-Sleep -Seconds 1
  $encoder = [System.Windows.Media.Imaging.PngBitmapEncoder]::new()
  $encoder.Frames.Add([System.Windows.Media.Imaging.BitmapFrame]::Create($render))
  $filestream = [System.IO.FileStream]::new($filename, [System.IO.FileMode]::Create)
  $encoder.Save($filestream)
  
  # clean up
  $reader.Close() 
  $reader.Dispose()
  
  $filestream.Close()
  $filestream.Dispose()
  
}
 

function Get-OCRByChar
{
    [CmdletBinding()]
    Param
    (
        [Parameter(Mandatory=$true,
                   ValueFromPipelineByPropertyName=$true,
                   Position=0,
                    ParameterSetName='Parameter Set text')]
        $Text,
        [Parameter(Mandatory=$true,
            ValueFromPipelineByPropertyName=$true,
            Position=0,
            ParameterSetName='Parameter Set char')]
        $Char
    )

    Begin
    {
        If($Char){
            $Text = [string]$Char
        }
    }
    Process
    {
        $count = 0
        $word = ForEach($character in $Text.ToCharArray()){
            Write-Progress -Activity "Testing $Text" -Status "Working on $character" -PercentComplete ($count/($Text.Length)*100) -Id 0
            Convert-TextToImage -Text $character -filename 'C:\temp\pic1.png' -Font $font -FontSize $fontSize
            & 'C:\Program Files\Tesseract-OCR\tesseract.exe' 'C:\temp\pic1.png' text1 --psm $psm --oem 1 -l eng
            [char](gc 'C:\temp\text1.txt').trim()
            $count++
        }
        ([string]$word).replace(' ','')
    }
    End
    {
    }
}
 

$keys = @()
$min = 0
$max = 65535
ForEach($i in $min..$max){
    Write-Progress -Activity "Checking all chars" -Status $i -PercentComplete (($i-$min)/($max-$min)*100) -Id 1
        try{
        $temp = Get-OCRByChar -Char $([char]$i)
        }Catch{
        $temp = [char]0
        }

        $keys += [pscustomobject]@{
        CharNum = $i
        Char = [char]$i
        LooksLike = $temp
        }
}
$keys | Export-Clixml 'C:\temp\Keys.xml'

 

 

 

 

At this point, we have a hash table with 65535 Unicode characters as the keys, and the ones they looked like (according to Tesseract) as the values. A few of them were obviously wrong, like ‘-‘ being detected as ‘:’ so I slightly tweaked a few values. If you want to get ambitious you could manually generate a better file.

Convert each item of interest to their Latin character equivalent

Now that we have a list of words to convert to Latin characters and a hash table with key/value pairs, we simply need to convert every word to the Latic characters it looks like. The short code below changes the $Text string to $result using the $KeyHash hash table.

 

 

 

 

        $result = [string]''
        ForEach ($char in ($Text.ToCharArray())) {
            $result += $KeyHash[[char]$char]
        }
        $result

 

 

 

 

Compare the converted values

Now that we have the Latin equivalent of every variable, function, and any other items we’re interested in, we can determine if any of them are similar. Then all words are grouped by their Latin equivalent so if there is more than one with the same Latin equivalent but different actual characters, then they look the same but are actually different.

 

 

 

 

Get-HomoglyphsInFile -FullName .\TestFile.ps1 -Predefined PowerShell -RemoveUninteresting
                                                                                                                                                                                                                                                                                                                                                                                                              Name   OCRValue Type     File                                                                                                                                                                          ----   -------- ----     ----                                                                                                                                                                          
$false $false   Variable TestFile.ps1
$falsе $false   Variable TestFile.ps1
$true  $true    Variable TestFile.ps1
$truе  $true    Variable TestFile.ps1
Foo2   Foo2     Function TestFile.ps1
Foо2   FoO2     Function TestFile.ps1

 

 

 

 

Other uses

Malware can be converted to characters that look the same, but the code won’t work so code can be analyzed but not executed.

Malicious students could be using scripts to cheat by automatically changing someone else’s paper that took the class previously. By using homoglyphs, they may be able to have the paper appear identical but avoid detection by the anti-cheating software.

 

Extensibility

The homoglyph detection module is extensible. You can write your own parser to parse another language or file format for Get-HomoglyphsInFile to use.

 

Possible additional work

  • Expand parsers to work on other popular languages
  • Create VSCode extension to automatically detect homoglyphs
  • Create Azure Pipeline to automatically scan commits
  • Possibly use GitHub actions to scan automatically

While homoglyphs have been around forever and used in attacks with DNS entries, I was unable to find automated scanning to help warn users of homoglyphs. It has been a fun experiment writing this PowerShell module and I hope you enjoy using it and building on it. If you find homoglyphs detection useful I’m interested in what you are using it for and what you found found so post it in the comments below.

 

Have fun scripting!

 

Additional reading

Posted at https://sl.advdat.com/3Nbv7Fxhttps://sl.advdat.com/3Nbv7Fx