To Complicate or Not to Complicate. — What is Fuzzy Hash?

I was inspired to write this article when I came across TryHackMe’s Pyramid of Pain room (mentioned under tools heading).

I thought hashes were a very good technology in their use cases, but the fact that the slightest change would produce a completely different result always made me feel like there was something missing. As if there should be a technology somewhere that wouldn’t produce a completely different set of letters and numbers when the file changed just a little bit. So when I learned about this technology in this mostly text-based THM room, I thought, this is it. And since I hadn’t heard about it from many people before, I thought I should definitely write about it and tell more people about it. Let’s dive in if you’re ready.

Let the question come, WHAT is this technology?

The English technical name is “context triggered piecewise hashes (CTPH), also known in the industry as fuzzy hashing or similarity hashing.

When I search for Turkish results, there are only a handful of results on this topic and when I try to translate it using translation tools, the following results are obtained.

Deepl: Context-triggered partial hashes
Google Translate: Context-triggered part-by-part hashes

I was not satisfied with the neutral translations of the tools, which are far from the technical knowledge of our field, and I think that they can be translated as hashes triggered from the relevant part. I should even say that since I came across hash in Turkish as “hash” and this is not a popular industry usage and we often consume English and Turkish together in technical conversations, I find it more appropriate to refer to it as “hashes triggered from the relevant parts” at least among the industry people. In this way, when you pronounce the name, it also conveys its full meaning to someone with a little cyber security technical knowledge. And this is usually a peculiar feature of the Turkish language. I think it’s great.

In terms of the general use of hashes, that is to generate a unique set of identifiers for an input and reference the original input. It is usually intended to provide security through the principle of confidentiality, and it is certainly not intended to produce matching or similar inputs.

The most common use of hashes is to generate a unique set of identifiers for an input and reference the original input, and to do so uniquely each time.

I’m sure every time you come into contact with hashes, you’ve wondered if someone somewhere doesn’t need partial matching hashes? I’ve never come across this subject, and since the content protection tactics of digital content service providers are usually to hashing the malicious content and automatically block the hash-matching content, I used to think what happens if it’s partially modified. I used to think that it would be a waste of time and processing power if the same review procedure had to be performed again.

The use case I will talk about in this article is in the world of malware analysis. In malware analysis, taking the hash of the related software and blocking it on systems is a backbone tactic. But in this mental chess game, it is very difficult to predict the next move in a sequence.

In this image, we see a radio image that has been hashed. When it is processed, the output is a result that represents it. I’m using the similarity of the outputs here to theoretically represent partial hashes, and I’d like to draw your attention to the similarities in shape and color.

In this image, we have a radio image with the same shape but different colors. In the traditional use of hashes, this difference would result in a completely different output. But when you use partial fragment-triggered hashing, we get results that show the different part, but also clearly show the same parts.

As nice as it sounds, hashing malware seems to be one of the most basic and easy approaches in the pyramid of pain. Analysts and malware hunters hash the malware, block it and make sure it doesn’t come back in the future.

The most expected move in this turn-based mental battle is for the threat actor to slightly differentiate their malware. The first approach I can think of is for threat actors to slightly modify the malware they have developed or purchased off-the-shelf. Or, since the first move of the defensive operators is obvious, they use it slightly differently for each target.

The response to this move is partially differentiated hashes. These hashes produce the same output for the same parts of the changed input, but different outputs for different parts. Analysts use these similarity hashes against malware detection evasion techniques.

If you are curious about the tools that can be used for this purpose, you can learn the names of other tools from the related Wikipedia page. Not many tools seem to do the job. I’ll leave it to you to decide whether this is because it has a limited use or whether it is too specific.

The THM room ends with a very meaningful sentence. “As David Bianco puts it, “the amount of pain you inflict on a threat actor depends on the types of indicators you can use.” From this perspective, these fuzzy hashes triggered from related fragments are a very basic and lightweight defense mechanism. Achieving efficiency requires the organization receiving threat intelligence services to obtain subscription like services from cyber companies that form large libraries.

This topic is also closely related to “Detection Maturity”. I am adding the link to the Intro to Detection Engineering room, which is both related to this topic and can be considered as a subtitle of the room that inspired this article, for those who are interested. Looking at this topic and today’s cyber industry, one cannot help but think that this approach is quite new and yet perhaps most organizations are only ready with high budget and effort, if they are ready at all. The funny thing is that the blog post introducing the Detection Maturity Level Model, which is perhaps the ancestor of the second THM room I mentioned, was written in 2014.

While I personally think that I am learning information that is maybe a few years old at most, it is quite funny when I imagine that the foundation of this information was laid in 2014 and the author Ryan Stillions laments with a slight smile that “there are still a lot of organizations that are still not mature enough to identify”.

Finally, I would like to end this post with a crazy question, maybe silly, maybe wicked, that comes to my mind: as proactive malware threat hunting, could the next development move on malware be carried out by the relevant operator and try to get ahead of the threat actors by creating the improving the relevant malware code for the systems it protects? I’m sure someone somewhere has definitely thought of this and tried it. But I don’t know the feasibility effort or how efficient or feasible it is. Maybe it creates a rabbit hole problem? Maybe it’s an interesting idea where AI and offensive security can be combined?

While I was searching the internet for this article, I saw that there was a blog post about Microsoft’s result by combining fuzzy hashing and deep learning, before the word artificial intelligence was on everyone’s lips. I definitely recommend you to check it out.

In addition, writing such a long post for such a simple topic is right up my alley.

Until next time dear readers, please keep being awesome!