Using AI to detect Bitcoin addresses involved in ransomware transactions

Published in

Analytics Vidhya

21 min readApr 19, 2021

Abstract

Online scams are on the rise in the past decade, and with the introduction of cryptocurrency and its pseudo-anonymity, it has never been more convenient for scammers.

This blog will go through an infamous online money-grabbing opportunity for hackers- the ‘Bitcoin Heist’. It’s straightforward. Fancy website, and hmm, download music for free? What could go wrong? *click*- That is when you download a special kind of computer virus called ‘ransomware’.

Ransomware spreads through your computer like wildfire and encrypts every last file. Want your sensitive data back? Just pay the hackers a hefty ransom and they will (hopefully) give you a decryption key, which should unlock your data.

Wondering how Bitcoin comes into play and not something like PhonePe? Bitcoin is a cryptocurrency based on peer-to-peer technology that involves no central authority like a bank. For all intents and purposes, transactions done over bitcoin cannot be traced. Now you might understand, why this is perfect for scammers.

In this project, we will use AI to analyze how these transactions take place and try to build models that predict if a given Bitcoin address is being used for malicious intent or not.

Introduction
Data cleaning and Feature Engineering
Exploratory Data Analysis
Modeling
Conclusion
Future work and References

Note: You can find the accompanying jupyter-notebook and all other code in the GitHub repository. For a thorough overview, it is recommended to follow the notebook as well.

1. Introduction

Problem statement

Given a bitcoin address along with some meta-data pertaining to that address, we are challenged to predict if that address has been used to receive ransoms in the past.

Real-world usage

Upon finding that an address has indeed been used for malicious intent, quick action can be taken against that bitcoin address, such as banning it from any future transactions or blacklisting it to prevent further online scams.

Business objectives

Although Bitcoin is open source, it takes in donations. It is in Bitcoin’s best interests that the platform has a good reputation, as more people starting using it, and because of that, it generates higher donations. Thus, to keep a good PR, Bitcoin would have to keep its platform from being advertised as a convenient means to scamming.

Assumption

Seeing that the outcome of misclassification by our model can be extremely high, we will assume that Bitcoin or a third-party has employed domain-expert(s) who will verify the outputs of our model.
We will have our final model minimize its False Negatives, that is, minimize the number of times an address is flagged to be white when in reality, it has been used for malicious intent. This poses a trade-off and in return, the rate of False Positives would increase. Hence, the assumption of having a domain expert who would filter out the false-positive data points is important.

Business constraints

Note that in this project, we only focus on banning/blacklisting bitcoin addresses used for malicious intents in the past and NOT in real-time. This project can very easily be extended to work in real-time as well, by having an API that checks if a receiver's bitcoin address has been flagged by our model before any new transaction is made, and if so, immediately notifies the authorities who can take further action. However, for this project, we will not be extending our use-case to real-time.

Now let us see the main business constraints.

The cost of misclassifying a positive data point can be high. This is because if an address that been used for malicious intent is flagged by our model to be white, that address can continue to be used to receive ransoms and scamming people.
There are no latency concerns since we are not considering real-time use cases. However, if we do, latency becomes a very important constraint.
Interpretability is important because our domain expert would require logical reasoning behind the model’s predictions so he can either verify or conduct further investigation.

The data

We use a dataset from the UCI Machine Learning Repository that contains parsed Bitcoin transaction graphs from 2009 January to 2018 December. This data-set contains labeled data of transactions and if whether they are white or if they belong to a class of Ransomware.
We load the .csv file into a Pandas’ DataFrame

2. Data-cleaning and Feature Engineering

The raw data we got from the UCI Repository needs to be processed before we do any modeling. We also need to modify the distributions for our feature columns because some of the predictive models we will use might assume that the features are normally distributed.

Understanding the dataset

Our dataset has ~30,00,000 rows and 10 columns. Out of these 10 columns, we have 9 predictors and one target column.

Each row relates to a particular transaction. This transaction has nine features, where each feature is:

1. address [String]:
Stores the address of the bitcoin transaction’s recipient.

2. year [int] :
Indicates the year in which the transaction has been done.

3. day [int] :
Indicates day of the year.

4. length [int] :
Length is designed to quantify mixing rounds on Bitcoin, where transactions receive and distribute similar amounts of coins in multiple rounds with newly created addresses to hide the coin origin.

5. weight [float] :
Weight quantifies the merge behavior (i.e., the transaction has more input addresses than output addresses), where coins in multiple addresses are each passed through a succession of merging transactions and accumulated in a final address.

6. count [int] :
Similar to weight, the count feature is designed to quantify the merging pattern. However, the count feature represents information on the number of transactions, whereas the weight feature represents information on the amount of transaction.

7. looped [int] :
Loop is intended to count how many transaction i) split their coins; ii) move these coins in the network by using different paths and finally, and iii) merge them in a single address. Coins at this final address can then be sold and converted to fiat currency.

8. neighbors [int] :
Indicates the number of neighbors a transcation had.

9. income [int] :
Income in terms of Satoshi amount where a Satoshi is the smallest unit of a bitcoin, equivalent to 100 millionth of a bitcoin.

If some of the above features are too technical, do not worry about it. These features have been engineered by domain experts by carefully analyzing each transaction. As AI Engineers, we just need to understand their distributions, their correlations with other features, and how important they are in predicting the target, among other things but knowing the low-level technicalities behind each feature may not be relevant.

The target

The target, or the entity we want to predict for a given transaction, is a set of labels. A label can either be white, indicating that the address corresponding to a particular transaction was NOT used for malicious intent or, it can belong to a set of labels such as paduaCryptoWall, montrealSam, princetonLocky, etc, each of which is the name of a particular family of ransomware.

For the task at hand, we do not need to worry about what family particular ransomware belongs to, but only if whether an address corresponding to a transaction has been used for malicious intent or not. Thus, we convert this multi-class classification task into a binary-class classification. Another reason this is beneficial is in the distribution of our target feature. It is extremely skewed i.e., we have an extremely imbalanced dataset.

Distribution of our features

Let’s take a look at how some of our features are distributed. Let’s start with the feature ‘length’.

Length

We notice that this distribution is not ideal. Empirically, we know that some models (such as Logistic Regression) have better predictive power when the features are normally distributed and not extremely skewed. Thus, we transform such features to be more Gaussian by using Boxcox or other transformations.

We see that the PDF is much closer to the bell curve than earlier. We perform similar transformations on all features and construct new and slightly transformed features.

Income

Let’s see how our income feature is distributed:

We notice that even our income feature is extremely skewed. Let’s try to fix this by applying a box-cox transformation.

PDF of Income after boxcox transformation

After applying the boxcox transformation, we see a great improvement in the overall distribution of data.

We perform similar analyses and transformations for other features as well.

Feature Engineering

Fixing skewness

Some of our newer features are boxcox or other transformations of our vanilla features.

Engineering new features

Creating new features that might co-relate with the target will give our models more predictive power. Let us take a look at a few of our newly engineered features.

Number of addresses:

This feature keeps track of how many times a given Bitcoin address has been encountered in the training dataset.
The hypothesis is that, if a Bitcoin address has been used for malicious intent, it might be involved in more transactions than usual, to receive ransoms. We will see later whether or not this hypothesis holds.

Day of week:

Another feature we engineer is the day of the week in which the transaction took place. We generate this information, by using the year and day data in the dataset and use python’s datetime library to retrieve the day. We will see later whether or not this feature correlates with the target.

Is close to a holiday:

Using the datetime object, we generated earlier, we can see if this transaction took place around an important holiday/festival. The hypothesis is that, if a transaction took place before or after a week of a major holiday, it might be white. We may understand this by thinking that people may want to transfer money to their family or friends during holidays. Once again, during EDA we will verify or disprove this assumption.

We engineer nine more features, including features that fix skewness, engineered features, and even interaction features. For the complete list, please refer to 3.1.2 Engineered Features in the jupyter notebook.

Exploratory Data Analysis

In this section, we shall go over the critical process of performing initial investigations on our dataset, to discover any patterns or anomalies. We will be doing both Uni-variate and Multi-variate analysis on our features.

Univariate analysis:

Balancing our dataset

Before we do any analysis, we need to fix the major class imbalance present in our dataset. For the sake of conserving RAM, and since we have enough data points, to begin with, we will be down-sampling our negative-class datapoints. However, we could also have over-sampled our positive data points and this would help retain more information regarding our majority class.
Another approach would have been to artificially synthesize more positive class data points by using methods like SMOTE.

Let us take a look at the class-separated PDFs of some of our features and understand how they differ. Note that we want to see separability in these PDFs because that helps the model to distinguish between the two classes.

Number of addresses

We notice a very clear difference in the PDFs of the two classes.
It is apparent that addresses corresponding to white transactions are much more likely to be fewer in number than addresses pertaining to malicious transactions.

Year

Before we see how the nature of the transactions changed over the years, first let us see how the feature itself is distributed.
We see that we have a uniform number of transactions from each year.

We immediately see that there is a clear distinction in the nature of transactions in the year 2012, 2015, and 2017. This feature should correlate well with our target. However, we will confirm that during our multivariate analysis.

Weight

We notice a not-so-great separability between the two PDFs but not anything our models cannot work with.
We understand that white transactions are more likely to have a higher weight-value.

Neighbors

We immediately notice a difference in distributions for the number of neighbors in our classes. Positive class data points tend to have far fewer transactions with 2 neighbors, and more with 4 and 3.

Is close to a holiday

We realize that our hypothesis that transactions done close to a major holiday are more likely to be white holds.

Quarter number

Our engineered feature quarter number clearly shows that transactions taken place in quarters 3 and 4 are much more likely to be white than pertaining to a ransomware transaction.

Please refer to 3.0 Uni-variate analysis in the accompanying jupyter notebook for a detailed analysis of the remaining features.

Multi-variate analysis

Multivariate analysis (MVA) is based on the principles of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. This will help us see how are features are correlated with each other as well as with our target.

Pair plots

Let us take a look at how some of our features change with respect to each other.

We immediately notice a different pattern of change in almost all pairs of features based on the target. For example, when gaussian_income is paired with log_count, we notice that far more white data points cluster towards a higher value than our positive class datapoints.
Pair plots also help us see the class-separated PDFs of our individual features. For example, we can see how different the class-separated PDFs of gaussian_income are.

Correlation map

Correlation is a measure of the strength of a linear relationship between two quantitative variables (e.g., height, weight). A correlation map shows how correlated sets of features are. Below, the darker the color in the box, the higher is the correlation between the corresponding features.

Ideally, all boxes other than the left diagonal should be as light as possible. But in reality, some features are always going to be dependent on each other. For example, we see that feature day and quarter number are extremely correlated (which should be expected, because that’s how we constructed our quarter feature).
To know why having a high multi-collinearity between our predictors is bad for modeling, please refer to this discussion.

Now that we have a high-level overview of how our features are correlated with each other, let us take a closer look at each pair of features to get a deeper understanding.

Income and count

KDE or Kernel Density Maps plot the probability density of a continuous variable. In the below plot, the smaller the ‘circle’, the higher is the probability density function of the combined variables (since we’re using a 2D KDE plot to measure the effect of two variables at the same time). To know more about KDE plots, please refer to this article.

In our plot, we want the blue lines and orange lines to be as separable as possible. If they are completely overlapping, that indicates that the feature does not change whether the target variable is 1 or 0.
However, in our case, there is a considerable amount of separability.

Length and weight

We see that when features length and weight are paired and class-separated, we see a slight difference in which data points are scattered across the plot. This indicates that these two predictors when together provide reasonable information for our model to work with.

Income and day of week

We want to analyze and see if there is an increase or decrease in income depending on what day it is.
From the plot, it is evident that the income generated is maximum on Fridays and Sundays.

Income, years and interaction_count_income

Let us take a closer look at how income and its corresponding interaction feature count_income has changed over the years.

Swarm plot between income, year and their effect on interaction_count_income

We can see that our interaction feature has changed over the years. Around the year 2014, most of the higher-income data-points are all belonging to type 0 but later in the year 2018, most of them belong to type 2.

Modeling

Before we can build predictive models and train them on our dataset, we need to establish a metric that defines our model’s performance.

Metrics

Metrics measure the performance of our classification model and choosing the right Key Performance Indicator (KPI) is integral to finding the right model. Some models might score good accuracy but score poor log-loss.

KPI

In our case, the design choice we made was to go with Recall as our KPI. We want to maximize our recall and choose our hyper-parameters that generate the least number of false negatives.
This is because we want to correctly classify as many positive data points (ransomware transactions) as possible even if it means we incorrectly classify a few white transactions to be of type ransomware also.
The assumption throughout the rest of our modeling is that we have a domain expert who will be verifying the outputs of our model and remove any false positives before investigating the transactions.

Other metrics used

Log-loss: A loss function that has no specific range. Lower is better.
Precision: Tells us what percentage of data points that we classified as positive, are in fact positive. Higher is better.
ROC AUC: Uses True Positive Rate and False Positive Rate to get the area under the curve. Higher is better.
F1-Score: Harmonic mean of precision and recall. Higher is better.
Accuracy: Tells us how accurate our predictions are. Higher is better

Now that we understand all of our features well and know exactly what metric we want to maximize, we can start building our models.

Random model

Before building any proper models, it is always a good idea to have what the worst-case performance would look like. Some of our metrics (such as log-loss) do not have any range. This random model would predict class labels (uniform) randomly.
The metrics calculated on our random model should give us a good idea as to what the worst-case scenario should look like. Let’s take a look at its outputs.

As can be seen from above, we want our log-loss to be no more than 17 and our recall to be at least 50%.

Note: We will be using two types of machine-learning algorithms for the rest of the modeling — distance-based and tree-based. Fine-tuned datasets for both these models have been constructed separately. For example, distance-based datasets have been One-Hot-Encoded, scaled, and so on, because this increases their predictive power. However tree-based models actually fair poorer when the features are one-hot-encoded and so on. This distinction has been clearly highlighted in the accompanying notebook, and hence for further clarification, it is recommended you take a look at the same.

Distance-based models

Distance-based models are based on the geometry of data. As the name implies, distance-based models work on the concept of distance.

Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. It simply tries to fit a hyper-plane in d-dimensions that best separates the positive and negative data points.
We use sklearn’s SGDClassifier with a logistic loss, which is more efficient than the vanilla LogisticRegression class because it uses Stochastic Gradient Descent.
For all models, to find the best hyper-parameters we use Randomized Search. After optimizing, we find that the best alpha to go with is 0.001. Fitting that to the model and getting test predictions, outputs the following metrics.

We see that our recall is very poor at only 52.3%. This is only about 2% more than our random model. Let’s take a look at the other metrics.

We see that the ROCAUC of our model is pretty good and far better than our base or random model.
Let us take a look at all of our other metrics

Since our KPI (Recall) is unacceptable, we will not be using Logistic Regression as our best model. Let us investigate how other models are going to perform.

Support Vector Machines

One takeaway we can have from our failure with Logistic Regression is that maybe our data is not linearly separable. As powerful as Logistic Regression is, it can only do so much with non-linear data. This is also a hint that our linear-SVM would fair just as poorly as Logistic Regression because of the same naive assumption that data is linearly separable. However, we shall still investigate this.
For SVMs too, we will be using an SGDClassifier but this time, we will define our loss function to be the hinge-loss, as we only want to find the support vectors.
Note that we do not use kernel-SVM as the time-complexity to fit a model like RBF-kernel SVM is quadratic and due to a large number of data points, it is not feasible.
After hyperparameter optimization, we find that the best value for C is 10. (Note that alpha = 1/C, so we will set our hyper-parameter as 0.1).

Taking a look at our confusion matrix and our output table, we see that it’s the same outcome as with Logistic Regression. Although our AUC is good, our precision and recall is just unacceptable (as we expected). Thus, we will not be considering either of our distance-based models to be adequate predictors.

Tree-based models

Tree-based models are a family of classifiers that form a tree with each attribute at one level and which perform series of condition checking with one attribute at a time.
With the failure of both our distance-based models, it is a good indicator that our data is non-linear and tree-based models have no naive assumption on the data, which should translate to better performance.

Random Forest

Random forests are an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees.
After optimization, we find that 500 is the optimum number of base-estimators. Note that since these models have way more horse-power, we optimize for F1-Score and not just Recall.
Let us take a look at how they perform.

We see a significantly more area under the curve than either of our previous models.
Let us see how its confusion matrix looks like.

We see an incredible boost in Precision, however, our Recall is very poor. Since our KPI was chosen to be Recall, Random Forest also seems to be unacceptable.

Gradient Boosted Decision Trees

The power of GBDT’s is no secret. They are powerful and relatively fast. Since the base predictor is still an individual tree, non-linearity is not a problem.
Let us see how GBDTs fair against our dataset. After hyperparameter optimization, we find the following results.

We see an incredible AUC of 0.98 which is the highest we have encountered so far.

Moving on to the confusion matrix and our output table, we see a very high Recall of 92.6% on unseen data points. Like mentioned before, we prioritize recall to any of our other metrics and thus choose GBDTs as our best model so far.
Our precision is at 17% which means that our model is generating quite a lot of false positives. However, we established earlier that since we have a domain expert who will validate the model’s outcomes before further action thus reducing the effect of the false positives.

Stacking Classifier

We saw that GBDTs had excellent Recall while Random Forest had great Precision. Will stacking both these classifiers give us a better result?
Stacked generalization consists of stacking the output of individual estimators and use a classifier to compute the final prediction. Stacking allows using the strength of each individual estimator by using their output as input of a final estimator.
We build our own class where the base estimators are GBDTs and RFs whereas the final classifier is a Logistic Regression model. Our model will use the output of both our models on our dataset as 2 separate features and train a logistic regression model on the outputs themselves.
Let us see if this gives us any more predictive power.

Once again, we see that our meta-class, that is, the Logistic Regression model has completely overfitted to the Random Forest outputs and for the most part, left out the GBDT’s outputs. Hence, the recall is low but the precision is high.
We do not consider our Stacking Classifier to be acceptable and still consider our GBDT’s to be the best performers so far.

Modeling summary

As we have seen, some of our models failed, some faired decently while one has performed great. Let us take a look at their numbers all in one place to get an idea of where and how they stand.

As can be seen, our GBDT stands above all and by a great margin in terms of recall. Since our most important concern is to predict ransomware transactions and misclassify positive class data points as little as possible, we go with the model that gives us the best Recall. Hence, we conclude our modeling by selecting GBDT.

Comparison with stock data

We have spent a lot of time engineering new features as well as fixing the skewness on existing ones among other data preprocessing steps such as scaling, encoding, and so on. A good question to ask is, how important was all of that for prediction? Let us find the answer to that question by training a GBDT on a dataset that has not been preprocessed (other than label encoding, etc.) and comparing it with our current GBDT.

Thankfully, we see that all of the efforts were not in vain as we see an increment of ~8% accuracy, 13% F1, 9% Precision, 4% Recall, and a log-loss that is better by ~2.6. Hence, we see that our data preprocessing along with all the feature engineering has played a very important role in predicting the target.

Conclusion

We started out with a dataset with extremely skewed features and with the goal of predicting if a bitcoin address was used for malicious intent or not. We performed a lot of data analysis and found out that the existing features needed some tweaking and new features had to be engineered to give our models some more juice.
We performed various transformations on our existing features and used the same to come up with brand new features which would later increase our predictive power substantially.
We looked at various kinds of models; distance-based, tree-based, and even stacked models. On investigating each model’s performance, we concluded that GBDT was the best performer.
Going back to our initial business constraints, we satisfy all three. Firstly, we maximized recall so as to minimize misclassification of a positive data point. The second constraint was that we did not have any strict latency constraint, which is why we were able to build such complicated models. The final constraint was that interpretability is very important. We can very easily have our model give out an explanation behind its outputs because its base-estimators (Decision Trees) are extremely interpretable. For feature importances and other graphs, please go through the notebook.

Future work

On close examination, we notice a trend in how our models performed. The more complicated a model is, the better it performed. This indicates that our data is not linearly separable and that the underlying function is very complex. Training a powerful Neural-network on this dataset might yield great performance because the network will be able to capture the non-linearity and fit to the underlying function very well.
Extending this project to work in real-time could also be researched and implemented.

References

Goldsmith, D., Grauer, K., & Shmalo, Y. (2020). Analyzing hack subnetworks in the bitcoin transaction graph. Applied Network Science, 5, 1–20. link
Rivera-Castro, R., Pilyugina, P., & Burnaev, E. (2019, November). Topological Data Analysis for Portfolio Management of Cryptocurrencies. In 2019 International Conference on Data Mining Workshops (ICDMW) (pp. 238–243), IEEE link

All of the code is available on the GitHub repository.

Thank you for reading and for any corrections, suggestions, or questions, reach out to me via email: yogeshchandrasekharuni@gmail.com or on LinkedIn.