Abstract
In a previous article, we saw the ease with which we could install and use Apache Spark within the SingleStore notebook environment. Continuing our series on Spark, we'll now use it to classify fraudulent credit card transactions.
The notebook file used in this article is available on GitHub.
Fraud dataset selection
We can find actual credit card data on Kaggle. The data are anonymised credit card transactions containing genuine and fraudulent cases.
The transactions occurred over two days during September 2013, and the dataset includes a total of 284,807 transactions, of which 492 are fraudulent, representing just 0.172% of the total.
This dataset, therefore, presents some challenges for analysis as it is highly unbalanced.
The dataset consists of the following fields:
- Time: The number of seconds elapsed between a transaction and the first transaction in the dataset
- V1 to V28: Details not available due to confidentiality reasons
- Amount: The monetary value of the transaction
- Class: The response variable (0 = no fraud, 1 = fraud)
One method to prepare the data for analysis is to keep all the fraudulent transactions and randomly sample 1% of the non-fraudulent transactions without replacement. The data would be sorted on the Time
column and provide a total of 3265 rows. However, many other approaches are possible.
We'll show the following metrics:
Predicted
| Positive | Negative |
Actual | | |
----------------+----------+----------+
Positive | TP | FN |
----------------+----------+----------+
Negative | FP | TN |
----------------+----------+----------+
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Where
- Accuracy: Measures the proportion of correctly classified instances among all instances
- Precision: Quantifies the proportion of correctly identified positive cases out of all cases identified as positive
- Recall: Evaluates the proportion of correctly identified positive cases out of all actual positive cases
- F1 Score: Combines precision and recall into a single metric, balancing both measures to provide a comprehensive evaluation of a model's performance
Create a SingleStore Cloud account
A previous article showed the steps to create a free SingleStore Cloud account. We'll use the following settings:
- Workspace Group Name: Spark Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: spark-demo
- Size: S-00
Create a new notebook
From the left navigation pane in the cloud portal, we'll select DEVELOP > Data Studio.
In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.
We'll call the notebook spark_fraud_demo, select a Blank notebook template from the available options, and save it in the Personal location.
Fill out the notebook
First, let's install Java:
!conda install -y --quiet -c conda-forge openjdk
Next, we'll obtain the reduced dataset, already prepared, and load it into a Pandas DataFrame:
url = "https://raw.githubusercontent.com/VeryFatBoy/gpt-workshop/main/data/creditcard.csv"
pandas_df = pd.read_csv(url)
We can check the number of rows:
pandas_df.shape[0]
The output should be:
3265
We can check the Class
:
pandas_df.groupby("Class").size()
The output should be:
Class
0 2773
1 492
dtype: int64
We can also output the first 5 rows, as follows:
pandas_df.head(5)
Since the details for the columns V1 to V28 are not available, we can only check the Amount
:
pandas_df["Amount"].describe()
The output should be:
count 3265.000000
mean 86.715210
std 195.568876
min 0.000000
25% 4.490000
50% 21.900000
75% 80.310000
max 2917.640000
Name: Amount, dtype: float64
We can produce a quick plot of the Amount
values using the following:
fig = px.scatter(
pandas_df,
y = "Amount",
color = pandas_df["Class"].astype(str),
hover_data = ["Amount"]
)
fig.update_layout(
# yaxis_type = "log",
title = "Amount and Class"
)
fig.show()
The output is shown in Figure 2.
Another way we can look at the data is as a histogram:
fig = px.histogram(
pandas_df,
x = "Amount",
nbins = 50
)
fig.show()
The output is shown in Figure 3.
Figures 2 and 3 show that the vast majority of transactions were small in value.
Next, let's create a SparkSession
:
# Create a Spark session
spark = SparkSession.builder.appName("Fraud Detection").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
and then use Logistic Regression:
# Convert pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
# Select features and labels
features = spark_df.columns[1:30]
labels = "Class"
# Assemble features into vector
assembler = VectorAssembler(
inputCols = features,
outputCol = "features"
)
spark_df = assembler.transform(spark_df).select("features", labels)
# Split the data into training and testing sets
train, test = spark_df.cache().randomSplit([0.7, 0.3], seed = 42)
# Initialise logistic regression model
lr = LogisticRegression(
maxIter = 1000,
featuresCol = "features",
labelCol = labels
)
# Train the logistic regression model
train_model = lr.fit(train)
# Make predictions on the test set
predictions = train_model.transform(test)
# Calculate the accuracy, precision, recall, and F1 score of the model
accuracy = predictions.filter(predictions.Class == predictions.prediction).count() / float(test.count())
evaluator = MulticlassClassificationEvaluator(
labelCol = labels,
predictionCol = "prediction"
)
precision = evaluator.evaluate(
predictions,
{evaluator.metricName: "precisionByLabel"}
)
recall = evaluator.evaluate(
predictions,
{evaluator.metricName: "recallByLabel"}
)
f1 = evaluator.evaluate(
predictions,
{evaluator.metricName: "fMeasureByLabel"}
)
Next, we'll create a Confusion Matrix:
# Create confusion matrix
cm = predictions.select("Class", "prediction")
cm = cm.groupBy("Class", "prediction").count()
cm = cm.toPandas()
# Pivot the confusion matrix
cm = cm.pivot(
index = "Class",
columns = "prediction",
values = "count"
)
# Generate and plot the confusion matrix
fig = px.imshow(
cm,
x = ["Genuine (0)", "Fraudulent (1)"],
y = ["Genuine (0)", "Fraudulent (1)"],
color_continuous_scale = "Reds",
labels = dict(x = "Predicted Label", y = "True Label")
)
# Add annotations to the heatmap
for i in range(len(cm)):
for j in range(len(cm)):
fig.add_annotation(
x = j,
y = i,
text = str(cm.iloc[i, j]),
font = dict(color = "white" if cm.iloc[i, j] > cm.values.max() / 2 else "black"),
showarrow = False
)
fig.update_layout(
title_text = "Confusion Matrix - Logistic Regression",
coloraxis_showscale = False
)
fig.show()
The output is shown in Figure 4.
Overall, the model has made some good predictions without too many errors.
We can also print some metrics:
# Print the accuracy, precision, recall and f1 of the model
print("Accuracy: %.4f" % accuracy)
print("Precision: %.4f" % precision)
print("Recall: %.4f" % recall)
print("F1: %.4f" % f1)
Example output:
Accuracy: 0.9817
Precision: 0.9862
Recall: 0.9924
F1: 0.9893
Finally, we'll stop Spark:
spark.stop()
Summary
In this short article, we've been able to use Apache Spark to build the first iteration of a fraud detection model using SingleStore notebooks. In the next article in this series, we'll use the SingleStore Spark Connector to read and write data using the SingleStore Data Platform. Stay tuned.
Top comments (0)