Applied Data Science on data breaches + Bonus

#cybersecurity #datascience #learning #python

Hello!

Today I decided to embed two domains: data science and cybersecurity.

Follow along and you'll see what I'm writing about.

What did I do?

I performed an analysis over the number of attacks based on the organization type.
I downloaded the dataset from Kaggle.
Then, I started working on the data using Jupyter Lab and Python.

The notebook is for exercises purposes, for testing and observing- or playing with- data.

As usual, the first and foremost I imported the data. Then, I loaded and cleaned the dataset.

Cleaning the data is a step that could be done more times, because EDA (Exploratory Data Analysis) is an iterative and non-sequential process. Therefore, later on I continued with this process, in order to uncover meaningful insights.

Few words about statistics

I chose a simple random sampling of n=40 to find out which organization is more prone to cyberattacks, based on the number of attacks. Simple random sampling means that every member of the population has an equal chance of being selected.

The hypothesis

Null Hypothesis (H0): There is no significant difference in the number of cyberattacks experienced by different types of organizations.
Alternative Hypothesis (H1): The number of cyberattacks differs significantly across different types of organizations.

According to the maximum number of attacks, it was concluded that healthcare industry is more prone, with 6 attacks. On the opposite, banking had the lowest number of attacks, i.e 1.

In the end, I performed a Shapiro- Wilk test, to check for the distribution normality of the dataset. The Null Hypothesis was rejected, so the data did not look normally distributed. I applied Kruskal- Wallis test, from which I failed to reject the Null Hypothesis- meaning that there is no significant difference between groups. In simpler terms, it means that there was not enough evidence to confidently say that one organization type is more prone to cyberattacks than the other.

Limitations and future considerations

No confidence level, margin of error and confidence interval were set. The sample size was small, therefore it is harder to detect statistically significant differences. In the future, the selection of a sample will respect these steps and a larger sample will be considered.

You can find the entire work on my GitHub page. 🧾

BONUS 🌟

As I specified, this article has a bonus. The combination of data science and cybersecurity goes on: I created a write-up for TryHackMe room Attacktive Directory!
One could say, at the first glance, that these topics are unrelated. Well, it's actually a demonstration of how a breach could take place! 😉 Because data breaches appear somehow and for some reason.

Curious? Well, check my write-up from my GitHub page.

What are your thoughts?

DEV Community

Applied Data Science on data breaches + Bonus

What did I do?

Few words about statistics

Limitations and future considerations

BONUS 🌟

Top comments (0)

Read next

List all Visual Studio solutions

Kubernetes homelab - Learning by doing, Part 6: Automation

Small But Mighty: Survey of Small Language Models in the LLM Era

Demystifying CXL Heterogeneous Systems with Heimdall Benchmark