Skip to content

This project aims to analyze and classify a real network traffic dataset to detect malicious/benign traffic records. It compares and tunes the performance of several Machine Learning algorithms to maintain the highest accuracy and lowest False Positive/Negative rates.

License

Notifications You must be signed in to change notification settings

sinanw/ml-classification-malicious-network-traffic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Classification - Network Traffic Analysis

This project aims to analyze and classify a real network traffic dataset to detect malicious/benign traffic records. It compares and tunes the performance of several Machine Learning algorithms to maintain the highest accuracy and lowest False Positive/Negative rates.

Data Set (Aposemat IoT-23)

The dataset used in this demo is: CTU-IoT-Malware-Capture-34-1.

  • It is part of Aposemat IoT-23 dataset.
  • A labeled dataset with malicious and benign IoT network traffic.
  • This dataset was created as part of the Avast AIC laboratory with the funding of Avast Software.

Data Classification Details

The project is implemented in four distinct steps simulating the essential data processing and analysis phases.

  • Each step is represented in a corresponding notebook inside notebooks.
  • Intermediary data files are stored inside the data path.
  • Trained models are stored inside models.

PHASE 1 - Initial Data Cleaning

Corresponding notebook: initial-data-cleaning.ipynb

Implemented data exploration and cleaning tasks:

  1. Loading the raw dataset file into pandas DataFrame.
  2. Exploring dataset summary and statistics.
  3. Fixing combined columns.
  4. Dropping irrelevant columns.
  5. Fixing unset values and validating data types.
  6. Checking the cleaned version of the dataset.
  7. Storing the cleaned dataset to a csv file.

PHASE 2 - Data Processing

Corresponding notebook: data-preprocessing.ipynb

Implemented data processing and transformation tasks:

  1. Loading dataset file into pandas DataFrame.
  2. Exploring dataset summary and statistics.
  3. Analyzing the target attribute.
  4. Encoding the target attribute using LabelEncoder.
  5. Handling outliers using IQR (Inter-quartile Range).
  6. Handling missing values:
    1. Impute missing categorical features using KNeighborsClassifier.
    2. Impute missing numerical features using KNNImputer.
  7. Scaling numerical attributes using MinMaxScaler.
  8. Encoding categorical features: handling rare values and applying One-Hot Encoding.
  9. Checking the processed dataset and storing it to a csv file.

PHASE 3 - Model Training

Corresponding notebook: model-training.ipynb

Trained and analyzed classification models:

  1. Naive Bayes: ComplementNB
  2. Decision Tree: DecisionTreeClassifier
  3. Logistic Regression: LogisticRegression
  4. Random Forest: RandomForestClassifier
  5. Support Vector Classifier: SVC
  6. K-Nearest Neighbors: KNeighborsClassifier
  7. XGBoost: XGBClassifier

Evaluation method:

Results were analyzed and compared for each considered model.

PHASE 4 - Model Tuning

Corresponding notebook: model-tuning.ipynb

Model tuning details:

  • Tuned model: Support Vector Classifier - SVC
  • Tuning method: GridSearchCV
  • Results were analyzed before/after tuning.

About

This project aims to analyze and classify a real network traffic dataset to detect malicious/benign traffic records. It compares and tunes the performance of several Machine Learning algorithms to maintain the highest accuracy and lowest False Positive/Negative rates.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published