Portfolio

Data Science for Cybersecurity

Master Thesis: Unsupervised Machine Learning for Intrusion Detection Systems.

ML Intrusion Detection implemented using Autoencoders, One Class SVM and Isolation Forest.

This thesis explores anomaly detection of web-based attack on microservices based applications by modeling application performance metrics and service logs. The general idea is that a normal activity profile can be built upon the (simulated) normal activity on the web application and then the anomalies such as web attacks can be detected as different behaviour with respect to the normal activity. This task will be carried out by generating a dataset only containing normal activity and then train machine learning models to distinguish between the learnt behaviour and different behaviours.

Adversarial Attacks in Deep Learning Systems.

Adversarial Attack detection using SVD as a proactive measure.

Brief description

Machine learning models, crucial in various applications like autonomous driving and security systems, are vulnerable to adversarial attacks. These attacks manipulate data to degrade model performance or breach confidentiality.

Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) are two prominent adversarial techniques. FGSM performs one-step updates to maximize loss, while PGD is an iterative, robust version of FGSM.

Defensive strategies include denoising to clean inputs and Singular Value Decomposition (SVD) to reduce dimensionality and remove perturbations. Practical demonstrations using the CleverHans library showcased these methods on datasets like MNIST.

While these defenses are effective, they require access to training data and can be computationally intensive, especially for large datasets.

Computer Vision and Signals

Image and Audio SuperResolution using CNN and GANs.

CNN and GANs for super resolution.

The main objective is to train Convolutional Neural Networks and Generative Adversarial Network for the task of super resolution: the enhancement of 1D (audio) and 2D(images) signals. This repository contains the demo that uses our trained models to apply super resolution to images and audio.

Pill Quality Control / Classification and Augmentation.

Image classification and augmentation using traditional techniques and Generative Adversarial Neural Networks.

The first objective of this project is to perform classification on pills, specifically trying to detect if in a quality control scenario is possibile to detect pills with cosmetics defects like chips or dirt. This task has been carried out training CNNs from scratch and comparing them with pre-trained nets. The second objective is to remedy for the lack of training data using generative adversarial neural networks (GANs), combined with traditional data augmentation.

Natural Language Processing

AppointMate: Appointment Agent App.

A conversational AI agent built with LangChain that allows users to book appointments with a professional via Telegram. The agent can check availability based on a schedule stored in an SQLite database and confirm bookings, optionally sending email notifications.

Functionalities

Conversational Interface: Interacts with users naturally through Telegram.
Availability Checking:
- Understands natural language date queries (e.g., “today”, “tomorrow”, “next Friday”, “July 10th”) using dateparser.
- Consults an SQLite database for existing appointments.
- Checks against predefined working hours.
- Presents available time slots to the user.
Appointment Booking:
- Guides the user to select a specific available slot.
- Prompts the user for their name if not already provided in the conversation.
- Saves the confirmed appointment to the SQLite database, preventing double booking.
Edit Appointments:
- Allows the user to edit exisiting appointment, checking for available slots.
LLM Integration: Supports using different Language Models:
- OpenAI models (e.g., GPT-4o-mini) via API.
- Local models via Ollama (e.g., Llama 3).
(Optional) Email Notifications:
- Sends a confirmation email to the professional upon successful booking.
- Attaches an .ics calendar invite file for easy addition to calendars (like Google Calendar, Outlook).
- Requires SMTP configuration in the .env file to function.

RAGademic: LLM+RAG for your notes.

This project provides a Retrieval-Augmented Generation (RAG) system to query your university notes and books efficiently (could be used to retrieve any kind of document really, but this is the use case I've opted for! :) ). It allows users to upload PDF documents, store them in a Chroma vector database, and interact with them through an LLM-powered chatbot. The project also includes a visualization tool for document embeddings.

Features

Chat with your notes: Ask questions, and the system retrieves relevant information using embeddings.
Database management: Upload, delete, and visualize document embeddings.
Arxiv source addition: Search for papers on specific topics and add them to your knowledge base.
Three LLM options:
- Use OpenAI's API
- Use local models like LLaMA3.2
- Use the new Gemma3 via ollama chat
Three Embedding options:
- Use OpenAI's embedder
- Use HF embedder from the hub
- Use chroma builtin local embedder
Persistent memory: Keeps track of conversations in between sessions, using llama3.2 as a summarizer to keep context windows occupation optimized.
Web Search Agent: If activated, it evaluates (using local llama3.2) the pertinence of the local knowledge wrt the query of the user and if necessary it scrapes the web for more context.

NIPS Papers: Topic Modelling and Text Summarization.

The Neural Information Processing System (NIPS) is a machine learning and computational neuroscience competition held every year from 1987 to date. We tackled a dataset containing all the papers submitted from 1987 to 2017, with the aim of applying unsupervised NLP models for topic modelling (LDA/PLSA) and a supervised approach for extractive text summarization (topic representation and indicator representation).

7 Sins Diachronical Analysis

The aim of the work is to perform a diachronic analysis of the 7 deadly sins to find out how their meaning and use has changed from the 19th century to the 21st century. Additionally, we tried implementing some comparison metrics for the embedding models we used.
This work involved using different word embedding techniques: Word2Vec and GloVe, while using CADE to align the text corpora and analyze the semantic difference of the words between the 1800s and 2000s. Additionally, we implemented a geometrical comparison technique to evaluate how different the embeddings are built between W2V and GloVe.

Time Series Analysis

Restaurant’s Revenue Loss during first COVID-19 pandemic lockdown (ITA).

Time Series Analysis and Forecasting (using ARIMA , and Mixture models ) of a restaurant's revenue during the first lockdown of the COVID-19 pandemic in Italy, to estimate the loss incurred..

One of the sectors most affected by the Covid-19 pandemic has certainly been the restaurant industry. Due to the related restrictions, restaurant owners saw their revenues plummet dramatically. In such a historical period, it can be very useful to analyze historical data to try to study and predict what the daily or weekly revenues will be in order to adjust the supply of raw materials accordingly. In this paper, models from the ARIMA family and the Cluster-Weighted Model with and without cross-validation were used for forecasting time series.

Energy Consumption Forecast (ITA).

Time Series Analysis and Forecasting (using ARIMA , UCM and LSTMs models) of energy consumption.

Energy consumption forecasts are crucial for various purposes, from purchasing energy from the producer to managing overloads. This paper analyzes a univariate time series of energy consumption sampled every 10 minutes. The provided period spans from 01/01/2017 to 30/11/2017, with the aim of estimating December's consumption. The dataset, in total, consists of 48,096 observations. There are no additional details such as weather, holidays, and other typical location-specific characteristics.

Data Visualization

Netflix Top 10 Quality: Data Analysis & Interactive Visualization (ITA).

Netflix calculates the "Weekly Top 10" simply by ordering movies or TV series based on the hours viewed in the last 7 days in descending order. But does this ranking truly reward titles of higher quality or the most popular ones? Are movies and TV series treated equally? Is the Top 10 genuinely helpful for users in selecting the best titles, or does it feature lower-quality content compared to what is available in Netflix's catalog?
Netflix is one of the most widely used streaming platforms, boasting over 214 million accounts. The platform employs one of the most effective recommendation systems, featuring on each user's homepage the most popular movies and TV series that align with the subscriber's preferences. A common feature among all accounts is the "Weekly Top 10," which appears at the top of the homepage and is updated every Sunday. Millions of people see this list of the top ten movies or TV series every day, inevitably influencing users' choices. Furthermore, it serves as an invaluable showcase for every actor and director, both emerging and established. The objective of this research is to analyze data related to each movie and TV series that made it into the Top 10 in the last six months to answer these research questions.

Infographics: PROM score and the possible relationship with weather conditions (ITA).

PROMs are patient-reported outcome measures following an operation or health treatment, often used to assess the quality of health care.
We evaluated, through some infographics made through Python, using the matplotlib and Seaborn libraries, the possible presence of a relationship between the outcomes of mental and physical health status assessments of a sample of patients, following surgery, and the gender and time(day and night) relative to the time of questionnaire completion.

Data Management

Data Acquisition and Modeling: Movies and Tv series using Netflix and IMDB Data (ITA).

Data Acquisition and Modeling: Document-based database containing information related to IMDB and Netflix, scraped from various sources and obtained via API.

The following study focuses on the acquisition, aggregation, integration, cleaning, and storage of a series of datasets related to the Netflix streaming platform in MongoDB. In particular, various data acquisition techniques such as web scraping and APIs were employed. Once the necessary data to answer research questions were obtained, they underwent cleaning and enrichment. Dozens of attributes related to titles that made it into Netflix's weekly Top 10 were derived from data provided by IMDb. Through the comprehensive enrichment process, a more enriched dataset with a more "flexible" structure was obtained and stored in the document-based MongoDB database.