Telco Data Anonymization Techniques and Tools
Explore the sensitive data involved in telco anonymization, techniques such as GANs and Autoencoders, and tools like Microsoft's Presidio and Python libraries for effective data anonymization in the telecommunications field.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Telco Data Anonymization Sridhar, Rohith
Dataset: What constitutes Sensitive data? PII Names (Systems, Domain, Individuals, Organizations, Places, etc.) Address (IP and MAC) Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC Location Data (GPS, Cell-ID, Count, etc.) < Anything missing ???>
Dataset: Sources? PII Type Dataset (links) Names (Systems, Domain, Individuals, Organizations, Places, etc.) ServerLOG1, ServerLog2, LogHub, Campus Address (IP and MAC) Internet Traffic Dataset: EX1, EX2 Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC Adult Dataset enhanced with Telco-Fields Adult Dataset: Generate random IMEI/IMSI* fields and add it to this dataset Location Data (GPS, Cell-ID, Count, etc.) OpenCellID, GPS, IEEE-dataport (crawdad)
What is the plan ? 1. Agree on what constitute the sensitive data. 2. Agree on the problem set (questions we would want to answer) 3. Try available tools (Libraries) and techniques (implementations) on the available datasets. 4. Find the gaps in datasets, tools and techniques. 5. Fill those gaps considering the problem-set. 6. Publish the results.
What are the techniques we are trying? GANs Synthetic data generation as a perfect anonymization solution. 4 Autoencoders Unsupervised techniques for the 3 anonymization Natural Language Processing NLP techniques for the Logs. 2 Classic Techniques K-Anonymity, L-Diversity, T-Closeness, Differential Privacy 1
Questions we are trying to answer 1. Do we have the datasets, which consists of the all the sensitive information? a. Well used, freely available, significant size, etc. ? 2. What kind of sensitive information is well suited for each of the techniques? a. Mapping of a type of sensitive information to a technique. 3. Is there a single technique that is applicable to all kinds of sensitive information? 4. Can we build a tool that takes in the dataset and anonymizes it automatically (with no manual intervention) using the best technique ?
Tools and Libraries Python-based: Manual: Randomization, perturbation, aggregation, etc. Implementation of classic techniques in Python. Libraries: Faker, mimesis, radar, AnonymizeDF, pynonymizer Microsoft s Presidio Tensorflow-privacy Flair (NLP) Artlabss open-data-anonymizer
GANs Refer to Synthetic Data Generation Problem at ITU by Thoth Project , and its implementation. https://github.com/ITU-AI-ML-in-5G-Challenge/ML5G-PS-009-Synthetic- Observability-Data-Generation-using-GANs Refer to this paper: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=5V852JcAA AAJ&sortby=pubdate&citation_for_view=5V852JcAAAAJ:t7izwRedFcYC https://cni.iisc.ac.in/seminars/2022-09-20/ Other similar works: https://github.com/ydataai/ydata-synthetic https://github.com/Pushkar-v/Generating-Synthetic-Data-using-GANs
Current status and Resource requirements https://github.com/sknrao/anonymization Phase: Experimental Phase - answering the initial questions. [ WE ARE HERE ] Development Phase - Filling the gaps, unified tool* Evaluation Phase - Testing with large datasets, assessment of anonymization. Resource requirements We do have access to servers, with good amount of RAM and processor speeds. However, we need server with GPUs, mainly to to run GANs-based anonymization techniques.