Telco Data Anonymization Techniques and Tools

 
Telco Data
Anonymization
 
Sridhar, Rohith
 
Dataset: What constitutes Sensitive data?
 
PII
Names (Systems, Domain, Individuals, Organizations, Places, etc.)
Address (IP and MAC)
Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC
Location Data (GPS, Cell-ID, Count, etc.)
< Anything missing ???>
 
Dataset: Sources?
 
What is the plan ?
 
1.
Agree on what constitute the ‘sensitive’ data.
2.
Agree on the problem set (questions we would want to answer)
3.
Try available tools (Libraries) and techniques (implementations) on the
available datasets.
4.
Find the gaps in datasets, tools and techniques.
5.
Fill those gaps considering the problem-set.
6.
Publish the results.
 
What are the techniques we are trying?
 
Questions we are trying to answer
 
1.
Do we have the datasets, which consists of the all the sensitive information?
a.
Well used, freely available, significant size, etc. ?
2.
What kind of sensitive information is well suited for each of the techniques?
a.
Mapping of a type of sensitive information to a technique.
3.
Is there a single technique that is applicable to all kinds of sensitive
information?
4.
Can we build a tool that takes in the dataset and anonymizes it automatically
(with no manual intervention) using the best technique ?
 
Tools and Libraries
 
Python-based:
Manual: Randomization, perturbation, aggregation, etc.
Implementation of classic techniques in Python.
Libraries: Faker, mimesis, radar, AnonymizeDF, pynonymizer
Microsoft’s Presidio
Tensorflow-privacy
Flair (NLP)
A
r
t
l
a
b
s
s
 
o
p
e
n
-
d
a
t
a
-
a
n
o
n
y
m
i
z
e
r
 
GANs
 
Refer to “Synthetic Data Generation Problem at ITU by Thoth Project”, and
its implementation.
https://github.com/ITU-AI-ML-in-5G-Challenge/ML5G-PS-009-Synthetic-
Observability-Data-Generation-using-GANs
Refer to this paper:
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=5V852JcAA
AAJ&sortby=pubdate&citation_for_view=5V852JcAAAAJ:t7izwRedFcYC
https://cni.iisc.ac.in/seminars/2022-09-20/
Other similar works:
https://github.com/ydataai/ydata-synthetic
https://github.com/Pushkar-v/Generating-Synthetic-Data-using-GANs
 
Current status and Resource requirements
 
https://github.com/sknrao/anonymization
Phase:
Experimental Phase - answering the initial questions.  [ 
WE ARE HERE
 ]
Development Phase - Filling the gaps, unified tool*
Evaluation Phase - Testing with large datasets, assessment of anonymization.
Resource requirements
We do have access to servers, with good amount of RAM and processor speeds.
However, we need server with GPUs, mainly to to run GANs-based anonymization
techniques.
Slide Note
Embed
Share

Explore the sensitive data involved in telco anonymization, techniques such as GANs and Autoencoders, and tools like Microsoft's Presidio and Python libraries for effective data anonymization in the telecommunications field.

  • Telco Anonymization
  • Data Privacy
  • GANs
  • Autoencoders
  • Python Libraries

Uploaded on Jul 02, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Telco Data Anonymization Sridhar, Rohith

  2. Dataset: What constitutes Sensitive data? PII Names (Systems, Domain, Individuals, Organizations, Places, etc.) Address (IP and MAC) Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC Location Data (GPS, Cell-ID, Count, etc.) < Anything missing ???>

  3. Dataset: Sources? PII Type Dataset (links) Names (Systems, Domain, Individuals, Organizations, Places, etc.) ServerLOG1, ServerLog2, LogHub, Campus Address (IP and MAC) Internet Traffic Dataset: EX1, EX2 Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC Adult Dataset enhanced with Telco-Fields Adult Dataset: Generate random IMEI/IMSI* fields and add it to this dataset Location Data (GPS, Cell-ID, Count, etc.) OpenCellID, GPS, IEEE-dataport (crawdad)

  4. What is the plan ? 1. Agree on what constitute the sensitive data. 2. Agree on the problem set (questions we would want to answer) 3. Try available tools (Libraries) and techniques (implementations) on the available datasets. 4. Find the gaps in datasets, tools and techniques. 5. Fill those gaps considering the problem-set. 6. Publish the results.

  5. What are the techniques we are trying? GANs Synthetic data generation as a perfect anonymization solution. 4 Autoencoders Unsupervised techniques for the 3 anonymization Natural Language Processing NLP techniques for the Logs. 2 Classic Techniques K-Anonymity, L-Diversity, T-Closeness, Differential Privacy 1

  6. Questions we are trying to answer 1. Do we have the datasets, which consists of the all the sensitive information? a. Well used, freely available, significant size, etc. ? 2. What kind of sensitive information is well suited for each of the techniques? a. Mapping of a type of sensitive information to a technique. 3. Is there a single technique that is applicable to all kinds of sensitive information? 4. Can we build a tool that takes in the dataset and anonymizes it automatically (with no manual intervention) using the best technique ?

  7. Tools and Libraries Python-based: Manual: Randomization, perturbation, aggregation, etc. Implementation of classic techniques in Python. Libraries: Faker, mimesis, radar, AnonymizeDF, pynonymizer Microsoft s Presidio Tensorflow-privacy Flair (NLP) Artlabss open-data-anonymizer

  8. GANs Refer to Synthetic Data Generation Problem at ITU by Thoth Project , and its implementation. https://github.com/ITU-AI-ML-in-5G-Challenge/ML5G-PS-009-Synthetic- Observability-Data-Generation-using-GANs Refer to this paper: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=5V852JcAA AAJ&sortby=pubdate&citation_for_view=5V852JcAAAAJ:t7izwRedFcYC https://cni.iisc.ac.in/seminars/2022-09-20/ Other similar works: https://github.com/ydataai/ydata-synthetic https://github.com/Pushkar-v/Generating-Synthetic-Data-using-GANs

  9. Current status and Resource requirements https://github.com/sknrao/anonymization Phase: Experimental Phase - answering the initial questions. [ WE ARE HERE ] Development Phase - Filling the gaps, unified tool* Evaluation Phase - Testing with large datasets, assessment of anonymization. Resource requirements We do have access to servers, with good amount of RAM and processor speeds. However, we need server with GPUs, mainly to to run GANs-based anonymization techniques.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#