Telco Data Anonymization Techniques and Tools

Telco Data

Anonymization

Sridhar, Rohith

Dataset: What constitutes Sensitive data?

●

PII

○

Names (Systems, Domain, Individuals, Organizations, Places, etc.)

○

Address (IP and MAC)

○

Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC

○

Location Data (GPS, Cell-ID, Count, etc.)

○

< Anything missing ???>

Dataset: Sources?

What is the plan ?

1.

Agree on what constitute the ‘sensitive’ data.

2.

Agree on the problem set (questions we would want to answer)

3.

Try available tools (Libraries) and techniques (implementations) on the

available datasets.

4.

Find the gaps in datasets, tools and techniques.

5.

Fill those gaps considering the problem-set.

6.

Publish the results.

What are the techniques we are trying?

Questions we are trying to answer

1.

Do we have the datasets, which consists of the all the sensitive information?

a.

Well used, freely available, significant size, etc. ?

2.

What kind of sensitive information is well suited for each of the techniques?

a.

Mapping of a type of sensitive information to a technique.

3.

Is there a single technique that is applicable to all kinds of sensitive

information?

4.

Can we build a tool that takes in the dataset and anonymizes it automatically

(with no manual intervention) using the best technique ?

Tools and Libraries

●

Python-based:

○

Manual: Randomization, perturbation, aggregation, etc.

○

Implementation of classic techniques in Python.

○

Libraries: Faker, mimesis, radar, AnonymizeDF, pynonymizer

○

Microsoft’s Presidio

○

Tensorflow-privacy

○

Flair (NLP)

○

GANs

●

Refer to “Synthetic Data Generation Problem at ITU by Thoth Project”, and

its implementation.

○

https://github.com/ITU-AI-ML-in-5G-Challenge/ML5G-PS-009-Synthetic-

Observability-Data-Generation-using-GANs

●

Refer to this paper:

○

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=5V852JcAA

AAJ&sortby=pubdate&citation_for_view=5V852JcAAAAJ:t7izwRedFcYC

●

https://cni.iisc.ac.in/seminars/2022-09-20/

●

Other similar works:

○

https://github.com/ydataai/ydata-synthetic

○

https://github.com/Pushkar-v/Generating-Synthetic-Data-using-GANs

Current status and Resource requirements

●

https://github.com/sknrao/anonymization

●

Phase:

○

Experimental Phase - answering the initial questions.  [

WE ARE HERE

○

Development Phase - Filling the gaps, unified tool*

○

Evaluation Phase - Testing with large datasets, assessment of anonymization.

●

Resource requirements

○

We do have access to servers, with good amount of RAM and processor speeds.

○

However, we need server with GPUs, mainly to to run GANs-based anonymization

techniques.

Slide Note

Embed Share

Download

Explore the sensitive data involved in telco anonymization, techniques such as GANs and Autoencoders, and tools like Microsoft's Presidio and Python libraries for effective data anonymization in the telecommunications field.

goldie Follow

Uploaded on Jul 02, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Telco Data Anonymization Sridhar, Rohith

Dataset: What constitutes Sensitive data? PII Names (Systems, Domain, Individuals, Organizations, Places, etc.) Address (IP and MAC) Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC Location Data (GPS, Cell-ID, Count, etc.) < Anything missing ???>

Dataset: Sources? PII Type Dataset (links) Names (Systems, Domain, Individuals, Organizations, Places, etc.) ServerLOG1, ServerLog2, LogHub, Campus Address (IP and MAC) Internet Traffic Dataset: EX1, EX2 Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC Adult Dataset enhanced with Telco-Fields Adult Dataset: Generate random IMEI/IMSI* fields and add it to this dataset Location Data (GPS, Cell-ID, Count, etc.) OpenCellID, GPS, IEEE-dataport (crawdad)

What is the plan ? 1. Agree on what constitute the sensitive data. 2. Agree on the problem set (questions we would want to answer) 3. Try available tools (Libraries) and techniques (implementations) on the available datasets. 4. Find the gaps in datasets, tools and techniques. 5. Fill those gaps considering the problem-set. 6. Publish the results.

What are the techniques we are trying? GANs Synthetic data generation as a perfect anonymization solution. 4 Autoencoders Unsupervised techniques for the 3 anonymization Natural Language Processing NLP techniques for the Logs. 2 Classic Techniques K-Anonymity, L-Diversity, T-Closeness, Differential Privacy 1

Questions we are trying to answer 1. Do we have the datasets, which consists of the all the sensitive information? a. Well used, freely available, significant size, etc. ? 2. What kind of sensitive information is well suited for each of the techniques? a. Mapping of a type of sensitive information to a technique. 3. Is there a single technique that is applicable to all kinds of sensitive information? 4. Can we build a tool that takes in the dataset and anonymizes it automatically (with no manual intervention) using the best technique ?

Tools and Libraries Python-based: Manual: Randomization, perturbation, aggregation, etc. Implementation of classic techniques in Python. Libraries: Faker, mimesis, radar, AnonymizeDF, pynonymizer Microsoft s Presidio Tensorflow-privacy Flair (NLP) Artlabss open-data-anonymizer

GANs Refer to Synthetic Data Generation Problem at ITU by Thoth Project , and its implementation. https://github.com/ITU-AI-ML-in-5G-Challenge/ML5G-PS-009-Synthetic- Observability-Data-Generation-using-GANs Refer to this paper: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=5V852JcAA AAJ&sortby=pubdate&citation_for_view=5V852JcAAAAJ:t7izwRedFcYC https://cni.iisc.ac.in/seminars/2022-09-20/ Other similar works: https://github.com/ydataai/ydata-synthetic https://github.com/Pushkar-v/Generating-Synthetic-Data-using-GANs

Current status and Resource requirements https://github.com/sknrao/anonymization Phase: Experimental Phase - answering the initial questions. [ WE ARE HERE ] Development Phase - Filling the gaps, unified tool* Evaluation Phase - Testing with large datasets, assessment of anonymization. Resource requirements We do have access to servers, with good amount of RAM and processor speeds. However, we need server with GPUs, mainly to to run GANs-based anonymization techniques.

Telco Data Anonymization Techniques and Tools

Download Presentation

Presentation Transcript

Related

More Related Content