Detecting Sensitive Data Disclosure via Text Analysis

Slide Note
Embed
Share

This article discusses techniques for detecting sensitive data disclosure, including taint analysis and bi-directional text correlation analysis. It covers existing detection methods, challenges with generic APIs, and solutions using text correlation analysis to determine data sensitiveness. Examples and motivating scenarios are provided to illustrate the concepts presented.


Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Detecting Sensitive Data Disclosure via Bi-directional Text Correlation Analysis Jianjun Huang, Xiangyu Zhang, Lin Tan

  2. Introduction Data security problem Sensitive data disclosure Taint analysis is commonly used to detect such problems. This technique looks for routes between data sources and sinks. (Haris et al, CoRR abs/1410.4978)

  3. Existing Detection Techniques Predefined sensitive data sources Example: getDeviceId() in Android apps (FlowDroid, PLDI 14) Generic APIs obtaining user inputs The context of user interface can tell the data sensitiveness Examaple: EditText.getText() associated with a text label Password (SUPOR/UIPicker, Security 15) Forward data flow between sources and sinks Any flow of data from source to sink without the user s consent can indicate a problem (Haris et al, CoRR abs/1410.4978)

  4. Challenges Generic APIs, such as reading data from network or files, doesn t tell the data sensitiveness Different with the cases acquiring user inputs. We found that up to 50% of the disclosures have data sources generated by such generic APIs. Treat all such APIs as sensitive data sources A lot of False Warnings Ignore all such APIs Missing warnings in certain cases No forward data flow between sources and sinks Data sensitiveness (sources) is recognized after the data is passed to a sink point.

  5. Motivating Example: Data Flow Network Message bundle = message.getData(); dt = bundle.getString( data ); sink(dt) ? dt is passed into thread Thread data json json.getString( username )

  6. Challenge 1 & Our Solution Challenge: Generic APIs Read data from network Data sensitiveness cannot be determined Network Message bundle = message.getData(); dt = bundle.getString( data ); sink(dt) Solution:Text Correlation Analysis Examine text labels to determine sensitiveness of the correlated variables Text in the code Textual keys in key-value pairs Text in API calls Text from the user interface (UI) Correlated labels (e.g., USERNAME, PASSWORD)

  7. Example un = json.getString( username ); json username un Variable json and un are associated with {username}. Textual analysis on the text label set tells both variables json and un hold sensitive data.

  8. Challenge 2 & Our Solution Challenge: Suppose json is recognized as sensitive Data sensitiveness is revealed after sink. No forward data flow from sensitiveness recognition point to sink is observed. Network dt = bundle.getString( data ) sink(dt) json.getString( username ) Solution:Bi-directional Propagation Backward propagation allows our technique to capture cases in which data sensitiveness is revealed after the data is sent to some sink.

  9. Example Network Message bundle = message.getData(); dt = bundle.getString( data ); Forward propagation sink(dt) propagation Backward Thread data json Sensitive data disclosure json.getString( username )

  10. BidText Static bi-directional text correlation analysis to detect sensitive data disclosures. Text labels are tagged with correlated variables, treated as the type of variables. json {username} un {username, <b>, </b>} Notation: Type[json] = {username} Types (set of text tags) are bi-directionally propagated. At the beginning: dt {data}, json {username} At then end: dt {data, username}, json {data, username}

  11. Text Binding Text can be from either the code or the UI var {Constant_Text} var = Constant_Text Type[var] = {Constant_Text} var is corresponding to a UI resource Extract corresponding UI text var {UI_Text} Type[var] = {UI_Text}

  12. Unary Assignment Propagation Unary assignment Propagation Union old types to form new types for both lhs and rhs Backward: Type [rhs] = Type[lhs] Type[rhs] lhs = rhs Forward: Type [lhs] = Type[lhs] Type[rhs]

  13. Binary Assignment Propagation Binary assignment lhs = r1 r2 Challenge: Backward Propagation Propagate type of lhs to both r1 and r2, or just one of them?

  14. Binary: Unification Propagation Propagate type of lhs to all rhs variables {secret key} {} {} At the beginning x = key mod M Pass 1: Forward {secret key} {secret key} {} {secret key} {secret key} {secret key} Pass 2: Backward

  15. Binary: Our Bi-directional Propagation Intuition: sensitiveness of r1 does not influence sensitiveness of r2 {} {secret key} {} At the beginning x = key mod M Pass 1: Forward {secret key} {secret key} {} Type[x] Type[M] Type[x] Type[key] {secret key} {secret key} {} Pass 2: Backward

  16. Propagation for Method Calls Normal Method Calls Method body exists in analysis scope Passing arguments and returning values are same as unary assignment. API Method Calls System/framework/library APIs are modelled for efficiency Both forward and backward propagation depend on the model of the specific API method.

  17. Practical Enhancement Check-And-Alert When certain condition check satisfies or fails, real programs may prompt some alerts to the user or write to a log file. We can use the alert/log message to infer what the corresponding variables involved in the condition check may hold. if (str == null || str.isEmpty()) { Toast.makeText(this, "Please Enter Password", 1); } str {Please Enter Password}

  18. Practical Enhancement String Concatenation Concatenating strings usually introduces constant texts. If the resultant string has some specific format (e.g., URL address), the involved variables can have more precise type set, instead of tagging all constant texts to all variables. url = http://.../page?cookie= + ck + &date= + dt; url http://.../page?cookie= ck sensitive &date= dt

  19. Practical Enhancement String Concatenation Partition the resultant string representation and map texts to correlated variables. ck {http:// /page?cookie}, dt {&date} url = http://.../page?cookie= + ck + &date= + dt; url http://.../page?cookie sensitive ck &date dt insensitive

  20. Evaluation Implementation Built on top of WALA Targeted to Android apps (dex bytecode) Evaluation subjects 10,000 Android apps downloaded from Google Play store. Evaluation Setup Sensitive keyword set Assign tags to certain APIs that return sensitive data and are used commonly as sensitive sources E.g., getDeviceId() IMEI

  21. Results 4,406 apps are reported with sensitive data disclosure problems. Categorized by sinks Sink Type Logging Non-logging %Reported Apps 96.8% 38.3% Categorized by sources (where text labels is discovered) Code: constant string in the code API: manually tagged APIs (e.g., getDeviceId()) Source Type %Reported Apps 53.9% UI Code 80.4% API 17.5%

  22. Comparison BidText reports a superset of SUPOR (Security 15) and traditional taint analysis tools. Type of Sources Generic APIs (obtaining user inputs) Bi-directional propagation Traditional Sources Other Generic APIs (network, file, ) Traditional SUPOR BidText Results

  23. Accuracy Manually inspect 100 randomly selected apps reported by BidText False positives are found in 10 apps False negatives are ignored due to the lack of oracle Code TEXT 84 API 22 UI 39 Reported Apps False Positives 3 0 7

  24. False Positives False Positive rate: 10% Imprecise model of APIs uidx = cursor.getColumnIndex( username ); x = cursor.getLong(y); sink(x); Incorrect recognition of text Code text: Apps_lang[apps_lng_iso2] lng is mostly used as a short of longitude UI Text: Pin to desktop Possible solution: integrate more advanced NLP techniques with program analysis.

  25. Artifacts Evaluated Source Code https://bitbucket.org/hjjandy/toydroid.bidtext Executables and Partial Evaluation Subjects https://github.com/hjjandy/FSE16-BidText-Artifacts

  26. Conclusion BidText, a static technique to detect sensitive data disclosures Identifies text labels, either from the code or the UI Treat them as types and associates them to corresponding variables Propagates the types through data flow bi-directionally Attributes the types to sink points Applies textual analysis to the type sets to determine if variables at sink points may hold sensitive data Evaluated on 10,000 Android apps and reports 4,406 apps, with a false positive rate of 10%.

  27. END Thanks! & Questions?

  28. END

  29. Propagation for Normal Method Calls Method body exists in the analysis scope Usually only the app s code is analyzed Passing arguments and return values are same as unary assignment 1-to-1 mapping between actual arguments and formal parameters, and between return value and resultant variable. ret = method(arg); method(param) { return val; }

  30. Propagation for API Method Calls Method body does not exist in the analysis scope System/framework/library APIs are modelled for efficiency Example: m for an API method and arg for an actual argument. Forward: Type [ret] = Type[ret] model_fwd(m, arg) = api( ) ret m , arg Backward: Type [arg] = Type[arg] model_bwd(m, ret)

  31. Keyword Set Construction Sensitive keyword set Randomly select 2,000 apps (exclusively from the 10,000 apps) Extract all texts discovered for each sink Manually inspect the texts to construct keyword set Plus the keyword set for user input (SUPOR, Security 15)

  32. Analysis Performance Analysis Performance 587.6 hours for 10,000 apps. 3.5 minutes per app on average. 8,852 apps finish normally (42% of total time) Average: 99.9 seconds per app Minimum: 0.2 seconds Maximum: 1197.4 seconds 1,148 apps Out of memory Time out (20 min) Distribution of accumulative analysis time for all apps.

  33. Analysis Performance Analysis Performance 4,406 apps are reported with disclosure problems 93.0% are analyzed within 10 minutes. Distribution for the analysis time (in minutes) of the apps reported with sensitive data disclosures.

  34. Related Work Type-based taint analysis by Huang et al. (FASE 14, ISSTA 15), Ernst et al. (CCS 14) Type-based taint analysis Predefined types BidText Automatically collects text as types Associates texts to variables Propagates bi-directionally Associates initial types to APIs Requires forward data flows between sources and sink

Related