Detecting Sensitive Data Disclosure via Text Analysis

D
e
t
e
c
t
i
n
g
 
S
e
n
s
i
t
i
v
e
 
D
a
t
a
 
D
i
s
c
l
o
s
u
r
e
v
i
a
B
i
-
d
i
r
e
c
t
i
o
n
a
l
 
T
e
x
t
 
C
o
r
r
e
l
a
t
i
o
n
 
A
n
a
l
y
s
i
s
Jianjun Huang
, Xiangyu Zhang, Lin Tan
Introduction
Data security problem
Sensitive data disclosure
 
Taint analysis is commonly used to detect such problems.
This technique looks for routes between data sources and
sinks. (
Haris 
et al
, CoRR abs/1410.4978
)
Existing Detection Techniques
 
Predefined sensitive data sources
Example: getDeviceId() in Android apps (
FlowDroid, PLDI’14
)
 
Generic APIs obtaining user inputs
The context of user interface can tell the data
sensitiveness
Examaple: EditText.getText() associated with a text label
“Password” (
SUPOR/UIPicker, Security’15
)
 
Forward data flow between sources and sinks
Any flow of data from source to sink without the user’s
consent can indicate a problem 
(
Haris et al, CoRR
abs/1410.4978
)
Challenges
 
Generic APIs, such as reading data from network or
files, doesn’t tell the data sensitiveness
Different with the cases acquiring user inputs.
We found that up to 50% of the disclosures have data
sources generated by such generic APIs.
 
Treat all such APIs as sensitive data sources
A lot of False Warnings
Ignore all such APIs
Missing warnings in certain cases
 
No forward data flow between sources and sinks
Data sensitiveness (
sources
) is recognized after the data
is passed to a sink point.
Motivating Example: Data Flow
 
Thread
 
“dt” is passed into thread
 
?
Challenge 1 & Our Solution
 
Challenge: 
Generic APIs
Read data from network
Data sensitiveness
cannot be determined
 
Solution:
 
Text Correlation Analysis
Examine text labels to determine sensitiveness of the
correlated variables
Text in the code
Textual keys in key-value pairs
Text in API calls
Text from the user interface (UI)
Correlated labels (e.g., USERNAME, PASSWORD)
Example
un = json.getString(“
username
”);
 
Variable 
json and un
 are associated with {username}.
Textual analysis on the text label set tells both variables
json
 and 
un
 hold sensitive data.
Challenge 2 & Our Solution
 
Challenge:
 Suppose 
json
 is
recognized as sensitive
Data sensitiveness is
revealed after sink.
No forward data flow from
sensitiveness recognition
point to sink is observed.
 
 
Solution:
 
Bi-directional Propagation
Backward propagation allows our technique to capture
cases in which data sensitiveness is revealed 
after
 the
data is sent to some sink.
Example
Thread
 
B
a
c
k
w
a
r
d
p
r
o
p
a
g
a
t
i
o
n
 
Forward
propagation
BidText
Static 
bi
-
d
irectional 
text
 correlation analysis to detect
sensitive data disclosures.
Text labels are tagged with correlated variables, treated
as the type of variables.
json 
 {username}
un 
 {username, <b>, </b>}
Notation
: 
Type[json] = {username}
Types (set of text tags) are bi-directionally propagated.
At the beginning:
dt 
 {data},  json 
 {username}
At then end:
dt 
 {data, username}, json 
 {data, username}
Text Binding
Text can be from either the code or the UI
var
 = Constant_Text
var
 is corresponding
to a UI resource
 
Extract corresponding
UI text
Type[var] = {Constant_Text}
Type[var] = {UI_Text}
Unary Assignment Propagation
Unary assignment Propagation
Union old types to form new types for both 
lhs
 and 
rhs
 
Backward:
Type’[rhs]
 =  Type[lhs] 
 Type[rhs]
 
Forward:
Type’[lhs]
 =  Type[lhs] 
 Type[rhs]
Binary Assignment Propagation
Binary assignment
lhs = r1 
 r2
Challenge:
 Backward Propagation
Propagate type of 
lhs
 to both 
r1
 and 
r2
, or just one of
them?
Binary: Unification Propagation
Propagate type of 
lhs
 to all 
rhs
 variables
x  =  key   
mod
   M
 
At the beginning
 
Pass 1:
Forward
 
Pass 2:
Backward
Binary: Our Bi-directional Propagation
Intuition: sensitiveness of 
r1
 does not influence
sensitiveness of 
r2
x  =  key   
mod
   M
 
At the beginning
 
Pass 1:
Forward
 
Pass 2:
Backward
Type[x] – Type[M]
Type[x] – Type[key]
Propagation for Method Calls
Normal Method Calls
Method body exists in analysis scope
Passing arguments and returning values are same as
unary assignment.
 
API Method Calls
System/framework/library APIs are modelled for
efficiency
Both forward and backward propagation depend on the
model
 of the specific API method.
Practical Enhancement
 
Check-And-Alert
When certain condition check satisfies or fails, real
programs may prompt some alerts to the user or write
to a log file.
We can use the alert/log message to infer what the
corresponding variables involved in the condition check
may hold.
if (str == null || str.isEmpty()) {
   Toast.makeText(this, 
                  "
Please Enter Password
", 1);
}
str 
 {Please Enter Password}
Practical Enhancement
String Concatenation
Concatenating strings usually introduces constant texts.
If the resultant string has some specific format 
(e.g.,
URL address)
,
 the involved variables can have more
precise type set, instead of tagging all constant texts to
all variables.
url = “
http://.../page?cookie=
” + ck 
      + “
&date=
” + dt;
 
sensitive
Practical Enhancement
String Concatenation
Partition the resultant string representation and map
texts to correlated variables.
 
sensitive
 
insensitive
 
ck 
 {http://…/page?cookie}, dt 
 {&date}
url = “
http://.../page?cookie=
” + ck 
      + “
&date=
” + dt;
Evaluation
Implementation
Built on top of WALA
Targeted to Android apps (dex bytecode)
Evaluation subjects
10,000 Android apps downloaded from Google Play
store.
Evaluation Setup
Sensitive keyword set
Assign tags to certain APIs that return sensitive data and
are used commonly as sensitive sources
E.g., getDeviceId() 
 “IMEI”
Results
4,406 apps are reported with sensitive data disclosure
problems.
 
Categorized by sources (where text labels is discovered)
Code: constant string in the code
API: manually tagged APIs (e.g., getDeviceId())
 
Categorized by sinks
Comparison
BidText reports a superset of SUPOR (
Security’15
) and
traditional taint analysis tools.
R
e
s
u
l
t
s
Accuracy
Manually inspect 100 randomly selected apps reported
by BidText
False positives are found in 10 apps
False negatives are ignored due to the lack of oracle
False Positives
False Positive rate: 10%
Imprecise model of APIs
 
Incorrect recognition of text
Code text: “Apps_lang[apps_
lng
_iso2]”
lng
 is mostly used as a short of “
longitude
UI Text: “
Pin
 to desktop”
Possible solution
: integrate more advanced NLP
techniques with program analysis.
 
uidx  = cursor.getColumnIndex(“username”);
x     = cursor.getLong(y);
      sink
(x);
Artifacts Evaluated
Source Code
https://bitbucket.org/hjjandy/toydroid.bidtext
Executables and Partial Evaluation Subjects
https://github.com/hjjandy/FSE16-BidText-Artifacts
Conclusion
BidText, a static technique to detect sensitive data
disclosures
Identifies text labels, either from the code or the UI
Treat them as types and associates them to
corresponding variables
Propagates the types through data flow bi-directionally
Attributes the types to sink points
Applies textual analysis to the type sets to determine if
variables at sink points may hold sensitive data
Evaluated on 10,000 Android apps and reports 4,406
apps, with a false positive rate of 10%.
END
Thanks!
&
Q
uestions?
END
Propagation for Normal Method Calls
Method body exists in the analysis scope
Usually only the app’s code is analyzed
Passing arguments and return values are same as unary
assignment
1-to-1 mapping between actual arguments and formal
parameters, and between return value and resultant
variable.
ret = method(arg);
method(param) {
  return val;
}
Propagation for API Method Calls
Method body does not exist in the analysis scope
System/framework/library APIs are modelled for
efficiency
Example:
m
 for an API method and 
arg
 for an actual argument.
 
Forward:
Type’[ret]
 = Type[ret] 
 
model_fwd
(m, arg)
 
Backward:
Type’[arg]
 = Type[arg] 
 
model_bwd
(m, ret)
Keyword Set Construction
Sensitive keyword set
Randomly select 2,000 apps (exclusively from the
10,000 apps)
Extract all texts discovered for each sink
Manually inspect the texts to construct keyword set
Plus the keyword set for user input (
SUPOR, Security’15
)
Analysis Performance
Analysis Performance
587.6 hours for 10,000 apps. 3.5 minutes per app on
average.
8,852 apps finish normally (42% of total time)
Average: 99.9 seconds per app
Minimum: 0.2 seconds
Maximum: 1197.4 seconds
Distribution of accumulative analysis
time for all apps.
1,148 apps
Analysis Performance
Analysis Performance
4,406 apps are reported with disclosure problems
93.0% are analyzed within 10 minutes.
Distribution for the analysis time (in minutes) of the
apps reported with sensitive data disclosures.
Related Work
Type-based taint analysis by Huang et al. (FASE’14,
ISSTA’15), Ernst et al. (CCS’14)
Slide Note

Good afternoon, everyone. My name is Jianjun Huang, from Purdue University. Today I will present our paper Detecting Sensitive Data Disclosure via Bi-directional Text Correlation Analysis. The artifacts of our work have been evaluated by the conference.

Embed
Share

This article discusses techniques for detecting sensitive data disclosure, including taint analysis and bi-directional text correlation analysis. It covers existing detection methods, challenges with generic APIs, and solutions using text correlation analysis to determine data sensitiveness. Examples and motivating scenarios are provided to illustrate the concepts presented.

  • Data Security
  • Sensitive Data
  • Text Analysis
  • Information Disclosure

Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Detecting Sensitive Data Disclosure via Bi-directional Text Correlation Analysis Jianjun Huang, Xiangyu Zhang, Lin Tan

  2. Introduction Data security problem Sensitive data disclosure Taint analysis is commonly used to detect such problems. This technique looks for routes between data sources and sinks. (Haris et al, CoRR abs/1410.4978)

  3. Existing Detection Techniques Predefined sensitive data sources Example: getDeviceId() in Android apps (FlowDroid, PLDI 14) Generic APIs obtaining user inputs The context of user interface can tell the data sensitiveness Examaple: EditText.getText() associated with a text label Password (SUPOR/UIPicker, Security 15) Forward data flow between sources and sinks Any flow of data from source to sink without the user s consent can indicate a problem (Haris et al, CoRR abs/1410.4978)

  4. Challenges Generic APIs, such as reading data from network or files, doesn t tell the data sensitiveness Different with the cases acquiring user inputs. We found that up to 50% of the disclosures have data sources generated by such generic APIs. Treat all such APIs as sensitive data sources A lot of False Warnings Ignore all such APIs Missing warnings in certain cases No forward data flow between sources and sinks Data sensitiveness (sources) is recognized after the data is passed to a sink point.

  5. Motivating Example: Data Flow Network Message bundle = message.getData(); dt = bundle.getString( data ); sink(dt) ? dt is passed into thread Thread data json json.getString( username )

  6. Challenge 1 & Our Solution Challenge: Generic APIs Read data from network Data sensitiveness cannot be determined Network Message bundle = message.getData(); dt = bundle.getString( data ); sink(dt) Solution:Text Correlation Analysis Examine text labels to determine sensitiveness of the correlated variables Text in the code Textual keys in key-value pairs Text in API calls Text from the user interface (UI) Correlated labels (e.g., USERNAME, PASSWORD)

  7. Example un = json.getString( username ); json username un Variable json and un are associated with {username}. Textual analysis on the text label set tells both variables json and un hold sensitive data.

  8. Challenge 2 & Our Solution Challenge: Suppose json is recognized as sensitive Data sensitiveness is revealed after sink. No forward data flow from sensitiveness recognition point to sink is observed. Network dt = bundle.getString( data ) sink(dt) json.getString( username ) Solution:Bi-directional Propagation Backward propagation allows our technique to capture cases in which data sensitiveness is revealed after the data is sent to some sink.

  9. Example Network Message bundle = message.getData(); dt = bundle.getString( data ); Forward propagation sink(dt) propagation Backward Thread data json Sensitive data disclosure json.getString( username )

  10. BidText Static bi-directional text correlation analysis to detect sensitive data disclosures. Text labels are tagged with correlated variables, treated as the type of variables. json {username} un {username, <b>, </b>} Notation: Type[json] = {username} Types (set of text tags) are bi-directionally propagated. At the beginning: dt {data}, json {username} At then end: dt {data, username}, json {data, username}

  11. Text Binding Text can be from either the code or the UI var {Constant_Text} var = Constant_Text Type[var] = {Constant_Text} var is corresponding to a UI resource Extract corresponding UI text var {UI_Text} Type[var] = {UI_Text}

  12. Unary Assignment Propagation Unary assignment Propagation Union old types to form new types for both lhs and rhs Backward: Type [rhs] = Type[lhs] Type[rhs] lhs = rhs Forward: Type [lhs] = Type[lhs] Type[rhs]

  13. Binary Assignment Propagation Binary assignment lhs = r1 r2 Challenge: Backward Propagation Propagate type of lhs to both r1 and r2, or just one of them?

  14. Binary: Unification Propagation Propagate type of lhs to all rhs variables {secret key} {} {} At the beginning x = key mod M Pass 1: Forward {secret key} {secret key} {} {secret key} {secret key} {secret key} Pass 2: Backward

  15. Binary: Our Bi-directional Propagation Intuition: sensitiveness of r1 does not influence sensitiveness of r2 {} {secret key} {} At the beginning x = key mod M Pass 1: Forward {secret key} {secret key} {} Type[x] Type[M] Type[x] Type[key] {secret key} {secret key} {} Pass 2: Backward

  16. Propagation for Method Calls Normal Method Calls Method body exists in analysis scope Passing arguments and returning values are same as unary assignment. API Method Calls System/framework/library APIs are modelled for efficiency Both forward and backward propagation depend on the model of the specific API method.

  17. Practical Enhancement Check-And-Alert When certain condition check satisfies or fails, real programs may prompt some alerts to the user or write to a log file. We can use the alert/log message to infer what the corresponding variables involved in the condition check may hold. if (str == null || str.isEmpty()) { Toast.makeText(this, "Please Enter Password", 1); } str {Please Enter Password}

  18. Practical Enhancement String Concatenation Concatenating strings usually introduces constant texts. If the resultant string has some specific format (e.g., URL address), the involved variables can have more precise type set, instead of tagging all constant texts to all variables. url = http://.../page?cookie= + ck + &date= + dt; url http://.../page?cookie= ck sensitive &date= dt

  19. Practical Enhancement String Concatenation Partition the resultant string representation and map texts to correlated variables. ck {http:// /page?cookie}, dt {&date} url = http://.../page?cookie= + ck + &date= + dt; url http://.../page?cookie sensitive ck &date dt insensitive

  20. Evaluation Implementation Built on top of WALA Targeted to Android apps (dex bytecode) Evaluation subjects 10,000 Android apps downloaded from Google Play store. Evaluation Setup Sensitive keyword set Assign tags to certain APIs that return sensitive data and are used commonly as sensitive sources E.g., getDeviceId() IMEI

  21. Results 4,406 apps are reported with sensitive data disclosure problems. Categorized by sinks Sink Type Logging Non-logging %Reported Apps 96.8% 38.3% Categorized by sources (where text labels is discovered) Code: constant string in the code API: manually tagged APIs (e.g., getDeviceId()) Source Type %Reported Apps 53.9% UI Code 80.4% API 17.5%

  22. Comparison BidText reports a superset of SUPOR (Security 15) and traditional taint analysis tools. Type of Sources Generic APIs (obtaining user inputs) Bi-directional propagation Traditional Sources Other Generic APIs (network, file, ) Traditional SUPOR BidText Results

  23. Accuracy Manually inspect 100 randomly selected apps reported by BidText False positives are found in 10 apps False negatives are ignored due to the lack of oracle Code TEXT 84 API 22 UI 39 Reported Apps False Positives 3 0 7

  24. False Positives False Positive rate: 10% Imprecise model of APIs uidx = cursor.getColumnIndex( username ); x = cursor.getLong(y); sink(x); Incorrect recognition of text Code text: Apps_lang[apps_lng_iso2] lng is mostly used as a short of longitude UI Text: Pin to desktop Possible solution: integrate more advanced NLP techniques with program analysis.

  25. Artifacts Evaluated Source Code https://bitbucket.org/hjjandy/toydroid.bidtext Executables and Partial Evaluation Subjects https://github.com/hjjandy/FSE16-BidText-Artifacts

  26. Conclusion BidText, a static technique to detect sensitive data disclosures Identifies text labels, either from the code or the UI Treat them as types and associates them to corresponding variables Propagates the types through data flow bi-directionally Attributes the types to sink points Applies textual analysis to the type sets to determine if variables at sink points may hold sensitive data Evaluated on 10,000 Android apps and reports 4,406 apps, with a false positive rate of 10%.

  27. END Thanks! & Questions?

  28. END

  29. Propagation for Normal Method Calls Method body exists in the analysis scope Usually only the app s code is analyzed Passing arguments and return values are same as unary assignment 1-to-1 mapping between actual arguments and formal parameters, and between return value and resultant variable. ret = method(arg); method(param) { return val; }

  30. Propagation for API Method Calls Method body does not exist in the analysis scope System/framework/library APIs are modelled for efficiency Example: m for an API method and arg for an actual argument. Forward: Type [ret] = Type[ret] model_fwd(m, arg) = api( ) ret m , arg Backward: Type [arg] = Type[arg] model_bwd(m, ret)

  31. Keyword Set Construction Sensitive keyword set Randomly select 2,000 apps (exclusively from the 10,000 apps) Extract all texts discovered for each sink Manually inspect the texts to construct keyword set Plus the keyword set for user input (SUPOR, Security 15)

  32. Analysis Performance Analysis Performance 587.6 hours for 10,000 apps. 3.5 minutes per app on average. 8,852 apps finish normally (42% of total time) Average: 99.9 seconds per app Minimum: 0.2 seconds Maximum: 1197.4 seconds 1,148 apps Out of memory Time out (20 min) Distribution of accumulative analysis time for all apps.

  33. Analysis Performance Analysis Performance 4,406 apps are reported with disclosure problems 93.0% are analyzed within 10 minutes. Distribution for the analysis time (in minutes) of the apps reported with sensitive data disclosures.

  34. Related Work Type-based taint analysis by Huang et al. (FASE 14, ISSTA 15), Ernst et al. (CCS 14) Type-based taint analysis Predefined types BidText Automatically collects text as types Associates texts to variables Propagates bi-directionally Associates initial types to APIs Requires forward data flows between sources and sink

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#