Factors of Income: A Machine Learning Analysis

 
“Factors” of Income:
 A Machine Learning
Analysis of US
Census Salary Data.
 
By Jordan Evans-Kaplan
 
 
Econ 490: Machine Learning
 
Motivation
 
Studying Census data allows economists to grasp the driving forces behind Income.
My goal was to regress two factor income levels: “<=$50K” and “>$50K” and identify
the most critical variables to predict annual income thresholds. Some of the most
effective variables included were:
Marital Status
Age
Capital Gains
The future applications of this work include generalizing economic health in regions
without income data, but where the predictor variables are available.
This allows data scientists to analyze the economic health of entire populations, rather
than just one component.
Potential downside:
 High computing/processing cost due to large datasets.
 
Literature Review
 
Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: A
Decision-Tree Hybrid.", Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining.
Landmark paper which began the study into Census Income
classification.
Chakrabarty, “A Statistical Approach to Adult Census Income
Level Prediction.”
Recent paper which set the current record with this dataset at 88.16%.
 
 
What is XGBoost?
 
Stands for:
e
X
treme 
G
radient 
B
oosting.
XGBoost is a powerful iterative
learning algorithm based on
gradient boosting.
Robust and highly versatile,
with custom objective loss
function compatibility.
 
 
 
 
How does XGBoost work?
 
Tree-Based Boosting algorithm.
4 Critical Parameters for Tuning:
η
: ETA or “Learning Rate”
max_depth: Controls the “height”
of the tree via splits.
γ
: Minimum required loss for the
model to justify a split.
λ
: L2 (Ridge) regularization on
variable weights.
 
 
 
 
 
 
 
Why use Xgboost?
 
All of the advantages of gradient boosting, plus more.
Frequent Kaggle data competition champion.
Utilizes CPU Parallel Processing by default.
Two main reasons for use:
1.
Low Runtime
2.
High Model Performance
 
Tuning XGBoost
 
In order to produce the optimal XGBoost
model, a grid-search method was employed
against a hyper-grid of possible parameters.
After tuning the model against these
possibilities, the following graph was
produced:
Figure shows the accuracy in each of the 9 possible
grids.
Highest accuracy was obtained in the bottom right
quadrant, where the value of Gamma is 1, and Max
Tree Depth is 3.
The blue line represents the shrinkage rate of 1/3,
otherwise called ETA.
 
 
Model Results:
 
Conclusion:
 
XGBoost Accuracy before tuning: 84.66%
 
Tuned XGBoost Accuracy: 87.09460%
 
Models such as these allow for robust
income classification, and identify key
variables associated with higher income
brackets.
Accuracy:
 0.870946
Accuracy : 0.8466
 
Thank You!
 
Special appreciation goes out to:
Econ 490 Machine Learning Staff:
Nazanin Khazra
Abdollah Farhoodi
 
Designer of XGBoost:
Tianqi Chen
 
 
 
 
Personal Contact Information:
Jordan Evans-Kaplan
Email: 
Evanska2@Illinois.edu
Slide Note
Embed
Share

A machine learning analysis of US Census salary data aiming to predict income levels based on critical variables like marital status, age, and capital gains. The study explores XGBoost as a powerful algorithm for predicting annual income thresholds, with potential applications in analyzing the economic health of populations. Literature reviews on related data mining works are discussed, highlighting the importance of accurate classification methods.


Uploaded on Jul 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Factors of Income: A Machine Learning Analysis of US Census Salary Data. By Jordan Evans-Kaplan Econ 490: Machine Learning

  2. Motivation Studying Census data allows economists to grasp the driving forces behind Income. My goal was to regress two factor income levels: <=$50K and >$50K and identify the most critical variables to predict annual income thresholds. Some of the most effective variables included were: Marital Status Age Capital Gains The future applications of this work include generalizing economic health in regions without income data, but where the predictor variables are available. This allows data scientists to analyze the economic health of entire populations, rather than just one component. Potential downside: High computing/processing cost due to large datasets.

  3. Literature Review Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid.", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Landmark paper which began the study into Census Income classification. Chakrabarty, A Statistical Approach to Adult Census Income Level Prediction. Recent paper which set the current record with this dataset at 88.16%.

  4. What is XGBoost? Stands for: eXtreme Gradient Boosting. XGBoost is a powerful iterative learning algorithm based on gradient boosting. Robust and highly versatile, with custom objective loss function compatibility.

  5. How does XGBoost work? Tree-Based Boosting algorithm. 4 Critical Parameters for Tuning: : ETA or Learning Rate max_depth: Controls the height of the tree via splits. : Minimum required loss for the model to justify a split. : L2 (Ridge) regularization on variable weights.

  6. Why use Xgboost? All of the advantages of gradient boosting, plus more. Frequent Kaggle data competition champion. Utilizes CPU Parallel Processing by default. Two main reasons for use: 1. Low Runtime 2. High Model Performance

  7. Tuning XGBoost In order to produce the optimal XGBoost model, a grid-search method was employed against a hyper-grid of possible parameters. After tuning the model against these possibilities, the following graph was produced: Figure shows the accuracy in each of the 9 possible grids. Highest accuracy was obtained in the bottom right quadrant, where the value of Gamma is 1, and Max Tree Depth is 3. The blue line represents the shrinkage rate of 1/3, otherwise called ETA.

  8. Model Results: Model Results:

  9. Conclusion: XGBoost Accuracy before tuning: 84.66% Accuracy : 0.8466 Tuned XGBoost Accuracy: 87.09460% Accuracy: 0.870946 Models such as these allow for robust income classification, and identify key variables associated with higher income brackets.

  10. Thank You! Special appreciation goes out to: Econ 490 Machine Learning Staff: Nazanin Khazra Abdollah Farhoodi Personal Contact Information: Jordan Evans-Kaplan Email: Evanska2@Illinois.edu Designer of XGBoost: Tianqi Chen

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#