Analysis of Yelp Reviews for Useful Insights

Yelp Review Analysis

CS 455: Introduction to Distributed Systems

Computer Science Department, Colorado State University

Adam Briles

Sam Westra

Background Information

●

User reviews are on almost all major sites selling a product

●

In a lot of cases they decide what product we buy

●

Wasting time scrolling through useless reviews benefits nobody

●

Sellers would rather have useful reviews be easily accessible

●

The quality of reviews could make or break a sale

Problem Characterization

●

How do you define a useful Yelp review?

○

We set out to find the characteristics of one

●

Involves reading and processing gigabytes of text

●

Splitting data from reviews isn’t easy

●

All review types(negative, positive, and average) use different rhetoric

○

How do you split up data to avoid conflicting results?

○

Users might prefer different characteristics in a negative review compared to a positive one

Methodology

●

Reading the file in

○

CSV can’t be used because reviews will be split

○

JSON format had to be used to read the data into a dataframe

●

Categorizing Data

○

Split into positive, negative, or average review based on stars given

○

Then split into useful or not useful by useful score

■

If above dataset useful average, considered useful

○

Six total categories

Methodology Continued

●

Each review category is then analyzed for:

○

Unique word count across all reviews

○

A list of words and their frequency sorted in descending order

○

Average cool score

○

Average funny score

○

Average length of the review

Methodology Continued

●

Word frequency was found by:

○

Splitting on whitespace

■

This has the effect of including something like “and” and “and,” in the list

●

Lets us know the context the word was used in

○

Removing common words

■

Common word list created by finding 100 most common words in dataset

●

Then filtering them out

○

Counting unique words to find a “vocabulary”

■

Able to then compare the size of vocabularies across each grouping.

Performance Benchmarks

●

Was our data reasonable?

○

Can only look at our data and see if it the information we gathered makes logical sense

●

Were the reviews that had the found characteristics useful?

●

Can we exclude the Useful rating and use all of our other found characteristics

to find useful reviews?

●

Cluster Info

○

Ran on a Linux based OS

○

Yelp Academic Dataset

○

Used 11 machines in our cluster

Performance Benchmarks Continued

●

We asked 5 of our peers if the reviews found matching our found

characteristics were more useful than what didn’t match our criteria

○

5/5 said the reviews we found were more useful than what reviews that didn’t match our criteria

■

We only used some of the review characteristics we found to create a filter

●

Length within 10% of 1000 characters

●

Cool score greater than or equal to 3

●

Funny score greater than or equal to 2

Performance Benchmarks Continued

●

Reviewed findings against studies performed on reviews and news articles.

○

Using extensive vocabularies with big words were often considered pompous and off-putting.

■

Consequences of Erudite Vernacular Utilized Irrespective of Necessity: Problems with Using Long Words Needlessly by

D.Oppenheimer

○

Longer entries allow for more social engagement and interaction. Leading to the review

becoming more useful.

■

"The data's in! Should you write short posts or long ones?" by R.Marsh

Sample Review Matching our Criteria

BEAUTIFUL..BEAUTIFUL..BEAUTIFUL!!!

I admit I'm no high roller, but I know good taste. This was the nicest hotel I've ever stayed in and one of the best views of Las

Vegas from the hotel suite (especially @ night). For a group of nine females (Bachelorette party), it was a decent size room

and the bathroom was the perfect fit..roomy and luxuriously designed for us BETCHolerettes!

The decor throughout Encore was poppin' BRIGHT RED, but I didn't think it was too overpowering as some mentioned.

Everything was very detailed from the floor to the ceiling, similar to Wynn! The staff was also friendly, willing to open the door

with a greeting and welcoming smile.

Unfortunately for me, it was too damn HOT and my sensitive skin couldn't tolerate the 115 degree HEAT sooo I didn't fully

experience the outdoor pool so much..BUUUT I did have an unforgettable time @ their Club, XS (check on my review). The

BEST CLUB evers!!!! I don't think I can ever club here in San Diego again...it just wouldn't be the same.

I'd be so happy to stay @ Encore again!!! Anyone up for a Vegas trip? HOLLLAAA!!!

Sample Reviews That Didn’t Match Criteria

●

I'm  loving this place. Great atmosphere, friendly knowledgable service, and

close to my house. Also I am very picky about my Chai tea, it's rare for me to

find one that I truly like, but the Masala Chai is spot on Perfect. You also can't

go wrong with the Rooibus tea, any flavor. Very happy I found this place.

●

I sampled a few different flavors that I loved all of, but chose campfire smores

and strawberry rhubarb. I wish I lived closer so I could come here more often.

High Star Not Useful     High Star Useful     Low star Not Useful    Low star Useful

Graphs

Insights and Conclusions

●

Users prefer reviews that are long in length

●

There was little difference in words used between useful and not useful

categories

●

The tone of words used differed dramatically between positive, negative, and

average reviews

●

If a review made someone think it was cool or funny it was likely to be found

useful

●

A model to predict what reviews are going to be useful could be made by

using:

○

HIgh review Length, about 1000 characters

○

Simple language used(Low distinct word count)

○

Cool score

○

Funny score

○

Certain words used

Slide Note

Embed Share

Download

User reviews play a crucial role in influencing purchase decisions; however, sorting through reviews can be time-consuming. This project aims to define characteristics of useful Yelp reviews by analyzing gigabytes of text data. The methodology involves categorizing reviews, analyzing unique word counts, word frequencies, cool/funny scores, and review lengths to extract meaningful insights.

dvand Follow

Uploaded on Sep 13, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Yelp Review Analysis CS 455: Introduction to Distributed Systems Computer Science Department, Colorado State University Adam Briles Sam Westra

Background Information User reviews are on almost all major sites selling a product In a lot of cases they decide what product we buy Wasting time scrolling through useless reviews benefits nobody Sellers would rather have useful reviews be easily accessible The quality of reviews could make or break a sale

Problem Characterization How do you define a useful Yelp review? We set out to find the characteristics of one Involves reading and processing gigabytes of text Splitting data from reviews isn t easy All review types(negative, positive, and average) use different rhetoric How do you split up data to avoid conflicting results? Users might prefer different characteristics in a negative review compared to a positive one

Methodology Reading the file in CSV can t be used because reviews will be split JSON format had to be used to read the data into a dataframe Categorizing Data Split into positive, negative, or average review based on stars given Then split into useful or not useful by useful score If above dataset useful average, considered useful Six total categories

Methodology Continued Each review category is then analyzed for: Unique word count across all reviews A list of words and their frequency sorted in descending order Average cool score Average funny score Average length of the review

Methodology Continued Word frequency was found by: Splitting on whitespace This has the effect of including something like and and and, in the list Lets us know the context the word was used in Removing common words Common word list created by finding 100 most common words in dataset Then filtering them out Counting unique words to find a vocabulary Able to then compare the size of vocabularies across each grouping.

Performance Benchmarks Was our data reasonable? Can only look at our data and see if it the information we gathered makes logical sense Were the reviews that had the found characteristics useful? Can we exclude the Useful rating and use all of our other found characteristics to find useful reviews? Cluster Info Ran on a Linux based OS Yelp Academic Dataset Used 11 machines in our cluster

Performance Benchmarks Continued We asked 5 of our peers if the reviews found matching our found characteristics were more useful than what didn t match our criteria 5/5 said the reviews we found were more useful than what reviews that didn t match our criteria We only used some of the review characteristics we found to create a filter Length within 10% of 1000 characters Cool score greater than or equal to 3 Funny score greater than or equal to 2

Performance Benchmarks Continued Reviewed findings against studies performed on reviews and news articles. Using extensive vocabularies with big words were often considered pompous and off-putting. Consequences of Erudite Vernacular Utilized Irrespective of Necessity: Problems with Using Long Words Needlessly by D.Oppenheimer Longer entries allow for more social engagement and interaction. Leading to the review becoming more useful. "The data's in! Should you write short posts or long ones?" by R.Marsh

Sample Review Matching our Criteria BEAUTIFUL..BEAUTIFUL..BEAUTIFUL!!! I admit I'm no high roller, but I know good taste. This was the nicest hotel I've ever stayed in and one of the best views of Las Vegas from the hotel suite (especially @ night). For a group of nine females (Bachelorette party), it was a decent size room and the bathroom was the perfect fit..roomy and luxuriously designed for us BETCHolerettes! The decor throughout Encore was poppin' BRIGHT RED, but I didn't think it was too overpowering as some mentioned. Everything was very detailed from the floor to the ceiling, similar to Wynn! The staff was also friendly, willing to open the door with a greeting and welcoming smile. Unfortunately for me, it was too damn HOT and my sensitive skin couldn't tolerate the 115 degree HEAT sooo I didn't fully experience the outdoor pool so much..BUUUT I did have an unforgettable time @ their Club, XS (check on my review). The BEST CLUB evers!!!! I don't think I can ever club here in San Diego again...it just wouldn't be the same. I'd be so happy to stay @ Encore again!!! Anyone up for a Vegas trip? HOLLLAAA!!!

Sample Reviews That Didnt Match Criteria I'm loving this place. Great atmosphere, friendly knowledgable service, and close to my house. Also I am very picky about my Chai tea, it's rare for me to find one that I truly like, but the Masala Chai is spot on Perfect. You also can't go wrong with the Rooibus tea, any flavor. Very happy I found this place. I sampled a few different flavors that I loved all of, but chose campfire smores and strawberry rhubarb. I wish I lived closer so I could come here more often.

High Rating Not Useful High Star Useful Mid Star Not Useful Mid Star Useful Low Star Not Useful Low Star Useful Average Funny Score 0.1266718515 9802408 1.9495819089 25232 0.1835774184 4882236 2.7329369264 25382 0.3013434278 4309776 1.7300711764 50031 Average Cool Score 0.2353359489 0019136 3.0825044965 635535 0.3019909320 70778 3.9864518589 98811 0.1083566922 8867256 0.9639341588 714281 Average Length 429.49612978 24254 883.99675115 83862 547.49582153 03134 1053.3096910 26993 668.45785439 49663 1085.7988595 459153 Total Reviews 2597270 343372 1246150 222835 1201181 343372 Unique Words In Category 1531164 759415 1121796 647131 1225543 797528

High Star Not Useful High Star Useful Low star Not Useful Low star Useful

Graphs

Insights and Conclusions Users prefer reviews that are long in length There was little difference in words used between useful and not useful categories The tone of words used differed dramatically between positive, negative, and average reviews If a review made someone think it was cool or funny it was likely to be found useful A model to predict what reviews are going to be useful could be made by using: HIgh review Length, about 1000 characters Simple language used(Low distinct word count) Cool score Funny score Certain words used

Analysis of Yelp Reviews for Useful Insights

Download Presentation

Presentation Transcript

Related

More Related Content