Intuitive fred888: Predicting the Future by Studying Social Media

Wednesday, March 27, 2013

Predicting the Future by Studying Social Media

Scholarly articles for predicting the future with social media
Predicting the future with social media - Asur - Cited by 264 Predicting tie strength with social media - Gilbert - Cited by 335 Predicting the present with google trends - Choi - Cited by 185

Search Results

[PDF]
Tech Report: predicting the Future With Social Media - HP Labs

www.hpl.hp.com/research/scl/papers/socialmedia/socialmedia.pdf
File Format: PDF/Adobe Acrobat - Quick View
by S Asur - Cited by 255 - Related articles
Predicting the Future With Social Media. Sitaram Asur. Social Computing Lab. HP Labs. Palo Alto, California. Email: sitaram.asur@hp.com. Bernardo A.
Predicting the Future with Social Media - ACM Digital Library

dl.acm.org/citation.cfm?id=1914092
by S Asur - 2010 - Cited by 254 - Related articles
In recent years, social media has become ubiquitous and important for social networking and content sharing. And yet, the content that is generated from these ...
Abstract - Authors - References - Cited By
Predicting the Future with Social Media

arxiv.org › cs
by S Asur - 2010 - Cited by 255 - Related articles
Mar 29, 2010 – Abstract: In recent years, social media has become ubiquitous and important for social networking and content sharing. And yet, the content that ...
Social media pros predict the future - Grand Rapids Business Journal

www.grbj.com/articles/75926-social-media-pros-predict-the-future
Jan 14, 2013 – What will the social media landscape look like throughout 2013? The Grand Rapids Business Journal asked local public relations, marketing ...
The Future of Social Media: 50+ Experts Share Their 2013 Predictions

Note: There are many scholarly articles listed above here on this subject. I found the predictions of box office results impressive in the one here below. If you can predict box office outcomes you likely can also predict to a greater or lesser degree what countries will rise or fall as well through social media as well. The amount of things potentially that could be discovered through social media likely at this point at least is not limited. end note. However, this quote didn't come through very well here because of the type of formatting used in the abstract. So, going to this button directly might be more useful:
Predicting the future with social media

arXiv:1003.5699v1 [cs.CY] 29 Mar 2010

Predicting the Future With Social Media

Sitaram Asur

Social Computing Lab

HP Labs

Palo Alto, California

Email: sitaram.asur@hp.com

Bernardo A. Huberman

Social Computing Lab

HP Labs

Palo Alto, California

Email: bernardo.huberman@hp.com

Abstract

—In recent years, social media has become ubiquitous

and important for social networking and content sharing. An

d

yet, the content that is generated from these websites remai

ns

largely untapped. In this paper, we demonstrate how social m

edia

content can be used to predict real-world outcomes. In parti

cular,

we use the chatter from Twitter.com to forecast box-office

revenues for movies. We show that a simple model built from

the rate at which tweets are created about particular topics

can

outperform market-based predictors. We further demonstra

te

how sentiments extracted from Twitter can be further utiliz

ed to

improve the forecasting power of social media.

I. I

NTRODUCTION

Social media has exploded as a category of online discourse

where people create content, share it, bookmark it and netwo

rk

at a prodigious rate. Examples include Facebook, MySpace,

Digg, Twitter and JISC listservs on the academic side. Becau

se

of its ease of use, speed and reach, social media is fast

changing the public discourse in society and setting trends

and agendas in topics that range from the environment and

politics to technology and the entertainment industry.

Since social media can also be construed as a form of

collective wisdom, we decided to investigate its power at

predicting real-world outcomes. Surprisingly, we discove

red

that the chatter of a community can indeed be used to make

quantitative predictions that outperform those of artifici

al

markets. These information markets generally involve the

trading of state-contingent securities, and if large enoug

h and

properly designed, they are usually more accurate than othe

r

techniques for extracting diffuse information, such as sur

veys

and opinions polls. Specifically, the prices in these market

s

have been shown to have strong correlations with observed

outcome frequencies, and thus are good indicators of future

outcomes [4], [5].

In the case of social media, the enormity and high vari-

ance of the information that propagates through large user

communities presents an interesting opportunity for harne

ssing

that data into a form that allows for specific predictions

about particular outcomes, without having to institute mar

ket

mechanisms. One can also build models to aggregate the

opinions of the collective population and gain useful insig

hts

into their behavior, while predicting future trends. Moreo

ver,

gathering information on how people converse regarding par

-

ticular products can be helpful when designing marketing an

d

advertising campaigns [1], [3].

This paper reports on such a study. Specifically we consider

the task of predicting box-office revenues for movies using

the chatter from Twitter, one of the fastest growing social

networks in the Internet. Twitter

1

, a micro-blogging network,

has experienced a burst of popularity in recent months leadi

ng

to a huge user-base, consisting of several tens of millions o

f

users who actively participate in the creation and propagat

ion

of content.

We have focused on movies in this study for two main

reasons.

•

The topic of movies is of considerable interest among

the social media user community, characterized both by

large number of users discussing movies, as well as a

substantial variance in their opinions.

•

The real-world outcomes can be easily observed from

box-office revenue for movies.

Our goals in this paper are as follows. First, we assess how

buzz and attention is created for different movies and how th

at

changes over time. Movie producers spend a lot of effort and

money in publicizing their movies, and have also embraced

the Twitter medium for this purpose. We then focus on the

mechanism of viral marketing and pre-release hype on Twitte

r,

and the role that attention plays in forecasting real-world

box-

office performance. Our hypothesis is that movies that are we

ll

talked about will be well-watched.

Next, we study how sentiments are created, how positive and

negative opinions propagate and how they influence people.

For a bad movie, the initial reviews might be enough to

discourage others from watching it, while on the other hand,

it

is possible for interest to be generated by positive reviews

and

opinions over time. For this purpose, we perform sentiment

analysis on the data, using text classifiers to distinguish

positively oriented tweets from negative.

Our chief conclusions are as follows:

•

We show that social media feeds can be effective indica-

tors of real-world performance.

•

We discovered that the rate at which movie tweets

are generated can be used to build a powerful model

for predicting movie box-office revenue. Moreover our

predictions are consistently better than those produced

by an information market such as the Hollywood Stock

Exchange, the gold standard in the industry [4].

1

http://www.twitter.com

•

Our analysis of the sentiment content in the tweets shows

that they can improve box-office revenue predictions

based on tweet rates only after the movies are released.

This paper is organized as follows. Next, we survey recent

related work. We then provide a short introduction to Twitte

r

and the dataset that we collected. In Section 5, we study how

attention and popularity are created and how they evolve.

We then discuss our study on using tweets from Twitter

for predicting movie performance. In Section 6, we present

our analysis on sentiments and their effects. We conclude

in Section 7. We describe our prediction model in a general

context in the Appendix.

II. R

ELATED

W

ORK

Although Twitter has been very popular as a web service,

there has not been considerable published research on it.

Huberman and others [2] studied the social interactions on

Twitter to reveal that the driving process for usage is a spar

se

hidden network underlying the friends and followers, while

most of the links represent meaningless interactions. Java

et

al [7] investigated community structure and isolated diffe

rent

types of user intentions on Twitter. Jansen and others [3]

have examined Twitter as a mechanism for word-of-mouth

advertising, and considered particular brands and product

s

while examining the structure of the postings and the change

in

sentiments. However the authors do not perform any analysis

on the predictive aspect of Twitter.

There has been some prior work on analyzing the correlation

between blog and review mentions and performance. Gruhl

and others [9] showed how to generate automated queries

for mining blogs in order to predict spikes in book sales.

And while there has been research on predicting movie sales,

almost all of them have used meta-data information on the

movies themselves to perform the forecasting, such as the

movies genre, MPAA rating, running time, release date, the

number of screens on which the movie debuted, and the

presence of particular actors or actresses in the cast. Josh

i

and others [10] use linear regression from text and metadata

features to predict earnings for movies. Sharda and Delen [8

]

have treated the prediction problem as a classification prob

lem

and used neural networks to classify movies into categories

ranging from ’flop’ to ’blockbuster’. Apart from the fact

that they are predicting ranges over actual numbers, the bes

t

accuracy that their model can achieve is fairly low. Zhang

and Skiena [6] have used a news aggregation model along

with IMDB data to predict movie box-office numbers. We

have shown how our model can generate better results when

compared to their method.

III. T

WITTER

Launched on July 13, 2006, Twitter

2

is an extremely

popular online microblogging service. It has a very large us

er

base, consisting of several millions of users (23M unique us

ers

2

http://www.twitter.com

in Jan

3

). It can be considered a directed social network, where

each user has a set of subscribers known as followers. Each

user submits periodic status updates, known as

tweets

, that

consist of short messages of maximum size 140 characters.

These updates typically consist of personal information ab

out

the users, news or links to content such as images, video

and articles. The posts made by a user are displayed on the

user’s profile page, as well as shown to his/her followers. It

is

also possible to send a direct message to another user. Such

messages are preceded by

@

user

id

indicating the intended

destination.

A

retweet

is a post originally made by one user that is

forwarded by another user. These retweets are a popular mean

s

of propagating interesting posts and links through the Twit

ter

community.

Twitter has attracted lots of attention from corporations

for the immense potential it provides for viral marketing.

Due to its huge reach, Twitter is increasingly used by news

organizations to filter news updates through the community.

A number of businesses and organizations are using Twitter

or similar micro-blogging services to advertise products a

nd

disseminate information to stakeholders.

IV. D

ATASET

C

HARACTERISTICS

The dataset that we used was obtained by crawling hourly

feed data from Twitter.com. To ensure that we obtained all

tweets referring to a movie, we used keywords present in the

movie title as search arguments. We extracted tweets over

frequent intervals using the Twitter Search Api

4

, thereby

ensuring we had the timestamp, author and tweet text for

our analysis. We extracted 2.89 million tweets referring to

24

different movies released over a period of three months.

Movies are typically released on Fridays, with the exceptio

n

of a few which are released on Wednesday. Since an average of

2 new movies are released each week, we collected data over

a time period of 3 months from November to February to have

sufficient data to measure predictive behavior. For consist

ency,

we only considered the movies released on a Friday and only

those in wide release. For movies that were initially in limi

ted

release, we began collecting data from the time it became

wide. For each movie, we define the

critical period

as the

time from the week before it is released, when the promotiona

l

campaigns are in full swing, to two weeks after release, when

its initial popularity fades and opinions from people have b

een

disseminated.

Some details on the movies chosen and their release dates

are provided in Table 1. Note that, some movies that were

released during the period considered were not used in this

study, simply because it was difficult to correctly identify

tweets that were relevant to those movies. For instance,

for the movie

2012

, it was impractical to segregate tweets

talking about the movie, from those referring to the year. We

have taken care to ensure that the data we have used was

3

http://blog.compete.com/2010/02/24/compete-ranks-to

p-sites-for-january-

2010/

4

http://search.twitter.com/api/

Movie

Release Date

Armored

2009-12-04

Avatar

2009-12-18

The Blind Side

2009-11-20

The Book of Eli

2010-01-15

Daybreakers

2010-01-08

Dear John

2010-02-05

Did You Hear About The Morgans

2009-12-18

Edge Of Darkness

2010-01-29

Extraordinary Measures

2010-01-22

From Paris With Love

2010-02-05

The Imaginarium of Dr Parnassus

2010-01-08

Invictus

2009-12-11

Leap Year

2010-01-08

Legion

2010-01-22

Twilight : New Moon

2009-11-20

Pirate Radio

2009-11-13

Princess And The Frog

2009-12-11

Sherlock Holmes

2009-12-25

Spy Next Door

2010-01-15

The Crazies

2010-02-26

Tooth Fairy

2010-01-22

Transylmania

2009-12-04

When In Rome

2010-01-29

Youth In Revolt

2010-01-08

TABLE I

N

AMES AND RELEASE DATES FOR THE MOVIES WE CONSIDERED IN OUR

ANALYSIS

.

disambiguated and clean by choosing appropriate keywords

and performing sanity checks.

2

4

6

8

10

12

14

16

18

20

500

1000

1500

2000

2500

3000

3500

4000

4500

release weekend

weekend 2

Fig. 1. Time-series of tweets over the critical period for di

fferent movies.

The total data over the critical period for the 24 movies

we considered includes 2.89 million tweets from 1.2 million

users.

Fig 1 shows the timeseries trend in the number of tweets

for movies over the critical period. We can observe that the

busiest time for a movie is around the time it is released,

following which the chatter invariably fades. The box-offic

e

revenue follows a similar trend with the opening weekend

generally providing the most revenue for a movie.

Fig 2 shows how the number of tweets per unique author

changes over time. We find that this ratio remains fairly

consistent with a value between 1 and 1.5 across the critical

period. Fig 3 displays the distribution of tweets by differe

nt

2

4

6

8

10

12

14

16

18

20

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Days

Tweets per authors

Release weekend

Fig. 2. Number of tweets per unique authors for different mov

ies

0

1

2

3

4

5

6

7

8

0

2

4

6

8

10

12

14

log(tweets)

log(frequency)

Fig. 3. Log distribution of authors and tweets.

authors over the critical period. The X-axis shows the numbe

r

of tweets in the log scale, while the Y-axis represents the

corresponding frequency of authors in the log scale. We can

observe that it is close to a Zipfian distribution, with a few

authors generating a large number of tweets. This is consist

ent

with observed behavior from other networks [12]. Next, we

examine the distribution of authors over different movies.

Fig 4

shows the distribution of authors and the number of movies

they comment on. Once again we find a power-law curve, with

a majority of the authors talking about only a few movies.

V. A

TTENTION AND

P

OPULARITY

We are interested in studying how attention and popularity

are generated for movies on Twitter, and the effects of this

attention on the real-world performance of the movies consi

d-

ered.

A. Pre-release Attention:

Prior to the release of a movie, media companies and and

producers generate promotional information in the form of

trailer videos, news, blogs and photos. We expect the tweets

for movies before the time of their release to consist primar

ily

of such promotional campaigns, geared to promote word-of-

mouth cascades. On Twitter, this can be characterized by

tweets referring to particular urls (photos, trailers and o

ther

2

4

6

8

10

12

14

16

18

20

22

24

0

1

2

3

4

5

6

7

8

9

10

x 10

5

Number of Movies

Authors

Fig. 4. Distribution of total authors and the movies they com

ment on.

Features

Week 0

Week 1

Week 2

url

39.5

25.5

22.5

retweet

12.1

12.1

11.66

TABLE II

U

RL AND RETWEET PERCENTAGES FOR CRITICAL WEEK

promotional material) as well as retweets, which involve us

ers

forwarding tweet posts to everyone in their friend-list. Bo

th

these forms of tweets are important to disseminate informat

ion

regarding movies being released.

First, we examine the distribution of such tweets for dif-

ferent movies, following which we examine their correlatio

n

with the performance of the movies.

2

4

6

8

10

12

14

16

18

20

22

24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Movies

Tweets with urls (percentage)

Week 0

Week 1

Week 2

Fig. 5. Percentages of urls in tweets for different movies.

Table 2 shows the percentages of urls and retweets in the

tweets over the critical period for movies. We can observe th

at

Features

Correlation

R

2

url

0.64

0.39

retweet

0.5

0.20

TABLE III

C

ORRELATION AND

R

2

VALUES FOR URLS AND RETWEETS BEFORE

RELEASE

.

Features

Adjusted

R

2

p-value

Avg Tweet-rate

0.80

3.65e-09

Tweet-rate timeseries

0.93

5.279e-09

Tweet-rate timeseries + thcnt

0.973

9.14e-12

HSX timeseries + thcnt

0.965

1.030e-10

TABLE IV

C

OEFFICIENT OF

D

ETERMINATION

(

R

2

)

VALUES USING DIFFERENT

PREDICTORS FOR MOVIE BOX

-

OFFICE REVENUE FOR THE FIRST WEEKEND

.

there is a greater percentage of tweets containing urls in th

e

week prior to release than afterwards. This is consistent wi

th

our expectation. In the case of retweets, we find the values to

be similar across the 3 weeks considered. In all, we found the

retweets to be a significant minority of the tweets on movies.

One reason for this could be that people tend to describe thei

r

own expectations and experiences, which are not necessaril

y

propaganda.

We want to determine whether movies that have greater

publicity, in terms of linked urls on Twitter, perform bette

r in

the box office. When we examined the correlation between the

urls and retweets with the box-office performance, we found

the correlation to be moderately positive, as shown in Table

3. However, the adjusted

R

2

value is quite low in both cases,

indicating that these features are not very predictive of th

e

relative performance of movies. This result is quite surpri

sing

since we would expect promotional material to contribute

significantly to a movie’s box-office income.

B. Prediction of first weekend Box-office revenues

Next, we investigate the power of social media in predicting

real-world outcomes. Our goal is to observe if the knowledge

that can be extracted from the tweets can lead to reasonably

accurate prediction of future outcomes in the real world.

The problem that we wish to tackle can be framed as

follows.

Using the tweets referring to movies prior to their

release, can we accurately predict the box-office revenue

generated by the movie in its opening weekend?

0

2

4

6

8

10

12

14

16

x 10

7

0

5

10

15

x 10

7

Predicted Box−office Revenue

Actual revenue

Tweet−rate

HSX

Fig. 6. Predicted vs Actual box office scores using tweet-rat

e and HSX

predictors

To use a quantifiable measure on the tweets, we define the

tweet-rate

, as the

number of tweets referring to a particular

While in this study we focused on the problem of predicting

box office revenues of movies for the sake of having a clear

metric of comparison with other methods, this method can be

extended to a large panoply of topics, ranging from the futur

e

rating of products to agenda setting and election outcomes.

At

a deeper level, this work shows how social media expresses a

collective wisdom which, when properly tapped, can yield an

extremely powerful and accurate indicator of future outcom

es.

VIII. A

PPENDIX

: G

ENERAL

P

REDICTION

M

ODEL FOR

S

OCIAL

M

EDIA

Although we focused on movie revenue prediction in this

paper, the method that we advocate can be extended to other

products of consumer interest.

We can generalize our model for predicting the revenue

of a product using social media as follows. We begin with

data collected regarding the product over time, in the form

of reviews, user comments and blogs. Collecting the data

over time is important as it can measure the rate of chatter

effectively. The data can then be used to fit a linear regressi

on

model using least squares. The parameters of the model

include:

•

A

: rate of attention seeking

•

P

: polarity of sentiments and reviews

•

D

: distribution parameter

Let

y

denote the revenue to be predicted and

ǫ

the error. The

linear regression model can be expressed as :

y

=

β

a

∗

A

+

β

p

∗

P

+

β

d

∗

D

+

ǫ

(4)

where the

β

values correspond to the regression coefficients.

The attention parameter captures the buzz around the produc

t

in social media. In this article, we showed how the rate of

tweets on Twitter can capture attention on movies accuratel

y.

We found this coefficient to be the most significant in our

experiments. The polarity parameter relates to the opinion

s

and views that are disseminated in social media. We observed

that this gains importance after the movie has been released

and adds to the accuracy of the predictions. In the case of

movies, the distribution parameter is the number of theater

s a

particular movie is released in. In the case of other product

s,

it can reflect their availability in the market.

IX. A

CKNOWLEDGEMENT

This material is based upon work supported by the National

Science Foundation under Grant

#

0937060 to the Computing

Research Association for the CIFellows Project.

R

EFERENCES

[1] Jure Leskovec, Lada A. Adamic and Bernardo A. Huberman. T

he

dynamics of viral marketing.

In Proceedings of the 7th ACM Conference

on Electronic Commerce

, 2006.

[2] Bernardo A. Huberman, Daniel M. Romero, and Fang Wu. Soci

al

networks that matter: Twitter under the microscope.

First Monday

, 14(1),

Jan 2009.

[3] B. Jansen, M. Zhang, K. Sobel, and A. Chowdury. Twitter po

wer:

Tweets as electronic word of mouth.

Journal of the American Society

for Information Science and Technology

, 2009.

[4] D. M. Pennock, S. Lawrence, C. L. Giles, and F.

̊

A. Nielsen. The real

power of artificial markets.

Science

, 291(5506):987–988, Jan 2001.

[5] Kay-Yut Chen, Leslie R. Fine and Bernardo A. Huberman. Pr

edicting

the Future.

Information Systems Frontiers

, 5(1):47–61, 2003.

[6] W. Zhang and S. Skiena. Improving movie gross prediction

through

news analysis.

In Web Intelligence

, pages 301304, 2009.

[7] Akshay Java, Xiaodan Song, Tim Finin and Belle Tseng. Why

we twit-

ter: understanding microblogging usage and communities.

Proceedings

of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining

and social network analysis

, pages 56–65, 2007.

[8] Ramesh Sharda and Dursun Delen. Predicting box-office su

ccess of

motion pictures with neural networks.

Expert Systems with Applications

,

vol 30, pp 243–254, 2006.

[9] Daniel Gruhl, R. Guha, Ravi Kumar, Jasmine Novak and Andr

ew

Tomkins. The predictive power of online chatter.

SIGKDD Conference

on Knowledge Discovery and Data Mining

, 2005.

[10] Mahesh Joshi, Dipanjan Das, Kevin Gimpel and Noah A. Smi

th. Movie

Reviews and Revenues: An Experiment in Text Regression

NAACL-HLT

,

2010.

[11] Rion Snow, Brendan O’Connor, Daniel Jurafsky and Andre

w Y. Ng.

Cheap and Fast - But is it Good? Evaluating Non-Expert Annota

tions

for Natural Language Tasks.

Proceedings of EMNLP

, 2008.

[12] Fang Wu, Dennis Wilkinson and Bernardo A. Huberman. Fee

back Loops

of Attention in Peer Production.

Proceedings of SocialCom-09: The 2009

International Conference on Social Computing

, 2009.

[13] Bo Pang and Lillian Lee. Opinion Mining and Sentiment An

alysis

Foundations and Trends in Information Retrieval

, 2(1-2), pp. 1135, 2008.

[14] Namrata Godbole, Manjunath Srinivasaiah and Steven Sk

iena. Large-

Scale Sentiment Analysis for News and Blogs.

Proc. Int. Conf. Weblogs

and Social Media (ICWSM)

, 2007.

end quote from:

http://arxiv.org/pdf/1003.5699.pdf

Intuitive fred888

Top 10 Posts This Month

Wednesday, March 27, 2013

Predicting the Future by Studying Social Media

Scholarly articles for predicting the future with social media

Search Results

Tech Report: predicting the Future With Social Media - HP Labs

Predicting the Future with Social Media - ACM Digital Library

Predicting the Future with Social Media

Social media pros predict the future - Grand Rapids Business Journal

The Future of Social Media: 50+ Experts Share Their 2013 Predictions

No comments: