The growth of the blogosphere offers an unprecedented opportunity to study language and how people use it on a large scale. We present an analysis of over 140 million words of English text drawn from the blogosphere, exploring if and how age and gender affect writing style and topic. Our primary result is that a number of stylistic and contentbased indicators are significantly affected by both age and gender, and that the main difference between older and younger bloggers, and between male and female bloggers, lies in the extent to which their discourse is outer or innerdirected. In fact, the linguistic factors that increase in use with age are just those used more by males of any age, and conversely, those that decrease in use with age are those used more by females of any age.
Automated blog analysis
Correlating age and gender
A great deal of research has been carried out over the last few decades on how different groups of people use language differently (see, e.g., Labov, 1972; Biber and Finegan, 1994; Schneider, 2002). This research has often been constrained, however, by the time and expense needed to collect and annotate data. Studies therefore often have had to make do with comparatively small sample sizes, which makes it tricky to determine how general any results actually are.
The growth of the blogosphere, however, provides an interesting way out of this conundrum. Anyone can write a blog, and blogs are written about anything the blogger wishes and in whatever style they wish, typically with no editorial control. Moreover, blogs are electronically available for downloading, so that data collection is greatly eased. Since there are many millions of such blogs, the blogosphere offers an unprecedented opportunity to study, in a natural context and over a vast scale, how different groups of people write.
We report here our analysis of a large corpus of blog postings to see if and how writing topic and style vary with age and gender of the blogger. There has been much research interest in possible differences between male and female language use (Coates, 1986; Labov, 1990; Holmes, 1997; Bergvall, 1999), some of which has raised great interest in the popular literature (e.g., Tannen, 2001). It has also recently been shown that writing topic and style are useful indicators of agelinked psychological developments in personality, interests, and feelings (Pennebaker, et al., 2003; Pennebaker and Stone, 2003). As we have noted, however, previous studies have generally been limited by the difficulty of data gathering, and so have relied on relatively small amounts of text (cf. Bailey and Dyer, 1992; Biber, 1993; Schneider, 2002), often gathered in artificial laboratory settings.
Our corpus comprises over 140 million words of naturally occurring text from randomly selected blogs by men and women from their teens into their forties. By applying factor analysis and machine learning techniques, we demonstrate here clear and consistent patterns of age and genderlinked variation in writing topic and style. We find that older bloggers tend to write about externallyfocused topics, while younger bloggers tend to write about more personallyfocused topics; changes in writing style with age are closely related. Perhaps surprisingly, similar patterns also characterize genderlinked differences in language style. In fact, the linguistic factors that increase in use with age are just those used more by males of any age, and conversely, those that decrease in use with age are those used more by females of any age. Our results thus confirm and generalize earlier results on agelinked (Pennebaker, et al., 2003; Pennebaker and Stone, 2003; Burger and Henderson, 2006) and genderlinked (Mulac and Lundell, 1994; Biber, 1994; Argamon, et al., 2003; Newman, et al., in press) variation in language use. We suggest that our results are best explained by positing a single factor distinguishing internal from external psychological focus that underlies both age and genderlinked variation in language use. Preliminary results along these lines were previously presented by the authors in (Schler, et al., 2006).
Automated blog analysis
To properly situate our current study, we note the growing literature on automated blog analysis, as exemplified by the 2006 AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs  and the annual Workshop on the Weblogging Ecosystem . Automated techniques related to our own have been applied to extracting and tracking feelings and opinions in the blogosphere (Ku, et al., 2006; Mishne and Rilke, 2006; Mihalcea and Liu, 2006), social network and related analyses (Gruhl, et al., 2004; Hsu, et al., 2006; Lin, et al., 2006), analyzing weblog comments (Mishne and Glance, 2006), finding hot stories and trends in the blogosphere (Glance, et al., 2004; Wu and Tseng, 2006), and identifying spam blogs (or splogs), artificially created to boost search engine ratings or attract commercial traffic (Kolari, et al., 2006; Rubin and Liddy, 2006).
Previous work on gender and age effects on the blogosphere has generally been of comparatively small scale. Herring, et al. (2004) have considered several blog genres, particularly the distinction between personal journal type blogs and filter type blogs (which collect and filter information and links). They have noted that most filter blogs are written by male bloggers and by older bloggers. Similarly, Nowson, et al. (2006) found a strong effect of author sex on blog language, finding that femaleauthored blogs were more contextualized (as measured by Heylighen and Dewaeles (2002) F measure) than maleauthored blogs. In this vein, Huffaker and Calvert (2005) found that teen bloggers are particularly likely to use blogging as a forum for exploring personal issues such as sexual identity. With a few exceptions (e.g., Herring, et al., 2004; Burger and Henderson, 2006), there has been little work on age in the blogosphere.
Some work on computermediated communication (CMC) other than blogs (i.e., discussion groups and email) has applied discourse and content analysis to relevant issues of genderlinked language. Thus, for example, it has been found that maledominated discussion groups had more statements of fact and fewer selfdisclosures (Savicki, et al., 1996), that women had higher rates of using emoticons in their messages (Witmer, 1996), and that email messages about vacations written by females mentioned more about social aspects and shopping while males focused more impersonally on the location, the journey, and local people (Colley and Todd, 2002).
Our research extends this previous work to the automated analysis of a much larger corpus of texts than those previously analyzed for such sociolinguistic variation (compare, e.g., Labov, 1990; Bailey and Dyer, 1992; Mulac and Lundell, 1994; Herring and Paolillo, 2006). This has the effect of minimizing possible sample biases, which is critical when dealing with over a thousand textual variables, as we do here. Moreover, as far as we are aware, our current study is the first to examine the relationship between how language use varies by age with how it varies by gender.
We gathered a collection of blogs from the Web site blogger.com in August 2004. We collected all blogs on the site which (a) contained at least 500 total words including at least 200 occurrences of common English words, and (b) had authorprovided indication of both gender and age. We then randomly selected 10 percent of the documents as a holdout set (for purposes described below). This left an initial collection of 46,947 blogs, summarized in Table 1 (our unit of analysis throughout this paper is each bloggers collected writing from inception until harvest; we do not distinguish between different posts by a given blogger). Note that over 60 percent of bloggers age 17 and below are females, while over 60 percent of bloggers older than 17 are males.
Table 1: Distribution of blogs in our initial collection by age and gender. Gender age Female Male Total 1317 6949 4120 11069 1822 7393 7690 15083 2327 4043 6062 10105 2832 1686 3057 4743 3337 860 1827 2687 3842 374 819 1193 4347 263 584 847 48 and older 314 906 1220 Total 21682 25065 46747
For purposes of analysis, formatting and nonEnglish text was automatically removed from each blog. To enable reliable age categorization (since a blog can span several years of writing), all blogs for boundary ages (ages 1822 and 2832) were removed. Each blogger was categorized by age at time of harvest: 10s (ages 1317), 20s (ages 2327) and 30+ (ages 3347), and also by gender: male and female. The number of blogs of each gender within each age category were equalized by randomly deleting surplus blogs from the larger gender category. The final corpus thus contained 19,320 blogs (8,240 in 10s, 8,086 in 20s, and 2,994 in 30+), comprising a total of 681,288 posts and over 140 million words. There were, on average, approximately 35 posts and 7300 words in each blog in the corpus.
We begin by considering the 1000 most frequent words in the corpus. These comprise 323 different function words and 677 different content words, accounting for 59.4 percent and 21.7 percent, respectively, of all word occurrences. We performed an automated factor analysis on the rate of use of each of the 677 content words, to find groups of related words that tend to occur in similar documents. This process, referred to as a meaning extraction method (Chung and Pennebaker, 2007), yielded twenty coherent factors that depict clear and distinct themes, mostly topicrelated. Word lists for the twenty factors, along with suggestive headings (for reference), are given in Table 2. In addition, we divided the function words into several categories according to their partsofspeech (pronouns, auxiliary verbs, etc.).
Table 2: Words in each factor. Factor Words Conversation know, people, think, person, tell, feel, friends, talk, new, talking, mean, ask, understand, feelings, care, thinking, friend, relationship, realize, question, answer, saying AtHome woke, home, sleep, today, eat, tired, wake, watch, watched, dinner, ate, bed, day, house, tv, early, boring, yesterday, watching, sit Family years, family, mother, children, father, kids, parents, old, year, child, son, married, sister, dad, brother, moved, age, young, months, three, wife, living, college, four, high, five, died, six, baby, boy, spend, christmas Time friday, saturday, weekend, week, sunday, night, monday, tuesday, thursday, Wednesday, morning, tomorrow, tonight, evening, days, afternoon, weeks, hours, july, busy, meeting, hour, month, june Work work, working, job, trying, right, met, figure, meet, start, better, starting, try, worked, idea PastActions said, asked, told, looked, walked, called, talked, wanted, kept, took, sat, gave, knew, felt, turned, stopped, saw, ran, tried, picked, left, ended Games game, games, team, win, play, played, playing, won, season, beat, final, two, hit, first, video, second, run, star, third, shot, table, round, ten, chance, club, big, straight Internet site, email, page, please, website, web, post, link, check, blog, mail, information, free, send, comments, comment, using, internet, online, name, service, list, computer, add, thanks, update, message Location street, place, town, road, city, walking, trip, headed, front, car, beer, apartment, bus, area, park, building, walk, small, places, ride, driving, looking, local, sitting, drive, bar, bad, standing, floor, weather, beach, view Fun fun, im, cool, mom, summer, awesome, lol, stuff, pretty, ill, mad, funny, weird Food/Clothes food, eating, weight, lunch, water, hair, life, white, wearing, color, ice, red, fat, body, black, clothes, hot, drink, wear, blue, minutes, shirt, green, coffee, total, store, shopping Poetic eyes, heart, soul, pain, light, deep, smile, dreams, dark, hold, hands, head, hand, alone, sun, dream, mind, cold, fall, air, voice, touch, blood, feet, words, hear, rain, mouth Books/Movies book, read, reading, books, story, writing, written, movie, stories, movies, film, write, character, fact, thoughts, title, short, take, wrote Religion god, jesus, lord, church, earth, world, word, lives, power, human, believe, given, truth, thank, death, evil, own, peace, speak, bring, truly Romance forget, forever, remember, gone, true, face, spent, times, love, cry, hurt, wish, loved Swearing shit, fuck, fucking, ass, bitch, damn, hell, sucks, stupid, hate, drunk, crap, kill, guy, gay, kid, sex, crazy Politics bush, president, Iraq, kerry, war, american, political, states, america, country, government, john, national, news, state, support, issues, article, michael, bill, report, public, issue, history, party, york, law, major, act, fight, poor Music music, songs, song, band, cd, rock, listening, listen, show, favorite, radio, sound, heard, shows, sounds, amazing, dance School school, teacher, class, study, test, finish, english, students, period, paper, pass Business system, based, process, business, control, example, personal, experience, general
Table 3 shows the frequencies of each factors average usage in each age and gender class, as well as the same data for function words according to their parts of speech.
Table 3: Mean frequencies of factor and partofspeech usage by age and gender. Factor 10s 20s 30s+ Male Female Overall Conversation 1.74 1.55 1.33 1.47 1.72 1.59 AtHome 1.11 .80 .75 .86 .98 .92 Family .65 .75 .94 .69 .79 .74 Time .65 .74 .68 .65 .73 .69 PastActions .74 .62 .63 .62 .73 .68 Work .61 .75 .70 .67 .69 .68 Games .67 .66 .66 .76 .57 .67 Internet .61 .63 .68 .74 .52 .63 Location .52 .65 .63 .60 .58 .59 Fun .88 .36 .28 .50 .64 .57 Food/Clothes .53 .55 .55 .49 .60 .54 Poetic .52 .53 .52 .48 .57 .53 Books/Movies .51 .54 .54 .54 .51 .53 Religion .44 .50 .55 .50 .46 .48 Romance .54 .44 .38 .39 .55 .47 Swearing .54 .35 .25 .41 .42 .41 Politics .27 .41 .56 .47 .28 .37 Music .36 .29 .26 .34 .29 .32 School .35 .19 .17 .26 .25 .26 Business .07 .13 .16 .13 .08 .11 Articles 5.10 6.46 6.97 6.46 5.45 5.96 PersonalPronouns 11.72 10.44 9.88 9.84 11.97 10.96 AuxiliaryVerbs 9.04 8.90 8.83 8.76 9.14 8.95 Conjunctions 2.89 2.59 2.48 2.63 2.76 2.70 Prepositions 11.83 13.04 13.30 12.76 12.36 12.56
First of all, these results indicate clear differences in both preferred topic and preferred style between bloggers of different ages . Usage of words associated with Family, Religion, Politics, Business, and Internet increases with age, while usage of words associated with Conversation, AtHome, Fun, Romance, Music, School, and Swearing decreases significantly with age. (All effects mentioned are statistically significant with p < 0.001.) None of the other factors varies directly with age in a statistically significant fashion. In addition to these topicrelated differences in blogs with blogger age, we also see clear differences in style, as measured by frequencies of grammatical partsofspeech. Usage of PersonalPronouns, Conjunctions, and AuxiliaryVerbs decreases significantly with age, while usage of Articles and Prepositions increases significantly with age.
In fact, such variations in word frequency can be exploited to effectively predict the age of a blogs writer. To show this, we computed, for each blog, a vector containing the frequencies in the blog of the abovementioned 377 function words as well as the 1000 most informative words  for age. Two different machinelearning algorithms, Bayesian multinomial logistic regression (BMR: Madigan, et al., 2005) and multiclass balanced realvalued Winnow (WIN: Littlestone, 1988; Dagan, et al., 1997), were applied to these frequency vectors to construct classification models for author age. Tenfold crossvalidation  was used to estimate generalization accuracy. The results show automatic classification of an unseen document into the correct age interval (10s, 20s, or 30+) with an accuracy of 77.4 percent (using BMR) and 75.0 percent (using WIN). Examination of the confusion matrix shows that 10s are distinguishable from 30+ with over 96 percent accuracy, whereas distinguishing 20s from either of the other two classes is more difficult. Using only function words gives accuracies of 69.4 percent (BMR) and 67.7 percent (WIN), while using just the high informationgain words gives accuracies of 76.2 percent (BMR) and 75.9 percent (WIN). Thus, as we might have expected, topic preference is most related to blogger age, although there is definitely a marked effect on writing style as well.
Regarding blogger gender, we see (Table 3) that Articles and Prepositions are used significantly more by male bloggers, while PersonalPronouns, Conjunctions, and AuxiliaryVerbs are used significantly more by female bloggers. These are the same features that we previously found to indicate male and female writing styles in published fiction and nonfiction works (Argamon, et al., 2003). In contentbased features, we see the factors Religion, Politics, Business, and Internet used more frequently by male bloggers, while the factors Conversation, AtHome, Fun, Romance, and Swearing are more often used by female bloggers. (All effects mentioned are statistically significant with p < 0.001.) Prediction of author gender (as above) from function words and the 1000 words with highest informationgain for gender gave accuracies of 79.3 percent (BMR) and 80.5 percent (WIN). These results are consistent with classification studies on author gender in other types of texts (Argamon, et al., 2003; de Vel, et al., 2002; Hota, et al., 2006).
It should be noted that style and content effects are highly correlated: use of multiple regressions indicates that controlling for style effects essentially eliminates content effects and vice versa. Thus, it may be that choice of content determines particular style preferences, or both content and style may be influenced by a single underlying variable such as genre preference (Herring, et al., 2004). It is highly probable, though, that a more general sociolinguistic variable underlies this phenomenon, for as we have noted, the results of the current study on genderlinked style are virtually identical to those found in studies of vastly differing genres, including published fiction and nonfiction (Argamon, et al., 2003).
Correlating age and gender
It has not escaped our attention that with few exceptions, the factors and partsofspeech that are used significantly more by younger (older) bloggers are also used significantly more by female (male) bloggers. Thus, Articles, Prepositions, Religion, Politics, Business, and Internet are used more by male bloggers as well as older bloggers, while PersonalPronouns, Conjunctions, AuxiliaryVerbs, Conversation, AtHome, Fun, Romance, and Swearing are used more by female bloggers as well as younger bloggers. There are only three exceptions to this pattern: Family, used more by older bloggers and by females; Music, used more by younger bloggers and by males; and, School, for which there is no significant difference between male and female usage.
The force of this observation is highlighted when examining those individual words that evince both strong agelinked and genderlinked effects. We consider the 316 words that are among both the 1000 words with highest information gain for age and the 1000 words with highest information gain for gender (as computed on the holdout set). The scatterplot in Figure 1 plots log(w(male)/w(female)) against log(w(30+)/w(10s)), where w(A) is the average frequency of word w in documents of class A. Note that every word but one (husband) lies in the first (male and 30+) or third (female and 10s) quadrants. That is, with just the one exception, every word we considered that is used more by females is used more by younger bloggers and vice versa. The Pearson correlation between the male/female and 30+/10s logratios is 0.71.
Figure 1: Scatterplot showing log(w(male)/w(female)) on the xaxis plotted against log(w(10s)/w(30+)) on the yaxis.
Points shown represent the words with highest information gain for both age and gender as described in the text.
The significance of these results is twofold. First is the fact that, in contradistinction to many previous similar studies, we have analyzed many millions of words of naturally occurring text. This fact lends credence to the conclusion that significant variation in our data reflects real variation in the world (or at least, the world of those likely to write Englishlanguage blogs), and is not a mere artifact of our experimental procedure.
Perhaps more significantly, however, our findings serve to link together earlier observations regarding agelinked and genderlinked writing variation that have not previously been connected. Previous studies investigating gender and language have shown genderlinked differences along dimensions of involvedness (Biber, 1995; Argamon, et al., 2003) or contextualization (Heylighen and Dewaele, 2002). Other studies have found agelinked differences in the immediacy and informality of writing (Pennebaker, et al., 2003). The current study suggests that these two sets of results are closely related. Indeed, they likely both reflect a single underlying distinction between inner and outeroriented communication that may explain both genderlinked and agelinked variation in language use.
About the authors
Shlomo Argamon is Associate Professor in the Department of Computer Science at the Illinois Institute of Technology in Chicago.
Email: argamon [at] iit [dot] edu
Professor Moshe Koppel can be found in the Department of Computer Science at BarIlan University (Ramat Gan 52900, Israel).
Email: moishk [at] gmail [dot] com
James W. Pennebaker is Professor and Chair of the Department of Psychology at the University of Texas in Austin.
Email: pennebaker [at] mail [dot] utexas [dot] edu
Dr. Jonathan Schler, Department of Computer Science, BarIlan University (Ramat Gan 52900, Israel).
Email: schler[at] gmail [dot] com
3. We must, of course, keep in mind that since this study is synchronic, we cannot separate generational effects from age effects. Moreover, since older bloggers are somewhat less common, they might represent an atypical demographic as early adopters of technology.
4. The informativeness of words for a particular text class (age or gender) was measured by the information gain measure (Quinlan, 1986), an informationtheoretic formula estimating how much information about the class of a text is conveyed by knowing the frequency of a particular word in the text.
5. Tenfold crossvalidation is a standard technique for estimating the generalization accuracy of a machine learning method (see Mitchell, 1997). The data is randomly divided into ten equally sized segments, and the system repeatedly trains on nine of them and tests on the remaining one; the average of these accuracies is the reported result. Thus we avoid testing on examples that were used in training.
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. Gender, Genre, and Writing Style in Formal Written Texts, Text, volume 23, number 3, pp. 321346; also at http://www.cs.biu.ac.il/~koppel/papers/male-female-text-final.pdf, accessed 21 August 2007.
G. Bailey and M. Dyer, 1992. An approach to sampling in dialectology, American Speech, volume 67, number 1, pp. 320.
V.L. Bergvall, 1999. Toward a comprehensive theory of language and gender, Language in Society, volume 28, pp. 273293.
D. Biber, 1995. Dimensions of register variation: A crosslinguistic comparison. Cambridge: Cambridge University Press.
D. Biber, 1994. An analytical framework for register studies, In: D. Biber and E. Finegan (editors). Sociolinguistic perspectives on register. New York: Oxford University Press, pp. 3156.
D. Biber, 1993. Using registerdiversified corpora for general language studies, Computational Linguistics, volume 19, number 2, pp. 219241.
D. Biber and E. Finegan (editors), 1994. Sociolinguistic perspectives on register. New York: Oxford University Press.
J.D. Burger and J.C. Henderson, 2006. An exploration of observable features related to blogger age, In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 1520.
C.K. Chung and J.W. Pennebaker, 2007, in press. Revealing peoples thinking in natural language: Using an automated meaning extraction method in openended selfdescriptions, Journal of Research in Personality.
J. Coates, 1986. Women, men, and language: A sociolinguistic account of sex differences in language. London: Longman.
A. Colley and Z. Todd, 2002. Genderlinked differences in the style and content of emails to friends, Journal of Language and Social Psychology, volume 21, number 4, pp. 380392.
I. Dagan, Y. Karov, and D. Roth, 1997. Mistakedriven learning in text categorization, Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP97), pp. 5563.
O. de Vel, M. Corney, A. Anderson, G. Mohay, 2002. Language and gender author cohort analysis of e-mail for computer forensics, Proceedings of the Second Digital Forensic Research Workshop, at http://dfrws.org/2002/papers/Papers/Olivier_DeVel.pdf, accessed 21 August 2007.
N. Glance, M. Hurst, and T. Tomokiyo, 2004. BlogPulse: Automated trend discovery for weblogs, Proceedings of the First Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, at http://www.blogpulse.com/papers/www2004glance.pdf, accessed 21 August 2007.
D. Gruhl, R. Guha, D. LibenNowell, and A. Tomkins, 2004. Information diffusion through blogspace, Proceedings of the 13th international Conference on World Wide Web (New York, N.Y.), pp. 491501, and at http://theory.lcs.mit.edu/~dln/papers/blogs/idib.pdf, accessed 21 August 2007.
S. Herring and J. Paolillo, 2006. Gender and genre variation in weblogs, Journal of Sociolinguistics, volume 10, number 4, pp. 439459.
S.C. Herring, L.A. Scheidt, S. Bonus, and E. Wright, 2004. Bridging the gap: A genre analysis of weblogs, Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS 04). Los Alamitos, Calif.: IEEE Press; also at http://www.ics.uci.edu/~jpd/classes/ics234cw04/herring.pdf, accessed 21 August 2007.
F. Heylighen and J.M. Dewaele, 2002. Variation in the contextuality of language: an empirical measure, Foundations of Science, volume 7, pp. 293340.
J. Holmes, 1997. Women, language, and identity, Journal of Sociolinguistics, volume 1, pp. 195224.
S. Hota, S. Argamon, M. Koppel, and I. Zigdon, 2006. Performing gender: Automatic stylistic analysis of Shakespeare s characters, Proceedings of the Digital Humanities Conference (Association for Computers in Humanities and the Association for Literary and Linguistic Computing), at http://lingcog.iit.edu/doc/hota_allc2006.pdf, accessed 21 August 2007.
W.H. Hsu, T. Weninger, T. Pydmarri, and M.S.R. Paradesi, 2006. Collaborative and structural recommendation of friends using Weblogbased social network analysis, In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 5560.
D.A. Huffaker and S.L. Calvert, 2005. Gender, identity, and language use in teenage blogs, Journal of ComputerMediated Communication, volume 10, number 2, at http://jcmc.indiana.edu/vol10/issue2/huffaker.html, accessed 21 August 2007.
P. Kolari, A. Java, and T. Finin, 2006. Characterizing the Splogosphere, Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, at http://www.blogpulse.com/www2006-workshop/papers/splogosphere.pdf, accessed 21 August 2007.
LW. Ku, YT. Liang, and HH. Chen, 2006. Opinion extraction, summarization and tracking in news and blog corpora, In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 100107, and at http://nlg18.csie.ntu.edu.tw:8080/opinion/SS0603KuLW.pdf, accessed 21 August 2007.
W. Labov, 1990. The intersection of sex and social class in the course of linguistic change, Language Variation and Change, volume 2, pp. 205254.
W. Labov, 1972. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.
YR. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng, 2006. Discovery of blog communities based on mutual awareness, Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, at http://www.blogpulse.com/www2006-workshop/papers/wwe2006-discovery-lin-final.pdf, accessed 21 August 2007.
N. Littlestone, 1988. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm, Machine Learning, volume 2, issue 4, pp. 285318.
D. Madigan, A. Genkin, D.D. Lewis, and D. Fradkin, 2005. Bayesian multinomial logistic regression for author identification, 25th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (AIP Conference Proceedings), volume 803, pp. 509516, and at http://www.stat.rutgers.edu/~madigan/mms/authorID-me05-fixed.pdf, accessed 21 August 2007.
R. Mihalcea and H. Liu, 2006. A corpusbased approach to finding happiness, In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 139144, and at http://www.cs.unt.edu/~rada/papers/mihalcea.aaai06ss.pdf, accessed 21 August 2007.
G. Mishne and M. de Rijke, 2006. Capturing global mood levels using blog posts, In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 145152, and at http://staff.science.uva.nl/~gilad/pubs/aaai06-blogmoods.pdf, accessed 21 August 2007.
G. Mishne and N. Glance, 2006. Leave a reply: An analysis of Weblog comments, Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, at http://www.blogpulse.com/www2006-workshop/papers/wwe2006-blogcomments.pdf, accessed 21 August 2007.
T.M. Mitchell, 1997. Machine learning. New York: Mc-GrawHill.
A. Mulac and T.L. Lundell, 1994. Effects of genderlinked language differences in adults written discourse: Multivariate tests of language effects, Language and Communication, volume 14, number 3, pp. 299309.
M.L. Newman, C.J. Groom, L.D. Handelman, and J.W. Pennebaker, in press. Gender differences in language use: An analysis of 14,000 text samples, Discourse Processes.
S. Nowson, J. Oberlander, and A.J. Gill, 2005. Weblogs, genres, and individual differences, Proceedings of the 27th Annual Conference of the Cognitive Science Society (Stresa, Italy), pp. 16661671, and at http://www.ics.mq.edu.au/~snowson/papers/nowson-cogsci.pdf, accessed 21 August 2007.
J.W. Pennebaker and L.D. Stone, 2003. Words of wisdom: Language use over the lifespan, Journal of Personality and Social Psychology, volume 85, pp. 291301.
J.W. Pennebaker, M.R. Mehl, and K. Niederhoffer, 2003. Psychological aspects of natural language use: Our words, ourselves, Annual Review of Psychology, volume 54, pp. 547577.
J.R. Quinlan, 1986. Induction of decision trees, Machine Learning, volume 1, number 1, pp. 81106.
V.L. Rubin and E.D. Liddy, 2006. Assessing credibility of Weblogs, In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 187190.
J. Schler, M. Koppel, S. Argamon, and J. Pennebaker, 2006. Effects of age and gender on blogging, In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 199205, and at http://lingcog.iit.edu/doc/springsymp-blogs-final.pdf, accessed 21 August 2007.
E.W. Schneider, 2002. Investigating variation and change in written documents, Chapter 3 of J.K. Chambers, P. Trudgill, and N. SchillingEstes (editors). Handbook of language variation and change. Malden, Mass.: Blackwell Publishing.
D. Tannen, 2001. You just dont understand: Women and men in conversation. New York: Quill.
Y. Wu and B.L. Tseng, 2006. Important Weblog identification and hot story summarization, In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 221227.
Paper received 28 July 2007; accepted 19 August 2007.
Copyright ©2007, First Monday.
Copyright ©2007, Shlomo Argamon, Moshe Koppel, James W. Pennebaker, and Jonathan Schler.
Mining the Blogosphere: Age, gender and the varieties of selfexpression by Shlomo Argamon, Moshe Koppel, James W. Pennebaker, and Jonathan Schler
First Monday, volume 12, number 9 (September 2007),