"cell_type": "markdown",
"metadata": {},
"source": [
- "# Gender roles and pronouns in some 19th century novels by women\n",
+ "# Gender roles and pronouns in some 19th century novels by women<a name=\"top\"></a>\n",
"\n",
"This is essentially a replication of the text analysis by Julia Silge on [Gender roles with text mining and _n_-grams](https://juliasilge.com/blog/gender-pronouns/), which in turn was an attempt at a similar study to that contained in [Understanding Gender and Character Agency in the 19th Century Novel](http://culturalanalytics.org/2016/12/understanding-gender-and-character-agency-in-the-19th-century-novel/) by Matthew Jockers and Gabi Kirilloff. \n",
"\n",
"The idea is to get an insight into gender roles and activity in novels by looking at the verbs which are associated with men and women. The Jockers and Kirilloff study used the Stanford CoreNLP engine for detailed parsing of the text; Silge used simple word bigram analysis to find words that follow from gendered pronouns.\n",
"\n",
- "This notebook does the same analysis as Silge, but using the tools available to TM351 students.\n",
+ "This notebook does the same analysis as Silge, but using the tools available to [TM351](http://www.open.ac.uk/courses/qualifications/details/tm351) students.\n",
"\n",
- "The books were downloaded from [Project Gutenberg](http://onlinebooks.library.upenn.edu/webbin/gutbook/author?name=Austen%2C%20Jane%2C%201775-1817)"
+ "The books were downloaded from [Project Gutenberg](http://onlinebooks.library.upenn.edu/webbin/gutbook/author?name=Austen%2C%20Jane%2C%201775-1817)\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "1. [Read the books and get the bigrams](#readbooks)\n",
+ "1. [Find the most skewed gendered bigrams](#skewedbigrams)\n",
+ "1. [Odds ratio: a quantification of the skew](#oddsratio)\n",
+ "1. [Plot the most skewed words](#plotting)\n",
+ "\n",
+ "### Other authors\n",
+ "* [George Eliot](#eliot)\n",
+ "* [Charlotte Brontë](#bronte)\n",
+ "* [Oscar Wilde](#wilde)\n",
+ "* [Charles Dickens](#dickens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Read the books and get the bigrams\n",
+ "## Read the books and get the bigrams<a name=\"readbooks\"></a>\n",
"\n",
"First, define the books and the files they're in. This assumes you've already downloaded the books and stored them in the same directory as this notebook.\n",
"\n",
- "In the books I used, I removed the Gutenberg-specific introduction and licence text at the start and end of the files. "
+ "In the books I used, I removed the Gutenberg-specific introduction and licence text at the start and end of the files. \n",
+ "\n",
+ "* [Back to top](#top)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Find the most skewed gendered bigrams\n",
- "Gendered bigrams are those with 'he' or 'she' in the first position. I use a Pandas Series to store the gendered bigrams, then apply `value_counts()` to count how many of each there are."
+ "# Find the most skewed gendered bigrams<a name=\"skewedbigrams\"></a>\n",
+ "Gendered bigrams are those with 'he' or 'she' in the first position. I use a Pandas Series to store the gendered bigrams, then apply `value_counts()` to count how many of each there are.\n",
+ "\n",
+ "* [Back to top](#top)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Odds ratio: a quantification of the skew\n",
+ "## Odds ratio: a quantification of the skew<a name=\"oddsratio\"></a>\n",
"Now find the odds ratio, which is the ratio of probabilities of each word being preceeded by 'she' vs the probabilty of it being preceeded by 'he'. \n",
"\n",
"To keep the numbers in a sensible range, take the log of the ratio.\n",
"\n",
- "Because not every work appears for both genders, we apply some 'smoothing' to avoid things blowing up. The smoothing in the original blog post was to add one to the number of occurrences of each genered bigram, and add one to the total number of bigrams for that gender. A slightly less bad version is to assume we've seen each possible bigram some small number of times (e.g. 0.1) and adust all the scores accordingly."
+ "Because not every work appears for both genders, we apply some 'smoothing' to avoid things blowing up. The smoothing in the original blog post was to add one to the number of occurrences of each genered bigram, and add one to the total number of bigrams for that gender. A slightly less bad version is to assume we've seen each possible bigram some small number of times (e.g. 0.1) and adust all the scores accordingly.\n",
+ "\n",
+ "* [Back to top](#top)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Plotting the most skewed words\n",
+ "# Plot the most skewed words<a name=\"plotting\"></a>\n",
"\n",
"Extract the words with the greatest skew, put them in a new DataFrame, and give each one a number so we can get back to it in the plotting.\n",
"\n",
- "The window says how many from each end of the list of gendered words, such as the 15 most female and the 15 most make.\n",
+ "The window says how many from each end of the list of gendered words, such as the 15 most female and the 15 most male.\n",
"\n",
- "If we want to exclude words from plotting, pass in a list of stopwords."
+ "If we want to exclude words from plotting, pass in a list of stopwords.\n",
+ "\n",
+ "* [Back to top](#top)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# George Eliot\n",
- "Do the same with the books of George Eliot, another 19th century female novelist, reusing the functions defined above."
+ "# George Eliot<a name=\"eliot\"></a>\n",
+ "Do the same with the books of George Eliot, another 19th century female novelist, reusing the functions defined above.\n",
+ "\n",
+ "* [Back to top](#top)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Charlotte Brontë's book, _Jane Eyre_\n",
- "This was called out in the original paper fitting the pattern less well, with women taking more active roles and men having more emotions. Does this pan out?"
+ "# Charlotte Brontë's book, _Jane Eyre_<a name=\"bronte\"></a>\n",
+ "This was called out in the original paper fitting the pattern less well, with women taking more active roles and men having more emotions. Does this pan out?\n",
+ "\n",
+ "* [Back to top](#top)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Now with Oscar Wilde's book, _The Picture of Dorian Grey_\n",
+ "# Now with Oscar Wilde's book, _The Picture of Dorian Grey_<a name=\"wilde\"></a>\n",
"\n",
- "Not in Silge's article, but in the original paper, _The Picture of Dorian Grey_ was called out as another gender-swap, with women being active and men being emotional."
+ "Not in Silge's article, but in the original paper, _The Picture of Dorian Grey_ was called out as another gender-swap, with women being active and men being emotional.\n",
+ "\n",
+ "* [Back to top](#top)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Charles Dickens's books\n",
- "A typically male 19th century author, with many books with male protagonists."
+ "# Charles Dickens's books<a name=\"dickens\"></a>\n",
+ "A typically male 19th century author, with many books with male protagonists.\n",
+ "\n",
+ "* [Back to top](#top)"
]
},
{