You have to be logged in to leave a comment.

jupytext	kernelspec
[{formats ipynb,md:myst} {text_representation [{extension .md} {format_name myst} {format_version 0.13} {jupytext_version 1.11.5}]}]	[{display_name Python [conda env:text-data-class]} {language python} {name conda-env-text-data-class-py}]

Homework: State Your Assumptions

POLONIUSWhat do you read, my lord?

HAMLETWords, words, words

-- Hamlet, Act 2, Scene 2

This homework deals with the assumptions made when taking text from its original "raw" form into something more computable.

Assumptions about the shape of text (e.g. how to break a corpus into documents)
Assumptions about what makes a token, an entity, etc.
Assumptions about what interesting or important content looks like, and how that informs our analyses.

There are three parts:

Splitting Lines from Shakespeare
Tokenizing and Aligning lines into plays
Assessing and comparing characters from within each play

NBThis file is merely a template, with instructions; do not feel constrained to using it directly if you do not wish to.

+++

Get the Data

Since the class uses dvc, it is possible to get this dataset either using the command line (e.g. dvc import https://github.com/TLP-COI/text-data-course resources/data/shakespeare/shakespeare.txt), or using the python api (if you wish to use python)

from dvc.api import read,get_url
import pandas as pd

txt = read('resources/data/shakespeare/shakespeare.txt', 
           repo='https://github.com/TLP-COI/text-data-course')

print(txt[:250])

Make sure this works before you continue! Either way, it would likely be beneficial to have the data downloaded locally to keep from needing to re-dowload it every time.

+++

Part 1

Split the text file into a table, such that

each row is a single line of dialogue
there are columns for
1. the speaker
2. the line number
3. the line dialogue (the text)

Hint: you will need to use RegEx to do this rapidly. See the in-class "markdown" example!

Question(s):

What assumptions have you made about the text that allowed you to do this?

+++

Part 2

You have likely noticed that the lines are not all from the same play! Now, we will add some useful metadata to our table:

Determine a likely source title for each line
add the title as a 'play' column in the data table.
make sure to document your decisions, assumptions, external data sources, etc.

This is fairly open-ended, and you are not being judged completely on accuracy. Instead, think outside the box a bit as to how you might accomplish this, and attempt to justify whatever approximations or assumptions you felt were appropriate.

+++

Part 3

Pick one or more of the techniques described in this chapter:

keyword frequency
entity relationships
markov language model
bag-of-words, TF-IDF
semantic embedding

make a case for a technique to measure how important or interesting a speaker is. The measure does not have to be both important and interesting, and you are welcome to come up with another term that represents "useful content", or tells a story (happiest speaker, worst speaker, etc.)

Whatever you choose, you must

document how your technique was applied
describe why you believe the technique is a valid approximation or exploration of how important, interesting, etc., a speaker is.
list some possible weaknesses of your method, or ways you expect your assumptions could be violated within the text.

This is mostly about learning to transparently document your decisions, and iterate on a method for operationalizing useful analyses on text. Your explanations should be understandable; homeworks will be peer-reviewed by your fellow students.

Tip!

Press p or to see the previous file or, n or to see the next file

readme.md 3.7 KB

Permalink History Raw

Homework: State Your Assumptions

Get the Data

Part 1

Part 2

Part 3

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

rtbs-dev / text-data-spr22 connected to https://github.com/tbsexton/text-data-spr22.git

readme.md 3.7 KB Permalink History Raw

Homework: State Your Assumptions

Get the Data

Part 1

Part 2

Part 3

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

rtbs-dev
/
text-data-spr22
connected to https://github.com/tbsexton/text-data-spr22.git

readme.md 3.7 KB

Permalink History Raw