Are you sure you want to delete this access key?
jupytext | kernelspec |
---|---|
[{formats ipynb,md:myst} {text_representation [{extension .md} {format_name myst} {format_version 0.13} {jupytext_version 1.11.5}]}] | [{display_name Python [conda env:text-data-class]} {language python} {name conda-env-text-data-class-py}] |
POLONIUSWhat do you read, my lord?
HAMLETWords, words, words
-- Hamlet, Act 2, Scene 2
This homework deals with the assumptions made when taking text from its original "raw" form into something more computable.
There are three parts:
NBThis file is merely a template, with instructions; do not feel constrained to using it directly if you do not wish to.
+++
Since the class uses dvc
, it is possible to get this dataset either using the command line (e.g. dvc import https://github.com/TLP-COI/text-data-course resources/data/shakespeare/shakespeare.txt
), or using the python api (if you wish to use python)
from dvc.api import read,get_url
import pandas as pd
txt = read('resources/data/shakespeare/shakespeare.txt',
repo='https://github.com/TLP-COI/text-data-course')
print(txt[:250])
Make sure this works before you continue! Either way, it would likely be beneficial to have the data downloaded locally to keep from needing to re-dowload it every time.
+++
Split the text file into a table, such that
Hint: you will need to use RegEx to do this rapidly. See the in-class "markdown" example!
Question(s):
+++
You have likely noticed that the lines are not all from the same play! Now, we will add some useful metadata to our table:
This is fairly open-ended, and you are not being judged completely on accuracy. Instead, think outside the box a bit as to how you might accomplish this, and attempt to justify whatever approximations or assumptions you felt were appropriate.
+++
Pick one or more of the techniques described in this chapter:
make a case for a technique to measure how important or interesting a speaker is. The measure does not have to be both important and interesting, and you are welcome to come up with another term that represents "useful content", or tells a story (happiest speaker, worst speaker, etc.)
Whatever you choose, you must
This is mostly about learning to transparently document your decisions, and iterate on a method for operationalizing useful analyses on text. Your explanations should be understandable; homeworks will be peer-reviewed by your fellow students.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?