Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

readme.md 3.7 KB

You have to be logged in to leave a comment. Sign In
jupytext kernelspec
[{formats ipynb,md:myst} {text_representation [{extension .md} {format_name myst} {format_version 0.13} {jupytext_version 1.11.5}]}] [{display_name Python [conda env:text-data-class]} {language python} {name conda-env-text-data-class-py}]

Homework: State Your Assumptions

POLONIUSWhat do you read, my lord?

HAMLETWords, words, words

-- Hamlet, Act 2, Scene 2

This homework deals with the assumptions made when taking text from its original "raw" form into something more computable.

  • Assumptions about the shape of text (e.g. how to break a corpus into documents)
  • Assumptions about what makes a token, an entity, etc.
  • Assumptions about what interesting or important content looks like, and how that informs our analyses.

There are three parts:

  1. Splitting Lines from Shakespeare
  2. Tokenizing and Aligning lines into plays
  3. Assessing and comparing characters from within each play

NBThis file is merely a template, with instructions; do not feel constrained to using it directly if you do not wish to.

+++

Get the Data

Since the class uses dvc, it is possible to get this dataset either using the command line (e.g. dvc import https://github.com/TLP-COI/text-data-course resources/data/shakespeare/shakespeare.txt), or using the python api (if you wish to use python)

from dvc.api import read,get_url
import pandas as pd

txt = read('resources/data/shakespeare/shakespeare.txt', 
           repo='https://github.com/TLP-COI/text-data-course')

print(txt[:250])

Make sure this works before you continue! Either way, it would likely be beneficial to have the data downloaded locally to keep from needing to re-dowload it every time.

+++

Part 1

Split the text file into a table, such that

  • each row is a single line of dialogue
  • there are columns for
    1. the speaker
    2. the line number
    3. the line dialogue (the text)

Hint: you will need to use RegEx to do this rapidly. See the in-class "markdown" example!

Question(s):

  • What assumptions have you made about the text that allowed you to do this?

+++

Part 2

You have likely noticed that the lines are not all from the same play! Now, we will add some useful metadata to our table:

  • Determine a likely source title for each line
  • add the title as a 'play' column in the data table.
  • make sure to document your decisions, assumptions, external data sources, etc.

This is fairly open-ended, and you are not being judged completely on accuracy. Instead, think outside the box a bit as to how you might accomplish this, and attempt to justify whatever approximations or assumptions you felt were appropriate.

+++

Part 3

Pick one or more of the techniques described in this chapter:

  • keyword frequency
  • entity relationships
  • markov language model
  • bag-of-words, TF-IDF
  • semantic embedding

make a case for a technique to measure how important or interesting a speaker is. The measure does not have to be both important and interesting, and you are welcome to come up with another term that represents "useful content", or tells a story (happiest speaker, worst speaker, etc.)

Whatever you choose, you must

  1. document how your technique was applied
  2. describe why you believe the technique is a valid approximation or exploration of how important, interesting, etc., a speaker is.
  3. list some possible weaknesses of your method, or ways you expect your assumptions could be violated within the text.

This is mostly about learning to transparently document your decisions, and iterate on a method for operationalizing useful analyses on text. Your explanations should be understandable; homeworks will be peer-reviewed by your fellow students.


Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...