Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  hebrew Type:  dataset Data Domain:  nlp Integration:  dvc git mlflow github
4be6faef6a
re-add DVC
2 years ago
5bf67f885d
add with ktiv male lock
2 years ago
img
6262826c6a
add images to README
2 years ago
src
f4909a90f5
modify training args
2 years ago
4be6faef6a
re-add DVC
2 years ago
96d4d9be38
add setup
2 years ago
5d3e04cc50
add other links
2 years ago
eac40d2c6b
update dvc lock
2 years ago
17c6b25136
add new dvc stage
2 years ago
f9d09b8c0a
add numpy to reqs
2 years ago
96d4d9be38
add setup
2 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

UNIKUD: Hebrew nikud with transformers

If you are accessing this repo via GitHub, please see the project page on DAGSHub for data files, pipelines and more.

Description

We provide a short description of UNIKUD here. For more information, please see the article: (TBD)

How data is used to train UNIKUD

How Hebrew text with vowels is used to train UNIKUD. The text with vowels removed is used as the model's input, and the original text with vowels is used as the target (what we are trying to predict).

Illustration of full spellings in Hebrew

"Ktiv male" (full spelling): The red letter is only used without vowels.

Illustration of one-hot encoded target

Hebrew vocalization as multilabel classification: Each Hebrew letter may be decorated with multiple "nikud", which can be represented as a one-hot (binary) vector. UNIKUD uses this one-hot encoding as its target. The figure is condensed for clarity but UNIKUD's one-hot targets actually contain 15 entries.

Requirements

First install:

  • Conda
  • Rust compiler:
    • curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    • Reopen shell or run source $HOME/.cargo/env

Then create and activate the UNIKUD environment with:

  • conda env create -f environment.yml
  • conda activate unikud

You may then download the required data files using DVC:

  • dvc remote add origin https://dagshub.com/morrisalp/unikud.dvc
  • dvc pull -r origin

Data

Sources of data:

To preprocess data, run:

Training

To reproduce the training pipeline, perform the following steps:

  1. Preprocess data:
  • dvc repro preprocessing
  1. Train ktiv male model:
  • dvc repro train-ktiv-male
  1. Add ktiv male to data file:
  • dvc repro add-ktiv-male
  1. Train UNIKUD model:
  • dvc repro train-unikud

Training steps will automatically log to MLflow (via the Huggingface Trainer object) if the following environment variables are set: MLFLOW_TRACKING_URI, MLFLOW_TRACKING_USERNAME, MLFLOW_TRACKING_PASSWORD.

Scripts will automatically use GPU when available. If you want to run on CPU, set the environment variable CUDA_VISIBLE_DEVICES to be empty (export CUDA_VISIBLE_DEVICES=).

Inference

To add nikud to unvocalized Hebrew text:

tokenizer = CanineTokenizer.from_pretrained("google/canine-c")
model = UnikudModel.from_pretrained('models/unikud/latest')
task = NikudTask(tokenizer, model)

text = 'זאת דוגמא של טקסט לא מנוקד בעברית'
print(task.add_nikud(text))

Other Links

Tip!

Press p or to see the previous file or, n or to see the next file

About

UNIKUD is an open-source tool for adding vowel signs (nikud) to Hebrew text with deep learning, using absolutely no rule-based logic.

Collaborators 1

Comments

Loading...