You have to be logged in to leave a comment.

UNIKUD: Hebrew nikud with transformers

If you are accessing this repo via GitHub, please see the project page on DAGSHub for data files, pipelines and more.

Description

We provide a short description of UNIKUD here. For more information, please see the article: UNIKUD: Adding Vowels to Hebrew Text with Deep Learning.

UNIKUD is an open-source tool for adding vowel signs (nikud) to Hebrew text with deep learning, using absolutely no rule-based logic. UNIKUD uses Google's CANINE transformer model as its backbone, and treats text vocalization as a character token multilabel classification problem.

How Hebrew text with vowels is used to train UNIKUD. The text with vowels removed is used as the model's input, and the original text with vowels is used as the target (what we are trying to predict).

UNIKUD's training data requires preprocessing, because texts in Hebrew without vowel marks may be written using "full spelling" (כתיב מלא) where extra letters are occasionally added to words:

"Ktiv male" (full spelling): The red letter is only used without vowels.

The core UNIKUD model uses a multilabel classification head as shown below:

Hebrew vocalization as multilabel classification: Each Hebrew letter may be decorated with multiple nikud, which can be represented as a binary vector. UNIKUD uses this label encoding as its target. The figure is condensed for clarity but UNIKUD's binary targets actually contain 15 entries.

See the "Experiments" tab on the UNIKUD DagsHub repository page for training and evaluation metrics.

Requirements

Inference only

Install the UNIKUD framework PyPI package via pip:

pip install unikud

For training

First install:

Rust compiler:
- curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Reopen shell or run source $HOME/.cargo/env

Then install requirements for UNIKUD and activate its environment with either of:

Conda (recommended):
- conda env create -f environment.yml
- conda activate unikud
Pip:
- pip install -r requirements.txt

You may then download the required data files using DVC:

dvc remote add origin https://dagshub.com/morrisalp/unikud.dvc
dvc pull -r origin

Data

Sources of data:

Public-domain works from the Ben-Yehuda Project
Wikimedia sources:
- Hebrew Wikipedia
- Hebrew Wikisource (ויקיטקסט)
- Hebrew Wiktionary (ויקימילון)

Training

To reproduce the training pipeline, perform the following steps:

Preprocess data:

dvc repro preprocessing

Train ktiv male model:

dvc repro train-ktiv-male

Add ktiv male to data file:

dvc repro add-ktiv-male

Train UNIKUD model:

dvc repro train-unikud

Training steps will automatically log to MLflow (via the Huggingface Trainer object) if the following environment variables are set: MLFLOW_TRACKING_URI, MLFLOW_TRACKING_USERNAME, MLFLOW_TRACKING_PASSWORD.

Scripts will automatically use GPU when available. If you want to run on CPU, set the environment variable CUDA_VISIBLE_DEVICES to be empty (export CUDA_VISIBLE_DEVICES=).

Inference

If you installed UNIKUD via pip, you may add nikud to Hebrew text as follows:

from unikud.framework import Unikud

u = Unikud() # installs required files

print(u('שלום חברים'))

Note: Unikud() takes optional keyword argument device= for CPU/GPU inference. Unikud.__call__ takes optional keyword arguments to adjust decoding hyperparameters.

Contributing

Please file an issue on this project's DagsHub or GitHub repo pages to report bugs or suggest improvements.

README.md 4.3 KB

Permalink History Raw

UNIKUD: Hebrew nikud with transformers

Description

Requirements

Inference only

For training

Data

Training

Inference

Contributing

Other Links

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

morrisalp / unikud connected to https://github.com/morrisalp/unikud.git

README.md 4.3 KB Permalink History Raw

UNIKUD: Hebrew nikud with transformers

Description

Requirements

Inference only

For training

Data

Training

Inference

Contributing

Other Links

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

morrisalp
/
unikud
connected to https://github.com/morrisalp/unikud.git

README.md 4.3 KB

Permalink History Raw