Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
If you are accessing this repo via GitHub, please see the project page on DAGSHub for data files, pipelines and more.
We provide a short description of UNIKUD here. For more information, please see the article: (TBD)
How Hebrew text with vowels is used to train UNIKUD. The text with vowels removed is used as the model's input, and the original text with vowels is used as the target (what we are trying to predict).
"Ktiv male" (full spelling): The red letter is only used without vowels.
Hebrew vocalization as multilabel classification: Each Hebrew letter may be decorated with multiple "nikud", which can be represented as a one-hot (binary) vector. UNIKUD uses this one-hot encoding as its target. The figure is condensed for clarity but UNIKUD's one-hot targets actually contain 15 entries.
Install the UNIKUD framework PyPI package via pip:
pip install unikud
You may then add nikud to Hebrew text as follows:
from unikud.framework import Unikud
u = Unikud() # installs required files
print(u('שלום חברים'))
Note: Unikud()
takes optional keyword argument device=
for CPU/GPU inference. Unikud.__call__
takes optional keyword arguments to adjust decoding hyperparameters.
First install:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
Then create and activate the UNIKUD environment with:
conda env create -f environment.yml
conda activate unikud
You may then download the required data files using DVC:
dvc remote add origin https://dagshub.com/morrisalp/unikud.dvc
dvc pull -r origin
Sources of data:
To preprocess data, run:
To reproduce the training pipeline, perform the following steps:
dvc repro preprocessing
dvc repro train-ktiv-male
dvc repro add-ktiv-male
dvc repro train-unikud
Training steps will automatically log to MLflow (via the Huggingface Trainer object) if the following environment variables are set: MLFLOW_TRACKING_URI
, MLFLOW_TRACKING_USERNAME
, MLFLOW_TRACKING_PASSWORD
.
Scripts will automatically use GPU when available. If you want to run on CPU, set the environment variable CUDA_VISIBLE_DEVICES
to be empty (export CUDA_VISIBLE_DEVICES=
).
To add nikud to unvocalized Hebrew text:
tokenizer = CanineTokenizer.from_pretrained("google/canine-c")
model = UnikudModel.from_pretrained('models/unikud/latest')
task = NikudTask(tokenizer, model)
text = 'זאת דוגמא של טקסט לא מנוקד בעברית'
print(task.add_nikud(text))
Please file an issue on this project's DagsHub or GitHub repo pages to report bugs or suggest improvements.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?