1 Branches

.dvc

4be6faef6a

re-add DVC

2 years ago

data

5bf67f885d

add with ktiv male lock

2 years ago

img

6262826c6a

add images to README

2 years ago

models

src

f4909a90f5

modify training args

2 years ago

.dvcignore

4be6faef6a

re-add DVC

2 years ago

.gitignore

96d4d9be38

add setup

2 years ago

README.md

5d3e04cc50

add other links

2 years ago

dvc.lock

eac40d2c6b

update dvc lock

2 years ago

dvc.yaml

17c6b25136

add new dvc stage

2 years ago

environment.yml

f9d09b8c0a

add numpy to reqs

2 years ago

setup.py

96d4d9be38

add setup

2 years ago

DagsHub Storage

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

You have to be logged in to leave a comment.

UNIKUD: Hebrew nikud with transformers

If you are accessing this repo via GitHub, please see the project page on DAGSHub for data files, pipelines and more.

Description

We provide a short description of UNIKUD here. For more information, please see the article: (TBD)

How Hebrew text with vowels is used to train UNIKUD. The text with vowels removed is used as the model's input, and the original text with vowels is used as the target (what we are trying to predict).

"Ktiv male" (full spelling): The red letter is only used without vowels.

Hebrew vocalization as multilabel classification: Each Hebrew letter may be decorated with multiple "nikud", which can be represented as a one-hot (binary) vector. UNIKUD uses this one-hot encoding as its target. The figure is condensed for clarity but UNIKUD's one-hot targets actually contain 15 entries.

Requirements

First install:

Conda
Rust compiler:
- curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Reopen shell or run source $HOME/.cargo/env

Then create and activate the UNIKUD environment with:

conda env create -f environment.yml
conda activate unikud

You may then download the required data files using DVC:

dvc remote add origin https://dagshub.com/morrisalp/unikud.dvc
dvc pull -r origin

Data

Sources of data:

Public-domain works from the Ben-Yehuda Project
Wikimedia sources:
- Hebrew Wikipedia
- Hebrew Wikisource (ויקיטקסט)
- Hebrew Wiktionary (ויקימילון)

To preprocess data, run:

Training

To reproduce the training pipeline, perform the following steps:

Preprocess data:

dvc repro preprocessing

Train ktiv male model:

dvc repro train-ktiv-male

Add ktiv male to data file:

dvc repro add-ktiv-male

Train UNIKUD model:

dvc repro train-unikud

Training steps will automatically log to MLflow (via the Huggingface Trainer object) if the following environment variables are set: MLFLOW_TRACKING_URI, MLFLOW_TRACKING_USERNAME, MLFLOW_TRACKING_PASSWORD.

Scripts will automatically use GPU when available. If you want to run on CPU, set the environment variable CUDA_VISIBLE_DEVICES to be empty (export CUDA_VISIBLE_DEVICES=).

Inference

To add nikud to unvocalized Hebrew text:

tokenizer = CanineTokenizer.from_pretrained("google/canine-c")
model = UnikudModel.from_pretrained('models/unikud/latest')
task = NikudTask(tokenizer, model)

text = 'זאת דוגמא של טקסט לא מנוקד בעברית'
print(task.add_nikud(text))

README.md

UNIKUD: Hebrew nikud with transformers

Description

Requirements

Data

Training

Inference

Other Links

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

morrisalp / unikud connected to https://github.com/morrisalp/unikud.git

README.md

UNIKUD: Hebrew nikud with transformers

Description

Requirements

Data

Training

Inference

Other Links

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

morrisalp
/
unikud
connected to https://github.com/morrisalp/unikud.git