1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
|
- BOHR (Big Old Heuristic Repository)
- ----------------------------------
- BOHR is a **repository of heuristics** for categorization of software engineering artifacts, e.g. commits, bug reports, etc.
- Categorization of artifacts is often required to create ground-truth datasets to train machine learning models on. For example, to train a model that classifies commits as "feature", "bugfix", or "refactoring", one needs to have a dataset of commits with these labels assigned.
- Since creating a large dataset manually is expensive, the alternative is to come up with "heuristics", short programs that can assign noisy labels to artifacts automatically. Implementing **a large number of such heuristics** and **combining their outputs** "smartly" is the idea behind `snorkel <https://www.snorkel.org/>`_, the state-of-the-art `weak supervision <http://ai.stanford.edu/blog/weak-supervision/>`_ tool.
- BOHR is a wrapper around snorkel which:
- * **Simplifies** the process of **adding new heuristics** and **evaluating their effectiveness**;
- * **Labels the datasets** registered with BOHR and **automatically updates the labels** once heuristics are added;
- * Keeps track of heursitics used for each version of generated datasets and models, and, in general, makes sure they are **reproducible** and **easily accessable** by using `DVC <https://dvc.org>`_.
- .. contents:: **Contents**
- :backlinks: none
-
- How do heuristics look like?
- ===================================
-
- .. code-block:: python
-
- # other imports
- ...
- from bohr.core import Heuristic
- from bohr.collection.artifacts import Commit
- from bohr.labels import CommitLabel
-
- @Heuristic(Commit)
- def bugless_if_many_files_changes(commit: Commit) -> Optional[Labels]:
- if len(commit.files) > 6:
- return CommitLabel.NonBugFix
- else:
- return None
-
- Important things to note:
- #. A heuristics is marked with the ``Heuristic`` decorator, and the artifact type to which it is applied is passed to it as a parameter;
- #. The artifact instance is exposed to the heuristic as a function parameter; the properties of the artifact object can be used to implement the logic;
- #. For the label to be assigned to the artifact, it has to be returned from the function; the heuristic must assign one of the labels defined in the BOHR label hierarchy or ``None`` if it abstains on the data point.
- BOHR usage scenarios
- ===================================
- 1. Only using existing heuristics
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- 2. Implementing new heuristics for existing tasks
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- 3. Implementing heuristics for new tasks and artifacts
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Overview of BOHR abstractions
- ================================
- .. raw:: html
- <img src="doc/bohr_abstractions.png" width="600px">
- Quick Start
- ============
- Installation
- ~~~~~~~~~~~~~
- Python >= 3.8 is required, use of virtual environment is strongly recommended.
- #. Run ``git clone https://github.com/giganticode/bohr && cd bohr``
- #. Install BOHR framework library: ``bin/setup-bohr.sh``. This will install `bohr-framework <https://github.com/giganticode/bohr-framework>`_, dependencies and tools to run heursistics.
- Important commands
- ~~~~~~~~~~~~~~~~~~~
- +-----------------------------------+-------------------------------------------------------------------+
- | | Command |
- +===================================+===================================================================+
- | Pull existing labeled dataset | | ``$ bohr pull bugginess 200k-commits`` |
- +-----------------------------------+-------------------------------------------------------------------+
- | Label your dataset | | ``$ bohr dataset add ~/new_commit_dataset.csv -t commit`` |
- | | | ``$ bohr task add-dataset bugginess new_commit_dataset --repro``|
- +-----------------------------------+-------------------------------------------------------------------+
- | Add heuristic(s), re-train | | ``$ vi heuristics/commit_files.py`` |
- | label model, and update labels | | ``$ bohr repro bugginess`` |
- +-----------------------------------+-------------------------------------------------------------------+
- | Add a new task | | ``$ bohr task add tangled-commits \`` |
- | | | ``... -l TangledCommit.NonTangled,TangledCommit.Tangled \`` |
- | | | ``... --repro`` |
- | | | |
- +-----------------------------------+-------------------------------------------------------------------+
- Contribute to the framework
- =============================
- To contribute to the framework, please refer to the documentation in the the `bohr-framework <https://github.com/giganticode/bohr-framework>`_ repo.
- Pre-prints and publications
- =============================
- .. code-block::
- @misc{babii2021mining,
- title={Mining Software Repositories with a Collaborative Heuristic Repository},
- author={Hlib Babii and Julian Aron Prenner and Laurin Stricker and Anjan Karmakar and Andrea Janes and Romain Robbes},
- year={2021},
- eprint={2103.01722},
- archivePrefix={arXiv},
- primaryClass={cs.SE}
- }
|