Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  git github
06413c0379
Upgrade/bohr 0.4.4 (#172)
3 years ago
809cf842c1
restore remove pre-commit-config
3 years ago
5ec3126a58
Update check-dvc-clean.yaml
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
bin
7b85b4be89
upgrade to 0.4.2 (#168)
3 years ago
76c09e2408
upgrade bohr to 0.3.9 (#162)
3 years ago
ba2412cbf0
Upgrade/bohr 0.4.3 (#169)
3 years ago
dc6ddec90f
Fix smells pre-processing script (#93)
3 years ago
76c09e2408
upgrade bohr to 0.3.9 (#162)
3 years ago
doc
ca4ad4c06f
doc images and gifs
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
ba2412cbf0
Upgrade/bohr 0.4.3 (#169)
3 years ago
2e798e7243
add new manual labels (#174)
3 years ago
c4d5fe473d
freeze transformer stages
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
cd001e03fe
add setup-bohr script
3 years ago
f2b487f68f
WIP #76: restructure the repo into framework and other (#90)
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
4cc932160e
add dvc pre-commit hooks
3 years ago
8dd7c80d2e
Pylint - black compatibility (#80)
3 years ago
5cbb8ede96
Fix reproduce action (#113)
3 years ago
7f11f72192
use setup-bohr script for travis build
3 years ago
b28a1b48f2
add license (#118)
3 years ago
46144ee965
Update README.rst
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
2e798e7243
add new manual labels (#174)
3 years ago
1c13682e5a
Add renovate.json
3 years ago
f0a84ba0d2
Add pylint and fix some warnings (#78)
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.rst

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
  1. BOHR (Big Old Heuristic Repository)
  2. ----------------------------------
  3. BOHR is a repository of heuristics for categorization of software engineering artifacts, such as commits and bug reports. Categorization of the artifacts is often required to create labeled datasets to train machine learning models on. Since manual labeling is expensive, researchers come up with imprecise heuristics that can assign labels to artifacts. The goal of BOHR is to let researchers contribute a large number of heuristics which are "smartly" combined by `snorkel <https://www.snorkel.org/>`_, the state-of-the art `weak supervision <http://ai.stanford.edu/blog/weak-supervision/>`_ tool.
  4. BOHR is a wrapper around snorkel which:
  5. * Simplifies the process of adding new heuristics and evaluating their effectiveness;
  6. * Labels the datasets registered with BOHR and automatically updates the labels once heuristics are added;
  7. * Keeps track of heursitics used for each version of generated dataset, and in general makes sure the datasets are reproducible and easily accessable by using `DVC <https://dvc.org>`_.
  8. .. contents:: **Contents**
  9. :backlinks: none
  10. Installation
  11. ===========================================
  12. Python >= 3.8 is required, use of virtual environment is strongly recommended.
  13. #. Run ``git clone https://github.com/giganticode/bohr && cd bohr``
  14. #. Install BOHR framework library: ``bin/setup-bohr.sh``. This will install `bohr-framework <https://github.com/giganticode/bohr-framework>`_, dependencies and tools to run heursistics.
  15. Scenarios of using BOHR
  16. ===================================
  17. .. raw:: html
  18. <img src="doc/reuse_levels.gif" width="600px">
  19. 1. Using labeled datasets
  20. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  21. Use ``bohr pull`` command. For example, to download ``200k-commits`` labeled by the ``bugginess`` task, run:
  22. ``bohr pull bugginess 200k-commits``
  23. Bohr extensively uses `DVC (Data Version Control) <https://dvc.org/>`_ to ensure the integrity and reproducibility of the datasets and models.
  24. 2. Label your own dataset with an existing model
  25. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  26. TBD
  27. 3. Adding heuristics for existing task
  28. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  29. Heuristics should be defined in ``.py`` files in the ``heuristics`` directory as methods marked with @Heuristic decorator. Below you can see a heuristic which marks a commit as non-bug-fixing if it has contains more than 6 modified files:
  30. .. code-block:: python
  31. @Heuristic(Commit)
  32. def bugless_if_many_files_changes(commit: Commit) -> Optional[Labels]:
  33. if len(commit.files) > 6:
  34. return CommitLabel.NonBugFix
  35. else:
  36. return None
  37. Important things to note:
  38. #. Artifact type is passed to heuristic decorator as a parameter; an object of the same type is exposed as a parameter to the function;
  39. #. Method name can be arbitrary as long it is unique and descriptive;
  40. #. Method should return the label which which the current commit is to be labeled, ``None`` if the labeling function should abstain on the datapoint. The label can be one of the objects defined in ``label.py``. See ... for more details on *label hierarchy*.
  41. Please refer to the `documentation <https://giganticode.github.io/bohr/Heuristics.html>`_ for more information on heuristics and special heuristic types.
  42. In order to train a new label model and to relabel the datasets with improved labels after adding new heuristics, run ``bohr repro``.
  43. 4. Adding a new task
  44. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  45. To add a new taks, run ``bohr task add`` command. For example, for a tasks of classifying commit as tangled or not:
  46. .. code-block::
  47. bohr task add tangled-commits \
  48. -d "Task to classify commits into tangled and non-tangled" # description
  49. -t commit # artifact to be classified
  50. -l TangledCommit.NonTangled,TangledCommit.Tangled # comma-separated label list for the classifier to choose from
  51. -c tangled # column with ground-truth labels
  52. --force # rewrite if the task with the same name already exists
  53. --use-all-datasets # use all the datasets found in BOHR that contain the artifact being classified
  54. --repro # apply right away compatible heuristics, generate a label model and label the datasets
  55. Overview of BOHR abstractions
  56. ====================================
  57. .. raw:: html
  58. <img src="doc/bohr_abstractions.png" width="600px">
  59. The name of the task is the key in the dictionary. The value is an object with the following fields:
  60. #. **Top artifact** - the artifact to be catigorized. In the case of "bugginess" task, commits are classified, therefore the top artifact is ``bohr.artifacts.commit.Commit``;
  61. #. **Label categories** - categories artifact to be classified as, for "bugginess" taks these are *CommitLabel.BugFix* and *CommitLabel.NonBugFix*. Values has to be taken from the ``labels.py`` file. See section `3. Labels:`_ on more information about labels in bohr and how to extend the label hierarchy.
  62. #. **Training sets** - datasets used to train a label model;
  63. #. **Test sets** - datasets to calculate metrics on.
  64. 3. Labels:
  65. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  66. Labels that are used to label artifacts in BOHR are pre-defined and can be reused across multiple tasks. E.g., ``Commit.Refactoring`` label can be used in heuristics for the tasks of detecting refactoring, but also in the task of detecting bug-fixing commits. Moreover, labels are organized in a hierarchy, e.g. ``Commit.FileRenaming`` can be a child of ``Commit.Refactoring``. Formally speaking, there is a binary relation IS-A defined on the set of labels, which defines their partial order, e.g. ``IS-A(Commit.FileRenaming, Commit.Refactoring)``
  67. Labels are defined in text files in the ``bohr/labels`` dir. Each row has a format: <parent>: <list of children>. Running ``bohr parse-labels`` will generate `labels.py` file in the root of the repository. Thus to extend the hierarchy of labels it's sufficient to make a change to a text file. The `label.py` will be regenerated, once the PR is received.
  68. 5 Artifact definitions
  69. ~~~~~~~~~~~~~~~~~~~~~~~~
  70. ``bohr.templates.artifacts`` also defines some pre-defined artifacts
  71. Contribute to the framework:
  72. =============================
  73. To contribute to the framework, please refer to the documentation in the the `bohr-framework <https://github.com/giganticode/bohr-framework>`_ repo.
  74. Pre-prints and publications
  75. ===========================================
  76. .. code-block::
  77. @misc{babii2021mining,
  78. title={Mining Software Repositories with a Collaborative Heuristic Repository},
  79. author={Hlib Babii and Julian Aron Prenner and Laurin Stricker and Anjan Karmakar and Andrea Janes and Romain Robbes},
  80. year={2021},
  81. eprint={2103.01722},
  82. archivePrefix={arXiv},
  83. primaryClass={cs.SE}
  84. }
Tip!

Press p or to see the previous file or, n or to see the next file

About

Big Old Heuristic Repository

Collaborators 1

Comments

Loading...