Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  git github
06413c0379
Upgrade/bohr 0.4.4 (#172)
3 years ago
809cf842c1
restore remove pre-commit-config
3 years ago
1c666f3899
set correct pythonpath
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
bin
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
ba2412cbf0
Upgrade/bohr 0.4.3 (#169)
3 years ago
dc6ddec90f
Fix smells pre-processing script (#93)
3 years ago
76c09e2408
upgrade bohr to 0.3.9 (#162)
3 years ago
doc
ca4ad4c06f
doc images and gifs
3 years ago
4ff89d92d1
update outdated hashsum of smell-madeyski.csv.dvc after it is shown as chnaged (not sure why?)
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
c4d5fe473d
freeze transformer stages
3 years ago
eea4739bdb
update to bohr 0.4.5: specify keywords in the heuristic decorators (#176)
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
cd001e03fe
add setup-bohr script
3 years ago
f2b487f68f
WIP #76: restructure the repo into framework and other (#90)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
4cc932160e
add dvc pre-commit hooks
3 years ago
8dd7c80d2e
Pylint - black compatibility (#80)
3 years ago
5cbb8ede96
Fix reproduce action (#113)
3 years ago
7f11f72192
use setup-bohr script for travis build
3 years ago
b28a1b48f2
add license (#118)
3 years ago
000e52e711
Update README.rst
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
f71b6f3631
upgrade to bohr 0.4.7 (#185)
3 years ago
1c13682e5a
Add renovate.json
3 years ago
f0a84ba0d2
Add pylint and fix some warnings (#78)
3 years ago
64a4060e7a
Add transformer (#148)
3 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.rst

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
  1. BOHR (Big Old Heuristic Repository)
  2. ----------------------------------
  3. BOHR is a **repository of heuristics** for categorization of software engineering artifacts, e.g. commits, bug reports, etc.
  4. Categorization of artifacts is often required to create ground-truth datasets to train machine learning models on. For example, to train a model that classifies commits as "feature", "bugfix", or "refactoring", one needs to have a dataset of commits with these labels assigned.
  5. Since creating a large dataset manually is expensive, the alternative is to come up with "heuristics", short programs that can assign noisy labels to artifacts automatically. Implementing **a large number of such heuristics** and **combining their outputs** "smartly" is the idea behind `snorkel <https://www.snorkel.org/>`_, the state-of-the-art `weak supervision <http://ai.stanford.edu/blog/weak-supervision/>`_ tool.
  6. BOHR is a wrapper around snorkel which:
  7. * **Simplifies** the process of **adding new heuristics** and **evaluating their effectiveness**;
  8. * **Labels the datasets** registered with BOHR and **automatically updates the labels** once heuristics are added;
  9. * Keeps track of heursitics used for each version of generated datasets and models, and, in general, makes sure they are **reproducible** and **easily accessable** by using `DVC <https://dvc.org>`_.
  10. .. contents:: **Contents**
  11. :backlinks: none
  12. How do heuristics look like?
  13. ===================================
  14. .. code-block:: python
  15. # other imports
  16. ...
  17. from bohr.core import Heuristic
  18. from bohr.collection.artifacts import Commit
  19. from bohr.labels import CommitLabel
  20. @Heuristic(Commit)
  21. def bugless_if_many_files_changes(commit: Commit) -> Optional[Labels]:
  22. if len(commit.files) > 6:
  23. return CommitLabel.NonBugFix
  24. else:
  25. return None
  26. Important things to note:
  27. #. A heuristics is marked with the ``Heuristic`` decorator, and the artifact type to which it is applied is passed to it as a parameter;
  28. #. The artifact instance is exposed to the heuristic as a function parameter; the properties of the artifact object can be used to implement the logic;
  29. #. For the label to be assigned to the artifact, it has to be returned from the function; the heuristic must assign one of the labels defined in the BOHR label hierarchy or ``None`` if it abstains on the data point.
  30. BOHR usage scenarios
  31. ===================================
  32. 1. Only using existing heuristics
  33. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  34. 2. Implementing new heuristics for existing tasks
  35. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  36. 3. Implementing heuristics for new tasks and artifacts
  37. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  38. Overview of BOHR abstractions
  39. ================================
  40. .. raw:: html
  41. <img src="doc/bohr_abstractions.png" width="600px">
  42. Quick Start
  43. ============
  44. Installation
  45. ~~~~~~~~~~~~~
  46. Python >= 3.8 is required, use of virtual environment is strongly recommended.
  47. #. Run ``git clone https://github.com/giganticode/bohr && cd bohr``
  48. #. Install BOHR framework library: ``bin/setup-bohr.sh``. This will install `bohr-framework <https://github.com/giganticode/bohr-framework>`_, dependencies and tools to run heursistics.
  49. Important commands
  50. ~~~~~~~~~~~~~~~~~~~
  51. +-----------------------------------+-------------------------------------------------------------------+
  52. | | Command |
  53. +===================================+===================================================================+
  54. | Pull existing labeled dataset | | ``$ bohr pull bugginess 200k-commits`` |
  55. +-----------------------------------+-------------------------------------------------------------------+
  56. | Label your dataset | | ``$ bohr dataset add ~/new_commit_dataset.csv -t commit`` |
  57. | | | ``$ bohr task add-dataset bugginess new_commit_dataset --repro``|
  58. +-----------------------------------+-------------------------------------------------------------------+
  59. | Add heuristic(s), re-train | | ``$ vi heuristics/commit_files.py`` |
  60. | label model, and update labels | | ``$ bohr repro bugginess`` |
  61. +-----------------------------------+-------------------------------------------------------------------+
  62. | Add a new task | | ``$ bohr task add tangled-commits \`` |
  63. | | | ``... -l TangledCommit.NonTangled,TangledCommit.Tangled \`` |
  64. | | | ``... --repro`` |
  65. | | | |
  66. +-----------------------------------+-------------------------------------------------------------------+
  67. Contribute to the framework
  68. =============================
  69. To contribute to the framework, please refer to the documentation in the the `bohr-framework <https://github.com/giganticode/bohr-framework>`_ repo.
  70. Pre-prints and publications
  71. =============================
  72. .. code-block::
  73. @misc{babii2021mining,
  74. title={Mining Software Repositories with a Collaborative Heuristic Repository},
  75. author={Hlib Babii and Julian Aron Prenner and Laurin Stricker and Anjan Karmakar and Andrea Janes and Romain Robbes},
  76. year={2021},
  77. eprint={2103.01722},
  78. archivePrefix={arXiv},
  79. primaryClass={cs.SE}
  80. }
Tip!

Press p or to see the previous file or, n or to see the next file

About

Big Old Heuristic Repository

Collaborators 1

Comments

Loading...