Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

process_zinc_csv.sh 2.0 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
  1. #!/bin/bash
  2. #
  3. # Copyright (c) 2022, NVIDIA CORPORATION.
  4. # SPDX-License-Identifier: Apache-2.0
  5. # Licensed under the Apache License, Version 2.0 (the "License");
  6. # you may not use this file except in compliance with the License.
  7. # You may obtain a copy of the License at
  8. #
  9. # http://www.apache.org/licenses/LICENSE-2.0
  10. #
  11. # Unless required by applicable law or agreed to in writing, software
  12. # distributed under the License is distributed on an "AS IS" BASIS,
  13. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  14. # See the License for the specific language governing permissions and
  15. # limitations under the License.
  16. # Process ZINC CSV data
  17. # This script processes the ZINC dataset used to train MolBART
  18. # so that train, val, and test splits are in separate directories
  19. # It also removes the set column since it is no longer needed and
  20. # will create a metadata file containing the number of molecules
  21. # in each file
  22. # Example input data format:
  23. # zinc_id,smiles,set
  24. # ZINC000843130676,CCN1CCN(c2ccc(-c3nc(CCN)no3)cc2F)CC1,train
  25. # ZINC000171110690,CC(C)(C)c1noc(CSCC(=O)NC2CC2)n1,train
  26. # ZINC000848409174,CC(=NN[C@H]1CCCOC1)c1cncnc1C,train
  27. SOURCE_DIR=./zinc_csv # location of original data
  28. DEST_DIR=./zinc_csv_split # location of new data
  29. ######
  30. METADATA_FILE=metadata.txt
  31. for SPLIT in "train" "val" "test"; do
  32. echo "Processing $SPLIT.."
  33. mkdir -p $DEST_DIR/${SPLIT}
  34. echo "file,size" > ${DEST_DIR}/${SPLIT}/${METADATA_FILE} # metadata file with number of molecules
  35. for f in $SOURCE_DIR/*.csv; do
  36. echo "Processing $f file.."
  37. BASE_FILENAME=`basename $f`
  38. DEST_FILE=${DEST_DIR}/${SPLIT}/${BASE_FILENAME}
  39. # Destination data
  40. echo "zinc_id,smiles" > ${DEST_FILE} # file header
  41. cat $f | grep $SPLIT | cut -d',' -f1,2 >> ${DEST_FILE} # output entries for split
  42. # Log number of molecules
  43. NUM_MOL=$(wc -l ${DEST_FILE}| cut -d' ' -f1)
  44. NUM_MOL=$(($NUM_MOL-1))
  45. echo "$BASE_FILENAME,$NUM_MOL" >> ${DEST_DIR}/${SPLIT}/${METADATA_FILE}
  46. done
  47. done
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...