Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

weaviate_pdf.py 1.1 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
  1. from langchain.text_splitter import RecursiveCharacterTextSplitter
  2. from langchain_community.document_loaders import PyPDFLoader
  3. from langchain_community.embeddings import HuggingFaceEmbeddings
  4. from langchain_community.vectorstores import Weaviate
  5. import weaviate
  6. client = weaviate.Client(
  7. url="http://localhost:8081",
  8. )
  9. model_name = "sentence-transformers/all-mpnet-base-v2"
  10. model_kwargs = {'device': 'cpu'}
  11. encode_kwargs = {'normalize_embeddings': False}
  12. hf = HuggingFaceEmbeddings(
  13. model_name=model_name,
  14. model_kwargs=model_kwargs,
  15. encode_kwargs=encode_kwargs
  16. )
  17. loader = PyPDFLoader("Insurance_Handbook_20103.pdf")
  18. pages = loader.load_and_split()
  19. text_splitter = RecursiveCharacterTextSplitter(
  20. chunk_size=1000,
  21. chunk_overlap=50,
  22. length_function=len,
  23. is_separator_regex=False,
  24. )
  25. texts = text_splitter.split_documents(pages)
  26. full_texts = [i.page_content for i in texts]
  27. vector_db = Weaviate.from_texts(
  28. full_texts,hf, client=client, by_text=False, index_name='BookOfInsurance', text_key='intro'
  29. )
  30. # Test the similarity search query function
  31. # print(vector_db.similarity_search("What is expense ratio?", k=3))
  32. client.close()
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...