ingest.helpers

helper utilities for ingesting documents

md5sum

 md5sum (filepath)

Perform an MD5 hash of a file.

source

date2iso

 date2iso (d)

source

iso2date

 iso2date (s)

source

extract_file_dates

 extract_file_dates (filepath)

*Takes a file path and returns an ISO datetime string of last-modified and create date of file.

Returns tuple of the form (create-date, last-modify-date)*

source

extract_extension

 extract_extension (filepath:str, include_dot=False)

Extracts file extension (including dot) from file path

source

extract_files

 extract_files (source_dir:str, follow_links=False,
                extensions:Union[dict,list,NoneType]=None)

source

extract_tables

 extract_tables (filepath:Optional[str]=None,
                 docs:Optional[List[langchain_core.documents.base.Document
                 ]]=[])

*Extract tables from PDF and append to end of supplied Document list. Accepts either a filepath or a list of LangChain Document objects all from a single file. If filepath is empty, the file path of interest is extracted from docs.

Returns an updated list of Document objects appended with extracted tables.*

source

includes_caption

 includes_caption (d:langchain_core.documents.base.Document)

Returns True if content of supplied Document includes a table caption

source

is_random_plaintext

 is_random_plaintext (extension, mimetype)

Check mimetype for plain text

source

extract_mimetype

 extract_mimetype (filepath)

Extract mimetype. Returns a tuple with extracted mimetype, type, subtype.

source

get_mimetype

 get_mimetype (filepath)

source

clean_text

 clean_text (text_s_or_b)

convert to string and strip.

source

ParagraphTextSplitter

 ParagraphTextSplitter (chunk_size:int=5000, chunk_overlap:int=0)

Interface for splitting text into chunks.

source

extract_file_metadata

 extract_file_metadata (file_path:str, store_md5:bool=True,
                        store_mimetype:bool=True,
                        store_file_dates:bool=True,
                        file_callables:dict={})

Extract file metadata

source

set_metadata_defaults

 set_metadata_defaults (docs:List[langchain_core.documents.base.Document],
                        extra_keys:list=[])

Sets Document metadata defaults

source

create_document

 create_document (page_content:str, only_required_metadata:bool=True,
                  **kwargs)

Create document with required metadata keys from METADATA.

source

dict_from_doc

 dict_from_doc (doc, content_field='page_content')

Create dictinoary from LangChain Document

source

doc_from_dict

 doc_from_dict (d:dict, content_field='page_content')

Create LangChain Document from dicationary