ingest.helpers

helper utilities for ingesting documents

source

md5sum

 md5sum (filepath)

Perform an MD5 hash of a file.


source

date2iso

 date2iso (d)

source

iso2date

 iso2date (s)

source

extract_file_dates

 extract_file_dates (filepath)

*Takes a file path and returns an ISO datetime string of last-modified and create date of file.

Returns tuple of the form (create-date, last-modify-date)*


source

extract_extension

 extract_extension (filepath:str, include_dot=False)

Extracts file extension (including dot) from file path


source

extract_files

 extract_files (source_dir:str, follow_links=False,
                extensions:Union[list,dict,NoneType]=None)

source

extract_tables

 extract_tables (filepath:Optional[str]=None,
                 docs:Optional[List[langchain_core.documents.base.Document
                 ]]=[])

*Extract tables from PDF and append to end of supplied Document list. Accepts either a filepath or a list of LangChain Document objects all from a single file. If filepath is empty, the file path of interest is extracted from docs.

Returns an updated list of Document objects appended with extracted tables.*


source

includes_caption

 includes_caption (d:langchain_core.documents.base.Document)

Returns True if content of supplied Document includes a table caption


source

is_random_plaintext

 is_random_plaintext (extension, mimetype)

Check mimetype for plain text


source

extract_mimetype

 extract_mimetype (filepath)

Extract mimetype. Returns a tuple with extracted mimetype, type, subtype.


source

get_mimetype

 get_mimetype (filepath)

source

clean_text

 clean_text (text_s_or_b)

convert to string and strip.


source

extract_file_metadata

 extract_file_metadata (file_path:str, store_md5:bool=True,
                        store_mimetype:bool=True,
                        store_file_dates:bool=True,
                        file_callables:dict={})

Extract file metadata


source

set_metadata_defaults

 set_metadata_defaults (docs:List[langchain_core.documents.base.Document],
                        extra_keys:list=[])

Sets Document metadata defaults


source

create_document

 create_document (page_content:str, only_required_metadata:bool=True,
                  **kwargs)

Create document with required metadata keys from METADATA.


source

doc_from_dict

 doc_from_dict (d:dict)

Create LangChain Document from dicationary