ingest.helpers
md5sum
md5sum (filepath)
Perform an MD5 hash of a file.
date2iso
date2iso (d)
iso2date
iso2date (s)
extract_file_dates
extract_file_dates (filepath)
*Takes a file path and returns an ISO datetime string of last-modified and create date of file.
Returns tuple of the form (create-date, last-modify-date)*
extract_extension
extract_extension (filepath:str, include_dot=False)
Extracts file extension (including dot) from file path
extract_files
extract_files (source_dir:str, follow_links=False, extensions:Union[list,dict,NoneType]=None)
extract_tables
extract_tables (filepath:Optional[str]=None, docs:Optional[List[langchain_core.documents.base.Document ]]=[])
*Extract tables from PDF and append to end of supplied Document list. Accepts either a filepath
or a list of LangChain Document
objects all from a single file. If filepath
is empty, the file path of interest is extracted from docs
.
Returns an updated list of Document objects appended with extracted tables.*
includes_caption
includes_caption (d:langchain_core.documents.base.Document)
Returns True if content of supplied Document includes a table caption
is_random_plaintext
is_random_plaintext (extension, mimetype)
Check mimetype for plain text
extract_mimetype
extract_mimetype (filepath)
Extract mimetype. Returns a tuple with extracted mimetype, type, subtype.
get_mimetype
get_mimetype (filepath)
clean_text
clean_text (text_s_or_b)
convert to string and strip.
extract_file_metadata
extract_file_metadata (file_path:str, store_md5:bool=True, store_mimetype:bool=True, store_file_dates:bool=True, file_callables:dict={})
Extract file metadata
set_metadata_defaults
set_metadata_defaults (docs:List[langchain_core.documents.base.Document], extra_keys:list=[])
Sets Document metadata defaults
create_document
create_document (page_content:str, only_required_metadata:bool=True, **kwargs)
Create document with required metadata keys from METADATA
.
doc_from_dict
doc_from_dict (d:dict)
Create LangChain Document from dicationary