As tomorrow is a holiday, I needed to wrap things up by EOD, making it difficult to do things outside work. will tackle TODOs tomorrow.
LlamaParse is the world’s first genAI-native document parsing platform - built with LLMs and for LLM use cases.
Your LLM application performance is only as good as your data. The main goal of LlamaParse is to parse and clean your data, ensuring that it’s good quality before passing to any downstream LLM use case such as advanced RAG.
You get 1k free pages a day. If you sign up for the paid plan, you get 7k free pages a week, and then $0.003 for each page.
Processing documents by parsing/cleanin data is a very difficult job. For example, I need to extract texts or tables from PDF, chunking data, and convert them into embedding records for RAG. The service seems to be a good choice for simple RAG usecases, but not sure it's suitable for B2B softwares. Though I need to talk with the team directly, its ability to comply with the EU Data Act is not mentioned. Also, as I checked, it does not provide options to fine-tune the results. It might not be the best option when the accuracy of the specific type of documents is important and fine-tuning is needed.
llama_parse/llama_parse/base.py
class LlamaParse(BasePydanticReader):
"""A smart-parser for files."""
api_key: str = ...
base_url: str = ...
result_type: ResultType = ...
num_workers: int = ...
check_interval: int = ...
max_timeout: int = ...
verbose: bool = ...
show_progress: bool = ...
language: Language = ...
parsing_instruction: Optional[str] = ...
skip_diagonal_text: Optional[bool] = ...
invalidate_cache: Optional[bool] = ...
do_not_cache: Optional[bool] = ...
fast_mode: Optional[bool] = ...
do_not_unroll_columns: Optional[bool] = ...
page_separator: Optional[str] = ...
gpt4o_mode: bool = Field(
default=False,
description="Whether to use gpt-4o extract text from documents.",
)
gpt4o_api_key: Optional[str] = Field(
default=None,
description="The API key for the GPT-4o API. Lowers the cost of parsing.",
)
ignore_errors: bool = ...
split_by_page: bool = ...
Also, I'm reluctant to use specific AI services, since there are a lot of changes happening in the world and it's not guaranteed that the services last for a next few years. If startups go bust, we need to rewrite the logic with the integrations.
In addition to the document and the GitHub repository, I checked its blog about the feature launch.
Launching the first GeAI-native document parsing platform - LlamaIndex
Since we first released LlamaParse it has featured industry-leading table extraction capabilities. Under the hood, this has been using LLM intelligence since the start. It seamlessly integrates with the advanced indexing/retrieval capabilities that the open-source framework offers, enabling users to build state-of-the-art document RAG.
It mentions PyPDF, PyMuPDF, Textract, and PdfMiner… so their algorithm uses LLM on top of these open-source projects to extract texts??
The service seems to be in its very early stage, and I cannot find any other resources about its algorithm.
It was hard to navigate the website to understand what Apache Airflow offers, but DAG (Directed Asynclic Graph) seems to be the one I can start digesting.
A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships to say how they should run.
So, the main use case of Apache Airflow is to define dependencies of tasks with DAG and schedule the workflow.
In this YouTube video https://youtu.be/SWZvtL1uxJE?si=pPLxJlMXQ-O-bM-h, they say that their initial thought is to automate this process (? the video is long and I didn't check the details)
My impression is that the service is for people on the business-side (Sales, Customer Support), not for build system on it, though I haven't checked the use cases page and might misunderstood.
I don't hear this service on a daily basis and might want to forget about that for now.
Protein shake 300 Yogurt 200 Avocado 200 Eggs 600
Total 1300 kcal
TODO: