No description

TypeScript 78.1%
Python 17.4%
Shell 1.3%
CSS 1.1%
JavaScript 1%
Other 1.1%

Find a file

Matt-Olek 9aa33b5558 docs: add Agentic Flow section to README.md with deepagents library details and diagram		2026-05-10 03:13:12 +02:00
.forgejo/workflows	fix: add LLAMA_CLOUD_API_KEY to environment variables in Python sanity check workflow	2026-05-10 03:07:46 +02:00
apps	fix: correct BACKEND_CORS_ORIGINS value in docker-compose and config files	2026-05-10 03:02:44 +02:00
ressources	docs: add Agentic Flow section to README.md with deepagents library details and diagram	2026-05-10 03:13:12 +02:00
.gitignore	Add EnvelopeCard component and integrate with MessageStack; update finalize_envelope tool and modify default LLM model	2026-05-09 15:25:51 +02:00
LICENCE	Initial commit	2026-05-08 22:47:17 +00:00
README.md	docs: add Agentic Flow section to README.md with deepagents library details and diagram	2026-05-10 03:13:12 +02:00

README.md

DavisAI - Monorepo

This repo contains the code of the DavisAI AI Engineering Assessment

High-level overview

This software app does the following:

Ingests reference documents related to urban planning regulations
Parses, transforms and stores them in a synthetic file-system
Spawns a deep agent with binded filesystem tools + backend to deliver a full agentic flow on the documents
Builds a 'Buildable Envelope' report provided to the user

Tech Stack

Frontend: React vite + TypeScript + TailwindCSS
- OpenAPI client generated automatically from backend specs
- Vitest for testing
- Zod for runtime validation
- langchain for message frontend parsing and formatting

-> Deployed on personal VPS through Dokploy + Nixpacks

Backend: FastAPI + Python 3.12
- Fully containerized with trixie image and uv by astral for environment management
- openapi.json generated automatically from FastAPI specs
- deepagents for agent orchestration and management
- Pydantic for data validation
- Pytest for testing
- LlamaCloud and OpenAI for LLM access/parsing
- Langfuse tracing for observability and tracing -> See scoring section for details
CI/CD: Forgejo:
- Self-hosted Forgejo (gitea fork) instance (git.olekhnovitch.fr)
- Automatic Workflows for :
  - Python sanity checks, linting, formatting and testing
  - Frontend sanity checks, linting, formatting and testing
  - Docker image build and push to DockerHub on release

Architecture

├── apps
│   ├── frontend
│   │   ├── index.html
│   │   ├── node_modules
│   │   ├── openapi.json # Auto-generated OpenAPI client
│   │   ├── package.json
│   │   ├── package-lock.json
│   │   ├── src
│   │   ├── vite.config.ts
│   │   └── vitest.config.ts
│   └── python-services
│       ├── docker-compose.yml
│       ├── Dockerfile
│       ├── pyproject.toml
│       ├── README.md
│       ├── run.sh  # Main entrypoint for running commands in the uv environment
│       ├── scripts
│       ├── src
│       ├── tests # Tests
│       └── uv.lock
├── LICENCE
└── README.md

How to run

Backend:
- cd apps/python-services
- cp .env.example .env and fill in the environment variables
- uv sync
- ./run.sh generate # To generate the openapi.json
- ./run.sh run
Frontend:
- cd apps/frontend
- cp .env.example .env
- npm ci to install dependencies
- npm run generate to generate the OpenAPI client
- npm run dev to start the development server

Agentic Flow

This implementation relies on the deepagents library for agent orchestration and management. The main agentic flow is implemented in AgentService in apps/python-services/src/python_services/agent/service.py. The deep agent architecture is basically a LLM with a harness made of :

A file system backend
A set of planning tools (write todos, read files, etc.)
A set of additional tools
An ability to spawn sub-agents with specific roles and contexts on-the-fly

The LLM used in production is gpt-5-codex from OpenAI, but could be swapped with any good reasoning model, even open source ones, with minimal changes in the codebase thanks to the LLMProvider abstraction.

Benchmarking and scoring

I am using Langfuse for tracing and observability of every single step of the agent's reasoning and execution process, from the moment the user uploads the documents to the moment the final report is generated and sent back to the user.

This is self hosted on a separate instance and let me log :

LLM provider metrics:
- TTFT (Time To First Token)
- Latency
- Token usage (prompt, completion, total)
- Costs : ~0.35€ per full run with gpt-5-codex, ~0.01€ with gpt-4o
Agentic metrics:
- Total execution time
- Time spent in each tool
- Number of calls to each tool
- Final output (call to generate_report)

Dashboarding :

Lots of things to be done here for better dashboarding

Benchmarking + nuggets extraction

With all the above metrics logged, I can easily benchmark different versions of the agent and extract 'success nuggets'.

Here is how I proceed :

Run x times a batch of a determined mission (for example the assessment one)
Define a list of 10-20 'success nuggets' that I want to extract from the runs, for example :
- "The agent identified the UMG area of the project"
- "The agent identified the PLU regulations applicable to the project, which are precisely ..."
- "The agent correctly used the generate_report tool with the correct format and content"
- "The agent provided the 3D buildable envelope of exactly X m3"
- "The agent said that the project is feasible in terms of urban planning regulations"
Grade the runs based on the presence or absence (1 or 0) of these nuggets in the agent's reasoning and final output, based on the Langfuse traces and final report content (can be automated with a LLM and structured output)
Average the scores of the x runs to get a final score.

This is the "nugger extraction" method I use to benchmark the agent's performance and progress across different versions and iterations.

Then, we can have multiple dimensions of the score :

"nuggets score"
"cost score" (based on the total cost of the run)
"latency score" (based on the total execution time)
"overall score" (based on a weighted average of the above scores)