- TypeScript 78.1%
- Python 17.4%
- Shell 1.3%
- CSS 1.1%
- JavaScript 1%
- Other 1.1%
| .forgejo/workflows | ||
| apps | ||
| ressources | ||
| .gitignore | ||
| LICENCE | ||
| README.md | ||
DavisAI - Monorepo
This repo contains the code of the DavisAI AI Engineering Assessment
High-level overview
This software app does the following:
- Ingests reference documents related to urban planning regulations
- Parses, transforms and stores them in a synthetic file-system
- Spawns a deep agent with binded filesystem tools + backend to deliver a full agentic flow on the documents
- Builds a 'Buildable Envelope' report provided to the user
Tech Stack
- Frontend: React vite + TypeScript + TailwindCSS
- OpenAPI client generated automatically from backend specs
- Vitest for testing
- Zod for runtime validation
- langchain for message frontend parsing and formatting
-> Deployed on personal VPS through Dokploy + Nixpacks
-
Backend: FastAPI + Python 3.12
- Fully containerized with trixie image and uv by astral for environment management
- openapi.json generated automatically from FastAPI specs
- deepagents for agent orchestration and management
- Pydantic for data validation
- Pytest for testing
- LlamaCloud and OpenAI for LLM access/parsing
- Langfuse tracing for observability and tracing -> See scoring section for details
-
CI/CD: Forgejo:
- Self-hosted Forgejo (gitea fork) instance (git.olekhnovitch.fr)
- Automatic Workflows for :
- Python sanity checks, linting, formatting and testing
- Frontend sanity checks, linting, formatting and testing
- Docker image build and push to DockerHub on release
Architecture
├── apps
│ ├── frontend
│ │ ├── index.html
│ │ ├── node_modules
│ │ ├── openapi.json # Auto-generated OpenAPI client
│ │ ├── package.json
│ │ ├── package-lock.json
│ │ ├── src
│ │ ├── vite.config.ts
│ │ └── vitest.config.ts
│ └── python-services
│ ├── docker-compose.yml
│ ├── Dockerfile
│ ├── pyproject.toml
│ ├── README.md
│ ├── run.sh # Main entrypoint for running commands in the uv environment
│ ├── scripts
│ ├── src
│ ├── tests # Tests
│ └── uv.lock
├── LICENCE
└── README.md
How to run
-
Backend:
cd apps/python-servicescp .env.example .envand fill in the environment variablesuv sync./run.sh generate# To generate the openapi.json./run.sh run
-
Frontend:
cd apps/frontendcp .env.example .envnpm cito install dependenciesnpm run generateto generate the OpenAPI clientnpm run devto start the development server
Agentic Flow
This implementation relies on the deepagents library for agent orchestration and management. The main agentic flow is implemented in AgentService in apps/python-services/src/python_services/agent/service.py. The deep agent architecture is basically a LLM with a harness made of :
- A file system backend
- A set of planning tools (write todos, read files, etc.)
- A set of additional tools
- An ability to spawn sub-agents with specific roles and contexts on-the-fly
The LLM used in production is gpt-5-codex from OpenAI, but could be swapped with any good reasoning model, even open source ones, with minimal changes in the codebase thanks to the LLMProvider abstraction.
Benchmarking and scoring
I am using Langfuse for tracing and observability of every single step of the agent's reasoning and execution process, from the moment the user uploads the documents to the moment the final report is generated and sent back to the user.
This is self hosted on a separate instance and let me log :
-
LLM provider metrics:
- TTFT (Time To First Token)
- Latency
- Token usage (prompt, completion, total)
- Costs : ~0.35€ per full run with gpt-5-codex, ~0.01€ with gpt-4o
-
Agentic metrics:
- Total execution time
- Time spent in each tool
- Number of calls to each tool
- Final output (call to
generate_report)
Dashboarding :
Lots of things to be done here for better dashboarding
Benchmarking + nuggets extraction
With all the above metrics logged, I can easily benchmark different versions of the agent and extract 'success nuggets'.
Here is how I proceed :
-
Run x times a batch of a determined mission (for example the assessment one)
-
Define a list of 10-20 'success nuggets' that I want to extract from the runs, for example :
- "The agent identified the UMG area of the project"
- "The agent identified the PLU regulations applicable to the project, which are precisely ..."
- "The agent correctly used the
generate_reporttool with the correct format and content" - "The agent provided the 3D buildable envelope of exactly X m3"
- "The agent said that the project is feasible in terms of urban planning regulations"
-
Grade the runs based on the presence or absence (1 or 0) of these nuggets in the agent's reasoning and final output, based on the Langfuse traces and final report content (can be automated with a LLM and structured output)
-
Average the scores of the x runs to get a final score.
This is the "nugger extraction" method I use to benchmark the agent's performance and progress across different versions and iterations.
Then, we can have multiple dimensions of the score :
- "nuggets score"
- "cost score" (based on the total cost of the run)
- "latency score" (based on the total execution time)
- "overall score" (based on a weighted average of the above scores)





