No description
  • TypeScript 78.1%
  • Python 17.4%
  • Shell 1.3%
  • CSS 1.1%
  • JavaScript 1%
  • Other 1.1%
Find a file
2026-05-10 03:13:12 +02:00
.forgejo/workflows fix: add LLAMA_CLOUD_API_KEY to environment variables in Python sanity check workflow 2026-05-10 03:07:46 +02:00
apps fix: correct BACKEND_CORS_ORIGINS value in docker-compose and config files 2026-05-10 03:02:44 +02:00
ressources docs: add Agentic Flow section to README.md with deepagents library details and diagram 2026-05-10 03:13:12 +02:00
.gitignore Add EnvelopeCard component and integrate with MessageStack; update finalize_envelope tool and modify default LLM model 2026-05-09 15:25:51 +02:00
LICENCE Initial commit 2026-05-08 22:47:17 +00:00
README.md docs: add Agentic Flow section to README.md with deepagents library details and diagram 2026-05-10 03:13:12 +02:00

DavisAI - Monorepo

This repo contains the code of the DavisAI AI Engineering Assessment


High-level overview

This software app does the following:

  • Ingests reference documents related to urban planning regulations
  • Parses, transforms and stores them in a synthetic file-system
  • Spawns a deep agent with binded filesystem tools + backend to deliver a full agentic flow on the documents
  • Builds a 'Buildable Envelope' report provided to the user

Tech Stack

  • Frontend: React vite + TypeScript + TailwindCSS
    • OpenAPI client generated automatically from backend specs
    • Vitest for testing
    • Zod for runtime validation
    • langchain for message frontend parsing and formatting

-> Deployed on personal VPS through Dokploy + Nixpacks

  • Backend: FastAPI + Python 3.12

    • Fully containerized with trixie image and uv by astral for environment management
    • openapi.json generated automatically from FastAPI specs
    • deepagents for agent orchestration and management
    • Pydantic for data validation
    • Pytest for testing
    • LlamaCloud and OpenAI for LLM access/parsing
    • Langfuse tracing for observability and tracing -> See scoring section for details
  • CI/CD: Forgejo:

    • Self-hosted Forgejo (gitea fork) instance (git.olekhnovitch.fr)
    • Automatic Workflows for :
      • Python sanity checks, linting, formatting and testing
      • Frontend sanity checks, linting, formatting and testing
      • Docker image build and push to DockerHub on release

Architecture

├── apps
│   ├── frontend
│   │   ├── index.html
│   │   ├── node_modules
│   │   ├── openapi.json # Auto-generated OpenAPI client
│   │   ├── package.json
│   │   ├── package-lock.json
│   │   ├── src
│   │   ├── vite.config.ts
│   │   └── vitest.config.ts
│   └── python-services
│       ├── docker-compose.yml
│       ├── Dockerfile
│       ├── pyproject.toml
│       ├── README.md
│       ├── run.sh  # Main entrypoint for running commands in the uv environment
│       ├── scripts
│       ├── src
│       ├── tests # Tests
│       └── uv.lock
├── LICENCE
└── README.md

How to run

  • Backend:

    • cd apps/python-services
    • cp .env.example .env and fill in the environment variables
    • uv sync
    • ./run.sh generate # To generate the openapi.json
    • ./run.sh run
  • Frontend:

    • cd apps/frontend
    • cp .env.example .env
    • npm ci to install dependencies
    • npm run generate to generate the OpenAPI client
    • npm run dev to start the development server

Agentic Flow

This implementation relies on the deepagents library for agent orchestration and management. The main agentic flow is implemented in AgentService in apps/python-services/src/python_services/agent/service.py. The deep agent architecture is basically a LLM with a harness made of :

  • A file system backend
  • A set of planning tools (write todos, read files, etc.)
  • A set of additional tools
  • An ability to spawn sub-agents with specific roles and contexts on-the-fly

deepagent

The LLM used in production is gpt-5-codex from OpenAI, but could be swapped with any good reasoning model, even open source ones, with minimal changes in the codebase thanks to the LLMProvider abstraction.

Benchmarking and scoring

I am using Langfuse for tracing and observability of every single step of the agent's reasoning and execution process, from the moment the user uploads the documents to the moment the final report is generated and sent back to the user.

Langfuse_ScreenShot

This is self hosted on a separate instance and let me log :

  • LLM provider metrics:

    • TTFT (Time To First Token)
    • Latency
    • Token usage (prompt, completion, total)
    • Costs : ~0.35€ per full run with gpt-5-codex, ~0.01€ with gpt-4o
  • Agentic metrics:

    • Total execution time
    • Time spent in each tool
    • Number of calls to each tool
    • Final output (call to generate_report)

Dashboarding :

costs

Lots of things to be done here for better dashboarding

Benchmarking + nuggets extraction

With all the above metrics logged, I can easily benchmark different versions of the agent and extract 'success nuggets'.

Here is how I proceed :

  1. Run x times a batch of a determined mission (for example the assessment one)

  2. Define a list of 10-20 'success nuggets' that I want to extract from the runs, for example :

    • "The agent identified the UMG area of the project"
    • "The agent identified the PLU regulations applicable to the project, which are precisely ..."
    • "The agent correctly used the generate_report tool with the correct format and content"
    • "The agent provided the 3D buildable envelope of exactly X m3"
    • "The agent said that the project is feasible in terms of urban planning regulations"
  3. Grade the runs based on the presence or absence (1 or 0) of these nuggets in the agent's reasoning and final output, based on the Langfuse traces and final report content (can be automated with a LLM and structured output)

  4. Average the scores of the x runs to get a final score.

This is the "nugger extraction" method I use to benchmark the agent's performance and progress across different versions and iterations.

Then, we can have multiple dimensions of the score :

  • "nuggets score"
  • "cost score" (based on the total cost of the run)
  • "latency score" (based on the total execution time)
  • "overall score" (based on a weighted average of the above scores)

Full Scren Demo

upload endchat enveloppe