Monorepos and large repo management
In this series (8 parts)
Google stores billions of lines of code in a single repository. Most teams do not operate at that scale, but the monorepo pattern is increasingly popular even for mid-sized projects. The trade-offs between monorepos and polyrepos shape how you organize code, manage dependencies, and configure CI.
Monorepo vs polyrepo
A monorepo stores multiple projects, services, or packages in one repository. A polyrepo gives each project its own repository.
| Aspect | Monorepo | Polyrepo |
|---|---|---|
| Code sharing | Import directly, always up to date | Publish packages, manage versions |
| Atomic changes | One commit can update API + client | Coordinated PRs across repos |
| CI complexity | Must detect what changed | Each repo has focused CI |
| Clone size | Grows with every project | Small per repo |
| Access control | Coarse (repo-level) or needs CODEOWNERS | Fine-grained per repo |
| Dependency management | Single lockfile, shared versions | Independent version per repo |
Neither is universally better. The choice depends on your team’s workflow.
When monorepos shine
- Tightly coupled services that change together frequently.
- Shared libraries consumed by multiple services.
- Small to medium teams that want simplified dependency management.
- Organizations that value atomic cross-project changes.
When polyrepos make sense
- Independent teams with separate release cycles.
- Open-source projects where each package has its own contributors.
- Projects with fundamentally different toolchains.
- Strict access control requirements.
Monorepo structure
A typical monorepo layout:
my-monorepo/
packages/
shared-utils/
package.json
src/
web-app/
package.json
src/
api-server/
package.json
src/
package.json # root workspace config
nx.json # or turbo.json
.github/
workflows/
ci.yml
Workspaces (npm, Yarn, pnpm) handle the package relationships. A build orchestrator handles task execution.
Scaling Git for large repos
As a monorepo grows, standard Git operations slow down. Several features address this.
Sparse checkout
Sparse checkout lets you check out only the directories you need. The rest of the repository exists in the object store but does not appear in your working directory.
# Enable sparse checkout
git sparse-checkout init --cone
# Check out only specific directories
git sparse-checkout set packages/web-app packages/shared-utils
# List current sparse checkout paths
git sparse-checkout list
# Disable sparse checkout
git sparse-checkout disable
The --cone mode restricts patterns to directory-based matching, which is significantly faster than arbitrary patterns.
Shallow clones
A shallow clone fetches only recent history, dramatically reducing clone time and disk usage.
# Clone with only the last commit
git clone --depth 1 https://github.com/org/monorepo.git
# Clone with the last 10 commits
git clone --depth 10 https://github.com/org/monorepo.git
# Deepen later if needed
git fetch --deepen=50
# Convert to a full clone
git fetch --unshallow
Shallow clones are ideal for CI environments where you only need the current code, not the full history.
Partial clones
Partial clones go further. They skip downloading blob objects until they are needed (on-demand fetching).
# Blobless clone: skip blobs, fetch on demand
git clone --filter=blob:none https://github.com/org/monorepo.git
# Treeless clone: skip blobs and trees
git clone --filter=tree:0 https://github.com/org/monorepo.git
Blobless clones are the sweet spot for most developers. You get full history (for log and blame) without downloading every file version upfront.
Git LFS
Git Large File Storage replaces large files with text pointers inside Git, storing the actual content on a separate server.
Why LFS exists
Git stores every version of every file. Binary files (images, videos, compiled assets, datasets) do not diff well and bloat the repository. A 50 MB model file with 20 versions means 1 GB of history.
Setup
# Install Git LFS
git lfs install
# Track file patterns
git lfs track "*.psd"
git lfs track "*.zip"
git lfs track "datasets/**"
# This updates .gitattributes
cat .gitattributes
# *.psd filter=lfs diff=lfs merge=lfs -text
# *.zip filter=lfs diff=lfs merge=lfs -text
How it works
graph LR DEV["Developer"] -->|git add large.psd| IDX["Staging area"] IDX -->|git commit| REPO["Local repo<br/>(stores pointer)"] REPO -->|git push| REMOTE["Git remote<br/>(stores pointer)"] REPO -->|git lfs push| LFS["LFS server<br/>(stores actual file)"] LFS -->|git lfs pull| DEV
Git LFS flow. The Git repository stores small pointer files. The actual binary content lives on the LFS server.
When you push, Git LFS uploads the binary to the LFS server and commits only a pointer file. When you clone or pull, LFS downloads the binaries you need.
LFS commands
# See tracked patterns
git lfs track
# See LFS files in the repo
git lfs ls-files
# Pull LFS files (if not auto-fetched)
git lfs pull
# Migrate existing files to LFS
git lfs migrate import --include="*.psd" --everything
LFS considerations
- Hosting support: GitHub, GitLab, and Bitbucket all support LFS. Self-hosted needs a separate LFS server.
- Bandwidth: LFS downloads count against hosting quotas.
- CI: CI runners need LFS installed. Use
GIT_LFS_SKIP_SMUDGE=1to skip LFS downloads when only source code matters.
Build orchestration
A monorepo without build intelligence rebuilds everything on every change. That does not scale.
Nx
Nx analyzes the dependency graph between projects and only builds/tests what is affected by a change.
# Run tests only for affected projects
npx nx affected --target=test
# Build only what changed
npx nx affected --target=build
# Visualize the dependency graph
npx nx graph
Nx caches results locally and remotely. If a project has not changed, its cached test results are reused.
Turborepo
Turborepo takes a similar approach with a focus on simplicity.
{
"pipeline": {
"build": {
"dependsOn": ["^build"],
"outputs": ["dist/**"]
},
"test": {
"dependsOn": ["build"]
}
}
}
Tasks declare their dependencies. Turborepo figures out the execution order and parallelism.
Bazel
Bazel is Google’s build system. It handles multi-language monorepos at massive scale. The learning curve is steep, but it provides hermetic builds, remote caching, and remote execution.
The choice between these tools depends on your ecosystem. Nx and Turborepo excel in JavaScript/TypeScript. Bazel is language-agnostic but complex.
CI strategies for monorepos
Path-based filtering
Run CI jobs only when relevant files change.
# GitHub Actions example
on:
push:
paths:
- 'packages/web-app/**'
- 'packages/shared-utils/**'
jobs:
test-web-app:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm test --workspace=packages/web-app
Affected-based CI
Use the build orchestrator’s affected detection in CI.
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- run: npx nx affected --target=test --base=origin/main
fetch-depth: 0 is important. Nx needs full history to determine what changed since the base branch.
Caching in CI
Cache dependencies and build outputs between runs.
- uses: actions/cache@v4
with:
path: |
node_modules/
.nx/cache/
key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}
Remote caching (Nx Cloud, Turborepo remote cache) shares cache across CI runs and developer machines.
CODEOWNERS
In a monorepo, different teams own different directories. CODEOWNERS enforces review requirements.
# .github/CODEOWNERS
packages/web-app/ @frontend-team
packages/api-server/ @backend-team
packages/shared-utils/ @platform-team
infrastructure/ @devops-team
Pull requests that touch a directory automatically request reviews from the owning team.
What comes next
With monorepo management covered, the final article in this series explores GitOps. GitOps takes the concept of Git as a source of truth and applies it to infrastructure and deployments, using the same branching and merging workflows you have learned throughout this series.