capstone · advanced · 1 week

Concurrent Web Crawler

concepts
thread pool · std::future · lock-free queue · structured concurrency · std::atomic
builds on
Threads, `std::async`, and Futures · Atomics + the Memory Model · RAII + Smart Pointers

The brief

Build a polite, bounded-concurrency web crawler. Given a seed URL and a depth limit, fetch HTML, extract links, queue them up, and repeat — across a pool of N worker threads — until either the queue drains or a time budget elapses.

The deliverable is a CLI: crawler <seed-url> --depth 3 --workers 8 --max-pages 1000. Output is a CSV of (url, status_code, bytes, fetch_ms, outlinks) rows.

This capstone exercises the entire concurrency phase end-to-end. By the end you’ll have written a thread-safe queue, a worker pool, a graceful-shutdown protocol, and a politeness layer (per-host rate limiting). It’s also the kind of project that’s bigger than it looks — budget conservatively, build the core first.

Scope (1 weekend)

The deliverable has six parts:

  1. HTTP client. GET a URL, return body + status + bytes + elapsed. libcurl is the obvious choice; cpp-httplib for a simpler header-only option.
  2. Link extractor. A small HTML scanner that pulls href URLs from <a> tags. Lenient — production crawlers use real HTML parsers, but a regex over href="..." covers ~95% of links and is fine for this scope.
  3. URL normalizer. Resolve relative URLs against the page’s base. Lowercase the host. Strip fragments. Use std::filesystem::path or a small URL class.
  4. Work queue. Thread-safe FIFO of pending URLs, with a “no more work coming” signal so workers know to exit. A std::queue + std::mutex + std::condition_variable is sufficient; a lock-free queue is a stretch goal.
  5. Worker pool. N std::jthreads, each looping: pop URL, fetch, parse links, enqueue new URLs, record result. Respect a stop_token for graceful shutdown.
  6. Output. As fetches complete, append a CSV row. A single writer thread receiving results via a std::mpsc-shaped channel avoids contended file I/O.

The shape

The work queue, with the shutdown signal as a first-class member:

#include <queue>
#include <mutex>
#include <condition_variable>
#include <optional>
#include <string>

class WorkQueue {
  std::queue<std::string> q_;
  std::mutex              m_;
  std::condition_variable cv_;
  bool                    done_ = false;

 public:
  void push(std::string url) {
    {
      std::lock_guard lock(m_);
      q_.push(std::move(url));
    }
    cv_.notify_one();
  }

  // Returns nullopt when the queue is empty AND done_ is set.
  std::optional<std::string> pop_or_done() {
    std::unique_lock lock(m_);
    cv_.wait(lock, [this] { return !q_.empty() || done_; });
    if (q_.empty()) return std::nullopt;     // done == true
    auto url = std::move(q_.front());
    q_.pop();
    return url;
  }

  void close() {
    {
      std::lock_guard lock(m_);
      done_ = true;
    }
    cv_.notify_all();
  }
};

Note the careful pattern: lock guards scope to the minimum, predicate wait handles spurious wakeups, close() notifies all so every worker sees the shutdown.

The crawl loop

A worker’s main loop:

void worker(std::stop_token stop, WorkQueue& work, ResultChannel& results,
            HttpClient& http) {
  while (!stop.stop_requested()) {
    auto url_opt = work.pop_or_done();
    if (!url_opt) break;                 // queue drained
    auto& url = *url_opt;

    auto response = http.get(url);       // returns std::expected<Response, Error>
    if (!response) {
      results.send({url, 0, 0, 0, {}});
      continue;
    }

    auto links = extract_links(response->body, url);
    for (const auto& link : links) {
      if (should_follow(link)) work.push(link);
    }

    results.send({url, response->status, response->bytes,
                  response->elapsed_ms, std::move(links)});
  }
}

The interaction between stop_token and the pop_or_done() is the tricky part. If stop_token fires while a worker is waiting on the condition variable, you also need to call work.close() from the stopper so the wait wakes up. The std::stop_callback is the right hook for that.

When the queue is “done”

A subtle question. The crawl is finished when:

A simple atomic counter does it:

std::atomic<int> in_flight{0};

// Before pushing or starting work:
in_flight.fetch_add(1, std::memory_order_relaxed);

// After finishing the work that came from a pop:
if (in_flight.fetch_sub(1, std::memory_order_acq_rel) == 1) {
  // Last in-flight just finished. If the queue is also empty, we're done.
  work.maybe_close_if_empty();
}

The exact protocol is finicky — it’s a worthwhile design exercise to get right. The acq_rel ordering is needed because the “is queue empty” check has to see all pushes that happened before the fetch_sub.

Politeness

A real crawler must not hammer servers. At minimum:

Skipping politeness for a learning project on a single laptop is fine; running it against the public web without these is not.

Implementation rubric

Your code should hit these targets:

  1. No data races. All cross-thread state is either std::atomic or protected by a mutex. Run with ThreadSanitizer (-fsanitize=thread) and verify zero warnings.
  2. Graceful shutdown. Pressing Ctrl-C drains in-flight work, flushes the results channel, and exits cleanly. No std::terminate on thread destructors.
  3. Bounded memory. The work queue has a max size; pushers block if it’s full. Otherwise a wide BFS over the public web exhausts memory.
  4. Tests. Mock the HTTP client (return canned responses). Test the crawler logic end-to-end with a fixture site of 5-10 pages.
  5. Output is deterministic given the fixture. Same seed + max-pages → same set of crawled URLs (the order may differ because of threading; just check the set).

Stretch goals

In order of difficulty:

  1. Async I/O instead of threads. Use boost::asio or cpp-httplib’s async mode. One thread, N concurrent in-flight requests. Compare throughput against the thread-pool version on a small fixture.
  2. Lock-free queue. Replace std::queue + mutex with moodycamel::ConcurrentQueue (or implement one). Compare contention under load.
  3. Distributed. Split the URL space by hash and have N processes each crawl their share. Coordinate via Redis or a flat file.
  4. A real HTML parser. Use gumbo-parser or lexbor. Get genuinely accurate link extraction. Discover edge cases.

What you’ll have learned

By the end:

Reference reading: Anthony Williams’s C++ Concurrency in Action (the single best book for this material); the crawler example in the Boost.Asio docs; Herb Sutter’s “Atomic Weapons” talks for the memory model in motion.