The Sentiment Engine: Deploying Local LLMs (LLaMA 3) for Real-Time Macroeconomic and News Sentiment Analysis in Algorithmic Trading

The Latency Tax of Cloud-Hosted LLM APIs

The analytical surface constructed across this series has been deliberately quantitative. Order flow toxicity, volume profile acceptance, microstructure imbalance, columnar tick storage. Each pillar has dealt exclusively with numerical signals derived from market data itself. There is, however, an entirely separate information channel that moves the same markets and that has, until recently, been computationally inaccessible to anyone outside a Bloomberg terminal subscription. That channel is unstructured text: central bank statements, regulatory filings, geopolitical wire reports, exchange announcements, and the constant ambient flow of financial news. The market reacts to these inputs in seconds. Most retail systems never see them at all.

The instinct, when LLM-based sentiment analysis became a tractable engineering problem, was to reach for a cloud API. OpenAI, Anthropic, or any of the hosted inference providers will accept a news headline and return a structured sentiment classification within a second or two. The economics seem reasonable on paper, and the code footprint is minimal. For a research notebook this is acceptable. For a production trading system, it is structurally unsound.

The problem is threefold. First, latency is non-deterministic. A round-trip to a cloud inference endpoint typically lands somewhere between 800 milliseconds and 4 seconds depending on the model, the region, and the current load on the provider’s infrastructure. In a market where Federal Reserve statement reactions complete in under 200 milliseconds, that latency band is the difference between trading the news and trading the consequence of someone else having traded the news. Second, cost compounds. At a few dollars per million input tokens, ingesting every Reuters and Bloomberg headline plus every central bank publication runs into hundreds of dollars per month for a single instrument coverage universe. Scale that to the multi-asset, multi-region coverage required for serious macro work and the bill becomes structurally untenable. Third, and most decisively, your alpha is now passing through a third party’s logging infrastructure. Every prompt, every response, every signal is observable to the provider. For any strategy with edge, this is not a privacy concern. It is an information leak.

The structural answer is to remove the cloud entirely. Run the language model on hardware you control, in the same physical region as your execution stack, with deterministic latency bounds and zero per-query cost. The release of capable open-weight models, particularly the LLaMA 3 family from Meta, made this not just possible but practical on commodity GPU hardware. What follows is the architecture of a local sentiment engine that ingests news in real time, scores it with a quantized LLaMA 3 deployment, and emits structured sentiment vectors directly into the same ClickHouse warehouse that already holds tick data and microstructure features.

Model Selection: The Quantization Tradeoff

The first architectural decision is which model to run and at what precision. The LLaMA 3 8B Instruct variant is the structural sweet spot for sentiment work. It is large enough to handle nuanced financial language, including the deliberate ambiguity of central bank communication, while remaining small enough to fit comfortably on a single consumer or workstation GPU after quantization. The 70B variant is tempting on capability grounds but introduces hardware requirements that do not justify the marginal accuracy gain for a structured classification task.

Quantization is the lever that makes this hardware-tractable. The native FP16 weights of an 8B model occupy roughly 16 gigabytes. Quantized to 4-bit using GPTQ, AWQ, or the GGUF Q4_K_M format, the same model collapses to approximately 4.8 gigabytes with a measured accuracy degradation under one percent on standard benchmarks. The practical implication is that the entire model fits inside the VRAM of a 12GB RTX 3060 or any RTX 4070-class card with substantial headroom for the KV cache. For a sentiment classification task with bounded output length, the throughput on this hardware lands in the range of 60 to 120 tokens per second, which translates to a complete sentiment analysis of a typical 200-token news article in under three seconds end-to-end. In practice, with prompt engineering that constrains output to a structured JSON envelope of ten to thirty tokens, the per-article latency drops below 800 milliseconds.

vLLM Versus llama.cpp: The Serving Layer Decision

Two serving runtimes dominate the local LLM ecosystem, and they optimize for different operational profiles. vLLM is the throughput-optimized choice. Its PagedAttention implementation manages the KV cache as a paged virtual memory system, allowing dozens of concurrent inference requests to share GPU memory efficiently. For a sentiment engine that processes hundreds of news articles per minute during market open or major event windows, vLLM is the structurally correct answer. llama.cpp, in contrast, optimizes for single-stream latency and CPU-GPU hybrid inference. It is the right choice when GPU resources are constrained or when the workload is single-threaded by design.

For the architecture documented here, vLLM is the default. It exposes an OpenAI-compatible HTTP API, which means the application code can be developed against the OpenAI Python client and seamlessly redirected to the local endpoint. This compatibility is not cosmetic. It means the same client code can fall back to a cloud provider during local hardware maintenance windows without rewriting any application logic.

The Ingestion Layer: News in Real Time

A sentiment engine without a fast ingestion pipeline is theoretical. The data sources fall into three structural tiers. The fastest tier is the direct exchange announcement feed, including Binance announcements, Bybit notices, and CME advisory bulletins, accessed via authenticated WebSocket or polled REST endpoints with sub-second freshness. The second tier is financial newswire feeds, accessed via either licensed APIs or RSS-derived aggregators, which carry mainstream financial reporting with a typical latency of five to thirty seconds from event to wire. The third tier is social and alternative data, primarily Twitter and curated Telegram channels, which often surface signals minutes before they hit mainstream wires but require aggressive noise filtering.

The ingestion service is best implemented as a set of independent asynchronous workers, each responsible for a single source, all writing into a common message queue that the inference layer consumes. Redis Streams or a lightweight RabbitMQ instance both work. The decoupling matters because source-specific failures must not block the rest of the pipeline. If the Twitter feed disconnects, the Reuters wire must keep flowing.

import asyncio
import logging
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from typing import AsyncIterator, List, Optional
import json

import aiohttp
import redis.asyncio as aioredis

logger = logging.getLogger("nql.ingest.news")
logger.setLevel(logging.INFO)

@dataclass(slots=True)
class NewsItem:
    source: str
    article_id: str
    headline: str
    body: str
    published_at_ms: int
    ingested_at_ms: int
    symbols_mentioned: List[str]
    url: Optional[str] = None

async def fetch_rss_feed(
    session: aiohttp.ClientSession,
    feed_url: str,
    source_name: str,
) -> AsyncIterator[NewsItem]:
    try:
        async with session.get(feed_url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
            if resp.status != 200:
                logger.warning("rss fetch failed source=%s status=%d", source_name, resp.status)
                return
            raw = await resp.text()
    except Exception as exc:
        logger.warning("rss fetch error source=%s err=%s", source_name, exc)
        return

    import feedparser
    parsed = feedparser.parse(raw)
    now_ms = int(datetime.now(timezone.utc).timestamp() * 1000)

    for entry in parsed.entries:
        published_struct = entry.get("published_parsed") or entry.get("updated_parsed")
        if published_struct is None:
            continue
        published_ms = int(datetime(*published_struct[:6], tzinfo=timezone.utc).timestamp() * 1000)
        yield NewsItem(
            source=source_name,
            article_id=entry.get("id", entry.get("link", "")),
            headline=entry.get("title", "").strip(),
            body=entry.get("summary", "").strip(),
            published_at_ms=published_ms,
            ingested_at_ms=now_ms,
            symbols_mentioned=[],
            url=entry.get("link"),
        )

async def ingestion_worker(
    feeds: List[tuple],
    redis_client: aioredis.Redis,
    poll_interval_sec: int = 30,
) -> None:
    seen_ids: set = set()
    async with aiohttp.ClientSession() as session:
        while True:
            for source_name, feed_url in feeds:
                async for item in fetch_rss_feed(session, feed_url, source_name):
                    if item.article_id in seen_ids:
                        continue
                    seen_ids.add(item.article_id)
                    if len(seen_ids) > 50_000:
                        seen_ids = set(list(seen_ids)[-25_000:])
                    payload = json.dumps(asdict(item))
                    await redis_client.xadd("news_stream", {"data": payload}, maxlen=100_000)
                    logger.info("ingested source=%s headline=%s",
                                item.source, item.headline[:80])
            await asyncio.sleep(poll_interval_sec)

Several design choices warrant explicit note. The deduplication set is bounded with a periodic compaction step because the naive growing set would consume unbounded memory across long-running services. The Redis stream is configured with a maxlen cap to enforce a fixed retention window; the inference layer is responsible for consuming faster than this cap permits stale messages to pile up. The error handling does not crash the worker on individual feed failures, which is structurally critical because the failure mode of news aggregation is partial, not total.

Prompt Engineering for Structured Sentiment Output

The single largest determinant of pipeline reliability is the prompt. Free-form sentiment output is not usable downstream. The model must be constrained to emit a strict JSON envelope with bounded fields, predictable types, and explicit handling of ambiguity. Anything else collapses under production load when a single malformed response stalls the consumer queue.

The prompt below is the result of substantial iteration. It enforces three orthogonal output dimensions: a sentiment score on a continuous scale rather than a discrete classification, a confidence score that allows the downstream layer to filter low-information outputs, and a market-impact estimate that captures the difference between informational news and actionable news. A central bank rate decision and a routine earnings preview may carry similar headline sentiment but radically different market impact, and the strategy layer needs both signals separately.

SENTIMENT_SYSTEM_PROMPT = """You are a financial sentiment analysis engine. You read news articles and emit a single JSON object with exactly these fields:

- sentiment: float in [-1.0, 1.0], where -1.0 is maximally bearish and 1.0 is maximally bullish for the asset class mentioned.
- confidence: float in [0.0, 1.0], representing your certainty in the sentiment classification.
- market_impact: float in [0.0, 1.0], where 0.0 is informational only and 1.0 is a regime-changing event likely to move price by more than two standard deviations.
- asset_classes: array of strings from this fixed set: ["crypto", "equities", "fx", "rates", "commodities", "general"].
- key_entities: array of strings naming the specific instruments, central banks, regulators, or companies referenced.
- reasoning: one sentence, under 30 words, justifying the scores.

Output only the JSON object. No prose, no preamble, no markdown fences."""

def build_user_prompt(headline: str, body: str) -> str:
    truncated_body = body[:1500] if body else ""
    return f"HEADLINE: {headline}\n\nBODY: {truncated_body}\n\nAnalyze and return the JSON object."

The body truncation at 1500 characters is a deliberate engineering choice. Longer context windows linearly increase inference latency without meaningfully improving sentiment classification accuracy on financial news, where the relevant signal is overwhelmingly concentrated in the headline and lead paragraph. This is the kind of structural optimization that compounds across millions of inferences.

The Inference Worker

The inference worker consumes from the Redis stream populated by the ingestion layer, sends each news item to the local vLLM endpoint, validates the structured response, and writes the resulting sentiment vector to ClickHouse. The implementation uses the OpenAI Python client pointed at the local endpoint, which keeps the code identical to a cloud-deployed equivalent and preserves the option to fall back to a hosted provider during local maintenance.

import asyncio
import json
import logging
from dataclasses import dataclass
from typing import Optional

import redis.asyncio as aioredis
from openai import AsyncOpenAI
import clickhouse_connect

logger = logging.getLogger("nql.inference.sentiment")

LOCAL_LLM_BASE_URL = "http://127.0.0.1:8000/v1"
LOCAL_LLM_MODEL = "meta-llama/Meta-Llama-3-8B-Instruct"

llm_client = AsyncOpenAI(base_url=LOCAL_LLM_BASE_URL, api_key="local-not-required")

@dataclass(slots=True)
class SentimentResult:
    article_id: str
    source: str
    sentiment: float
    confidence: float
    market_impact: float
    asset_classes: list
    key_entities: list
    reasoning: str
    inference_latency_ms: int
    scored_at_ms: int

async def score_news_item(item: dict) -> Optional[SentimentResult]:
    import time
    start = time.perf_counter()
    try:
        response = await llm_client.chat.completions.create(
            model=LOCAL_LLM_MODEL,
            messages=[
                {"role": "system", "content": SENTIMENT_SYSTEM_PROMPT},
                {"role": "user", "content": build_user_prompt(
                    item["headline"], item.get("body", ""),
                )},
            ],
            temperature=0.1,
            max_tokens=300,
            response_format={"type": "json_object"},
            timeout=15.0,
        )
        raw = response.choices[0].message.content
        parsed = json.loads(raw)
    except json.JSONDecodeError as exc:
        logger.warning("malformed json article=%s err=%s", item.get("article_id"), exc)
        return None
    except Exception as exc:
        logger.warning("inference failed article=%s err=%s", item.get("article_id"), exc)
        return None

    latency_ms = int((time.perf_counter() - start) * 1000)
    try:
        return SentimentResult(
            article_id=item["article_id"],
            source=item["source"],
            sentiment=float(parsed["sentiment"]),
            confidence=float(parsed["confidence"]),
            market_impact=float(parsed["market_impact"]),
            asset_classes=list(parsed.get("asset_classes", [])),
            key_entities=list(parsed.get("key_entities", [])),
            reasoning=str(parsed.get("reasoning", ""))[:500],
            inference_latency_ms=latency_ms,
            scored_at_ms=int(time.time() * 1000),
        )
    except (KeyError, ValueError, TypeError) as exc:
        logger.warning("schema validation failed article=%s err=%s",
                       item.get("article_id"), exc)
        return None

The temperature parameter is set deliberately low at 0.1. Sentiment classification is a discrimination task, not a creative task, and high temperature introduces non-determinism that destroys reproducibility across backtests. The response_format parameter, supported by recent vLLM builds, constrains the model’s output to valid JSON via grammar-constrained decoding, which eliminates the most common failure mode of malformed responses entirely. The schema validation block catches the residual cases where the model produces valid JSON but with missing fields or wrong types.

From Sentiment Vectors to Trading Signals

Raw sentiment scores are not a trading signal. They are a feature. The transformation from feature to signal requires three additional analytical layers, each of which is a structural pillar in its own right.

The first layer is aggregation across time. A single bullish headline carries little weight. A sustained sequence of bullish headlines from independent sources, weighted by the confidence and market-impact scores, is a structurally meaningful signal. The mathematical construct is a confidence-weighted exponentially decaying moving average of sentiment, computed independently per asset class and per time horizon. Short horizons capture event-driven reactions. Long horizons capture macro regime shifts.

The second layer is cross-source corroboration. A headline that appears on one wire is noise. The same headline propagating across Reuters, Bloomberg, and the Financial Times within minutes is signal. The propagation speed itself carries information: a story that breaks simultaneously across all major wires is typically a coordinated official announcement, while a story that originates on a single wire and propagates over thirty minutes is more often an exclusive that may or may not be subsequently confirmed.

The third layer is orthogonality to price. The most valuable sentiment signal is the one that has not yet been reflected in price. The integration into the ML pipeline previously constructed in this series is straightforward: the sentiment vector becomes one more feature column alongside microstructure features, technical features, and order flow features. The LightGBM ensemble or LSTM sequential model learns the joint conditional distribution and assigns the sentiment feature whatever weight the cross-validated performance justifies. This is the structurally correct integration. Sentiment is not a primary signal. It is one input to a calibrated multi-feature model.

Operational Hardening and Failure Modes

The non-obvious operational risk in a local LLM deployment is GPU thermal throttling and memory fragmentation under sustained load. A model that benchmarks at 100 tokens per second in fresh-boot conditions can degrade to 40 tokens per second after twenty hours of continuous operation if the cooling envelope is undersized or if the KV cache fragments. Production deployments require GPU temperature monitoring, scheduled vLLM restarts during low-volume windows, and explicit memory headroom that prevents the cache from saturating during news event spikes.

The second operational concern is model drift relative to live market language. LLaMA 3 was trained on data with a fixed cutoff. New financial terminology, new central bank policy frameworks, new instrument classes appear continuously. The pragmatic mitigation is a periodic evaluation harness that scores model outputs against human-labeled ground truth on a rolling sample of recent news, with alerting when classification accuracy drifts beyond a defined threshold. Retraining or model version updates are then a scheduled operational event rather than a reactive emergency.

The Verdict

The case for local LLM deployment is not ideological. It is structural. Latency is bounded by hardware you control rather than network conditions you do not. Cost is fixed at the hardware amortization rate rather than scaling linearly with query volume. Privacy is absolute because no token leaves the perimeter. These three properties are precisely the properties required for any infrastructure component that sits in the alpha generation path. A cloud LLM API is acceptable for research, for prototyping, for offline analysis. It is structurally inappropriate for production trading.

The deeper observation is that the entire premise of competing with institutional desks on the basis of price-action analysis was always going to be a losing structural position. The moment you accept that the modern market is a multi-modal information environment, integrating numerical time series, order book microstructure, on-chain flows, and unstructured text simultaneously, the architectural requirement becomes clear. Every input channel must be observable, processable, and integrable on infrastructure the operator controls. The sentiment engine documented here is one such channel. The ClickHouse warehouse, the microstructure analytics, and the ML pipeline are the others. Together they constitute the observable surface of the modern adversarial market. Without all of them, the operator is partially blind. With all of them, the asymmetry against institutional capital, while never eliminated, becomes a contest of execution rather than a contest of access.