Knowledge Compression: Zipping Information for LLMs

Information is like gas: it expands to fill the space it is given. If you give an LLM 128k tokens of context, it will consume 128k tokens. But often, the Actual Information contained in those 128k tokens could be summarized on a single index card.

Knowledge Compression is the process of using a small, efficient model to "Pre-process" large amounts of data into a dense, high-signal format before sending it to a large, intelligent reasoning model.

In this lesson, we explore "Semantic Zipping," "Entity Extraction as Compression," and how to build a multi-model pipeline for extreme context efficiency.

1. What is Semantic Zipping?

Standard GZIP compression (used for files) doesn't work for LLMs because the LLM needs to read the data. Semantic Zipping is different: it reduces the "Verbiage" while preserving the "Meaning."

Input (300 Words):

"We are very pleased to announce that after careful consideration of all the logistical and financial factors, the executive board has decided to relocate our primary corporate headquarters from New York City to Austin, Texas starting in the third quarter of 2025."

Sementically Zipped (15 Words):

"HQ Relocation: NYC to Austin. Timing: Q3 2025. Status: Approved by Board."

The Compression Ratio

In this example, we achieved a 20:1 compression ratio. If you have 200 documents in your RAG system, "Zipping" them allows you to fit them all into a single context window that would previously only hold 10 documents.

2. Techniques for Automated Compression

To build a production system, you cannot compress data manually. You use an "Extraction Pass."

The Extraction Prompt: "Extract only entities, numbers, and key decisions from this text. Use a 'Key: Value' format."
The Formatting Step: Convert the results into YAML or a Markdown list (Module 4.4).
The Result: A high-density "Knowledge Profile" of the document.

graph TD
    A[Raw Document: 5k tokens] --> B[Cheap Model: Llama 3 8B]
    B -->|Compress| C[Extracted Facts: 200 tokens]
    C --> D[Premium Model: Claude 3.5 Sonnet]
    D --> E[Final High-Level Insight]
    
    style B fill:#f96
    style D fill:#69f

3. Implementation: The Compression Pipeline (Python)

Python Code: Multi-Model Compression

def compress_document(raw_text: str):
    """
    Pass 1: Use a small model to 'Digest' the text.
    """
    compression_prompt = (
        "Identity: Technical Summarizer. "
        "Task: Summarize the following as a dense set of facts. "
        "Constraint: 0% conversational fluff. No adjectives."
    )
    
    # We use a cheaper model (e.g. GPT-4o mini) for the compression
    # to save money on the large raw input.
    compressed_data = call_llm(
        model="gpt-4o-mini", 
        system=compression_prompt, 
        user=raw_text
    )
    return compressed_data

def final_query(query: str, large_doc_pool: list):
    # Instead of sending all raw_docs (very expensive!)
    # We send the pre-compressed versions.
    compressed_pool = [compress_document(d) for d in large_doc_pool]
    
    return call_llm(
        model="gpt-4o", 
        user=f"Context: {compressed_pool}\nQuery: {query}"
    )

4. The "Symbolic Compression" Strategy

Sometimes you can replace whole concepts with "Codes."

Original: "This situation is a high-priority security risk involving data leakage." (12 tokens)
Compressed: CODE: RED_LEAK (3 tokens)

If your system (both the compressor and the reasoner) is instructed to use these codes, you increase the "Intelligence per Token" by 400%.

5. Risks of Compression: Hallucination and Loss

Every time you compress, you lose Nuance.

If the "Nuance" is just "Linguistic Fluff," then it's fine.
If the "Nuance" is a subtle legal distinction (e.g. "Maybe" vs "Shall"), compression can be dangerous.

The Hybrid Strategy: Keep the "Groomed Markdown" (Module 2.4) for most data, but only use "Extreme Compression" for background context that isn't central to the query.

6. Summary and Key Takeaways

Extract, don't Summarize: Extraction (Key:Value) is more token-efficient than narrative summary.
Tiered Processing: Use small, fast models for the compression phase to save costs.
Signal over Style: AI doesn't need "Elegance"; it needs "Facts."
Custom Lexicon: Use abbreviations and symbolic codes for common domain concepts.

In the next lesson, Topic-Based Context Isolation, we learn how to "Segregate" your AI's brain to prevent cross-contamination of unrelated facts.

Exercise: The Zipping Challenge

Take a 2-page research paper.
Use an LLM to "Semantically Zip" it into exactly 10 bullet points of < 10 words each.
Now, ask a different LLM a complex question about the paper, but only give it the 10 bullet points as context.
Does it get the answer right?

If yes, you just saved 98% of your token cost for that query.
If no, which piece of "Nuance" was deleted that was necessary?

Knowledge Compression: Zipping Information for LLMs

Knowledge Compression: Zipping Information for LLMs

1. What is Semantic Zipping?

The Compression Ratio

2. Techniques for Automated Compression

3. Implementation: The Compression Pipeline (Python)

Python Code: Multi-Model Compression

4. The "Symbolic Compression" Strategy

5. Risks of Compression: Hallucination and Loss

6. Summary and Key Takeaways

Exercise: The Zipping Challenge

Congratulations on completing Module 6 Lesson 3! You are now an information architect.

Subscribe to our newsletter