Abstract: Recent large language models (LLMs) face increasing inference latency as input context length and model size grow. Retrieval-augmented generation (RAG) exacerbates this by significantly ...