{"id":13753,"date":"2025-10-17T01:52:17","date_gmt":"2025-10-17T05:52:17","guid":{"rendered":"https:\/\/spinor.info\/weblog\/?p=13753"},"modified":"2025-10-17T01:52:17","modified_gmt":"2025-10-17T05:52:17","slug":"retrieval-augmented-generation-rag","status":"publish","type":"post","link":"https:\/\/spinor.info\/weblog\/?p=13753","title":{"rendered":"Retrieval Augmented Generation (RAG)"},"content":{"rendered":"<p>I&#8217;ve been reading about this topic a lot lately: Retrieval Augmented Generation, the next best thing that should make large language models (LLMs) more useful, respond more accurately in specific use cases. It was time for me to dig a bit deeper and see if I can make good sense of the subject and understand its implementation.<\/p>\n<p>The main purpose of RAG is to enable a language model to respond using, as context, a set of relevant documents drawn from a documentation library. Preferably, relevance itself is established using machine intelligence, so it&#8217;s not just some simple keyword search but semantic analysis that helps pick the right subset.<\/p>\n<p>One particular method is to represent documents in an abstract vector space of many dimensions. A query, then, can be represented in the same abstract vector space. The most relevant documents are found using a &#8220;cosine similarity search&#8221;, which is to say, by measuring the &#8220;angle&#8221; between the query and the documents in the library. The smaller the angle (the closer the cosine is to 1) the more likely the document is a match.<\/p>\n<p>The abstract vector space in which representations of documents &#8220;live&#8221; is itself generated by a specialized language model (an embedding model.) Once the right documents are found, they are fed, together with the user&#8217;s query, to a generative language model, which then produces the answer.<\/p>\n<p>As it turns out, I just had the perfect example corpus for a test, technology demo implementation: My more than 11,000 Quora answers, mostly about physics.<\/p>\n<p>Long story short, I <a href=\"https:\/\/www.vttoth.com\/CMS\/ai-and-machine-learning-notes\/436\">now have this<\/a>:<\/p>\n<p><a href=\"https:\/\/www.vttoth.com\/CMS\/ai-and-machine-learning-notes\/436\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-13754\" src=\"https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/10\/quora-rag.png\" alt=\"\" width=\"422\" height=\"649\" srcset=\"https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/10\/quora-rag.png 633w, https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/10\/quora-rag-195x300.png 195w, https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/10\/quora-rag-97x150.png 97w\" sizes=\"(max-width: 422px) 100vw, 422px\" \/><\/a><\/p>\n<p>The nicest part: This RAG solution &#8220;lives&#8221; entirely on my local hardware. The main language model is Google&#8217;s Gemma with 12 billion parameters. At 4-bit quantization, it fits comfortably within the VRAM of a 16 GB consumer-grade GPU, leaving enough room for the cosine similarity search. Consequently, the model response to queries in record time: the answer page shown in this example was generated in less than about 30 seconds.<\/p>\n<fb:like href='https:\/\/spinor.info\/weblog\/?p=13753' send='true' layout='standard' show_faces='true' width='450' height='65' action='like' colorscheme='light' font='lucida grande'><\/fb:like>","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been reading about this topic a lot lately: Retrieval Augmented Generation, the next best thing that should make large language models (LLMs) more useful, respond more accurately in specific use cases. It was time for me to dig a bit deeper and see if I can make good sense of the subject and understand <a href='https:\/\/spinor.info\/weblog\/?p=13753' class='excerpt-more'>[&#8230;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[58,35],"tags":[],"class_list":["post-13753","post","type-post","status-publish","format-standard","hentry","category-cybernetics","category-personal","category-58-id","category-35-id","post-seq-1","post-parity-odd","meta-position-corners","fix"],"_links":{"self":[{"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/posts\/13753","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13753"}],"version-history":[{"count":3,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/posts\/13753\/revisions"}],"predecessor-version":[{"id":13757,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/posts\/13753\/revisions\/13757"}],"wp:attachment":[{"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13753"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13753"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13753"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}