{"id":13740,"date":"2025-09-30T02:21:15","date_gmt":"2025-09-30T06:21:15","guid":{"rendered":"https:\/\/spinor.info\/weblog\/?p=13740"},"modified":"2025-09-30T02:21:15","modified_gmt":"2025-09-30T06:21:15","slug":"hacking-the-llama","status":"publish","type":"post","link":"https:\/\/spinor.info\/weblog\/?p=13740","title":{"rendered":"Hacking the Llama"},"content":{"rendered":"<p>There is a wonderful tool out there that works with many of the published large language models and multimodal models: Llama.cpp, a pure C++ implementation of the inference engine to run models like Meta&#8217;s Llama or Google&#8217;s Gemma.<\/p>\n<p>The C++ implementation is powerful. It allows a 12-billion parameter model to run at speed even without GPU acceleration, and emit 3-4 tokens per second in the generation phase. That is seriously impressive.<\/p>\n<p>There is one catch. Multimodal operation with images requires embedding, which is often the most time-consuming part. A single image may take 45-60 seconds to encode. And in a multi-turn conversation, the image(s) are repeatedly encoded, slowing down the conversation at every turn.<\/p>\n<p>An obvious solution is to preserve the embeddings in a cache and avoid re-embedding images already cached. Well, this looked like a perfect opportunity to deep-dive into the Llama.cpp code base and make a surgical change. A perfect opportunity also to practice my (supposedly considerable) C++ skills, which I use less and less these days.<\/p>\n<p>Well, what can I say? I did it and it works.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-13741\" src=\"https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/09\/hack-llama.png\" alt=\"\" width=\"810\" height=\"1016\" srcset=\"https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/09\/hack-llama.png 1013w, https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/09\/hack-llama-239x300.png 239w, https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/09\/hack-llama-816x1024.png 816w, https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/09\/hack-llama-120x150.png 120w, https:\/\/spinor.info\/weblog\/wp-content\/uploads\/2025\/09\/hack-llama-768x964.png 768w\" sizes=\"(max-width: 810px) 100vw, 810px\" \/><\/p>\n<p>I can now converse with Gemma, even with image content, and it feels much snappier.<\/p>\n<fb:like href='https:\/\/spinor.info\/weblog\/?p=13740' send='true' layout='standard' show_faces='true' width='450' height='65' action='like' colorscheme='light' font='lucida grande'><\/fb:like>","protected":false},"excerpt":{"rendered":"<p>There is a wonderful tool out there that works with many of the published large language models and multimodal models: Llama.cpp, a pure C++ implementation of the inference engine to run models like Meta&#8217;s Llama or Google&#8217;s Gemma. The C++ implementation is powerful. It allows a 12-billion parameter model to run at speed even without <a href='https:\/\/spinor.info\/weblog\/?p=13740' class='excerpt-more'>[&#8230;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[58,35,36],"tags":[],"class_list":["post-13740","post","type-post","status-publish","format-standard","hentry","category-cybernetics","category-personal","category-programming","category-58-id","category-35-id","category-36-id","post-seq-1","post-parity-odd","meta-position-corners","fix"],"_links":{"self":[{"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/posts\/13740","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13740"}],"version-history":[{"count":2,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/posts\/13740\/revisions"}],"predecessor-version":[{"id":13743,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=\/wp\/v2\/posts\/13740\/revisions\/13743"}],"wp:attachment":[{"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13740"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13740"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/spinor.info\/weblog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13740"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}