Half-Elf on Tech

Thoughts From a Professional Lesbian

Tag: ollama

  • No More Ollama Drama: A Private AI

    No More Ollama Drama: A Private AI

    In the running of LezWatch.TV, I’ve often thought about how one might use AI with it. At a certain point, an AI acts as a fancy faceted search, when you’re looking for a TV show. But at the same time, asking something “What was that awesome show that took place in the 18th century and had sex workers?” (they probably mean Harlots) is beyond what a search can do.

    I asked myself… just because I can remember that doesn’t mean everyone can, so could I possibly make my own AI tool for people to talk to?

    (The answer is yes.)

    Start Simple: Turn Human to Data

    First and foremost… how do you want to do this? I chose hosting my own LLM (Large Language Models) on my dedicated server. It’s got enough cores and disk space, but may have a bit shy on the memory. Self hosting means I can lock it down and control the guardrails. I picked Ollama.

    Installing Ollama is pretty much following their directions. Then, once you have a model, you kickstart it:

    sudo HOME=/home/ollama-data ollama create lezwatch-bot -f /home/ollama/agents/LezWatchBot
    

    The nice thing about the model file is that if I ever have to re-create the server, I just run that command and Ollama knows what my bot is!

    After I installed the LLM, I had to think about how to talk to it. I decided to start as a Semantic Translator. The LLM will take “messy human talk” and mapping it to my “strict database reality.”

    Now this goes by another name. Prompt Engineering + RAG (Retrieval-Augmented Generation).

    That’s a mouthful ain’t it?

    What about this… Most of the work for this is done in making a Rest API endpoint that takes in simplified data and uses it to build a WP Query.

    That sounds pretty straightforward doesn’t it? This is something that works well specifically because LWTV has very structured and organized data. All the work we put in with tropes and categories and ratings allow me to make some pretty clear assumptions that are rooted in data.

    All I have to do is teach my LLM that when I say “I want a really good show from Canada that is still on air.” it knows I mean country:canada,on_air:yes,score:80,worthit:yes

    Because that’s easy.

    Teach It to Talk

    The heart of your LLM is the model.

    I made a custom model called LezWatchBot and I spent three days working on the model file. In Ollama, they have modelfiles that let you design a brain, basically.

    I keep a copy of the LezBot on our GitHub repo because it will let me rebuild things in a heartbeat and honestly it was originally craaaaazy long. I was telling it what our tropes etc are. But then I explained its job is to be a curator and to use some of our custom data (the worth it explanation) to help write an explanation as to why a show fits or not.

    The model now has the basic ‘documentation’ on how to be and the ‘logic’ part of what a trope is and what EU means was shifted. I’ll explain this in a bit.

    Next I use WordPress to send a message to the bot, which returns with the specific search parameters to find the show. Its a fun back and forth with APIs talking to each other, but I added in caching on both ends to save my bacon.

    Again, we are still at the basic level here, where my LLM is just a search agent.

    Make it Think

    Now it’s fun time!

    Currently, the AI finds a show and spits out the template I gave it in the modelfile. But it doesn’t explain why it chose that show based on the user’s specific mood.

    The AI should say: “Since you’re looking for a slow-burn from Canada, I recommend ‘Workin’ Moms’ because the tension between the leads perfectly matches that vibe, and it’s currently on air.”

    To get there, my AI needed to hold onto the user’s “vibe” while it looks at the JSON data the PHP returns. You do this by updating the Final Response Formatting section in the Modelfile to include a Reasoning Sentence.

    With that, it would do a decent job of telling me why it picked a certain show. That’s when I added in things like the worthit explanations and any additional data (episodes to watch etc) we had.

    On to being smarter!

    This boy is on Yanque town, waiting for travellers who want to take a picture with his llama. Meanwhile, the llama is distracted with his chullo. (By Mariel Gonzales)
    By Mariel Gonzales

    Smarter, Smaller, more Indexed

    Training an LLM is interesting. Most of the time people think about JSONL, which is used to train an LLM. If you use a JSONL to fine-tune, the model “bakes” that info into its brain. If a show’s score changes from 80 to 20, the model will still think it’s 80 until you re-train.

    Obviously that won’t work for me. Show scores change based on number of characters, dead and alive, and so on, which can change on a dime. Instead, I went with a Local Vector Database also called a Semantic Index.

    I built out system where the AI has a long-term memory of the entire catalog without needing to query the WordPress SQL database for every little thought. For that, I made a new JSON API endpoint on WordPress that would spit out all the shows and the critical data the AI would need.

    Instead of hammering the WordPress API for every search, I sync the data once a day into a local JSON ‘brain.’ The Python bot just keeps that file open, scanning 1,000+ shows in milliseconds, which is way faster than any SQL query we could throw at it

    Then I made a python script that runs in a cron job:

    1. Fetch: Hits WP API for any show modified in the last 24 hours.
    2. Transform: Converts that show data into a “Semantic Chunk.”
      Example: “Show: Wynonna Earp. Tropes: slow-burn, law-enforcement. Score: 92. Status: Ended.” (note: it actually has a lot more than that, but this is fine for the example)
    3. Load: Saves this into a local text file or a small SQLite database that Ollama can “read” faster than a WordPress query.

    Well. Except for one small problem.

    It was crazy slow.

    Make it Faster, Make it Snappier, Make it Gay

    The problem is that while I have a 12-core server, my memory isn’t huge and I don’t have vram on it. My options became:

    1. Move to a more AI based server
    2. Set limits – aka trim the fat
    3. Pick a different AI model

    I went with 2 and 3.

    To trim the fat, I had to address the problem that Ollama unloads the model from memory after 5 minutes of inactivity to save RAM. I change the setting to 24 hours. Then I some params to my model file in order to throttle usage. After all, I don’t need a huge context window for this:

    PARAMETER num_ctx 4096
    PARAMETER num_thread 8

    Then I added a ‘warm up’ to the start of my python scripts.

    But when I ran a trial:

    -> Extracting intent for: 'Find me three underrated dramas from Europe.'
       [Extraction took 83.14s]
    

    83 seconds is nuttier than squirrel poo!

    This is where I decided the model was too big. I went from Llama 3.1 8B to phi3.5 and it got worse!

    -> Extracting intent for: 'Find me 4 tv shows from the US that are underrated drams'
       [AI took 198.36s]
    

    This led me to llama3.2:3b which is fast but stupid. Well not stupid, but it begets a caveman joke.

    A screenshot of a man dressed as a caveman, with a caption "So easy, a caveman could do it."

    In truth, you have to make your model file so simple a caveman could use it to make fire. The problem I faced was that 3B can’t be a philosopher and a librarian at the same time. So I stripped its job back: Don’t talk to the user yet. Just look at the sentence and spit out SEARCH_ACTION: country:uk, genre:comedy.

    I turned the AI into a regex generator, and suddenly, the speed was there. Which means that a lot of the heavy lifting now has to be done on the WP end and/or some other script. The resulting makefile is incredibly small by comparison with only one step.

    On Beyond CLI

    I didn’t want this to live in the command line forever. I do plan on making this fully integrated with WordPress but at this point I wanted to make sure I was getting accurate information with the parameters.

    Since we have a LezWatch Slack, I build a Python bridge using Flask and Gunicorn, which sits behind Nginx, and then I have a Slack App that calls the bridge.

    Example of the Slack bot that finds shows.
    Screenshot

    Obviously there are some kinks to iron out, but this is starting to work as we want.

    Now how did I make my Slack Bot smart? Well that will be another post.