🎞️ Videos → AI in the Browser and on the Edge
Description
Open Source libraries like Transformer.js allow you to run Machine Learning workloads right within your browser. We can perform multi-language speech recognition and translation, text-to-speech, and even RAG, fully offline and in-browser! In this session we take a look at some examples of what's possible, as well as a small look behind the scenes at Microsoft's open source ONNX Runtime which makes this possible. Hopefully this session can inspire you to sprinkle some AI into your client applications! Slides: https://go.thor.bio/jsbkk-slides Thor on X: https://x.com/thorwebdev In-Browser Semantic Search with Transformers.js and pglite + pgvector: https://github.com/thorwebdev/browser-vector-search Babelfish AI: https://github.com/supabase-community/babelfish.ai?tab=readme-ov-file
Chapters
- Intro and Fun Demo with Google Gemini and ElevenLabs 0:00
- Transitioning to Local AI and LLMs 2:29
- Running LLMs Locally: Ollama, Llamafile, and Cortex 3:57
- Demo: Local Semantic Search with Hugging Face Models 5:57
- Local LLMs on Beefy Machines and Small Token Windows 7:49
- Demo: Semantic Search with Local Postgres Database (PGlite) 8:04
- Explanation of Semantic Search and Embeddings 9:07
- How In-Browser LLMs Work: ONNX Runtime and Wasm 12:40
- Hugging Face, Transformers.js, and Model Conversion 14:23
- Demo: Babelfish AI - Realtime Transcription and Translation in the Browser 15:48
- Supabase Edge Runtime and ONNX with Rust 21:58
Transcript
These community-maintained transcripts may contain inaccuracies. Please submit any corrections on GitHub.
Intro and Fun Demo with Google Gemini and ElevenLabs0:00
ขอบคุณครับ สวัสดีครับ Thank you.
Can you all hear me okay in the back? That's great. I see some thumbs up there. I have a little demo that actually is not super related to my talk, but I was just working on it yesterday and I thought it was pretty cool. So I'd love to try it out with you. And I'm going to film you while we do it. So we're doing a little exercise. So basically, what this does is it just captures the video. So I'm just going to capture you. And then I'm going to use Google Gemini to basically give me a description of what's happening in the video. And then I'm just using ElevenLabs to give me text-to-speech to just say what's happening. So what I'd love for you to do is if you just clap
or put your arms up in the air or maybe stand up, just do some unpredictable stuff and let's see what Google Gemini thinks is happening. Okay, so let's go in three, two, one.
Okay, applause, some waving, stand up, jump.
Okay, let's see what's happening there. And hopefully, I pray to the demo gods.
Okay, I don't think we have audio. Do we have audio? Sorry, I probably should have told.
It's muted, microphone. It's muted? Okay. Yeah, I didn't tell them that I want audio. So I can just read it out. The camera pans up to show an auditorium full of people sitting in chairs. Many of them are wearing face masks. Is that true? That doesn't seem true. The people are waving and cheering. Cheering? That's very good.
Let me close the microphone.
Okay. I put my audio through the HDMI. Can you use that? No?
You got the aux. I think it's okay. Yeah, it's fine. We'll just use my voice.
Transitioning to Local AI and LLMs2:29
So yeah, I think it kind of got the gist of that, right? Like waving and cheering. That's great. So anyway, that was just a little excursion before we dive in. So that demo was actually using APIs.
So that was using the Gemini API and the ElevenLabs API.
But what we want to talk today is kind of local AI and actually using LLMs and inference in the browser and in edge runtimes. So this is me. My name is Thor or if you're speaking Mandarin, Leishen. I gave that name myself. Since no one was laughing, I guess there's no Mandarin speakers in the room. No? Okay. My Thai unfortunately is not so great. But yeah, if you want to see kind of fun little demos, you can follow me on Twitter. You can just scan that here. And this QR code is just a link to my slides because there's a bunch of demos and resources in the slides. So you can just keep that for the future and check out the demos later on.
Cool.
Running LLMs Locally: Ollama, Llamafile, and Cortex3:57
So before we dive into running large language models
in the browser, let's look at running them locally
on your machine. Who here has used things like Ollama before? Okay. I think that's probably the most popular one here. So Ollama, another project that Mozilla is working on is called Llamafile. Have people heard of Llamafile here? No? Yeah. Okay. And then one that's actually built in Singapore
is called Cortex. So cortex.so, people might have heard of jan.ai, so JAN, which is kind of a local ChatGPT client,
that is powered by Cortex as the underlying API engine.
And what's really cool with that, for example, all of these are powered by llama.cpp. So basically llama.cpp is kind of the underlying implementation of the inference models underneath. Open source project, really neat, and that is what is powering a lot of the local model stuff. So CPP is just C++. It's basically an inference engine written in C++ that can run on a variety of hardware, including your CPU if you're using that locally on your machine and don't have a GPU, it's basically running the workloads on your GPU.
Demo: Local Semantic Search with Hugging Face Models5:57
Okay. Now, let me quickly show you Llamafile, which is a really neat project as well. So basically what they are doing is they are bundling llama.cpp, the engine, together with the model in a single file. So what you can do then is you can download the llama file, and you can run it locally on your machine. So we can do that here. So basically, I just use tiny llama here. You're running that, optimized for CPU usage
in this case. And then you actually get like a localhost client. It doesn't look as nice, but you can say "Who are you?"
And then you can chat with it and it basically says, "I'm llama, nice to meet you too." And then we can say maybe, "Tell me about Bangkok."
"The city has so much to offer, one of the most vibrant and exciting cities in Southeast Asia." There we go. I think Llamafile knows that I'm in Bangkok and it's trying to please.
Yes, so these are some of the projects that are exciting for running large language models locally.
So if you have a beefy machine, you can probably go up to some of the larger models as well. Tiny llama I think just has a fairly small token window that's included, but it works quite well
Local LLMs on Beefy Machines and Small Token Windows7:49
on the machine. Now, before we dive into AI in the browser and actually running your inference directly in the browser, let's do a little demo here.
Demo: Semantic Search with Local Postgres Database (PGlite)8:04
So what this does is, actually similar to what Piti was talking about just before, vector search within MongoDB. So what this is doing is basically a semantic search, or vector similarity search, but fully locally in the browser. So the idea behind semantic search is that instead of
text search where you're looking for words and similarities in the words, you're looking for semantic similarity, right? So you're looking, okay, what is fun? And based on human knowledge and context here, the model thinks, yeah, driving a car can be fun, playing with a dog can be fun, sleeping can be great fun as well.
Explanation of Semantic Search and Embeddings9:07
And the great thing is so basically the first time when I hit the button, it was loading for a bit, but it was actually loading down the model from Hugging Face. So Hugging Face, kind of like the GitHub for large language models. So it was pulling down that model locally into the browser. And so now what I can do is actually when I turn off my Wi-Fi, so I now have the model loaded locally
into the browser, I can, yeah.
All good?
[announcement about tax invoice in Thai]
Sorry. Next. I have no idea what he said. But no one is leaving, so I hope it wasn't too bad.
I hope I didn't say anything bad. All good? Do we need to translate for the English speakers or? No? No, no, no. Okay. So this is information that's only relevant to Thai people. Very good. So, okay. We had fun, but maybe what we can do as well is we can search for furniture.
Right? If you had a normal search where you're doing text search,
you wouldn't find any of this, right? It would look for the word sort of furniture or misspellings of furniture. But what we want here is semantically, without having to categorize the information,
we actually want to encode this information, all the human knowledge, the semantics, the context into this here. And so this is where the embeddings come in. So the large language model basically turns the entire context of this information into just an array of numbers. And then what we can do is we can search for things that are similar. And so what we do is we also create an embedding for the furniture, and then we just perform a similarity search. We just look, okay, what is similar to this? And we see, okay, desk, bed, chair is furniture. Now what we can do as well here is food for example. So tomato, banana, a hot dog.
Okay, this is a hot dog. Yes, no. So those are food items, right, that we can eat. Or fruit. So we know a tomato is a fruit, right?
But we wouldn't put it in a fruit salad ideally. But yeah, banana, apple, tomato. But now if we do electronics.
Yes. Someone said Apple. Correct, right? As humans, we know this thing is an Apple electronic device. If I were to put that into a text search, you wouldn't find that. But so this is really cool. My Wi-Fi is still off. So this is running completely locally in my browser.
Including actually a full Postgres database.
So this is a really cool open source project. It's called PGlite, which actually basically runs Postgres in your browser
via Wasm, which is really cool. So if you have some time, check out that demo. That's really neat.
How In-Browser LLMs Work: ONNX Runtime and Wasm12:40
But now let's maybe dig into how that works under the hood.
So how this works in the browser is actually it's using this ONNX runtime. So Open Neural Network Exchange. And I think actually originally it's an open source project by a team at Microsoft. Now more and more companies are contributing to it as well. But the idea is to provide this ONNX format
as well as a runtime where we can train our models and output them in a format that we can then run on this ONNX runtime, TensorRT, Apache MXNet. So it's kind of a compatibility layer for different large language models to allow us to run these models on various hardware, various different devices. And so what this actually means is, when we're using the ONNX runtime, we can deploy that in the cloud,
edge devices, mobile applications actually. So there's an ONNX runtime that can run locally on your phone, but then also browser. And so for the web browser, we have an ONNX runtime that runs in Wasm.
So basically using Wasm with the ONNX runtime, we can run these language models that are in the ONNX format directly in the browser,
Hugging Face, Transformers.js, and Model Conversion14:23
which is really neat. So if you look for example on Hugging Face, you can just put in ONNX, and you can find all the models that are basically in this ONNX format, which you can then use in the browser. And how do you actually use them in practice in the browser?
And that is a great open source project actually by Hugging Face which is called Transformers JS.
So again, if you go into the slides, you can find all this documentation there, so everything is linked out. But so Transformers JS is the JavaScript interface
to this ONNX runtime that is running in Wasm in your browser. So here you can see Transformers JS uses the ONNX runtime in the browser, and the best part about it is that you can easily convert your pre-trained PyTorch, TensorFlow models. So if you're actually training models yourself, you can fairly easily output them in format to then use it with Transformers JS in the browser.
And there's a tool as well called Optimum, apparently, which you can use. Yeah, you can click through to the documentation to learn a little bit more about this.
Demo: Babelfish AI - Realtime Transcription and Translation in the Browser15:48
Now this first demo was fun as well,
but I also have another demo. You can scan that here as well, or you can click through in the slides. I call it Babelfish AI. Oh. And I guess I forgot to link the slides there.
That's not good. Okay. So we'll type that in. Babelfish.
So it's on the Supabase community there. Let's open this up.
Do you know OpenAI Whisper? It's kind of this model that allows you to transcribe text in various different languages. The demo we're doing here is we're loading a smaller version of the open source OpenAI model, and we're using that to transcribe my audio
directly in the browser. So this is running locally on my machine in the browser, transcribing whatever I'm saying. And then, I thought, okay, that's fun, that's pretty cool, but then maybe we can do another thing. We're just using in this case here, Supabase Realtime
to basically stream.
Let me share here.
So if you scan this it won't because the model is pretty big. The translation model is actually an open source model by Meta, which allows text-to-text translation of 200 languages into 200 other languages, which means it's pretty big. So it probably will crash your phone, sorry if you scanned this already. But if you scan it with your phone and then open it on your laptop, you can follow along live if you want to.
So here we're transcribing what I'm saying in English,
but now, maybe you in the audience, you don't speak English, and I'm sorry if that's the case. But then what we can do is we can just say, okay, let's just translate that into Thai. Now obviously, I don't read Thai, so I don't know if this is accurate or not.
However, what we can do is anyone read Thai here? Does that look reasonable?
Someone says, "Yeah." I don't know if he actually speaks Thai, but I'm going to trust him.
Let me go back into- there's Danish as well, there's Dutch. I know here, John is Dutch, for example. So we can translate that into Dutch for him.
Does that look good, John?
Okay, maybe we'll go the other round. I know, you know, English. So now the cooler thing as well is
zum Beispiel, wir haben hier Tobias, der spricht Deutsch. Aber ihr alle sprecht kein Deutsch, also können wir das dann auf Englisch übersetzen.
Well, so it actually works quite well for sort of
the Latin languages. But now what we could do as well is
大家好
我是住在新加坡的德国人
German. Well, that's not what I said.
In Mandarin, we don't have any punctuation. It's a bit wonky, but you know what I'm trying to convey here. Or maybe French. Bonjour, un mot français, c'est très mauvais.
And the French word is very bad. Know what I said? My French is very bad.
But really what I'm trying to convey here is that
we're running this locally in the browser. Where's English? Come on. English. My MacBook is pretty beefy, so it actually works quite well. The OpenAI Whisper, especially when we're using WebGPU. Transformers.js can tap into WebGPU as well to
accelerate that. So really what I'm trying to convey to you is that
this is very exciting. There's a lot of stuff happening in the open source space, as well as the local AI space.
And I think that's really cool. So I hope this can inspire you, maybe add some AI to
your client-side applications and tap into that. Because in the future, you're going to be running large language models locally on your phone, on your fridge, on your toaster, probably everywhere. So I think that's really exciting there.
Supabase Edge Runtime and ONNX with Rust21:58
I don't have much time to get into the Supabase Edge runtime. It's also using the ONNX runtime, just with another cool open source project called Ort, which is machine learning inference for Rust.
Because Wasm was too slow in that edge runtime, so we're just using the Rust layer there. And with that, ขอบคุณครับ. Thanks so much.