Patrick O'Shaughnessy

Transformers.js: Building Next-Generation WebAI Applications

Learn how to create stunning AI-powered web applications with Transformers.js, an innovative JavaScript library for running state-of-the-art machine learning models 100% locally in your browser. In this talk, Joshua Lochner, Open Source ML Engineer at Hugging Face, explores how to leverage emerging web technologies like WebGPU and WebNN to create interactive, privacy-preserving, and scalable web experiences.

Published
Published Nov 21, 2025
Uploaded
Uploaded Jun 13, 2026
File type
YouTube
Queried
0

Full transcript

Showing the full transcript for this video.

AI-generated transcript with timestamped sections.

0:05-1:38

[00:05] - Hi everyone, my name is Joshua, and I'm a web machine learning engineer at Hugging Face. And today I'm excited to talk about how you can build next generation web AI applications using Transformers.js. [00:19] So first, a quick introduction. What is Transformers.js and maybe what is HuggingFace? Well, HuggingFace is the platform where the machine learning community collaborates on models, data sets and applications, otherwise known as spaces. [00:36] For models, we host over 2.1 million models, AI models, built by the community on the HuggingFace hub. You're able to search for them, filter by various maybe library tags or tasks, and [00:51] Find the model which best suits your needs. [00:55] We also host a large collection of data sets, just over half a million. Some of these are megabytes in size, some of them are petabytes in size, and you can even query them in your browser, visualizing certain data and maybe understanding data a bit better to see how you will be training a model. [01:15] Next, we have spaces, also known as our application or our AI app store. Currently to date, we have over 1 million AI apps built by the community. And we're really excited for WebAI apps, as you maybe saw a couple moments ago, to be deployed as a Hugging Face space. And we also support semantic search across these spaces.

1:45-3:20

[01:45] using our semantic search here, for example, change the lighting and it'll show you a space that has been created by the community that can do your task. [01:53] So, we also maintain a large collection of open source libraries. Some of these you may be familiar with, like the Transformers library, Diffusers, SafeTenses, just to name a few. But the one we'll be talking about today is Transformers.js. [02:08] So what is Transformers.js? Well, Transformers.js is a JavaScript library that allows you to run AI models 100% locally in the browser. Since everything runs 100% locally, all your data is kept safe and secure and you're also able to achieve extremely low latency and effortless scalability. [02:25] We also take advantage of many browser APIs that are now at our disposal, like WebNN, as you just heard of, WebGPU, and the WebAssembly. [02:35] Some benefits of in-browser inference include, number one, security and privacy. So let's say you're recording a video of your face or microphone input. So let's say you're recording a video of your face or microphone input. [02:45] Maybe you don't want that data to be sent over to a cloud. Maybe you're processing very sensitive documents that are related to your company. You want those to be parsed locally. [02:55] Next step is real-time applications. Thanks to the lack of server, you're able to not need bounces between server or making requests. This is especially important in areas where internet connectivity is not as strong. And you also don't need to send extremely large files over the internet. Everything happens 100% locally.

3:20-5:05

[03:20] then this serves to highlight both for the developer as well as the user of the application. As a developer, [03:27] you, when you're creating, or maybe you're showcasing a model of yours, and you want to show users how to play around with the model, you don't have to [03:37] host it on a dedicated GPU. You can rather distribute that compute to the users. [03:43] as well as so for the now we've spoken about what we see for the developers, the improvements and also for the user. [03:49] There's no API keys being exchanged, not paying per token. [03:53] Everything runs on your device, therefore you'll pay for the compute by simply using your device. [03:59] And then, thanks to the web, distribution is as simple as posting a link. Someone goes to the website and there you go. There's your AI application. There's no worries about, you know, finding PyTorch and Python dependencies and worrying about whether this model will run well on Mac or Linux or Windows. Everything is just bundled into the website. You view the website and everything is good to go. [04:27] Some steps to maybe think about when optimizing for in-browser inference. [04:33] Number one is quantization. This basically involves reducing the computational and memory costs. [04:38] by simply using lower precision data types like 8-bit integers or 16-bit floating points. We're able to reduce the size of models up to 8x without seeing major quality degradation. Of course this changes depending on the models you're trying to run, especially for smaller models maybe more sensitive to quantization. So it's very model specific but we try to provide a wide range of various quantizations

5:08-6:41

[05:08] using transformers.js. [05:10] The next thing is to take advantage of browser APIs like WebGPU and WebNN to really [05:18] and take advantage of the native [05:22] hardware that the user has, but also in a highly efficient, optimized manner. [05:27] And then finally is ensuring that when taking our model from maybe a Python-based ecosystem to the web, taking into account how to bring this model and export it in a way that ensures [05:44] high optimization levels. So this may include fused kernels, custom operations, and so on. And in this case, for a very simple BERT embedding model, able to achieve a 4x performance boost, [05:59] just by changing the way the model is exported. [06:02] We also benefit greatly from the versatility of JavaScript. So yes, you're able to run these models and the library in the browser, but that's not where JavaScript stops. You may know that there are various JavaScript runtimes, Node.js, Bun, Dino. WebGPU support is currently being worked on in a few of them and even in some working already quite well. Therefore, Transformers.js is able to take advantage of these browser APIs that are now [06:30] bundled into native executions. We also work well with various libraries and frameworks, like React, Svelte, Angular View, et cetera.

6:42-8:15

[06:42] And then various environments where you're able to deploy your applications. So yes, you're able to create a website that maybe would use a web worker, [06:49] But you're also able to use Transformers.js and these models as browser extensions, maybe serverless with maybe Superbase Edge functions, as well as desktop applications like Electron. [07:01] And then finally, we also integrate quite well with various bold tools, Vite, Webpack, etc. [07:08] So you're able to bundle your application into a way that can be shipped to the users. [07:12] We're also working on mobile support via React Native, and we will be hopefully giving you a few more updates in the near future. [07:21] Currently speaking of browser support, so Google Chrome and other Chromium-based browsers have extremely good web GPU support specifically, so all the demos I'll be showing are recorded in Chromium-based browsers, and they work extremely well and are able to take advantage of your hardware capabilities. [07:51] and a wide variety of others. WebGPU and WebNN is still experimental, but we hope to be getting this shipped really, really soon. And then finally, Safari. WebGPU actually just shipped in Safari 26, meaning you'll be able to use and run these WebAI applications in your browser, in Mac OS, iOS, iPad OS, and even Vision OS, which is really exciting to see.

8:15-9:53

[08:15] So, let's talk about usage and how Transformers.js has grown over time. [08:22] So starting off with NPM downloads, in just the last month we hit around 1.68 million NPM downloads. This is up 7% from the month before. [08:33] Unique monthly users of Transformers JS models, just also around 1.7 million, up 12% from the previous month. [08:40] And then CDN requests. So for those who don't want to maybe NPM install and run in various build tools, you can access directly with the CDN link. And this is at nearly 11 million requests, which is up 13% from last month. [08:57] And maybe seeing this in context, Transformers.js version 1, when we released in March 2023, it was just a little side project, nothing major. [09:08] interested people maybe playing around with some things. Very low usage in the beginning. But as we kept iterating over time, version 2 hit around 5,000 unique monthly users. And then in the past year, [09:22] when version 3 released, which was actually was released at this, or announced at the WebAI Summit last year, hitting around 750,000 unique monthly users for version 3. [09:34] And now today we just hit over 1.7 million unique monthly users [09:39] So it's almost a 2x increase, over a 2x increase since last year. [09:44] And this is all thanks to our amazing community from all over the world, building really amazing WebAI applications with Transformers.js.

9:53-11:33

[09:53] We just want to say a massive thank you to the community. [09:57] uh, [09:58] Transformers.js would be nothing without you all. So we thank you so much for building and creating and showcasing what you've built. [10:05] So how can you take and use Transformers.js in your WebAI applications? Well, I hope to show it's only as simple as three lines of code to get started. So the first line is to import the pipeline function from the Transformers.js library. [10:21] The second line is to create an instance of the pipeline. In this case we'll be performing a task known as sentiment analysis. And then the third line is running the [10:31] your input, in this case text, so I love transformers, and then using the pipeline that you just created to return a result, in this case positive, with high likelihood. We also support being able to specify your custom models, so a very similar thing to what you just saw, but now the task has changed, so we're now doing background removal, and we're also choosing a different model that has been created by the community for background removal. [11:01] in this case, a link to the image, and you'll be able to remove the image 100% locally in your browser. [11:06] We also support various loading and runtime parameters. So in this case, the first loading parameters, when you create the pipeline instance in the beginning, you're able to specify the device, whether you want to run on GPU, maybe, WebGPU, WebNN, CPU or WebAssembly, as well as the data type or the quantization. So in this case, 4-bit quantization with 16-bit activations.

11:36-13:08

[11:36] specify various parameters like how many tokens you want to generate, whether you want to do sampling, what's the temperature, those kinds of parameters you may be familiar with. [11:45] We also support a bit more advanced usage if you want to really take advantage of maybe lower level things, maybe trying to [11:54] integrate this into your application logic of maybe mouse movements specifically. So in this case, being able to do something like image segmentation with segment anything in very similar way to how the Python Transformers library may work. So if you're familiar with that library, we hope that the translation over to the JavaScript runtime will not be too challenging. [12:15] So maybe taking a step back, like maybe asking yourself how it works. So first, we provide a large collection of pre-converted models, nearing around 2,500 on the Hugging Face Hub. And if you have your own custom model, we provide various scripts and libraries that allow you to convert your PyTorch, Jacks, or TensorFlow model to a unifying standard known as Onyx. [12:40] ONNIC stands for Open Neural Network Exchange. [12:43] And then what you do is you write your Transformers JS code. In this case, we're performing speech recognition using Whisper Tiny, and we're running on WebGPU. And then behind the scenes, we take advantage of Onyx Runtime Web, which allows you to run these models on WebAssembly, WebGPU, and even WebNN, allowing you to choose the device and then run on CPU, GPU, or NPU, depending on what the user has at their disposal.

13:09-14:44

[13:09] So let's see what does it take to actually build these [13:13] as AI-powered web applications. It all starts with the idea. And it involves asking yourself the question, what is the problem you're trying to solve? [13:21] or what experience are you trying to create? [13:24] Then, maybe ask yourself, why do I want to run this model in the browser? What are the advantages I can take advantage of for running in the browser? Maybe the low latency is something that's important to you. Maybe the distribution is something that's important to you. Maybe security and privacy is important to you. And those are all benefits for running on device. [13:44] And then you ask yourself, is there a task that is really being created for me that I can use to solve this problem, whether it's sentiment analysis, computing embeddings, [13:55] Thank you. [13:56] And then once you've identified the task you're trying to use, finding the model which best suits your use case. So if you're maybe translating between various languages, you might want to find a model that is good for French translation, for example. Or maybe the model size is important to you and you'd rather prefer maybe a 10 to 20 megabyte background removal model versus maybe a couple hundred megabytes, depending on the real-time aspect you're looking for. [14:23] Yeah. [14:24] So if all those boxes are ticked, let's build it with Transformers.js. [14:27] And we also encourage you to take and to learn from the community and see the example applications we've put out. So I put a few links here that if you want to go visit and see what applications we've built, highly recommend it. And it's kind of showing you what's possible and then

14:44-16:14

[14:44] giving the power back to you to integrate into your own workflows, to take advantage of your own knowledge in your very specific domain. And we really encourage and like to see what you build with it. [14:58] Some factors to consider when building WebAI applications. So, of course, bandwidth. The user is going to need to download the model at least once. The model once downloaded once is cached. [15:10] You won't have to be re-downloading things. So model sizes need to be taken into account. Accuracy versus speed, whether you're trying to ensure high accuracy is something that's very important, or you're trying to run in real time, there are some trade-offs to make. Some device features, what capabilities does the user have, whether they have access to browser APIs, web GPU, microphone input, etc. [15:33] And then some target devices. What are you trying to run on? Are you trying to run on mobile? Are you trying to only run on desktop? This can all help [15:41] Uh, [15:42] help you make decisions to what models, what creations you're going to be working on. [15:49] So, now that we've got that out of the way, let's see what developers have been able to build so far. [15:54] Starting off with the traditional chatbot experience. But what I want you to take a note of is specifically the speed of this model that is running. A 1.7 B model running on my M4 Mac at over 160 tokens a second. So blink and you'll miss it. This is really amazing for real time applications, as I'll show a bit later.

16:14-17:37

[16:14] As models get smaller, you can even pump those numbers up to even higher. It's very great for low latency experiences. We also support various reasoning models. So this is DeepSeq R1's distilled version, a 1.5b model for reasoning in the browser. This actually, this model, when released, outperformed GPT-4.0 and Claude 3.5 Sonnet on various math benchmarks, which is really amazing to see for a model that can actually run in your browser. [16:42] Thank you. [16:43] Next, we have a vision language model, which is able to take video input, either video stream or from camera input or from maybe your screen recording, and is able to do live captioning depending on what it sees. So in this case, we're running a video and live captioning as fast as it can, every frame that it captures. [17:05] performing description, even able to recognize text. And I'll actually be showcasing this demo outside for those who would like to see it in real time. [17:15] Then we also support, this is a Gemma 3 270M model, and I think this is where the power of small on-device models come into play, where for your very specific task, in this case a fun little bedtime story generator, finding a model which works well for you that can run in the browser is really great to see.

17:45-19:18

[17:45] muted for now but I will show it a little bit later, is while the text is being generated, while the story is being generated, it's actually being spoken out to you. And what this means is that you actually have very low latency from the time you click start. [17:58] to the time you actually hear something back, which is really great. [18:01] Then, this is a really fun one, running in browser a tool calling model. Language models are notoriously bad at maybe mathematics, so instead of asking it to hallucinate, we instead give that control to a function that we've defined in JavaScript actually. In this case, math evaluation for some input, even random number generation, and this is now what it'll call the tool and then return a random number between 1 and 1,000. [18:31] look this up to browser APIs like location and time APIs. So in this case, requesting the user's location, of course they have to accept, and then returning that back to the LLM where it can formulate a more informed response. [18:46] Next, this is something called the semantic galaxy using embedding Gemma. What happens here is that you select a bunch of documents and you click generate galaxy. And what happens is that the model embeds these in a higher dimensional space and we project that back to 3D, allowing you to search for these documents. [19:04] Number one, in real time. And number two, in a more visual and interactive way. So in this case, we search for weather and all the weather related documents are displayed. We can also hop around the galaxy, see and the numbers that are attached there indicate the similarity score.

19:18-20:59

[19:18] And as you can see, the [19:20] semantics behind the documents are taken into account for the clusters, which is quite fun to visualize various documents of yours. [19:29] Then Kokoro, which is a text-to-speech model, you maybe saw the demo earlier, and we'll show a demo in a few moments. This model is kind of groundbreaking in a way because at only 82 million parameters, you're able to produce high-quality, realistic text-to-speech. [19:49] This one is Whisper Web, a web GPU, allowing you to perform real-time speech recognition in the browser using OpenAI's set of Whisper models. [20:00] And then this is a game we developed actually almost two years ago, maybe over two years ago, called Doodle Dash. It's based on Google's Quick Draw game. But in this case, what's happening is that as you're drawing it live, an image classification model is running behind the scenes, detecting whether you are correctly drawing the label. It also runs on mobile, so if you want to try it out, just search for Doodle Dash, and you'll be able to run it on your phone. [20:25] And then the next one I think is pretty interesting for educational purposes. The idea here is to show [20:31] people how vision transformers work. So patching up the image and then identifying and creating semantically relevant chunks for analysis. So in this case, what we do is we upload an image of a tiger. And as the model progressively begins to maybe understand what's going on, so in the first layer, this is the first attention head, it doesn't really know what's going on. As you progress down the neural network, as you increase the number of, past the number of layers,

21:01-22:34

[21:01] So that's the sixth layer. As you keep going until the last layer, the model seems to understand maybe what makes up a tiger. Maybe you see the eyes are an interesting part, maybe the fur, and it correctly predicts tiger. Then this is a new model released by Meta called Dyno V3. It enables actually video tracking in your browser, which is really exciting, [21:31] educational purposes, trying to show you when highlighting over a query position, seeing what other features are highlighted. And as you can see in the video tracking example, being able to play, select a few important key regions, and then clicking play to, [21:48] see where those points move over time, which is really, really great to see. Uh, funny enough, this model wasn't even trained for the specific task, but, uh, it is able to generalize [21:58] um, [21:59] using the rich features it's able to produce. [22:02] And then this is our real time speech demo. - Hey there, my name is Hart. How can I help you today? [22:07] Can you tell me a joke, please? [22:11] Sure, here's one for you. [22:12] Why don't scientists trust atoms? [22:15] Because they make up everything. [22:17] Now, can you tell me what the capital of France is? [22:21] Of course. [22:22] the capital of France. [22:24] is Paris. [22:26] - [22:30] Now I want you to roleplay as Santa and I'll be a little boy asking you for gifts.

22:36-24:34

[22:36] Ho, ho, ho! [22:38] Merry Christmas, little buddy. [22:40] What can I bring you this year? [22:42] I was hoping maybe a lump of coal. [22:46] A lump of coal? [22:48] That's a bit harsh, isn't it? [22:50] How about a nice, warm, cozy blanket instead? [22:54] That'll work too. [22:57] Perfect! [22:58] I'll make sure to wrap it up nicely for you. [23:01] There we go. So what's amazing about that demo is it's actually using and incorporating a bunch of different models. So using a voice activity detection model to know when you're trying to speak in the beginning, then a speech recognition model, and then a language model backbone for the quote unquote brain, and then text-to-speech model at the end. [23:31] the latency and crease performance which is [23:33] which we'll be excited to show in the near future when we get it working. So what is the latest news? What is the current plans? And what are maybe the next steps? [23:41] So taking a step back, [23:43] Where have we gone over time? So the idea early 2023 just want to create a spam detection [23:51] version. It was really, really simple. Version one released a couple of weeks later on NPM, just a few architecture supported. [23:59] Version 2 involved a complete rewrite to ES modules, 19 supported architectures, [24:05] A year and a bit later, version 3 was released, introducing web GPU and web NN support, now 119 supported architectures. And then a year from that, today, we support around 170 architectures. But you might be asking yourself, what's next? Well, I'm excited to announce that we are currently in developer preview for Transformers.js version 4. So if you're looking forward to even foster execution, an even wider range of models being supported, we hope you try it out.

24:35-24:56

[24:35] a release candidate in a couple of weeks on NPM for you to try out. [24:40] And I'm excited to see what you build with that. So with that, hopefully this talk has inspired you to maybe consider new technologies for your applications and more specifically your Web AI applications. And I'm really excited to see what you build with it next. [24:55] Thanks so much.

Want to learn more?