Local LLMs on Android: Benchmarking Qwen 3.5 Offline
5 mins read

Local LLMs on Android: Benchmarking Qwen 3.5 Offline

I was on a flight last Thursday, airplane mode engaged, and I needed to parse a massive block of messy JSON. Normally I’d just wait until I landed to hit an API. Instead, I pulled out my Pixel 8, fired up a local 2B-parameter AI model, and had it format the whole thing in about four seconds. Completely offline.

Well, that’s not entirely accurate — we are finally past the gimmick phase of mobile artificial intelligence. You don’t need a massive server rack anymore. If your Android handset has at least 4GB of RAM, you can run a highly capable assistant right in your pocket. I’ve been testing the Qwen 3.5 series over the last few weeks—specifically the 0.8B and 2B variants formatted as GGUF files. They load incredibly fast. They also don’t chew through your battery quite as violently as earlier mobile inference attempts did.

The Storage Permission Nightmare

Getting this working used to require compiling obscure C++ libraries via Termux. Gross. Now you just grab a dedicated open-source chat client. I prefer the lightweight loaders that let you drop any standard GGUF file into a designated folder. You download the weights—usually around 1.2GB for a quantized 2B model—point the application at the directory, and you’re off.

smartphone airplane mode - Do Smartphones Really Charge Faster in Airplane Mode? | HowStuffWorks
smartphone airplane mode – Do Smartphones Really Charge Faster in Airplane Mode? | HowStuffWorks

I did run into a ridiculously frustrating issue on Android 14. The OS’s scoped storage permissions aggressively blocked my chat client from reading the .gguf file sitting right there in my Downloads folder. The app would just crash silently on startup. The fix is annoying but simple. You have to move the model files directly into the application’s specific Android/data/com.[appname]/files directory. The catch? The default Google Files app hides this folder entirely now. I had to use a third-party file manager via an ADB shell command just to copy the weights over. Once the file was in the sandbox, everything booted instantly.

Actual Hardware Benchmarks

Let’s look at some real numbers. I benchmarked the Qwen 3.5 2B model on a Snapdragon 8 Gen 2 device. Using an MLC-compiled client optimized specifically for the neural processing unit (NPU), I hit a sustained 24 tokens per second. That is faster than most people read. The smaller 0.8B model? It absolutely flies.

I threw the 0.8B weights on an older backup Motorola with just 2GB of RAM to see what would happen. And it managed about 18 t/s using standard CPU inference. It isn’t going to write a complex Python script for you, but for quick offline summarization or drafting an angry email to your ISP, it works brilliantly. The memory footprint hovered right around 850MB during active generation.

CPU vs NPU: The Battery Killer

smartphone airplane mode - This is why you should turn off your smartphone during flight ...
smartphone airplane mode – This is why you should turn off your smartphone during flight …

You basically have two paths when setting this up. Standard GGUF loaders lean entirely on the CPU. They work on basically any Android device made in the last four years. The downside is they will turn your phone into a hand warmer after ten minutes of use.

The alternative is using something like MLC Chat, which compiles the architecture to directly target the phone’s GPU or NPU. The difference in power consumption is massive. I ran a stress test generating long-form text continuously. CPU inference dropped my battery by 14% in twenty minutes. The back glass was physically uncomfortable to hold. The NPU-optimized version only drained 4% in that same timeframe, and the device stayed completely cool. If you plan to use local generation heavily, you absolutely must seek out NPU-accelerated clients.

The Memory Management Problem

AI on smartphone - New smartphone concept ditches apps for AI | News.az
AI on smartphone – New smartphone concept ditches apps for AI | News.az

Hardware manufacturers are clearly noticing this shift toward local execution. By Q1 2027, I probably expect we’ll see Android OS natively exposing protected NPU memory pools specifically for user-sideloaded weights. Right now, RAM management is still a wild west situation.

If you switch away from your active chat to check Chrome for a quick search, Android’s aggressive background task killer will often nuke the LLM process instantly to free up memory. When you switch back, you have to wait for the entire 1.2GB model to reload from storage into RAM. It breaks your entire workflow.

My current workaround is pinning the chat app in Android’s recent apps menu, which tells the OS scheduler to leave it alone. It works most of the time, provided you aren’t trying to open a heavy 3D game simultaneously.

Ditch the cloud dependency if you can spare the storage space. Grab a 2B model, load it up, and turn off your Wi-Fi. It is weirdly liberating to know your queries aren’t sitting in a datacenter log somewhere waiting to be analyzed.

Leave a Reply

Your email address will not be published. Required fields are marked *