How to Deploy Qwen-3 on DigitalOcean GPU Droplets

How to Deploy Qwen-3 on DigitalOcean GPU Droplets

The past half year has shown an absolutely meteoric growth of LLM models in terms of popularity, capability, and scalability that feel completely unprecedented from where we were even a few years ago. Driving these innovations are the open-source model creators that have consistently delivered newer and more innovative technologies to the masses. Players in this space include the incredible Meta Labs from Meta, with their work to release models like Llama 4 Maverick and Scout, and Qwen Team from Alibaba Cloud, which recently released Qwen3 to the public.

In this article, we will look at Qwen3’s new release, discuss how it compares to previous Qwen releases and other SOTA Open-Source LLMs, and then show how to run Qwen3 on an NVIDIA GPU powered DigitalOcean GPU Droplet.

Prerequisites

Intermediate Deep Learning skills: this article will cover intermediate level topics like LLM architectures and deployment Familiarity with LLMs: users with basic knowledge of LLMs will benefit more from this article Beginner Python: this article uses beginner Python coding to generate outputs with the LLMs Access to a GPU Droplet: this article requires access to a GPU Droplet to follow along

Qwen3 Overview

Qwen3 is an LLM based on the original Qwen architecture from Alibaba Cloud. This model has been iterated on consistently since its original release, and Qwen3 is just the latest version to be publicly available.

For now, there is no technical report for Qwen3, but we can make a few assumptions about the model based on what has been said and previously released. Since the original model architecture for Qwen v1 was reportedly similar to LLaMA, we can infer this has continued into Qwen3. They used a Transformer-based decoder architecture for dense models in Qwen2 and Qwen2.5, and we can assume this has evolved further for the architecture in Qwen3. We can also assume they used a similar Mixture of Experts technique for those model types, where standard feed-forward network (FFN) layers are replaced with with specialized MoE layers, and where each layer is comprised of multiple FFN experts and a routing mechanism that dispatches tokens to the top-K experts. This likely has improved further in Qwen3 from Qwen2.5, but little is clear about how without the technical report.

Fortunately, we know a lot more about what the model can reportedly do. The following has been stated specifically about the model capabilities fom the official Qwen3 blogpost.

  • "Unique support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within a single model, ensuring optimal performance across various scenarios.
  • Significant enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
  • Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
  • Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
  • Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation."

Qwen3 compared with other SOTA LLMs

Article content

As we can see from the technical comparison released by Qwen Team for this release, Qwn3-235B-A22 (the largest version of the model) outperforms or competes closely with top Open-Source and Closed-Source models like DeepSeek-R1 and Gemini2.5-Pro. AS more information becomes available, we will update this section with our own benchmarks comparing these models performances on DigitalOcean Machines.

How to run Qwen3 on a GPU Droplet

There are multiple ways to run Qwen3 on a GPU Droplet on DigitalOcean, namely using VLLM, SGLang, and transformers. In this section of the article, we will show how to set up the Qwen3 30B A3B model on our Droplet.

First, spin up and launch your DigitalOcean GPU Droplet. For more details on setting up your environment for AI/ML on a GPU Droplet, please refer to this tutorial. Once that is complete, SSH into your machine or use the Cloud Console to access the terminal for our remote machine.

Next, we are going to set up our environment. Paste the following code into your terminal:

apt-get install git-lfs python3-pip
pip install vllm transformers sgl_kernel orjson torchao
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]>=0.4.6.post2"        

This will install everything needed for each of the three methods we could use to deploy the Qwen3 model.

Finally, we will download the model onto our machine so that we can easily access the model files in the directory of our choice.

git-lfs clone https://huggingface.co/Qwen/Qwen3-30B-A3B        

This will download the model files into the directory ./Qwen3-30B-A3B. With that, we can begin deploying the model using one of our three methods. Let’s discuss VLLM first. Alternatively, we can use the HuggingFace CLI downloader. Be sure to save the path to the model save files when downloading!

huggingface-cli download Qwen/Qwen3-30B-A3B        

VLLM

VLLM is an open-source library designed to enhance the efficiency of large language model (LLM) inference and serving. Using VLLM, we can very quickly serve our model in a production ready capacity. Since we have already installed all the required packages, we can get started immediately.

To launch the VLLM server for the Qwen3 model, paste the following code into the terminal.

vllm serve ./Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1        

Change the path to the HuggingFace cache location if you used the CLI.

Now that the server is spun up, we can query the model using the regular cURL syntax for VLLM. Use the example below to generate a sample output from the model.

curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
        }'        

If everything is working correctly, we should get an output approximating the following:

“””text {“id”:“cmpl-b3f3e575ee034634a0691bc586fa98be”,“object”:“text_completion”,“created”:1746468551,“model”:“./Qwen3-30B-A3B”,“choices”:[{“index”:0,“text”:" city in the state of California,",“logprobs”:null,“finish_reason”:“length”,“stop_reason”:null,“prompt_logprobs”:null}],“usage”:{“prompt_tokens”:4,“total_tokens”:11,“completion_tokens”:7,“prompt_tokens_details”:null}} ”””

SGLang

Next, we will run the model using SGLang. Similar to VLLM, SGLang is an Open Source framework for large language models and vision language models. It is very effective for hosting LLMs at scale, for both testing and production ready use cases. To launch Qwen3-30B-A3B on a GPU Droplet, all we need to do is paste the following command into the terminal.

python3 -m sglang.launch_server --model-path ./Qwen3-30B-A3B --reasoning-parser qwen3        

Like VLLM, since we have already downloaded the model, this should be a relatively quick process. Change the path to the HuggingFace cache location if you used the CLI. We can now query the model using cURL.

curl -X POST http://localhost:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
        "text": "The capital of France is",
        "sampling_params": {
        "temperature": 0,
        "max_new_tokens": 32
        }
  }'        

This will generate an output that is approximately like the following:

{"text":" Paris. The capital of the United Kingdom is London. The capital of Germany is Berlin. The capital of Spain is Madrid. The capital of Italy is Rome.","meta_info":{"id":"eb1e4282eddc4e5e93ef0186ae3f1a89","finish_reason":{"type":"length","length":32},"prompt_tokens":5,"completion_tokens":32,"cached_tokens":2,"e2e_latenc        

Transformers

Finally, we can use the Transformers library to query the model directly and Pythonically. While the other methods can be accessed with Python, Transformers is the native model integration we prefer for the greatest versatility. Since have loaded in our model to storage already, we can first load our checkpoints with the following code while also setting up the Python environment.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype="auto",
        device_map="auto"
)        

Next, we can prepare the model input. Modify the following code to change the prompt, whether or not to use reasoning, and tokenize the inputs.

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
        {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)        

Finally, we can generate the output:

# conduct text completion
generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
        # rindex finding 151668 (</think>)
        index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
        index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)        

This will generate an output using the reasoning capability of the model. Since the output is several paragraphs long, we have opted to not include it here. Nonetheless, it’s easy to see how we can now integrate this code seamlessly as needed with our projects, like Gradio applications.

Closing Thoughts

Qwen3 is a particularly exciting model thanks to its seamless switching between Thinking and Non-Thinking modes and innovative agentic capabilities. We look forward to seeing how more and more models like Qwen3 are released to push the development of LLMs even further forward.


Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products.



To view or add a comment, sign in

More articles by DigitalOcean

Insights from the community

Others also viewed

Explore topics