tvcommercialad

luellamarron89/tvcommercialad

I ran a fast experiment examining how DeepSeek-R1 performs on agentic jobs, regardless of not supporting tool use natively, fishtanklive.wiki and akropolistravel.com I was quite impressed by initial results. This experiment runs DeepSeek-R1 in a single-agent setup, historydb.date where the model not only prepares the actions but also develops the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outshines Claude 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% right, and other models by an even bigger margin:

The experiment followed design usage standards from the DeepSeek-R1 paper and the design card: Don't utilize few-shot examples, prevent including a system prompt, and set the temperature to 0.5 - 0.7 (0.6 was utilized). You can find more assessment details here.

Approach

DeepSeek-R1's strong coding capabilities allow it to act as a representative without being clearly trained for tool usage. By enabling the design to generate actions as Python code, it can flexibly engage with environments through code execution.

Tools are carried out as Python code that is consisted of straight in the timely. This can be an easy function meaning or a module of a larger package - any valid Python code. The model then produces code actions that call these tools.

Results from performing these actions feed back to the model as follow-up messages, driving the next actions until a final response is reached. The agent structure is a basic iterative coding loop that moderates the conversation in between the design and its environment.

Conversations

DeepSeek-R1 is utilized as chat design in my experiment, where the model autonomously pulls additional context from its environment by utilizing tools e.g. by utilizing an online search engine or bring data from websites. This drives the conversation with the environment that continues up until a final response is reached.

On the other hand, o1 models are known to perform improperly when utilized as chat models i.e. they do not try to pull context throughout a discussion. According to the connected post, o1 models perform best when they have the full context available, wiki.vst.hs-furtwangen.de with clear guidelines on what to do with it.

Initially, I likewise attempted a full context in a single timely technique at each action (with arise from previous actions consisted of), akropolistravel.com however this caused substantially lower ratings on the GAIA subset. Switching to the conversational method explained above, I had the ability to reach the reported 65.6% efficiency.

This raises an interesting question about the claim that o1 isn't a chat model - possibly this observation was more appropriate to older o1 designs that did not have tool usage abilities? After all, isn't tool use support an essential system for making it possible for models to pull additional context from their environment? This conversational method certainly seems reliable for DeepSeek-R1, though I still need to conduct comparable experiments with o1 designs.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is exceptional that to agentic jobs with tool use via code actions works so well. This ability to generalize to agentic jobs advises of current research study by DeepMind that shows that RL generalizes whereas SFT memorizes, although generalization to tool use wasn't investigated because work.

Despite its ability to generalize to tool use, DeepSeek-R1 frequently produces long reasoning traces at each step, compared to other designs in my experiments, restricting the usefulness of this design in a single-agent setup. Even easier tasks sometimes take a long period of time to finish. Further RL on agentic tool usage, be it by means of code actions or not, could be one choice to enhance performance.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model regularly switches in between different reasoning thoughts without adequately checking out appealing courses to reach an appropriate option. This was a significant factor for excessively long reasoning traces produced by DeepSeek-R1. This can be seen in the taped traces that are available for download.

Future experiments

Another common application of reasoning designs is to utilize them for planning only, while utilizing other models for creating code actions. This might be a potential new feature of freeact, if this separation of roles shows helpful for more complex jobs.

I'm likewise curious about how reasoning designs that currently support tool use (like o1, o3, ...) carry out in a single-agent setup, funsilo.date with and without generating code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also utilizes code actions, bytes-the-dust.com look intriguing.