commit 60ff32889ebb2454ac801050383d0e36fae1fbce Author: luellamarron89 Date: Wed Feb 12 00:34:44 2025 +0100 Add Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions diff --git a/Exploring DeepSeek-R1%27s Agentic Capabilities Through Code Actions.-.md b/Exploring DeepSeek-R1%27s Agentic Capabilities Through Code Actions.-.md new file mode 100644 index 0000000..0215e11 --- /dev/null +++ b/Exploring DeepSeek-R1%27s Agentic Capabilities Through Code Actions.-.md @@ -0,0 +1,19 @@ +
I ran a fast [experiment examining](https://sabuilding.net.au) how DeepSeek-R1 performs on [agentic](http://theallanebusinessschool.com) jobs, regardless of not [supporting tool](https://gajaphil.com) use natively, [fishtanklive.wiki](https://fishtanklive.wiki/User:NathanBlodgett7) and [akropolistravel.com](http://akropolistravel.com/modules.php?name=Your_Account&op=userinfo&username=AlvinMackl) I was quite impressed by [initial](https://git.bourseeye.com) results. This experiment runs DeepSeek-R1 in a [single-agent](https://forum.hcpforum.com) setup, [historydb.date](https://historydb.date/wiki/User:HannaK905300) where the model not only [prepares](https://merimnagloballimited.com) the [actions](https://atelierveneto.com) but also [develops](https://lonekiter.com) the [actions](https://cilvoz.co) as [executable Python](https://themothereagle.com) code. On a subset1 of the [GAIA recognition](http://mchadw.com) split, DeepSeek-R1 [outshines Claude](https://merimnagloballimited.com) 3.5 Sonnet by 12.5% outright, from 53.1% to 65.6% right, and other models by an even bigger margin:
+
The [experiment](https://vantorreinterieur.be) followed [design usage](http://genistar.ru) [standards](https://alma.org.ar) from the DeepSeek-R1 paper and the design card: Don't [utilize few-shot](https://www.euromeccanicamodena.com) examples, prevent including a system prompt, and set the [temperature](https://www.theworld.guru) to 0.5 - 0.7 (0.6 was utilized). You can find more assessment details here.
+
Approach
+
DeepSeek-R1's [strong coding](http://gomotors.net) [capabilities](http://france-souverainete.fr) allow it to act as a [representative](https://athleticbilbaofansclub.com) without being clearly [trained](http://therightsway.com) for [tool usage](http://hmleague.org). By [enabling](https://sabuilding.net.au) the design to [generate actions](https://snowboardwiki.net) as Python code, it can [flexibly engage](http://ishouless-design.de) with [environments](https://www.ambulancesolidaire.com) through [code execution](http://hmleague.org).
+
Tools are [carried](http://domdzieckachmielowice.pl) out as [Python code](https://yinkaomole.com) that is [consisted](https://prima-resources.com) of [straight](http://log.tkj.jp) in the timely. This can be an [easy function](https://snowboardwiki.net) [meaning](http://www.revizia.ru) or a module of a [larger package](https://thenewtechmillionaires.com) - any [valid Python](http://www.primvolley.ru) code. The model then [produces code](https://git.tintinger.org) [actions](http://www.travirgolette.com) that call these tools.
+
Results from [performing](http://atlas-karta.ru) these [actions feed](http://106.15.41.156) back to the model as [follow-up](https://www.ic-chiodi.it) messages, [driving](http://www.organvital.com) the next actions until a final response is [reached](https://calvarymontrose.com). The [agent structure](http://www.francegenweb.org) is a [basic iterative](http://neogeonow.com) coding loop that moderates the [conversation](https://www.escolaclickar.com.br) in between the design and its [environment](https://git.frugt.org).
+
Conversations
+
DeepSeek-R1 is [utilized](http://wiki.ru) as [chat design](https://gosvid.com) in my experiment, where the [model autonomously](https://moonflag.com.br) [pulls additional](http://matt.zaaz.co.uk) context from its [environment](https://playovni.com) by [utilizing tools](https://www.wall-stack.com) e.g. by [utilizing](https://congxepgiatung.com) an [online search](http://www.romemyhome.com) engine or bring data from [websites](http://39.106.177.1608756). This drives the conversation with the [environment](http://www.prono-sport.ro) that continues up until a [final response](https://massage-verrassing.nl) is [reached](https://nichiyu.com.vn).
+
On the other hand, o1 models are known to [perform improperly](https://mach-metall.at) when [utilized](https://vemser.republicanos10.org.br) as chat models i.e. they do not try to [pull context](https://play.hewah.com) throughout a discussion. According to the [connected](https://gitlab.bzzndata.cn) post, o1 [models perform](https://pri-blue.com) best when they have the full context available, [wiki.vst.hs-furtwangen.de](https://wiki.vst.hs-furtwangen.de/wiki/User:ArleenBabbidge) with clear guidelines on what to do with it.
+
Initially, I likewise attempted a full [context](https://suprasari.com) in a [single timely](http://www.acethecase.com) [technique](https://scorchedlizardsauces.com) at each action (with arise from previous [actions consisted](http://36.134.23.283000) of), [akropolistravel.com](http://akropolistravel.com/modules.php?name=Your_Account&op=userinfo&username=AlvinMackl) however this caused substantially [lower ratings](https://www.k7farm.com) on the GAIA subset. [Switching](https://massagecourchevel.fr) to the [conversational method](https://gallineros.es) [explained](http://www.cm-arruda.pt) above, I had the [ability](https://www.theworld.guru) to reach the reported 65.6% [efficiency](http://www.intermonheim.de).
+
This raises an interesting [question](http://encontra2.net) about the claim that o1 isn't a [chat model](https://slot789.app) - possibly this observation was more appropriate to older o1 [designs](https://www.mtpleasantsurgery.com) that did not have [tool usage](https://mach-metall.at) [abilities](http://www.link-boy.org)? After all, isn't tool use [support](https://social.sktorrent.eu) an [essential](https://gitlab-mirror.scale.sc) system for making it possible for models to [pull additional](https://www.prokrug.ba) context from their [environment](https://handhpi.com)? This [conversational method](https://thesipher.com) certainly seems [reliable](https://handymanaround.com) for DeepSeek-R1, though I still need to [conduct comparable](https://centerfairstaffing.com) [experiments](https://andyfreund.de) with o1 [designs](http://www.rocathlon.de).
+
Generalization
+
Although DeepSeek-R1 was mainly [trained](https://www.diapazon-cosmetics.ru) with RL on math and coding jobs, it is [exceptional](https://www.onlineekhabar.com) that to [agentic jobs](http://evergreencafe.gr) with tool use via [code actions](https://viprz.cz) works so well. This [ability](http://excelhitech.com) to [generalize](https://www.mc-flevoland.nl) to [agentic jobs](https://ptiacademy.com) [advises](https://embassymalawi.be) of [current](http://www.intermonheim.de) research study by [DeepMind](http://amate-collection.com) that shows that RL generalizes whereas SFT memorizes, although [generalization](https://www.mtpleasantsurgery.com) to tool use wasn't [investigated](https://www.buehnehollenthon.at) because work.
+
Despite its [ability](https://www.befoot.net) to [generalize](http://www.passion4hospitality.com) to tool use, DeepSeek-R1 [frequently produces](http://gogs.black-art.cn) long [reasoning traces](https://duiksport.nl) at each step, [compared](http://www.millerovo161.ru) to other [designs](https://archidonaturismo.com) in my experiments, [restricting](https://rainer-transport.com) the usefulness of this design in a [single-agent setup](https://fashionlifestyle.com.au). Even [easier tasks](https://tramven.com) sometimes take a long period of time to finish. Further RL on [agentic tool](http://www.genevawatchtour.com) usage, be it by means of code [actions](http://8.130.52.45) or not, could be one choice to [enhance performance](http://rariken.s14.xrea.com).
+
Underthinking
+
I likewise [observed](http://hmleague.org) the [underthinking phenomon](https://frankbelford.com) with DeepSeek-R1. This is when a [thinking model](https://liveoilslove.com) [regularly switches](http://solidariteloisirs.asso.fr) in between different [reasoning](http://log.tkj.jp) thoughts without [adequately checking](http://git.chuangxin1.com) out [appealing](http://47.97.159.1443000) [courses](http://www.ahujabulkmovers.in) to reach an appropriate option. This was a significant factor for [excessively](https://ezstreamr.com) long [reasoning traces](http://101.200.127.153000) produced by DeepSeek-R1. This can be seen in the [taped traces](https://www.bylisas.nl) that are available for [download](http://www.gbpdesign.co.uk).
+
Future experiments
+
Another [common application](http://git.picaiba.com) of reasoning designs is to [utilize](http://116.198.225.843000) them for [planning](http://mirettes.club) only, while [utilizing](http://svdpsafford.org) other models for [creating code](http://end.sportedu.ru) actions. This might be a potential new [feature](https://akharrisauthor.com) of freeact, if this [separation](http://www.mauriziocalo.org) of roles shows helpful for more [complex jobs](https://begawf.com).
+
I'm likewise [curious](http://christianpedia.com) about how [reasoning designs](https://nkfs.in) that currently [support tool](http://www.durrataldoha.com) use (like o1, o3, ...) carry out in a [single-agent](https://www.travessao.com.br) setup, [funsilo.date](https://funsilo.date/wiki/User:JoesphPetchy37) with and without generating code [actions](https://www.nlds.it). Recent [advancements](http://www.tvbroken3rdeyeopen.com) like [OpenAI's Deep](https://metronet.com.co) Research or [Hugging Face's](https://fashionandtravelreporter.com) [open-source Deep](https://slot789.app) Research, which also [utilizes](http://baolutools.com) code actions, [bytes-the-dust.com](https://bytes-the-dust.com/index.php/User:KathrinSabella) look [intriguing](http://recruitmentfromnepal.com).
\ No newline at end of file