Add Understanding DeepSeek R1
commit
b449eb6def
92
Understanding-DeepSeek-R1.md
Normal file
92
Understanding-DeepSeek-R1.md
Normal file
|
@ -0,0 +1,92 @@
|
|||
<br>DeepSeek-R1 is an [open-source language](http://indreakvareller.dk) [design built](https://git.lain.church) on DeepSeek-V3-Base that's been making waves in the [AI](https://fmstaffingsource.com) [neighborhood](https://academy-piano.com). Not only does it match-or even [surpass-OpenAI's](https://livingspaces.ie) o1 design in many criteria, however it likewise comes with completely MIT-licensed weights. This marks it as the first non-OpenAI/Google design to deliver strong thinking abilities in an open and available manner.<br>
|
||||
<br>What makes DeepSeek-R1 particularly exciting is its transparency. Unlike the less-open methods from some industry leaders, [DeepSeek](http://jo.hnsdfsdff.dsgdsgdshdghsdhdhfdmichaelbfischer.at) has actually published a [detailed training](https://alaskanoahsark.com) [approach](https://git.thijsdevries.net) in their paper.
|
||||
The design is likewise incredibly affordable, with [input tokens](http://shuriklimited.com) costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
|
||||
<br>Until ~ GPT-4, the typical wisdom was that much better models needed more data and compute. While that's still legitimate, models like o1 and R1 show an option: inference-time scaling through [thinking](https://jmusic.me).<br>
|
||||
<br>The Essentials<br>
|
||||
<br>The DeepSeek-R1 paper presented numerous designs, but main among them were R1 and R1-Zero. Following these are a series of [distilled models](http://neogeonow.com) that, while intriguing, I won't go over here.<br>
|
||||
<br>DeepSeek-R1 utilizes two major concepts:<br>
|
||||
<br>1. A multi-stage pipeline where a small set of cold-start data kickstarts the model, followed by massive RL.
|
||||
2. Group [Relative Policy](http://www.sueboyd.com) Optimization (GRPO), a [reinforcement knowing](https://www.hl-manufaktur.de) method that [depends](https://corevibesstudio.com) on comparing multiple design outputs per timely to prevent the requirement for a different critic.<br>
|
||||
<br>R1 and R1-Zero are both thinking models. This essentially means they do Chain-of-Thought before [addressing](https://southpasadenafarmersmarket.org). For the R1 series of designs, this takes form as thinking within a tag, before addressing with a [final summary](https://rosa06n22489447.edublogs.org).<br>
|
||||
<br>R1-Zero vs R1<br>
|
||||
<br>R1[-Zero applies](https://10mektep-ns.edu.kz) Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no [supervised fine-tuning](http://osongmall.com) (SFT). RL is [utilized](https://dailytimesbangladesh.com) to enhance the design's policy to make the most of reward.
|
||||
R1-Zero attains exceptional precision however often produces complicated outputs, such as [blending multiple](https://stepaheadsupport.co.uk) languages in a single response. R1 repairs that by [integrating restricted](http://worldwidefoodsupplyinc.com) monitored fine-tuning and several RL passes, which improves both [accuracy](https://rosa06n22489447.edublogs.org) and readability.<br>
|
||||
<br>It is fascinating how some languages might [express](https://www.imagneticianni.it) certain ideas much better, which leads the model to select the most [expressive language](https://gitea.myrmidon.org) for the task.<br>
|
||||
<br>Training Pipeline<br>
|
||||
<br>The training pipeline that DeepSeek released in the R1 paper is profoundly fascinating. It showcases how they created such [strong thinking](http://yamagablanks.com) models, and what you can [anticipate](https://www.jefffoster.net) from each phase. This includes the problems that the resulting models from each stage have, and how they solved it in the next phase.<br>
|
||||
<br>It's intriguing that their [training pipeline](http://www.malizmaj.hr) varies from the typical:<br>
|
||||
<br>The [normal training](https://geotravel.am) strategy: [Pretraining](https://flowsocial.xyz) on big [dataset](https://cupnosh.com) (train to forecast next word) to get the base model → monitored [fine-tuning](https://music.busai.me) → [choice tuning](https://www.openwastecompliance.com) through RLHF
|
||||
R1-Zero: Pretrained → RL
|
||||
R1: [Pretrained](https://www.elcon-medical.com) → Multistage training pipeline with [numerous SFT](http://www.consulting.sbm.pw) and RL stages<br>
|
||||
<br>Cold-Start Fine-Tuning: [Fine-tune](http://www.ipinfo.co.kr) DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to make sure the RL process has a decent beginning point. This provides a great model to start RL.
|
||||
First RL Stage: Apply GRPO with rule-based rewards to improve reasoning [accuracy](http://afro2love.com) and formatting (such as forcing chain-of-thought into believing tags). When they were near [convergence](http://c5r.ru) in the RL procedure, they relocated to the next action. The result of this action is a [strong reasoning](https://www.wall-stack.com) model however with [weak basic](http://www.jedge.top3000) capabilities, e.g., poor formatting and language [blending](https://git.siin.space).
|
||||
[Rejection](http://aikidojoterrassa.com) [Sampling](https://www.melissoroi.gr) + basic data: Create brand-new SFT information through rejection sampling on the RL [checkpoint](http://e-hp.info) (from step 2), combined with supervised information from the DeepSeek-V3-Base model. They [gathered](https://pierre-humblot.com) around 600[k high-quality](https://www.resolutionrigging.com.au) thinking samples.
|
||||
Second Fine-Tuning: [Fine-tune](https://www.firstimageus.com) DeepSeek-V3-Base again on 800k total samples (600k thinking + 200[k basic](https://www.klattringpakullaberg.se) tasks) for [broader capabilities](https://teaclef75.edublogs.org). This action led to a [strong reasoning](https://glenoak.com.au) design with general [capabilities](https://10mektep-ns.edu.kz).
|
||||
Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to refine the last model, in addition to the thinking rewards. The result is DeepSeek-R1.
|
||||
They also did model distillation for several Qwen and Llama designs on the thinking traces to get distilled-R1 models.<br>
|
||||
<br>Model distillation is a technique where you use a teacher model to enhance a trainee design by generating [training data](http://the-serendipity.com) for the trainee model.
|
||||
The [teacher](https://chitahanto-smilemama.com) is normally a bigger model than the trainee.<br>
|
||||
<br>Group [Relative Policy](https://nicklog8.com) Optimization (GRPO)<br>
|
||||
<br>The fundamental concept behind using [reinforcement](https://taller84.com) knowing for LLMs is to fine-tune the design's policy so that it naturally produces more accurate and useful answers.
|
||||
They used a benefit system that examines not just for correctness however also for [proper format](https://www.archives.gov.il) and [language](http://kidscareschoolbti.com) consistency, so the design slowly finds out to prefer reactions that satisfy these quality requirements.<br>
|
||||
<br>In this paper, they encourage the R1 design to [generate chain-of-thought](http://riedewald.nl) reasoning through [RL training](https://www.nexocomercial.com) with GRPO.
|
||||
Rather than [including](https://yiwodofo.com) a separate module at reasoning time, the training procedure itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emergent habits of the optimized policy.<br>
|
||||
<br>What makes their approach especially intriguing is its dependence on straightforward, rule-based reward functions.
|
||||
Instead of depending on costly external designs or human-graded examples as in [conventional](https://www.thevitaminstation.net) RLHF, the [RL utilized](https://www.iglemdv.com) for R1 uses basic requirements: it might give a higher benefit if the answer is appropriate, if it follows the anticipated/ formatting, and if the [language](https://mahenda.blog.binusian.org) of the [response matches](https://crystalaerogroup.com) that of the timely.
|
||||
Not [counting](https://planetacarbononeutral.org) on a [benefit model](https://live.gitawonk.com) likewise implies you don't have to hang around and effort training it, and it doesn't take memory and [calculate](https://gabrielbulhoes.com.br) away from your [main design](https://www.iassw-aiets.org).<br>
|
||||
<br>GRPO was [introduced](https://kingaed.com) in the [DeepSeekMath paper](http://desk.stinkpot.org8080). Here's how GRPO works:<br>
|
||||
<br>1. For each input prompt, the model generates different actions.
|
||||
2. Each reaction gets a scalar reward based upon [factors](https://www.wtfbellingham.com) like accuracy, formatting, and language consistency.
|
||||
3. Rewards are changed relative to the group's efficiency, basically measuring just how much better each reaction is compared to the others.
|
||||
4. The design updates its technique a little to favor reactions with higher relative advantages. It just makes slight adjustments-using methods like clipping and a [KL penalty-to](https://mikltd.eu) ensure the policy doesn't wander off too far from its original behavior.<br>
|
||||
<br>A cool aspect of GRPO is its flexibility. You can use [simple rule-based](https://felizservices.com) benefit functions-for circumstances, granting a perk when the design properly uses the [syntax-to](https://doradocc.com) guide the training.<br>
|
||||
<br>While DeepSeek utilized GRPO, you might use [alternative](http://122.51.6.973000) approaches rather (PPO or PRIME).<br>
|
||||
<br>For those aiming to dive deeper, Will Brown has actually written quite a nice execution of training an LLM with [RL utilizing](http://www.unimogsound.be) GRPO. GRPO has also already been contributed to the Transformer Reinforcement Learning (TRL) library, which is another good resource.
|
||||
Finally, Yannic Kilcher has a [terrific](https://dailytimesbangladesh.com) [video explaining](http://koontzcorp.com) GRPO by going through the .<br>
|
||||
<br>Is RL on LLMs the path to AGI?<br>
|
||||
<br>As a last note on [explaining](http://www.bestmusicdistribution.com) DeepSeek-R1 and the methods they've provided in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.<br>
|
||||
<br>These findings indicate that [RL enhances](http://www.leganavalesantamarinella.it) the design's general efficiency by rendering the [output circulation](https://www.destination-india.com) more robust, to put it simply, it seems that the [improvement](http://physio-krollpfeifer.de) is attributed to increasing the appropriate reaction from TopK instead of the enhancement of essential abilities.<br>
|
||||
<br>Simply put, RL fine-tuning tends to shape the output circulation so that the highest-probability outputs are more most likely to be proper, even though the general ability (as measured by the variety of correct responses) is mainly present in the pretrained design.<br>
|
||||
<br>This suggests that reinforcement knowing on LLMs is more about refining and "shaping" the existing circulation of actions rather than enhancing the model with completely new [abilities](https://git.lab.evangoo.de).
|
||||
Consequently, while [RL methods](http://www.lobbycom.fr) such as PPO and GRPO can [produce](https://www.sumnedrevo.sk) significant [performance](https://vidstreamr.com) gains, there seems a fundamental ceiling identified by the underlying design's pretrained knowledge.<br>
|
||||
<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm excited to see how it unfolds!<br>
|
||||
<br>Running DeepSeek-R1<br>
|
||||
<br>I've utilized DeepSeek-R1 through the main chat [interface](https://bdv-ngo.de) for various problems, which it seems to fix all right. The extra search functionality makes it even nicer to utilize.<br>
|
||||
<br>Interestingly, o3-mini(-high) was released as I was [writing](https://wiki.labnuevoleon.mx) this post. From my [preliminary](http://40.73.118.158) testing, R1 [appears stronger](https://filuv.bnkode.com) at [mathematics](https://santissimosacramento.org.br) than o3-mini.<br>
|
||||
<br>I also rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
|
||||
The [main goal](https://ignite2unite.wp.txstate.edu) was to see how the model would [perform](https://seasphilippines.com) when released on a single H100 [GPU-not](https://icskorea.co.kr) to extensively evaluate the design's abilities.<br>
|
||||
<br>671B through Llama.cpp<br>
|
||||
<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized model by Unsloth, with a 4[-bit quantized](https://www.northbrightonpreschool.com.au) KV-cache and partial GPU offloading (29 layers running on the GPU), running via llama.cpp:<br>
|
||||
<br>29 [layers appeared](https://www.skateone.com) to be the sweet area offered this [configuration](http://parktennis.nl).<br>
|
||||
<br>Performance:<br>
|
||||
<br>A r/localllama user explained that they were able to get over 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their local gaming setup.
|
||||
Digital Spaceport composed a full guide on how to run [Deepseek](https://gitea.fcliu.net) R1 671b completely locally on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
|
||||
<br>As you can see, the tokens/s isn't quite bearable for any major work, however it's [enjoyable](http://www.signaturesports.com.au) to run these big models on available [hardware](https://artstroicity.ru).<br>
|
||||
<br>What [matters](https://consulae.com) most to me is a combination of usefulness and time-to-usefulness in these designs. Since thinking designs require to think before responding to, their time-to-usefulness is typically higher than other designs, however their usefulness is likewise normally higher.
|
||||
We [require](https://jandlfabricating.com) to both make the most of usefulness and reduce time-to-usefulness.<br>
|
||||
<br>70B via Ollama<br>
|
||||
<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running by means of Ollama:<br>
|
||||
<br>GPU usage shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
|
||||
<br>Resources<br>
|
||||
<br>DeepSeek-R1: [townshipmarket.co.za](https://www.townshipmarket.co.za/user/profile/20264) Incentivizing Reasoning Capability in LLMs by means of [Reinforcement Learning](https://gitea.elatteria.com)
|
||||
[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
|
||||
DeepSeek R1 - Notion (Building a [totally local](https://endofthelanegreenhouse.com) "deep researcher" with DeepSeek-R1 - YouTube).
|
||||
DeepSeek R1's dish to replicate o1 and the future of [thinking LMs](http://dev.onstyler.net30300).
|
||||
The Illustrated DeepSeek-R1 - by [Jay Alammar](https://one.izandu.com).
|
||||
Explainer: What's R1 & Everything Else? - Tim Kellogg.
|
||||
DeepSeek R1 [Explained](https://trufle.sk) to your grandma - YouTube<br>
|
||||
<br>DeepSeek<br>
|
||||
<br>- Try R1 at [chat.deepseek](https://kaede27y.com).com.
|
||||
GitHub - deepseek-[ai](https://nhadaututhanhcong.com)/DeepSeek-R 1.
|
||||
deepseek-[ai](https://www.processinstruments.uy)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](https://www.kermoflies.de) structure that [merges multimodal](http://nn-ns.ru) understanding and generation. It can both comprehend and [generate images](https://www.letsauth.net9999).
|
||||
DeepSeek-R1: [Incentivizing Reasoning](http://camping-les-clos.fr) [Capability](https://wowfestival.it) in Large Language Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source [thinking model](http://analytic.autotirechecking.com) that rivals the efficiency of OpenAI's o1. It provides a [detailed method](https://r-electro.com.ua) for training such designs using large-scale support [learning methods](http://shasta.ernesthum.i.li.at.e.ek.k.ac.o.nne.c.t.tn.tuGo.o.gle.email.2.%5cn1sarahjohnsonw.estbrookbertrew.e.rhu.fe.ng.k.ua.ngniu.bi..uk41Www.zanelesilvia.woodw.o.r.t.hBa.tt.le9.578Jxd.1.4.7m.nb.v.3.6.9.cx.z.951.4Ex.p.lo.si.v.edhq.gSilvia.woodw.o.r.t.hR.eces.si.v.e.x.g.zLeanna.langtonvi.rt.u.ali.rd.jH.att.ie.m.c.d.o.w.e.ll2.56.6.3Burton.renefullgluestickyriddl.edynami.c.t.r.ajohndf.gfjhfgjf.ghfdjfhjhjhjfdghsybbrr.eces.si.v.e.x.g.zleanna.langtonc.o.nne.c.t.tn.tuGo.o.gle.email.2.%5c%5c%5c%5cn1sarahjohnsonw.estbrookbertrew.e.rhu.fe.ng.k.ua.ngniu.bi..uk41Www.zanelesilvia.woodw.o.r.t.hfullgluestickyriddl.edynami.c.t.r.ajohndf.gfjhfgjf.ghfdjfhjhjhjfdghsybbrr.eces.si.v.e.x.g.zleanna.langtonc.o.nne.c.t.tn.tuGo.o.gle.email.2.%5c%5c%5c%5cn1sarahjohnsonw.estbrookbertrew.e.rhu.fe.ng.k.ua.ngniu.bi..uk41Www.zanelesilvia.woodw.o.r.t.hp.a.r.a.ju.mp.e.r.sj.a.s.s.en20.14magdalena.tunnH.att.ie.m.c.d.o.w.e.ll2.56.6.3burton.renec.o.nne.c.t.tn.tuGo.o.gle.email.2.%5cn1sarahjohnsonw.estbrookbertrew.e.rhu.fe.ng.k.ua.ngniu.bi..uk41Www.zanelesilvia.woodw.o.r.t.hwww.je-evrard.net).
|
||||
DeepSeek-V3 Technical Report (December 2024) This report goes over the implementation of an FP8 blended precision training [structure confirmed](http://association-vivian-maier-et-le-champsaur.fr) on an incredibly massive model, attaining both sped up training and [archmageriseswiki.com](http://archmageriseswiki.com/index.php/User:RosieG65174) reduced GPU memory use.
|
||||
DeepSeek LLM: Scaling Open-Source [Language Models](https://www.danbrownjr.com) with Longtermism (January 2024) This paper explores scaling laws and presents findings that help with the [scaling](http://www.funkallisto.com) of massive designs in open-source configurations. It introduces the DeepSeek LLM job, devoted to [advancing open-source](http://takao-t.com) language designs with a long-term viewpoint.
|
||||
DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of [Code Intelligence](http://einhards.de) (January 2024) This research study introduces the DeepSeek-Coder series, a range of open-source code designs trained from scratch on 2 trillion tokens. The models are [pre-trained](https://thegreaterreset.org) on a top quality project-level code corpus and [coastalplainplants.org](http://coastalplainplants.org/wiki/index.php/User:EverettPellegrin) employ a [fill-in-the-blank task](https://tschick.online) to [improve](http://porettepl.com.br) code generation and infilling.
|
||||
DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://elanka.ca) [Language Model](http://porettepl.com.br) (May 2024) This paper presents DeepSeek-V2, a [Mixture-of-Experts](https://picsshare.net) (MoE) language design [defined](https://kapro-elevators.com) by cost-effective training and efficient reasoning.
|
||||
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research [introduces](https://www.bsidecomm.com) DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains performance comparable to GPT-4 Turbo in [code-specific jobs](https://music.busai.me).<br>
|
||||
<br>Interesting events<br>
|
||||
<br>- Hong [Kong University](https://www.thyrighttoinformation.com) replicates R1 [outcomes](https://gitlab.henrik.ninja) (Jan 25, '25).
|
||||
- Huggingface [announces](https://dwincontabil.com.br) huggingface/open-r 1: [townshipmarket.co.za](https://www.townshipmarket.co.za/user/profile/20404) Fully open [reproduction](http://astro.eresult.it) of DeepSeek-R1 to [replicate](https://asromafansclub.com) R1, completely open source (Jan 25, '25).
|
||||
- OpenAI researcher confirms the DeepSeek team [independently discovered](http://fristweb.com) and utilized some core ideas the OpenAI group utilized on the way to o1<br>
|
||||
<br>Liked this post? Join the newsletter.<br>
|
Loading…
Reference in New Issue
Block a user