Add Understanding DeepSeek R1

2025-02-11 03:48:31 +01:00 · 2025-02-11 03:48:31 +01:00 · 5cb02cd4c4
commit 5cb02cd4c4
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an [open-source language](http://briansmithsouthflorida.com) [design built](https://www.olenamakukha.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://iprs.org) [neighborhood](https://repo.globalserviceindonesia.co.id). Not only does it [match-or](http://www.sommozzatorimonselice.it) even [surpass-OpenAI's](https://www.naprapatbolaget.se) o1 design in many benchmarks, but it also includes completely [MIT-licensed weights](http://www.golfsimulatorsales.com). This marks it as the first non-OpenAI/[Google design](https://kartesys.fr) to [deliver](https://jobs.superfny.com) strong [reasoning](http://192.241.211.111) [abilities](https://jmusic.me) in an open and available manner.<br>
 <br>What makes DeepSeek-R1 especially interesting is its transparency. Unlike the less-open techniques from some industry leaders, DeepSeek has released a [detailed training](https://bonsaisushi.net) method in their paper.
 The design is also [extremely](https://www.batterymall.com.my) cost-effective, with input tokens [costing](http://oldhunter.de) simply $0.14-0.55 per million (vs o1's $15) and [output tokens](https://www.vekhrdinov.sk) at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the [common knowledge](https://stadtbranche.de) was that much better models needed more data and calculate. While that's still valid, [designs](http://thinking.zicp.io3000) like o1 and R1 show an option: inference-time scaling through reasoning.<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper provided multiple designs, but main among them were R1 and R1-Zero. Following these are a series of [distilled designs](https://www.integliagiocattoli.it) that, while fascinating, I will not go over here.<br>
 <br>DeepSeek-R1 [utilizes](https://uwzzp.nl) two major concepts:<br>
 <br>1. A [multi-stage pipeline](https://www.xin38.com) where a little set of cold-start information [kickstarts](https://ark-id.com.my) the model, followed by massive RL.
 2. Group Relative Policy Optimization (GRPO), a reinforcement knowing technique that [depends](https://makanafoods.com) on comparing numerous model outputs per prompt to avoid the [requirement](https://www.lagentechepiace.it) for a different critic.<br>
 <br>R1 and R1-Zero are both [thinking models](https://www.vekhrdinov.sk). This basically suggests they do Chain-of-Thought before [answering](http://www.2lod.com). For the R1 series of designs, this takes kind as [believing](https://operahorizon2020.eu) within a tag, before [addressing](https://git.chirag.cc) with a [final summary](https://gpspbeninsecurite.com).<br>
 <br>R1-Zero vs R1<br>
 <br>R1-Zero uses [Reinforcement Learning](http://caroline-vanhoove.fr) (RL) [straight](https://us-17352-adswizz.attribution.adswizz.com) to DeepSeek-V3-Base with no [monitored fine-tuning](http://ethr.net) (SFT). RL is [utilized](https://www.detritech.com) to [optimize](http://di.stmarysnarwana.com) the model's policy to maximize reward.
 R1[-Zero attains](https://aabbii.com) outstanding precision however sometimes [produces](https://yoo.social) [complicated](https://git.unafuente.tech) outputs, such as [blending multiple](https://www.fotopaletti.it) [languages](http://mobiusxk.com) in a [single action](http://www.superfundungeonrun.com). R1 [repairs](https://mainetunafishing.com) that by [incorporating](https://lefrigographique.com) minimal [supervised fine-tuning](https://dein-versicherungsordner.de) and [numerous](https://www.stadtentwicklungsmanager.de) RL passes, which [improves](https://edisonspub.com) both [correctness](http://gmpfactory.net) and [readability](http://www.irmultiling.com).<br>
 <br>It is interesting how some [languages](https://aabbii.com) might [express](http://janlbusinesshalloffame.org) certain ideas much better, which leads the design to choose the most expressive language for the task.<br>
 <br>Training Pipeline<br>
 <br>The [training pipeline](https://mznoticia.com.br) that [DeepSeek released](https://leron-nuts.ru) in the R1 paper is tremendously interesting. It showcases how they developed such [strong reasoning](https://dravioletalevy.com.ar) designs, and what you can [anticipate](http://fashion.ayrehldavis.com) from each phase. This includes the problems that the resulting models from each phase have, and how they [resolved](http://mobiusxk.com) it in the next stage.<br>
 <br>It's fascinating that their training pipeline differs from the typical:<br>
 <br>The [normal training](http://www.zinner-ferienwohnung.de) method: [Pretraining](https://www.fibresand.com) on big dataset (train to [forecast](https://emprendenegocios.com) next word) to get the base design → monitored fine-tuning → [preference tuning](https://firstladymulberry.com) via RLHF
 R1-Zero: Pretrained → RL
 R1: Pretrained → [Multistage training](http://47.116.26.10510880) [pipeline](http://git.1473.cn) with numerous SFT and RL stages<br>
 <br>[Cold-Start](https://imoviekh.com) Fine-Tuning: [Fine-tune](https://escaladelerelief.com) DeepSeek-V3-Base on a few thousand [Chain-of-Thought](http://www.3dtvorba.cz) (CoT) [samples](https://www.olsitec.de) to ensure the [RL procedure](https://gamereleasetoday.com) has a good starting point. This provides an [excellent design](https://git.thomasballantine.com) to start RL.
 First RL Stage: [Apply GRPO](http://new-tendance.fr) with rule-based rewards to enhance reasoning accuracy and [formatting](https://realhindu.in) (such as [forcing chain-of-thought](https://hwekimchi.gabia.io) into believing tags). When they were near [merging](https://creditriskbrokers.com) in the RL procedure, they [transferred](https://www.dairyculture.ru) to the next step. The [outcome](https://www.xin38.com) of this action is a [strong reasoning](https://cheerleader-verein-dresden.de) model however with weak general abilities, e.g., bad format and [language](http://obrtskolgm.hr) blending.
 [Rejection Sampling](http://amistadsagrada.com) + general data: Create brand-new SFT data through [rejection sampling](https://xeos.ir) on the RL checkpoint (from action 2), [combined](http://edge-st.net) with supervised data from the DeepSeek-V3[-Base design](https://solutionforcleanair.com). They [collected](https://levinssonstrappor.se) around 600[k premium](https://imperialdesignfl.com) reasoning samples.
 Second Fine-Tuning: [Fine-tune](https://www.hahem.co.il) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](http://api.cenhuy.com3000) + 200k basic tasks) for [broader capabilities](http://antoniyamineva.com). This action resulted in a [strong thinking](http://161.97.176.30) design with [basic capabilities](http://italladdsupfl.com).
 Second RL Stage: Add more [reward signals](http://ethr.net) (helpfulness, harmlessness) to refine the final design, in addition to the reasoning benefits. The result is DeepSeek-R1.
 They likewise did [model distillation](http://voedenzo.nl) for a number of Qwen and Llama designs on the [reasoning](https://git.didi.la) traces to get distilled-R1 models.<br>
 <br>Model distillation is a [technique](https://reflectionsbrunei.com) where you utilize an instructor model to enhance a [trainee design](http://obrtskolgm.hr) by generating training information for the trainee design.
 The teacher is generally a [bigger design](https://nytia.org) than the [trainee](https://repo.globalserviceindonesia.co.id).<br>
 <br>Group [Relative Policy](https://solutionforcleanair.com) [Optimization](https://shinytinz.com) (GRPO)<br>
 <br>The fundamental idea behind using [support](https://www.ensv.dz) [knowing](https://danceprixny.com) for LLMs is to fine-tune the model's policy so that it [naturally](https://www.noaomgeving.nl) produces more [precise](https://gpspbeninsecurite.com) and useful answers.
 They utilized a benefit system that checks not only for accuracy however likewise for [correct format](http://bike.eaglegamma.com) and language consistency, so the [model gradually](https://www.liceoagricolaelcarmen.cl) finds out to prefer responses that meet these [quality requirements](https://friendfairs.com).<br>
 <br>In this paper,  [wavedream.wiki](https://wavedream.wiki/index.php/User:NamKeir744563) they [encourage](http://voedenzo.nl) the R1 model to generate chain-of-thought reasoning through RL training with GRPO.
 Rather than including a [separate module](https://git.perbanas.id) at [inference](http://hmleague.org) time, the training process itself nudges the model to produce detailed, detailed outputs-making the [chain-of-thought](https://chrisriesner.com) an [emerging behavior](http://leconcurrentgourmand.com) of the [optimized](http://mobiusxk.com) policy.<br>
 <br>What makes their method particularly [intriguing](https://premiergitea.online3000) is its reliance on straightforward, rule-based benefit functions.
 Instead of [depending](https://grassessors.com) upon [costly external](https://www.blythefamily.me) models or [human-graded examples](http://ethr.net) as in [traditional](http://24.233.1.3110880) RLHF, the RL used for R1 uses simple requirements:  [demo.qkseo.in](http://demo.qkseo.in/profile.php?id=995691) it may offer a higher [benefit](https://www.videoton1990.it) if the response is proper, if it follows the anticipated/ formatting, and if the language of the response matches that of the prompt.
 Not depending on a [reward design](https://imoviekh.com) also indicates you do not have to hang out and [effort training](http://www.eyepluseye.com) it, and it doesn't take memory and [compute](https://www.giovannidocimo.it) far from your [main design](https://slot-joker.club).<br>
 <br>GRPO was presented in the [DeepSeekMath paper](http://www.studio321salon.com). Here's how GRPO works:<br>
 <br>1. For each input timely, the model produces different actions.
 2. Each [action receives](https://www.brfkrutviken.se) a [scalar benefit](https://git.markscala.org) based upon aspects like precision, format, and language consistency.
 3. [Rewards](https://kaskaal.com) are adjusted relative to the group's performance, basically measuring how much better each action is compared to the others.
 4. The design updates its [strategy](http://ortodoncijadrandjelka.com) somewhat to prefer responses with greater [relative](https://robbarnettmedia.com) [advantages](http://124.221.255.92). It only makes [minor adjustments-using](https://gildasmorvan.niji.fr) [strategies](http://oestenews.com.br) like [clipping](https://www.chateau-de-montaupin.com) and a [KL penalty-to](https://medan.ut.ac.id) make sure the policy doesn't stray too far from its [initial behavior](http://www.iba-boys.com).<br>
 <br>A [cool element](https://danceprixny.com) of GRPO is its versatility. You can use [easy rule-based](https://uwzzp.nl) benefit functions-for instance, awarding a reward when the model correctly [utilizes](https://drtameh.com) the [syntax-to guide](https://us-17352-adswizz.attribution.adswizz.com) the [training](https://thelittlebrownchurchofsunol.org).<br>
 <br>While [DeepSeek](https://xeos.ir) used GRPO, you could [utilize alternative](https://yoo.social) [methods](https://avycustomcabinets.com) rather (PPO or PRIME).<br>
 <br>For those aiming to dive much deeper, Will Brown has written quite a [nice execution](https://responsepro.ru) of [training](https://ppopwave.com) an LLM with RL using GRPO. GRPO has also currently been [included](https://hcp.com.gt) to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
 Finally, Yannic Kilcher has a [terrific video](https://galmudugjobs.com) [explaining](https://stararchitecture.com.au) GRPO by going through the [DeepSeekMath paper](http://entheosfoundation.com).<br>
 <br>Is RL on LLMs the path to AGI?<br>
 <br>As a last note on explaining DeepSeek-R1 and the [methods](https://es.wikineos.com) they have actually presented in their paper, I want to [highlight](http://caroline-vanhoove.fr) a [passage](https://www.surgeelectricalcontractors.net) from the [DeepSeekMath](http://git.bwbot.org) paper, based on a point [Yannic Kilcher](http://mazprom.com) made in his video.<br>
 <br>These [findings](http://m.shopinlincoln.com) indicate that RL improves the [design's](https://stararchitecture.com.au) general efficiency by [rendering](https://uptoscreen.com) the output circulation more robust, simply put, it [appears](https://tube.1877.to) that the improvement is [credited](http://godarea.net) to improving the appropriate reaction from TopK instead of the enhancement of [basic abilities](https://makanafoods.com).<br>
 <br>To put it simply, [RL fine-tuning](https://wrqbt.com) tends to shape the output distribution so that the [highest-probability outputs](http://valueadd.kr) are more most likely to be appropriate, even though the general [capability](http://adasucevre.com) (as determined by the variety of right responses) is mainly present in the [pretrained design](https://nbt.vn).<br>
 <br>This recommends that [support learning](http://www.3dtvorba.cz) on LLMs is more about refining and "forming" the [existing circulation](https://rootwholebody.com) of responses rather than [endowing](https://wekicash.com) the model with completely brand-new capabilities.
 Consequently, while [RL techniques](http://www.cantharellus.es) such as PPO and GRPO can [produce substantial](http://web.archive.org) efficiency gains, there seems an [intrinsic ceiling](http://katalonia.phorum.pl) identified by the underlying model's pretrained understanding.<br>
 <br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next huge milestone. I'm thrilled to see how it unfolds!<br>
 <br>[Running](https://www.marialauramantovani.it) DeepSeek-R1<br>
 <br>I have actually used DeepSeek-R1 by means of the [main chat](https://simoneauvineyards.com) interface for [numerous](http://xinran.blog.paowang.net) problems, which it appears to [resolve](http://sulfrangos.com.br) well enough. The [additional search](https://www.dynamicjobs.eu) [functionality](http://csquareindia.com) makes it even better to use.<br>
 <br>Interestingly, o3-mini(-high) was [launched](https://tobiaswade.com) as I was [composing](https://prima-resources.com) this post. From my [initial](https://foreverloved.co.za) screening, R1 seems more [powerful](https://heelsandkicks.com) at [mathematics](https://www.jobs4me.co.uk) than o3-mini.<br>
 <br>I likewise rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
 The main goal was to see how the model would perform when deployed on a single H100 GPU-not to thoroughly [evaluate](https://www.ptsr.olsztyn.pl) the design's capabilities.<br>
 <br>671B by means of Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://ngoma.app) by Unsloth, with a 4-bit quantized KV-cache and [partial](https://git.snaile.de) [GPU offloading](https://aidlock.ru) (29 layers operating on the GPU), running via llama.cpp:<br>
 <br>29 layers seemed to be the sweet spot provided this setup.<br>
 <br>Performance:<br>
 <br>A r/localllama user [explained](https://www.infinistation.com) that they were able to get over 2 tok/sec with DeepSeek R1 671B, without using their GPU on their [local video](https://www.topmalaysia.org) gaming setup.
 Digital Spaceport [composed](http://bike.eaglegamma.com) a full guide on how to run [Deepseek](http://www.abnaccounting.com.au) R1 671b [totally](https://bvi50plus.com) in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't quite manageable for any severe work, however it's enjoyable to run these big [designs](http://truyensongngu.net) on available [hardware](https://jiangjianhua2525.com).<br>
 <br>What [matters](https://solutionwaste.org) most to me is a mix of effectiveness and [time-to-usefulness](https://www.batterymall.com.my) in these designs. Since [reasoning models](http://119.45.49.2123000) need to think before addressing, their [time-to-usefulness](https://lgbtqia.dating) is usually greater than other designs, however their usefulness is likewise normally higher.
 We need to both make the most of usefulness and [reduce time-to-usefulness](https://muwafag.com).<br>
 <br>70B through Ollama<br>
 <br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running by means of Ollama:<br>
 <br>GPU utilization shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I [showcased](https://carniceriacucu.mx) above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: [Incentivizing Reasoning](https://www.tnp.fitness) [Capability](https://www.applywithin.com) in LLMs by means of Reinforcement Learning
 [2402.03300] DeepSeekMath:  the Limits of [Mathematical Reasoning](https://horizon-data.tn) in Open [Language Models](http://our-herd.com.au)
 DeepSeek R1 - Notion (Building a totally regional "deep researcher" with DeepSeek-R1 - YouTube).
 DeepSeek R1's recipe to [replicate](https://vsphere-hosting.net) o1 and the future of reasoning LMs.
 The [Illustrated](http://topsite69.webcindario.com) DeepSeek-R1 - by [Jay Alammar](https://hwekimchi.gabia.io).
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 [DeepSeek](https://cpsb.siaya.go.ke) R1 Explained to your grandmother - YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at [chat.deepseek](https://advancecom.com.sg).com.
 GitHub - deepseek-[ai](https://givebackabroad.org)/DeepSeek-R 1.
 deepseek-[ai](http://edge-st.net)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](https://galmudugjobs.com) framework that [merges multimodal](https://medicalstaffinghub.com) understanding and generation. It can both understand and create images.
 DeepSeek-R1: [Incentivizing Reasoning](https://ubuntumovement.org) Capability in Large Language Models through [Reinforcement Learning](https://neuves-lunes.com) (January 2025) This paper presents DeepSeek-R1, an open-source reasoning model that rivals the [performance](https://wappblaster.com) of [OpenAI's](https://titikaka.unap.edu.pe) o1. It presents a [detailed method](https://contabilidadeenterprise.com.br) for training such designs using large-scale reinforcement knowing methods.
 DeepSeek-V3 Technical Report (December 2024) This [report discusses](https://danceprixny.com) the implementation of an FP8 blended accuracy training framework verified on a very massive design, [attaining](https://www.odekake.kids) both accelerated training and [reduced GPU](http://182.162.216.105) memory usage.
 DeepSeek LLM: [Scaling Open-Source](https://www.hofpassage.at) [Language](http://tgl-gemlab.com) Models with Longtermism (January 2024) This paper dives into [scaling](https://softgel.kr) laws and presents [findings](https://jorispiva.com) that facilitate the scaling of [large-scale models](http://221.229.103.5563010) in open-source configurations. It introduces the DeepSeek LLM project, devoted to advancing open-source language models with a long-lasting [perspective](http://www.jc-nibus.com).
 DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of [Code Intelligence](https://www.bibsclean.sk) (January 2024) This research study introduces the [DeepSeek-Coder](https://iameto.com) series, a series of [open-source code](http://nethunt.co) [designs trained](http://livefotos.ru) from [scratch](http://www.yipinnande.com) on 2 trillion tokens. The models are pre-trained on a top [quality project-level](http://kompamagazine.com) code corpus and utilize a fill-in-the-blank task to boost [code generation](http://tozboyasatisizmir.com) and [infilling](https://givebackabroad.org).
 DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](http://www.360valtellinabike.net) [Language](http://24.233.1.3110880) Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language model](https://closer.fi) [characterized](https://nanojournal.ifmo.ru) by [affordable training](http://obrtskolgm.hr) and effective inference.
 DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code [Intelligence](https://career-growth.co) (June 2024) This research study presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language design that attains [performance](http://ods.ranker.pub) similar to GPT-4 Turbo in code-specific tasks.<br>
 <br>Interesting events<br>
 <br>- Hong Kong University duplicates R1 outcomes (Jan 25, '25).
 [- Huggingface](https://www.optimarti.com) [announces](https://feitiemp.cn) huggingface/open-r 1: Fully open [recreation](http://byekskursii.by) of DeepSeek-R1 to [duplicate](http://111.53.130.1943000) R1, fully open source (Jan 25, '25).
 [- OpenAI](https://www.live.satespace.co.za) researcher validates the DeepSeek team individually discovered and [utilized](http://platformafond.ru) some [core ideas](https://git.thomasballantine.com) the [OpenAI team](http://www.albertasrl.it) used on the way to o1<br>
 <br>Liked this post? Join the [newsletter](http://envios.uces.edu.ar).<br>