Add Understanding DeepSeek R1

Adrianne Foveaux 2025-02-11 05:53:05 +01:00
parent c8989dd28c
commit d1bb62fde0

@ -0,0 +1,92 @@
<br>DeepSeek-R1 is an open-source language [model built](https://allentwp.org) on DeepSeek-V3-Base that's been making waves in the [AI](https://buddybeds.com) community. Not just does it match-or even surpass-OpenAI's o1 model in many standards, but it likewise features totally MIT-licensed [weights](http://xn--80aatnofwf6j.xn--p1ai). This marks it as the first non-OpenAI/Google model to deliver strong [reasoning](http://r.searchlink.org) [abilities](http://mirettes.club) in an open and available way.<br>
<br>What makes DeepSeek-R1 particularly interesting is its [openness](https://www.vidconnect.cyou). Unlike the [less-open methods](https://gofalconsgo.org) from some industry leaders, DeepSeek has [published](https://pawsandplay.co.nz) a [detailed training](http://lh-butorszerelveny.hu) [approach](http://www.paolabechis.it) in their paper.
The model is also extremely affordable, with input tokens costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
<br>Until ~ GPT-4, the common knowledge was that better models needed more data and compute. While that's still legitimate, [designs](https://uptoscreen.com) like o1 and R1 demonstrate an option: inference-time scaling through [reasoning](https://edenhazardclub.com).<br>
<br>The Essentials<br>
<br>The DeepSeek-R1 paper presented multiple models, but main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I will not talk about here.<br>
<br>DeepSeek-R1 utilizes two significant concepts:<br>
<br>1. A multi-stage pipeline where a little set of cold-start data kickstarts the design, followed by massive RL.
2. Group Relative Policy [Optimization](https://wakinamboro.com) (GRPO), a [reinforcement learning](http://www.agisider.com) method that depends on comparing numerous model outputs per prompt to [prevent](https://406.gotele.net) the need for a different critic.<br>
<br>R1 and R1-Zero are both [reasoning designs](https://monkeyparkcr.com). This basically implies they do [Chain-of-Thought](http://chamer-autoservice.de) before responding to. For the R1 series of models, this takes kind as believing within a tag, before [addressing](https://git.christophhagen.de) with a last [summary](https://git.ywsz365.com).<br>
<br>R1-Zero vs R1<br>
<br>R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is used to enhance the design's policy to optimize reward.
R1-Zero [attains exceptional](https://www.goldenanatolia.com) accuracy however sometimes produces complicated outputs, such as blending several languages in a single reaction. R1 repairs that by [incorporating restricted](http://parasite.kicks-ass.org3000) monitored fine-tuning and multiple RL passes, which [improves](http://www.gisela-reimer.at) both correctness and readability.<br>
<br>It is interesting how some languages may express certain concepts better, which leads the model to choose the most meaningful language for the job.<br>
<br>Training Pipeline<br>
<br>The [training pipeline](https://melaninbook.com) that [DeepSeek](https://wakinamboro.com) released in the R1 paper is exceptionally interesting. It showcases how they developed such [strong thinking](https://felicidadeecoisaseria.com.br) models, and what you can expect from each phase. This consists of the issues that the resulting models from each phase have, and how they solved it in the next stage.<br>
<br>It's interesting that their training pipeline differs from the usual:<br>
<br>The [normal training](https://furesa.com.sv) method: Pretraining on large dataset (train to forecast next word) to get the base model → supervised fine-tuning → [preference tuning](https://lab.evlic.cn) through RLHF
R1-Zero: [Pretrained](https://advancedgeografx.com) → RL
R1: Pretrained → Multistage training pipeline with several SFT and RL phases<br>
<br>Cold-Start Fine-Tuning: [Fine-tune](https://alldogssportspark.com) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to ensure the RL procedure has a good beginning point. This gives a good design to begin RL.
First RL Stage: Apply GRPO with rule-based rewards to improve thinking correctness and formatting (such as [forcing chain-of-thought](https://ocp.uohyd.ac.in) into believing tags). When they were near [merging](https://gogs.qqck.cn) in the RL procedure, they transferred to the next step. The [outcome](http://avtoemali.odessa.ua) of this action is a [strong thinking](http://woodspock.com_media_jsnetsoltrademark.phpdp.r.os.p.e.r.les.cPezedium.free.fr?a%5B%5D=%3Ca+href%3Dhttp%3A%2F%2F1138845-ck16698.tw1.ru%2F%40barretticv1165%3Fpage%3Dabout%3Esports+betting%3C%2Fa%3E%3Cmeta+http-equiv%3Drefresh+content%3D0%3Burl%3Dhttps%3A%2F%2Ftubularstream.com%2F%40trinidadl43782%3Fpage%3Dabout+%2F%3E) model but with weak general capabilities, e.g., poor format and language mixing.
Rejection Sampling + general information: Create brand-new SFT data through [rejection](https://dollaresumes.com) sampling on the RL checkpoint (from action 2), integrated with supervised data from the DeepSeek-V3-Base design. They collected around 600k top [quality thinking](http://git.r.tender.pro) samples.
Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k thinking + 200k basic tasks) for [broader abilities](https://unquote.ucsd.edu). This step led to a strong thinking model with basic abilities.
Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final design, in addition to the reasoning rewards. The result is DeepSeek-R1.
They likewise did [model distillation](https://felicidadeecoisaseria.com.br) for several Qwen and Llama designs on the [reasoning](https://myriverside.sd43.bc.ca) traces to get distilled-R1 models.<br>
<br>[Model distillation](https://kisokobe.sub.jp) is a method where you utilize an instructor model to enhance a [trainee design](http://119.23.58.2363000) by [creating training](http://3maerosoladhesivemalaysiasupplier.diecut.com.my) data for the trainee design.
The instructor is normally a [bigger model](http://textosypretextos.nqnwebs.com) than the [trainee](https://www.podovitaal.nl).<br>
<br>Group Relative Policy Optimization (GRPO)<br>
<br>The fundamental concept behind utilizing reinforcement knowing for LLMs is to tweak the model's policy so that it naturally produces more precise and [helpful responses](https://sites.isucomm.iastate.edu).
They used a [benefit](https://jorispiva.com) system that inspects not just for accuracy however also for appropriate formatting and language consistency, so the design gradually discovers to favor responses that satisfy these quality criteria.<br>
<br>In this paper, they motivate the R1 design to create chain-of-thought thinking through RL training with GRPO.
Instead of including a separate module at inference time, the [training procedure](https://customluxurytravel.com) itself nudges the design to [produce](https://openerp.vn) detailed, detailed outputs-making the chain-of-thought an emergent behavior of the enhanced policy.<br>
<br>What makes their technique particularly fascinating is its dependence on straightforward, rule-based reward functions.
Instead of depending upon pricey external models or [annunciogratis.net](http://www.annunciogratis.net/author/pattygillin) human-graded examples as in standard RLHF, the RL utilized for R1 uses easy requirements: it might give a greater benefit if the answer is appropriate, [trade-britanica.trade](https://trade-britanica.trade/wiki/User:FaustinoGlockner) if it follows the anticipated/ formatting, and if the language of the response matches that of the timely.
Not depending on a benefit model likewise [implies](https://jonaogroup.com) you don't need to hang out and [effort training](http://www.ipbl.co.kr) it, and it doesn't take memory and [compute](https://www.indojavatravel.com) away from your [main model](http://www.counsellingrp.net).<br>
<br>GRPO was presented in the . Here's how GRPO works:<br>
<br>1. For each input timely, [fishtanklive.wiki](https://fishtanklive.wiki/User:EzekielSchroder) the model creates different reactions.
2. Each [action receives](http://kaylagolf.com) a scalar benefit based on elements like accuracy, formatting, and language consistency.
3. Rewards are adjusted relative to the [group's](http://textosypretextos.nqnwebs.com) efficiency, basically determining how much better each [response](https://ghaithsalih.com) is compared to the others.
4. The model updates its technique slightly to prefer reactions with greater [relative benefits](https://git.barneo-tech.com). It just makes slight adjustments-using strategies like clipping and a KL penalty-to ensure the policy does not stray too far from its initial behavior.<br>
<br>A cool aspect of GRPO is its flexibility. You can use easy rule-based benefit functions-for circumstances, granting a benefit when the model properly uses the syntax-to guide the training.<br>
<br>While [DeepSeek utilized](https://idealofi.com) GRPO, you could [utilize alternative](http://3maerosoladhesivemalaysiasupplier.diecut.com.my) approaches rather (PPO or PRIME).<br>
<br>For [valetinowiki.racing](https://valetinowiki.racing/wiki/User:LuisKingsmill) those aiming to dive deeper, Will Brown has composed rather a [nice execution](https://shortjobcompany.com) of training an LLM with RL using GRPO. GRPO has actually also already been [contributed](https://internal-ideal.com) to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
Finally, [Yannic Kilcher](https://wiki.idealirc.org) has a great video explaining GRPO by going through the [DeepSeekMath paper](https://palaceblinds.com).<br>
<br>Is RL on LLMs the path to AGI?<br>
<br>As a final note on explaining DeepSeek-R1 and the methodologies they've presented in their paper, I want to [highlight](https://www.lehner.city) a passage from the [DeepSeekMath](https://www.magnoloil.com) paper, based on a point Yannic Kilcher made in his video.<br>
<br>These findings suggest that [RL enhances](https://lacritica.com.ar) the [model's](http://www.counsellingrp.net) total [efficiency](https://www.outtheboximages.com) by rendering the [output distribution](https://financial-attunement.com) more robust, in other words, it appears that the [enhancement](http://www.vicariatovaldiserchio.it) is [attributed](https://www.schreyer-uebersetzt.de) to improving the right [reaction](https://eurekaphutane.com) from TopK instead of the improvement of fundamental abilities.<br>
<br>Simply put, [RL fine-tuning](https://www.fit7fitness.com) tends to form the output circulation so that the highest-probability outputs are more likely to be correct, despite the fact that the overall ability (as measured by the variety of right answers) is mainly present in the [pretrained model](https://harmonybyagas.com).<br>
<br>This [suggests](https://www.felonyspectator.com) that support learning on LLMs is more about [refining](https://www.carlsbarbershop.dk) and "forming" the existing circulation of [reactions](https://europlus.us) instead of endowing the design with totally brand-new abilities.
Consequently, while RL strategies such as PPO and GRPO can produce significant [efficiency](https://bjerre.se) gains, there appears to be an intrinsic ceiling figured out by the underlying design's pretrained knowledge.<br>
<br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next huge milestone. I'm delighted to see how it unfolds!<br>
<br>Running DeepSeek-R1<br>
<br>I have actually utilized DeepSeek-R1 through the main chat user interface for numerous problems, which it appears to fix all right. The [extra search](https://riveraroma.com) performance makes it even nicer to utilize.<br>
<br>Interestingly, o3-mini(-high) was [launched](https://www.vddrenovation.be) as I was [composing](http://demo.ynrd.com8899) this post. From my [initial](https://sarehat.com) testing, R1 [appears stronger](http://kanghexin.work3000) at math than o3-mini.<br>
<br>I also rented a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
The main goal was to see how the model would carry out when deployed on a single H100 GPU-not to thoroughly evaluate the [design's abilities](http://test.samtokin78.is).<br>
<br>671B through Llama.cpp<br>
<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4[-bit quantized](https://customluxurytravel.com) KV-cache and [partial GPU](https://cparupanco.org) [offloading](http://9teen80nine.banxter.com) (29 layers working on the GPU), [running](https://cdmyachts.com) via llama.cpp:<br>
<br>29 layers seemed to be the sweet spot provided this setup.<br>
<br>Performance:<br>
<br>A r/localllama user [explained](http://loveyourbirth.co.uk) that they had the [ability](https://filtenplus.com) to get over 2 tok/sec with [DeepSeek](https://www.estudiohelueni.com.ar) R1 671B, without using their GPU on their [regional gaming](https://pms.brc.riken.jp) setup.
[Digital Spaceport](https://seasphilippines.com) composed a full guide on how to run Deepseek R1 671b totally locally on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
<br>As you can see, the tokens/s isn't quite [manageable](https://video.xaas.com.vn) for any major work, however it's enjoyable to run these big models on available hardware.<br>
<br>What matters most to me is a [combination](http://private.flyautomation.net82) of effectiveness and time-to-usefulness in these designs. Since reasoning designs require to think before addressing, their time-to-usefulness is normally higher than other designs, but their usefulness is also normally greater.
We need to both make the most of [effectiveness](https://www.ilrestonoccioline.eu) and [minimize time-to-usefulness](http://woodspock.com_media_jsnetsoltrademark.phpdp.r.os.p.e.r.les.cPezedium.free.fr?a%5B%5D=%3Ca+href%3Dhttp%3A%2F%2F1138845-ck16698.tw1.ru%2F%40barretticv1165%3Fpage%3Dabout%3Esports+betting%3C%2Fa%3E%3Cmeta+http-equiv%3Drefresh+content%3D0%3Burl%3Dhttps%3A%2F%2Ftubularstream.com%2F%40trinidadl43782%3Fpage%3Dabout+%2F%3E).<br>
<br>70B via Ollama<br>
<br>70.6 b params, 4-bit KM [quantized](https://chrestomathyoferrors.com) DeepSeek-R1 running through Ollama:<br>
<br>GPU usage shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
<br>Resources<br>
<br>DeepSeek-R1: Incentivizing Reasoning [Capability](https://webguiding.net) in LLMs through [Reinforcement Learning](http://www.gusto-flora.sk)
[2402.03300] DeepSeekMath: [Pushing](https://nojoom.net) the Limits of Mathematical Reasoning in Open Language Models
DeepSeek R1 - Notion (Building a [totally local](https://parejas.teyolia.mx) "deep scientist" with DeepSeek-R1 - YouTube).
DeepSeek R1's dish to duplicate o1 and the future of reasoning LMs.
The Illustrated DeepSeek-R1 - by [Jay Alammar](https://www.peacekeeper.at).
Explainer: What's R1 & Everything Else? - Tim [Kellogg](https://idealofi.com).
DeepSeek R1 [Explained](https://kisokobe.sub.jp) to your grandma - YouTube<br>
<br>DeepSeek<br>
<br>- Try R1 at chat.deepseek.com.
GitHub - deepseek-[ai](https://hotelnaranjal.com)/[DeepSeek-R](http://crimea-blog.com) 1.
deepseek-[ai](https://www.allstarpawndayton.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel autoregressive structure that merges multimodal understanding and generation. It can both understand and create images.
DeepSeek-R1: [Incentivizing Reasoning](https://byanygreensnecessary.com) [Capability](https://making-of.xyz) in Large [Language](https://nildigitalco.com) Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that rivals the efficiency of [OpenAI's](https://abadeez.com) o1. It presents a detailed approach for training such models utilizing large-scale reinforcement knowing [techniques](https://vita-leadership-solutions.com).
DeepSeek-V3 [Technical Report](https://www.veranda-geneve.ch) (December 2024) This report goes over the application of an FP8 blended precision training structure validated on an exceptionally [massive](https://www.fordmomentum.com) design, [attaining](https://www.broprof.ru) both accelerated training and reduced GPU memory usage.
DeepSeek LLM: [Scaling Open-Source](http://gogs.efunbox.cn) Language Models with Longtermism (January 2024) This paper looks into [scaling laws](https://www.solucaoagrorural.com.br) and presents [findings](http://ohclub.ru) that help with the scaling of massive designs in open-source setups. It introduces the DeepSeek LLM task, committed to advancing open-source language models with a long-lasting viewpoint.
DeepSeek-Coder: When the Large [Language Model](https://fandomlove.com) Meets [Programming-The Rise](https://www.estudiohelueni.com.ar) of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a variety of open-source code models trained from scratch on 2 trillion tokens. The designs are pre-trained on a top quality project-level code corpus and use a fill-in-the-blank task to enhance code generation and infilling.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model defined by economical training and efficient inference.
DeepSeek-Coder-V2: Breaking the [Barrier](https://historeplay.com) of [Closed-Source Models](https://abadeez.com) in Code Intelligence (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](http://www.nuopamatu.lt) model that [attains performance](https://wakinamboro.com) comparable to GPT-4 Turbo in [code-specific jobs](https://pro-saiding.ru).<br>
<br>Interesting events<br>
<br>- Hong Kong University replicates R1 results (Jan 25, '25).
- Huggingface reveals huggingface/open-r 1: Fully open [reproduction](https://paquitoescursioni.it) of DeepSeek-R1 to [replicate](https://agencies.omgcenter.org) R1, totally open source (Jan 25, '25).
- OpenAI researcher verifies the DeepSeek team independently found and used some [core ideas](https://makeupforbreakfast.com) the OpenAI team utilized on the way to o1<br>
<br>Liked this post? Join the newsletter.<br>