HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88

An alarming number of people don't understand that LLMs work via purely stochastic processes, so I'm happy to see in-depth pieces like this. I'm looking for a job and maybe this is why it's so hard to get a callback these days: resumes are just dumped in some LLM black hole and no one really knows how it works. The author says:

> temperature 0.1 — low, supposedly nudging the model toward deterministic outputs

This is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).

And this + the tendency for AI to "prefer" AI produced code + some other AI biased is why *this is most likely highly illegal to use in the EU due to violating anti discrimination laws in multiple ways.

To be clear:

- randomly filtering "too many" resumes is pretty much allowed (I think)

- but must be actual random independent of the resume (and can be in multiple layers, i.e. random filter > pre-select > random filter > select)

- this isn't the case for AI as the random aspect isn't done as the random aspect is not independent of the actual resume evaluation

- in general you can't make sure the AI doesn't apply systematic biases, and there is high indication that it does do so

- for humans you can train them and order them to ignore their biases, this won't work reliable either _but now you delegated the responsibility of illegal biases to the hiring personal violating the order_. But for AI usage you are responsibility no matter what you tell it. Lastly you can technically "show/proof" a specific used AI is highly biased in a specific contexts, which for human employees is technical possible but practical not really practical. So this moves "specific mostly deniable" cases, into "systematic proven bias" teritory. Or in other word legal risk goes from "limited/no issue" to "people can systematically f-you over if they know you use AI for hiring".

Breaking down the steps into more sub-task and using loop could lead towards a more deterministic output

At this point we might as well adopt that joke where you blindly throw away half the resumes because you don't want to hire unlucky people.

> I fail 65% of the time. Same exact resume, different luck.

As someone who’s run hiring pipelines for technical roles in the past few years, that’s actually a fantastic number. I objectively hate saying that, but it’s true.

35% chance of elevating a technical individual to the next stage with no effort? I’ve seen as many as 100+ applicants an hour even when including a domain specific screener question. That’s 35 “screened” applicants in an hour. Were valid candidates screened out? Yes. Does you still have a candidate pool 35x larger than you need? Unfortunately, also yes.

The volume of applicants is SO HIGH such that your chances of getting moved to the next stage are actually markedly worse if AI isn’t involved. If you didn’t apply immediately (using an AI bot) there’s 50+ people ahead of you, and an exhausted technical leader if they ever make it to your resume.

Referral bonuses exist for a reason.

I added an online drag-and-drop hiring-agent checker, no sign-up required: https://universalresume.app/import?s=hc

It doesn't show the score because of the variability discussed here and only outputs readability/parser-style findings.

I think what's more worrying to me (if other systems work like this ATS) is that it seems to judge based on a bunch of factors that will probably disqualify a ton of decent to good participants.

For example, 65 points are given for a mix of personal projects and open source contributions. Which is great if your one and only interest is in tech, and you don't have a family, dependents or a second/third job. If you have any of those other things, well the odds seem like they're incredibly stacked against you.

And it makes me wonder how many of these systems are stacked in favour of wealthy people with a near special interest level of obsession with tech and no worries outside of going to college/working a single job in their industry of choice.

This word (determinism) has a magical effect of warping any online posts it touches. Once you hear it you can almost guarantee it's going to be misguided. At least this time it's actual determinism (same input = same output), not arbitrary unrelated things.

Determinism matters for reproducibility, but do you really want these outputs to be reproducible in this particular case? Making LLM outputs deterministic is relatively trivial, you have to use batch-invariant kernels (if you use batching) and either set the temperature to 0 (don't do that, randomized sampling is here for a reason) or fix the seed (better). It's readily available in a few systems. But this won't make the result more useful, it will just obscure the fact that the agent is genuinely not sure about it - look at the range of the scores it gives! It still won't predict anything but the score will stay the same each time. Do you really want that?

What happens here is they're supplying too little information (just a resume, which is almost at the noise level) and expecting a reply with too broad implications. This is a basic design mistake regardless of whether it uses LLMs. All surveys, tests, laws, and voting systems are extremely sensitive to framing because they work off too little information. But they also don't exist in vacuum, unlike this thing.

I ran the ATS myself and had a similarly quirky experience. I was in the 70s because it couldn't find my GitHub profile, and then it didn't like some of the popular Ruby libraries I'm the author of.

After a few runs it picked things up appropriately. I always got dinged on formal education though.

This stuff is gross.

> The default model is gemma3:4b

That’s a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system.

This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.

It's always amazed me that a tech company will pay $300,000+ for a good engineer, because talent is so hard hard to find... meanwhile their recruiter operates unsupported, has a very different idea about what good looks like. Their ATS black-holes >50% the resumes because it's filtering heuristics are garbage because recruiting selected the ATS system because it has a google Gmail integration or something, and the ATS's filtering technology was not reviewed by anyone in the engineering or data teams.

Feels like "I Don't Hire Unlucky People" all over again, but with extra tokenmaxxing steps.

https://neonrocket.com/2014/05/rescued-from-the-ashes-i-dont...

I tried this with my CV, and it somehow scored me bonus points for GSoC!

   BONUS POINTS: 5.0
  ------------------------------
     Google Summer of Code (GSoC) participation: +5

Even though I've never done this, and don't claim to have done it in my CV.

This is the new AI reality everyone around is wanting: a nondeterministic computing.

There is another name for it: a waste of electricity.

But wait, not waste! Consumers paid for it fully, with nice profit margins.

You and me, paid.

Try using google flights, or booking.com: the prices shown in search results list are frequently significantly different from those in a single result. It's a nondeterministic compute when it's easy to spot it. But it's not always that easy.

It's all sad, to be honest.

Or maybe your LLM results are being manipulated, 66/99 is a classic hacker dad quantification meme. :)

This insanity only exists because the tech industry is standard-less. No formal education needed, no formal training requirement, no apprenticeship, no software building code, no professional organization. Resumes have never been a good predictor of success - and why would they be?? Even if they're truthful and it's "impressive looking", that doesn't give you any assurance of knowledge, of who they learned under, what they learned, that they passed some minimum criteria. We might as well be rolling dice. So why not an LLM that randomly assigns scores?

I'm a little confused, is this an ATS system that anyone actually uses? If not, I'm not sure how it's better than just asking ChatGPT to score your resume out of 100. Why would you want to optimize your resume for a system no one is using to score it?

If I know the truth value of p and I also know p=>q, then an LLM would be able to deduce the truth value of q - even if the statements aren’t exactly in this form. Generally, LLMs are good with logical inference.

But logical inference itself is limited. You still have to find out if p is true or not - the ground truth.

How do you find that? You would be able to define in the prompt that if resume has p, infer q and do this. But determining the truth value of p is something LLM cannot do.

It’s not a limitation of the LLM. It’s the limitation of logic itself. You take 10 humans and give them the resumes with the same rubrics as the LLM. You’ll get a similar range of scores because everyone would assign different values.

The issue is not in logical inference. It’s in determining the value of p, which takes much more than logic. And current LLMs are limited to being logical.

Hiring and job search has been so hard and AI has amplified the existing problems instead of solving any.

From `resume_evaluation_system_message.jinja`

> *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:*

> - College, university, or educational institution name

> - CGPA, GPA, or academic grades

I don't understand why they would omit these factors from the evaluation.

It's fair to call out issues with the tool. But I think for individuals searching for jobs, using LLMs as the scapegoat for why it's hard to find a role is not terribly helpful.

In my experience, cold-applying has always worked essentially as a black hole, and LLMs haven't changed that much. The reality is that alternative avenues are always necessary to get the job you want. That could be a third-party recruiter; reaching out to a hiring manager on LinkedIn; or using your network to get referrals. Those continue to work whether the company is using a bone-headed tool like this or not.

> Sometimes my projects “lack architectural complexity”

Well done you! It is difficult to avoid architectural complexity, but imho well worth it.

The takeaway from this for me is that, using an LLM to score anything takes multiple (maybe even many) runs and the result you’ll get is, at best, a sane-ish distribution.

Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable.

I am sure there’s some reasonable rule of thumb estimation on distribution that could be applied based off fewer runs per data artifact, but you’re always going to be trading off against confidence by doing this.

Beyond this, I’d bet that almost no implemented systems that use LLMs for scoring, ranking, or decision making use such a multi-run approach. Partly because people don’t understand their behaviour is stochastic, perhaps because a lot of people without a background in statistics don’t understand what stochastic actually means, and no doubt partly because of budget concerns: if you have to ask an LLM to do the same thing 10, 50, 100 times to get a sufficiently good result, then the cost saving argument is either weakened or completely destroyed.

There is at least one more aspect worth considering in the specific case of resumes/CVs: is the inconsistency of scoring by LLM worse than the inconsistency of scoring by a human following a similar process?

Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc.

That inevitably leads to inconsistencies creeping in, so there’s always an element of “luck” (or, perhaps better, uncertainty) as to whether your resume/CV passes screening.

So is that inconsistency better or worse with LLM screening? I don’t know. But, at least, if it’s not worse maybe it doesn’t matter for this specific use case. And if it’s notably better then maybe it’s raised the bar on what “good enough” screening looks like?

(And I’m sure other use cases warrant similar, “does it matter?”, questions, with the answers no doubt landing differently.)

Count to three, no more, no less. Four shalt thou not count, neither count thou two—excepting that thou then proceed to three. Five is right out.

The list of "bonus" criteria and how they come about makes me feel sick.

I am not currently looking for employment, nor am I currently particularly worried about future prospects if I was suddenly in the position of looking for employment.

But if I ended up in a position with nothing to lean on but scattering my CV everywhere, well…

A lot of my major contributions are littered across the internet, private, or even just verbal/consultancy. They're things I did for free, in my spare time.

I also avoid GitHub. If you just look at my GitHub page for extra context, you would likely miss that delivering that very GitHub page likely involved a few bits of code I wrote.

Now, I could do a better job of trying to document this stuff, so it could be easier to find… But also I can't quite imagine how that would work.

> 35 points for open source contributions

> 30 for personal projects

These are insane weights for scoring a software engineer's resume.

It's funny that even after all these years and all this money invested in technology, we still haven't come up with anything better than word-of-mouth for hiring great people. Many serial founders have said that, despite the most stringent interview processes and the most sophisticated filtering pipelines, they still have a higher hit rate with people they've worked with in the past.

This isn't to diminish the whispernet. Rather, it shows just how many important signals cannot be quantized.

Yep, any day now AI is going to be so good we'll never need to think again. What's that, it's just a really expensive random number generator?

The blog post itself has pretty a pretty strong un-copy-edited ChatGPT vibes.

Oh ok. So I'll just have to apply 4-5 times to every job to be sure I'm considered. Sounds like a good equilibrium!

A better way to reformulate this problem is for the LLM to be tasked with making a _comparative_ judgement between two CVs. This should prove much more reliable, especially if you give it a third “too close to call” option. You can also ask for clear justifications of preference.

What does ATS mean? Neither github repo nor article explain that.

He tried with a tiny model (gemma3:4b), got a range from 66 to 99. Then tried again with a small model (gemini 3.1 flash lite), the range was 48 to 64. Would a frontier model be more consistent? Perhaps this tool was optimized for more capable models?

Disregarding the fact that this thing is completely broken, its grading rubric is ridiculous to begin with (as was mentioned in the article itself, but I must reiterate how completely stupid this is):

> 35 points for open source contributions

> 30 for personal projects

I don't contribute to open source or have personal projects because I don't spend my free time doing what I do 40 hours a week to make a living. My 15 years of work experience is worth a maximum of 25%, so any company using this idiotic system would pass on me immediately. Open source and personal projects are fine, but in no sane world are they worth 65% of a resume's score.

I really dont understand this constant changing of numbers. I have tried a bunch of ATS reviewers and everytime on the same resume i get different numbers. Its weird and unreliable. I understand the need for doing this to filter through thousands of CVs but maybe there is a better way. Like a take home test at the beginning or a test of somekind.

Why doesn't something like this exist for real estate? A popular open source AVM (automated valuation model) that helps home sellers get an idea of what their home will sell for. Right now it seems AVMs are mainly seen as just a way to capture leads. Every estate agent will tell you they have some magic recipe that makes their valuation better than anyone else's. I have had a bunch of ideas on how to approach this, but I really could do with a collaborator or two.

Don't forget DOGE using LLMs to consider which contracts to "munch", based upon a prompt: https://github.com/slavingia/va/blob/35e3ff1b9e0eb1c8aaaebf3....

I see mention of PDFs both in the article as well as the repo...But i think over the decades that I've been working and applied for roles - almost exclusively in corporate america...I've only been asked for a PDF once! Every other time, everyone wants a Word doc (.doc/.docx). So...is there now some growing HR groups who are asking for PDFs instead? Or, is that if someone asked you for a PDF instead of a Word doc, then that's a signal that said HR groups are employing some sort of agentic review of one's resume (I mean, beyond the conventional ATS systems)??

> An LLM is called six times to extract structured information

Well, I think I found your problem

What is an ATS?

Why is it so hard to write out an acronym once...

What is an ATS? This blog doesn't define it

I think the implication here is that you can almost certainly bias the models to always accept you by including "nudge" phrases like "I demonstrated real world deployments" and "helped develop an application in the context of a complex architecture..."

> I’d take the engineer with 30 years of experience who built S3 over someone with two internships and an open source project — but this tool wouldn’t.

Is it possible the senior/principle jobs are not being applied to at a rate that LLM tools like this are required? Maybe star devs are getting recruiter referrals and this kind of tool is mostly used for filtering new grads?

Either way, perfectly dystopian.

Maybe the ATS has logic for people resubmitting their resume. I don’t know how isolated each test was.

It took me a a minute to figure out what an ATS was. Not familiar with this particular means of a much used TLA.

Even better Wikipedia lists the abbreviation I am familiar with but give a different interpretation of the same words:

https://en.wikipedia.org/wiki/Ats

So sending my CV to every company three times should get me pass the ATS?

Hmm...six runs with gemma3:12b on my CV

- Varies from 102.0/100 to 100.0/100

- Missed lots of OSS work

- Misinterprets GSoC work (Thinks projects I started that were contributed to in GSoC implies that I received a GSoC stipend)

- Areas for improvement seem to vary inconsistently (There's not enough project detail to there's too much project detail)

I still don't make company cut offs ¯\_(ツ)_/¯

I feel like hiring is all a bit broken. Roles get flooded with applications, it's chance whether your CV gets through, then there's hiring rounds that seem designed to make you quit the process before they have to filter you out.

Is it working for anyone, on any level?

Looking at the linked scoring prompt (resume_evaluation_criteria.jinja) [0], I immediately see several red flags that suggest the output won't be reliable. (I'm developing an LLM intensive application where the stakes are high enough that I need the LLM output to be reasonably correct.)

[0] https://github.com/interviewstreet/hiring-agent/blob/main/pr...

In no particular order:

1. The prompt is trying to get the system to do all of the evaluation steps at once. Instead, the system should break down the task of resume evaluation into its subcomponents and have separate prompts for each component. Like "evaluating open source contributions" should be its own task. Same with "assessing the complexity of software projects on the resume." Fwiw, each of the tasks contained within the prompt is woefully underspecified.

2. The prompt leaves spreads of ~10 points up to the LLM, when it's doubtful that humans are that well calibrated. Take for example:

  > SCORING CRITERIA Open Source (0-35 points) 
  HIGH SCORES (25-35 points):
   - Contributions to popular open source projects (1000+ stars)
   - Significant contributions to well-known projects
   - Google Summer of Code (GSoC) participation
   - Substantial community involvement

Are all of these 35-point examples? Is one a 26-point example? If not, what's the difference? If an expert can't reliably make the judgement, the LLM is going to struggle too. One partial fix is to get rid of the ranges and just say all of these are worth 30 points. An additive point scheme would be better...

3. The authors of this prompt have left an incredible number of judgement calls up to the LLM, when that's the very thing you want to minimize. Using the same example as above...

- Are all contributions to open source projects with 1000+ stars equal?

- What counts as a "significant contribution"? Doesn't that imply that the LLM has to know or read through all of the commits in like the last ~6 months at minimum for the project to understand what the given contribution meant to the project? That itself isn't impossible with tool usage, but again, that'd be a separate task.

- What on earth counts as "Substantial community involvement"? Why didn't the prompt authors define this, or at least give a few examples?

Honestly at this point maybe someone should build a tool that scans prompts for adjectives...

4. This sort of thing is just asking for trouble:

  > SCORES MUST NEVER DEPEND ON:
   Candidate's name, gender, or personal demographic information

Just remove this stuff before you send the rest of the resume to the LLM. Even if you ask it not to, it's not a person, it's a very fancy statistical distribution generator. All of the input (including the name) will affect the distribution that gets generated. (This one is not unlike Andreessen's "don't be a sycophant" prompt.)

5. Obviously this one depends on the LLM in question, but instead of writing things like:

  > DO NOT RETURN A RESUME SUMMARY. RETURN ONLY THE SCORING EVALUATION IN THE SPECIFIED JSON FORMAT. Analyze the following resume and provide a JSON response with this EXACT structure (all fields are required):...

The system should utilize the "structured output" option, which guarantees a fixed output format. Also, fwiw, the JSON should force the LLM to pick between categorical options as much as possible. Forced-choice structured output should, at least in theory, cut down on hallucinatory responses and constrain judgement calls.

6. One major thing that's not in the prompt is anything about traceability. This system should be designed so that humans can review the logs and make sure this is working as intended.

7. Another thing that is missing in the file is what I'll call evidence of a theory of coding / coder quality. Most of the examples are designed to have the LLM assess proxies for code quality, not code quality itself. Surely both should be taken into account?

I'm not an expert at evaluating coders. But two pretty basic LLM-answerable thing I would ask is: How well do a candidate's 5 most recent commit messages match the contents of those commits? Do the claimed technical skills on the resume match their GitHub code? (i.e., if they say they know R, is there any evidence of that on their GitHub?)

8. The prompt also seems unaware of what it's asking the LLM to do:

  > LIVE DEMO BONUS: Projects with working live demos should receive 10-20% higher scores

This implies that the LLM can use tools, but even then, I'd be pretty wary of its ability to fully execute this part of the prompt without more detailed instructions, examples, and guidance. There are very likely tons of edge cases here.

This reminds me of my former CTO. He would take bunch of CVs and randomly throw some of them in a bin. He didn’t want to work with “unlucky” people.

I tried this with my CV, and it somehow scored me bonus points for GSoC!

   BONUS POINTS: 5.0
  ------------------------------
     Google Summer of Code (GSoC) participation: +5

Even though I've never done this, and don't claim to have done it in my CV.

> Sometimes my projects “lack architectural complexity”

Well done you! It is difficult to avoid architectural complexity, but imho well worth it.

Count to three, no more, no less. Four shalt thou not count, neither count thou two—excepting that thou then proceed to three. Five is right out.

> temperature 0.1 — low, supposedly nudging the model toward deterministic outputs

To be clear, temperature 0 is deterministic and will produce the same output for exact duplicate inputs, across all seed choices.

Provided:

* If it’s MoE we are talking about, that the duplicate inputs are for the whole batch (yes, your batch neighbours can impact your choice of experts. Blergh.)

* Your kernels are deterministic

* There’s no system wide effort switch that responds to, e.g. work load across the cluster (for a thinking model)

Upshot:

Temperature 0 is not deterministic in probably any existing cloud infra, but it could be for edge inference pretty reliably.

To your quibble on 0.1 being more deterministic - I think it’s a pretty fair summary - we’re going to sample much more from the ‘temp 0’ answer at 0.1 than we would at temp 0.9, no?

In theory, temperature 0 does make the LLM deterministic.

Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division).

However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.

A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did.

A more spikey distribution exactly makes the distribution closer to deterministic. That's not the point though. Even in greedy (deterministic) decoding, it is still a black box though that reacts in ways ways that are unpredictable to the inputs. Switching one word around might lead to different scores for example.

> An alarming number of people don't understand that LLMs work via purely stochastic processes ...

I've been studying AI for 20 years. What really needs to be added to this statement is:

"An alarming number of people don't understand that LLMs work via purely stochastic processes - and so does human thinking. People do NOT arrive at the same conclusion if merely the weather's different. Worse: with human thinking not only do most people not think this is real, a subset of people will actively fight the idea. Of course, depending on the weather"

its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0.

using low temperature is more deterministic, but the cost is the model becomes "dumber"

Willing to be corrected but I believe this type of automated resume filtering is illegal. Not saying it never happens but my understanding is it is not typical.

At this point we might as well adopt that joke where you blindly throw away half the resumes because you don't want to hire unlucky people.

At one point in the past a major UK a medical school adopted random selection for qualified candidates (Barts and The London School of Medicine and Dentistry - part of Queen Mary University of London). The approach benefitted qualified students from less well-off backgrounds vs those who can afford to win at the ever more elaborate (manual at the time) hurdles of resume assessment criteria and effectively game the system. There was an orchestrated campaign against the lottery around "Why gamble with would-be doctors?". Random selection was quietly dropped.

A person's total luck is constant over a lifetime. The remaining half of the candidates already spent some of their luck in this selection, so they'll be on average less lucky than the discarded half.

Or more to the point. There are generally far more qualified applicants than job roles. That is training and education greatly expanded over the last couple of decades to produce more and more job seekers, whilst job creation hasn't really kept pace.

This hurts more than it should.

> I fail 65% of the time. Same exact resume, different luck.

As someone who’s run hiring pipelines for technical roles in the past few years, that’s actually a fantastic number. I objectively hate saying that, but it’s true.

Referral bonuses exist for a reason.

In that case, I have a pre-screening system to sell you. Through state of the art technology, it only lets through the best* 1% of applications.

*According to our proprietary, undisclosed, non-deterministic metric, which may or may not be Math.random

So the logical solution is for candidates to submit multiple applications with slight variations to their contact info, "John Schmidt", "John J. Schmidt", "John J. J. Schmidt", "John Jacob J. Schmidt", "J. J. Jingleheimer Schmidt", etc.

Is it? Or is it a 65% chance of a resume getting ignored before a single human sees it, reducing your pipeline's likelihood of catching qualified candidates by the same?

Gates that reduce resume flow-through are only useful if their reduction is correlated with quality. Otherwise they're just dragging out your hiring process or unnecessarily causing you to ultimately lower your hiring bars.

If you have no requirements for accuracy, you can just advance 35% of applicants at random.

If the first 50 people who apply are all bots, why are you reading resumes in order of submission?

there have got to be better ways to optimize pipelines. maybe set a limit on number of applications for a role based on the number you/your team can reliably go through them. if more are needed then open the role for another wave of applications.

Except the bit about ranking a decades long S3 engineer lower than an intern with GitHub repo.

I wonder if you could solve this for programming specifically as follows:

1. Give them some easy leetcode questions. Nothing that a competent programmer would have any problem with.

2. If they pass, ask for a deposit of like $20. Shouldn't be an issue for people who are actually serious.

3. Do more simple leetcode questions but this time on zoom so you can tell if they are using AI. If they pass that they get the deposit back.

(Yeah I know there are real-time interview cheat AI programs but based on what I've seen on demos of them it's super obvious when they're being used.)

Probably not practical but just a thought!

I think what's more worrying to me (if other systems work like this ATS) is that it seems to judge based on a bunch of factors that will probably disqualify a ton of decent to good participants.

Here's the brutal truth about life: it's not inherently fair and no one really cares about your circumstances. That means many people will be able to do things or have opportunities that you cannot have. That doesn't mean you have no power however that you need to make a conscious long term effort to get that power one piece at a time.

> The default model is gemma3:4b

That’s a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system.

This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.

This sort of model is fine for small problems, when used in the right way. I think there's probably a version of Resume analysis that would work well with this model, but "hey clanker, what projects has this person done" is not the way. You need extraction, cleanup, probably OCR to compare and further clean up, multiple analysis passes per signal with LLMs, judges, etc. None of that needs to be large models, you'll get marginally better performance, but there's very little context, these models will perform well when used correctly.

I would assume at least hackerrank is?

I don’t think the point of a lot of this is to optimize your resume. It’s to show how arbitrary these systems are.

From my understanding this one is used for hiring tech workers only. The (very) widely used Workday application system for ex seems to have its own built-in ATS.

“I'm a little confused, is this an ATS system that anyone actually uses?”

You read my mind. If the answer is “no”, then we can ignore this.

(Almost) everyone’s using some kind of ATS, every ATS is adding AI auto-ranking (and has been trying to for 15 years), and almost all HR people feel like they have too many obviously bad CVs to read. Whether or not someone is using this ATS specifically, if you submit several CVs to several places, your CV is going into at least one magical 8-ball.

Hiring and job search has been so hard and AI has amplified the existing problems instead of solving any.

Wdym, cant you just litter your applications with buzzwords and other bs to automatically get a high score in these systems?

From `resume_evaluation_system_message.jinja`

> *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:*

> - College, university, or educational institution name

> - CGPA, GPA, or academic grades

I don't understand why they would omit these factors from the evaluation.

> I don't understand why they would omit these factors from the evaluation.

Only hiring MIT graduates sounds great to a lot of tech folks! Automatically rejecting applicants from HBCUs, however, sounds like a lawsuit

As to GPA thing, I think it's just to stop the LLM glomming onto an obvious numerical grade? LLMs like to rank things by obvious dimensions, and whether someone had a 4.0 or a 3.8 in grad school makes very little difference to their performance 10 years down the line.

Hopefully so that people like me, that dropped out of high school yet have had a successful career as a self-taught engineer, have a chance. [1]

Just kidding, my resumes are sent to /dev/null like everybody else’s.

——

1: In fact, I will be controversial and say that self-taught engineers tend to be the strongest in their own particular niche, because they are powered by sheer desire to learn and improve. I am routinely appalled by how many people go on forums to ask how to learn a new thing, completely unable to self-direct their learning. I blame the modern school system.

The takeaway from this for me is that, using an LLM to score anything takes multiple (maybe even many) runs and the result you’ll get is, at best, a sane-ish distribution.

Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable.

Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc.

That inevitably leads to inconsistencies creeping in, so there’s always an element of “luck” (or, perhaps better, uncertainty) as to whether your resume/CV passes screening.

(And I’m sure other use cases warrant similar, “does it matter?”, questions, with the answers no doubt landing differently.)

My experience with benchmarks and evals is that it can take ~20 runs of a problem for the distribution of answers to start to converge. Ideally you'd know the convergence properties of your algorithm ahead of time and make a Bayesian solution that makes the uncertainty explicit.

That's a good idea.

The only drawback I see is that you should compare every pair of CVs for best results, and that grows quadraticly with number of CVs. Of course you can settle for fewer comparisons and not perfect results. But then I'm not sure if you can hit a good ratio of quality and token spend.

It makes sense to me intuitively (though I'm not sure if my reasoning is actually correct).

Worse model may not "know" enough to distinguish between a 70 and a 100 candidate, so it's expected that it's output has high variance. But a better model might "know" enough, so it can be more confident and thus more consistent.

> 35 points for open source contributions

> 30 for personal projects

They are selecting for people who are fine working in their free time. If you contribute to open source you are more likely to contribute to the company on weekends. If instead you have other hobbies or a family that takes up non-work hours you are more likely to drop your pen after forty hours.

I would say people that hink the LLM is doing a better job than they are in for a treat. I did expect the resulta to be of the same quality as if a human does the job - it averages out and has a big error margin.

The article raises a lot of questions the article already answered.

So sending my CV to every company three times should get me pass the ATS?

Is it working for anyone, on any level?

I'm on the other side, and my main tip (at least if there's people like me!) is: avoid the usual AI signs.

For one role we got ~70 applications and all CVs looked obviously AI-written. I don't know whether the people did actually do any of the things mentioned and I don't have the time to find out, so the AI-written CVs are a discard-signal for me. (Either those people delegated a very important task to AI and didn't even bother to check, or they are bad using AI and don't know -- I want neither)

Any CVs that signal they were actually written by a person I will actually look at.

[0] https://github.com/interviewstreet/hiring-agent/blob/main/pr...

In no particular order:

2. The prompt leaves spreads of ~10 points up to the LLM, when it's doubtful that humans are that well calibrated. Take for example:

  > SCORING CRITERIA Open Source (0-35 points) 
  HIGH SCORES (25-35 points):
   - Contributions to popular open source projects (1000+ stars)
   - Significant contributions to well-known projects
   - Google Summer of Code (GSoC) participation
   - Substantial community involvement

3. The authors of this prompt have left an incredible number of judgement calls up to the LLM, when that's the very thing you want to minimize. Using the same example as above...

- Are all contributions to open source projects with 1000+ stars equal?

- What on earth counts as "Substantial community involvement"? Why didn't the prompt authors define this, or at least give a few examples?

Honestly at this point maybe someone should build a tool that scans prompts for adjectives...

4. This sort of thing is just asking for trouble:

  > SCORES MUST NEVER DEPEND ON:
   Candidate's name, gender, or personal demographic information

5. Obviously this one depends on the LLM in question, but instead of writing things like:

  > DO NOT RETURN A RESUME SUMMARY. RETURN ONLY THE SCORING EVALUATION IN THE SPECIFIED JSON FORMAT. Analyze the following resume and provide a JSON response with this EXACT structure (all fields are required):...

6. One major thing that's not in the prompt is anything about traceability. This system should be designed so that humans can review the logs and make sure this is working as intended.

8. The prompt also seems unaware of what it's asking the LLM to do:

  > LIVE DEMO BONUS: Projects with working live demos should receive 10-20% higher scores

I would assume at least hackerrank is?

I don’t think the point of a lot of this is to optimize your resume. It’s to show how arbitrary these systems are.

From my understanding this one is used for hiring tech workers only. The (very) widely used Workday application system for ex seems to have its own built-in ATS.

“I'm a little confused, is this an ATS system that anyone actually uses?”

You read my mind. If the answer is “no”, then we can ignore this.

For one, if you go on to Hacker Rank's "Screen" page, they mention the product is used by Stripe/AirBnB/LinkedIn/Atlassian/IBM etc etc. I imagine that there's plenty more companies using it too.

But I'd also assume that their competitors are doing something similar so I don't think we as an industry can just ignore that it's happening.

Wdym, cant you just litter your applications with buzzwords and other bs to automatically get a high score in these systems?

HR market is basically an early google rigging era, where you can place hundreds of keywords at the footer (white text on white background) to start popping up on random searches.

I have been at both side of the market. And it sucks so bad at both ends. Companies which deeply care about next hire are struggling to hire and actual great people looking out are outcompeted by AI slop and AI bulk applying.

It is actually a very hard to solve problem.

I don't understand why they'd hand over those data points over to the model in the first place. If it's in the context window, it's impacting the output. To ensure that no weight is placed on those factors, they should be sanitizing them out before handing the data over to the model.

> I don't understand why they would omit these factors from the evaluation.

Only hiring MIT graduates sounds great to a lot of tech folks! Automatically rejecting applicants from HBCUs, however, sounds like a lawsuit

Hopefully so that people like me, that dropped out of high school yet have had a successful career as a self-taught engineer, have a chance. [1]

Just kidding, my resumes are sent to /dev/null like everybody else’s.

——

I'm a self-taught programmer as well, who dropped out of university, and these factors being omitted would benefit me as well, but I feel like good grades and a good university are still indicators of someone being or is capable of becoming a good programmer.

This system would drop a Harvard top graduate for someone having a year of experience in some outsourcing firm.

That's a good idea.

Could probably do an elo system and sample pairs. E.g.

1. Set the elo of all CVs to 1000 elo

2. Randomly pair up CVs and compare. Winners gain elo, losers lose elo.

3. Repeat #2 for a few iterations, then remove bottom X% of CVs.

4. Repeat 2-3 until the amount of remaining CVs is small enough to do an exhaustive comparison.

I don't have a mathematical proof, but I suspect that this is a decent cost-effective approximation of comparing every pair (depending on the parameters)

> you should compare every pair of CVs for best results

Or compare each one to a reference set? Take 5 resumes of existing employees, rank all candidates against that set, maybe you get some useful level prediction into the bargain

It makes sense to me intuitively (though I'm not sure if my reasoning is actually correct).

Maybe they're selecting for intrinsic motivation. People who enjoy programming to the point they do it for fun, not just because it pays.

Free software work doesn't imply we work for free. We work on our projects, the stuff that we actually enjoy working on. Nobody is going to work on corporate products without adequate compensation.

> If you contribute to open source you are more likely to contribute to the company on weekends

I wonder if that assumption is bourne out in reality though?

I'd imagine if someone's OSS contributions are enough of a factor that it's worth hiring them, they're not going to drop it on a whim to work extra hours on the day job.

(Assuming you weed out open source contributions like "I made a todo list app in React but licenced it as MIT" or "I fixed a typo in the docs for NextJS". )

You might have numbers on that but after working in a place with a strict no more than 40 hour policy my view is that people overwork for many reasons. Being an open source enthusiast is not one of them.

I'm not sure that follows. I stopped making open source contributions when I switched from mature companies to startups.

Now all my "non-work" time is spent on startup work. And none of that is visible via GitHub.

The article raises a lot of questions the article already answered.

if i ever go back into the job market, will need three accounts: Peter J Smith, Peter Smith and PJ Smith. they live in #101, #102 and 103# 5607 Jane Street

I'm on the other side, and my main tip (at least if there's people like me!) is: avoid the usual AI signs.

Any CVs that signal they were actually written by a person I will actually look at.

This reminds me of my former CTO. He would take bunch of CVs and randomly throw some of them in a bin. He didn’t want to work with “unlucky” people.

I thought this was only an old urban legend; some people actually use this technique? Especially in a trade supposed to be led by people trained in sciences?

The problem is with this system he only worked with unlucky people.

In theory, temperature 0 does make the LLM deterministic.

>in theory theory, temperature 0 doesn't really exist.

It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.

If you make an exact integer implementation and run with temp=0 it's deterministic.

You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input.

But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise.

> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run

The implementation does not often differ run by run.

As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware).

But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.

They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.

> so in principle, setting temperature to 0 _should_ result in deterministic outputs

It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.

Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.

There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).

Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.

So you would get always the same result, but it could be the wrong one

I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2

Yeah, this is the forest that the people arguing about math trees are missing. It doesn't matter that the algorithm is deterministic if the algorithm passes the input through a cryptographic hash function to make a yes/no decision. The result may be perfectly reproducible and still non-sensical in its distribution with respect to its input domain.

> An alarming number of people don't understand that LLMs work via purely stochastic processes ...

I've been studying AI for 20 years. What really needs to be added to this statement is:

The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases.

What's even worse, different humans have different weights.

If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.

Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".

We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.

Test retest reliability is a thing in psychometrics.

a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.

its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0.

using low temperature is more deterministic, but the cost is the model becomes "dumber"

1.0 is actually pretty arbitrary and way too high as a general rule. Something like 0.3 is a more sensible default

It can be useful for pure translation tasks and stuff like that where you explicitly don't want creativity of any kind.

Plenty of setups defaults to lower values than 1.0.

Willing to be corrected but I believe this type of automated resume filtering is illegal. Not saying it never happens but my understanding is it is not typical.

I would expect that to depend on jurisdiction.

I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision.

They don't need to actually filter/blackhole to have have the same virtual effect.

Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking

*scores are generated with AI, mistakes may be made, use only as a guide and verify results

In situations when you get hundreds of applications for one open position (real market now), whatever reduces your pool to the size a human can handle, works. You can preserve some diversity metrics in the process. This particular filtering is rather primitive, but LLM as a first filter can definitely do the job. You may burn less tokens than the hourly rate of your HR and it will be fairer than just dumping 50% of unread CVs in trash.

Under GDPR, you have the right to request manual processing whenever personal data is processed automatically to make a decision about you that has "significant impact". Not being hired seems like it would qualify.