BONUS POINTS: 5.0
------------------------------
Google Summer of Code (GSoC) participation: +5
Even though I've never done this, and don't claim to have done it in my CV.Well done you! It is difficult to avoid architectural complexity, but imho well worth it.
> temperature 0.1 — low, supposedly nudging the model toward deterministic outputs
This is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).
As someone who’s run hiring pipelines for technical roles in the past few years, that’s actually a fantastic number. I objectively hate saying that, but it’s true.
35% chance of elevating a technical individual to the next stage with no effort? I’ve seen as many as 100+ applicants an hour even when including a domain specific screener question. That’s 35 “screened” applicants in an hour. Were valid candidates screened out? Yes. Does you still have a candidate pool 35x larger than you need? Unfortunately, also yes.
The volume of applicants is SO HIGH such that your chances of getting moved to the next stage are actually markedly worse if AI isn’t involved. If you didn’t apply immediately (using an AI bot) there’s 50+ people ahead of you, and an exhausted technical leader if they ever make it to your resume.
Referral bonuses exist for a reason.
For example, 65 points are given for a mix of personal projects and open source contributions. Which is great if your one and only interest is in tech, and you don't have a family, dependents or a second/third job. If you have any of those other things, well the odds seem like they're incredibly stacked against you.
And it makes me wonder how many of these systems are stacked in favour of wealthy people with a near special interest level of obsession with tech and no worries outside of going to college/working a single job in their industry of choice.
That’s a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system.
This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.
> *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:*
> - College, university, or educational institution name
> - CGPA, GPA, or academic grades
I don't understand why they would omit these factors from the evaluation.
Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable.
I am sure there’s some reasonable rule of thumb estimation on distribution that could be applied based off fewer runs per data artifact, but you’re always going to be trading off against confidence by doing this.
Beyond this, I’d bet that almost no implemented systems that use LLMs for scoring, ranking, or decision making use such a multi-run approach. Partly because people don’t understand their behaviour is stochastic, perhaps because a lot of people without a background in statistics don’t understand what stochastic actually means, and no doubt partly because of budget concerns: if you have to ask an LLM to do the same thing 10, 50, 100 times to get a sufficiently good result, then the cost saving argument is either weakened or completely destroyed.
There is at least one more aspect worth considering in the specific case of resumes/CVs: is the inconsistency of scoring by LLM worse than the inconsistency of scoring by a human following a similar process?
Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc.
That inevitably leads to inconsistencies creeping in, so there’s always an element of “luck” (or, perhaps better, uncertainty) as to whether your resume/CV passes screening.
So is that inconsistency better or worse with LLM screening? I don’t know. But, at least, if it’s not worse maybe it doesn’t matter for this specific use case. And if it’s notably better then maybe it’s raised the bar on what “good enough” screening looks like?
(And I’m sure other use cases warrant similar, “does it matter?”, questions, with the answers no doubt landing differently.)
> 35 points for open source contributions
> 30 for personal projects
I don't contribute to open source or have personal projects because I don't spend my free time doing what I do 40 hours a week to make a living. My 15 years of work experience is worth a maximum of 25%, so any company using this idiotic system would pass on me immediately. Open source and personal projects are fine, but in no sane world are they worth 65% of a resume's score.
Is it working for anyone, on any level?
[0] https://github.com/interviewstreet/hiring-agent/blob/main/pr...
In no particular order:
1. The prompt is trying to get the system to do all of the evaluation steps at once. Instead, the system should break down the task of resume evaluation into its subcomponents and have separate prompts for each component. Like "evaluating open source contributions" should be its own task. Same with "assessing the complexity of software projects on the resume." Fwiw, each of the tasks contained within the prompt is woefully underspecified.
2. The prompt leaves spreads of ~10 points up to the LLM, when it's doubtful that humans are that well calibrated. Take for example:
> SCORING CRITERIA Open Source (0-35 points)
HIGH SCORES (25-35 points):
- Contributions to popular open source projects (1000+ stars)
- Significant contributions to well-known projects
- Google Summer of Code (GSoC) participation
- Substantial community involvement
Are all of these 35-point examples? Is one a 26-point example? If not, what's the difference? If an expert can't reliably make the judgement, the LLM is going to struggle too. One partial fix is to get rid of the ranges and just say all of these are worth 30 points. An additive point scheme would be better...3. The authors of this prompt have left an incredible number of judgement calls up to the LLM, when that's the very thing you want to minimize. Using the same example as above...
- Are all contributions to open source projects with 1000+ stars equal?
- What counts as a "significant contribution"? Doesn't that imply that the LLM has to know or read through all of the commits in like the last ~6 months at minimum for the project to understand what the given contribution meant to the project? That itself isn't impossible with tool usage, but again, that'd be a separate task.
- What on earth counts as "Substantial community involvement"? Why didn't the prompt authors define this, or at least give a few examples?
Honestly at this point maybe someone should build a tool that scans prompts for adjectives...
4. This sort of thing is just asking for trouble:
> SCORES MUST NEVER DEPEND ON:
Candidate's name, gender, or personal demographic information
Just remove this stuff before you send the rest of the resume to the LLM. Even if you ask it not to, it's not a person, it's a very fancy statistical distribution generator. All of the input (including the name) will affect the distribution that gets generated. (This one is not unlike Andreessen's "don't be a sycophant" prompt.)5. Obviously this one depends on the LLM in question, but instead of writing things like:
> DO NOT RETURN A RESUME SUMMARY. RETURN ONLY THE SCORING EVALUATION IN THE SPECIFIED JSON FORMAT. Analyze the following resume and provide a JSON response with this EXACT structure (all fields are required):...
The system should utilize the "structured output" option, which guarantees a fixed output format. Also, fwiw, the JSON should force the LLM to pick between categorical options as much as possible. Forced-choice structured output should, at least in theory, cut down on hallucinatory responses and constrain judgement calls.6. One major thing that's not in the prompt is anything about traceability. This system should be designed so that humans can review the logs and make sure this is working as intended.
7. Another thing that is missing in the file is what I'll call evidence of a theory of coding / coder quality. Most of the examples are designed to have the LLM assess proxies for code quality, not code quality itself. Surely both should be taken into account?
I'm not an expert at evaluating coders. But two pretty basic LLM-answerable thing I would ask is: How well do a candidate's 5 most recent commit messages match the contents of those commits? Do the claimed technical skills on the resume match their GitHub code? (i.e., if they say they know R, is there any evidence of that on their GitHub?)
8. The prompt also seems unaware of what it's asking the LLM to do:
> LIVE DEMO BONUS: Projects with working live demos should receive 10-20% higher scores
This implies that the LLM can use tools, but even then, I'd be pretty wary of its ability to fully execute this part of the prompt without more detailed instructions, examples, and guidance. There are very likely tons of edge cases here.I don’t think the point of a lot of this is to optimize your resume. It’s to show how arbitrary these systems are.
You read my mind. If the answer is “no”, then we can ignore this.
Only hiring MIT graduates sounds great to a lot of tech folks! Automatically rejecting applicants from HBCUs, however, sounds like a lawsuit
As to GPA thing, I think it's just to stop the LLM glomming onto an obvious numerical grade? LLMs like to rank things by obvious dimensions, and whether someone had a 4.0 or a 3.8 in grad school makes very little difference to their performance 10 years down the line.
Just kidding, my resumes are sent to /dev/null like everybody else’s.
——
1: In fact, I will be controversial and say that self-taught engineers tend to be the strongest in their own particular niche, because they are powered by sheer desire to learn and improve. I am routinely appalled by how many people go on forums to ask how to learn a new thing, completely unable to self-direct their learning. I blame the modern school system.
The only drawback I see is that you should compare every pair of CVs for best results, and that grows quadraticly with number of CVs. Of course you can settle for fewer comparisons and not perfect results. But then I'm not sure if you can hit a good ratio of quality and token spend.
Worse model may not "know" enough to distinguish between a 70 and a 100 candidate, so it's expected that it's output has high variance. But a better model might "know" enough, so it can be more confident and thus more consistent.
For one role we got ~70 applications and all CVs looked obviously AI-written. I don't know whether the people did actually do any of the things mentioned and I don't have the time to find out, so the AI-written CVs are a discard-signal for me. (Either those people delegated a very important task to AI and didn't even bother to check, or they are bad using AI and don't know -- I want neither)
Any CVs that signal they were actually written by a person I will actually look at.
Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division).
However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.
I've been studying AI for 20 years. What really needs to be added to this statement is:
"An alarming number of people don't understand that LLMs work via purely stochastic processes - and so does human thinking. People do NOT arrive at the same conclusion if merely the weather's different. Worse: with human thinking not only do most people not think this is real, a subset of people will actively fight the idea. Of course, depending on the weather"
using low temperature is more deterministic, but the cost is the model becomes "dumber"
You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input.
But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise.
The implementation does not often differ run by run.
I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision.
Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking
*scores are generated with AI, mistakes may be made, use only as a guide and verify results
If the first 50 people who apply are all bots, why are you reading resumes in order of submission?
1. Give them some easy leetcode questions. Nothing that a competent programmer would have any problem with.
2. If they pass, ask for a deposit of like $20. Shouldn't be an issue for people who are actually serious.
3. Do more simple leetcode questions but this time on zoom so you can tell if they are using AI. If they pass that they get the deposit back.
(Yeah I know there are real-time interview cheat AI programs but based on what I've seen on demos of them it's super obvious when they're being used.)
Probably not practical but just a thought!
It doesn't show the score because of the variability discussed here and only outputs readability/parser-style findings.
But logical inference itself is limited. You still have to find out if p is true or not - the ground truth.
How do you find that? You would be able to define in the prompt that if resume has p, infer q and do this. But determining the truth value of p is something LLM cannot do.
It’s not a limitation of the LLM. It’s the limitation of logic itself. You take 10 humans and give them the resumes with the same rubrics as the LLM. You’ll get a similar range of scores because everyone would assign different values.
The issue is not in logical inference. It’s in determining the value of p, which takes much more than logic. And current LLMs are limited to being logical.
I am not currently looking for employment, nor am I currently particularly worried about future prospects if I was suddenly in the position of looking for employment.
But if I ended up in a position with nothing to lean on but scattering my CV everywhere, well…
A lot of my major contributions are littered across the internet, private, or even just verbal/consultancy. They're things I did for free, in my spare time.
I also avoid GitHub. If you just look at my GitHub page for extra context, you would likely miss that delivering that very GitHub page likely involved a few bits of code I wrote.
Now, I could do a better job of trying to document this stuff, so it could be easier to find… But also I can't quite imagine how that would work.
https://neonrocket.com/2014/05/rescued-from-the-ashes-i-dont...
Well, I think I found your problem
Why is it so hard to write out an acronym once...
But I'd also assume that their competitors are doing something similar so I don't think we as an industry can just ignore that it's happening.
It is actually a very hard to solve problem.
This system would drop a Harvard top graduate for someone having a year of experience in some outsourcing firm.
1. Set the elo of all CVs to 1000 elo
2. Randomly pair up CVs and compare. Winners gain elo, losers lose elo.
3. Repeat #2 for a few iterations, then remove bottom X% of CVs.
4. Repeat 2-3 until the amount of remaining CVs is small enough to do an exhaustive comparison.
I don't have a mathematical proof, but I suspect that this is a decent cost-effective approximation of comparing every pair (depending on the parameters)
Or compare each one to a reference set? Take 5 resumes of existing employees, rank all candidates against that set, maybe you get some useful level prediction into the bargain
Free software work doesn't imply we work for free. We work on our projects, the stuff that we actually enjoy working on. Nobody is going to work on corporate products without adequate compensation.
I wonder if that assumption is bourne out in reality though?
I'd imagine if someone's OSS contributions are enough of a factor that it's worth hiring them, they're not going to drop it on a whim to work extra hours on the day job.
(Assuming you weed out open source contributions like "I made a todo list app in React but licenced it as MIT" or "I fixed a typo in the docs for NextJS". )
Now all my "non-work" time is spent on startup work. And none of that is visible via GitHub.
It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.
But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.
They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.
It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.
Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.
If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.
Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".
*According to our proprietary, undisclosed, non-deterministic metric, which may or may not be Math.random
Gates that reduce resume flow-through are only useful if their reduction is correlated with quality. Otherwise they're just dragging out your hiring process or unnecessarily causing you to ultimately lower your hiring bars.
Determinism matters for reproducibility, but do you really want these outputs to be reproducible in this particular case? Making LLM outputs deterministic is relatively trivial, you have to use batch-invariant kernels (if you use batching) and either set the temperature to 0 (don't do that, randomized sampling is here for a reason) or fix the seed (better). It's readily available in a few systems. But this won't make the result more useful, it will just obscure the fact that the agent is genuinely not sure about it - look at the range of the scores it gives! It still won't predict anything but the score will stay the same each time. Do you really want that?
What happens here is they're supplying too little information (just a resume, which is almost at the noise level) and expecting a reply with too broad implications. This is a basic design mistake regardless of whether it uses LLMs. All surveys, tests, laws, and voting systems are extremely sensitive to framing because they work off too little information. But they also don't exist in vacuum, unlike this thing.
To be clear:
- randomly filtering "too many" resumes is pretty much allowed (I think)
- but must be actual random independent of the resume (and can be in multiple layers, i.e. random filter > pre-select > random filter > select)
- this isn't the case for AI as the random aspect isn't done as the random aspect is not independent of the actual resume evaluation
- in general you can't make sure the AI doesn't apply systematic biases, and there is high indication that it does do so
- for humans you can train them and order them to ignore their biases, this won't work reliable either _but now you delegated the responsibility of illegal biases to the hiring personal violating the order_. But for AI usage you are responsibility no matter what you tell it. Lastly you can technically "show/proof" a specific used AI is highly biased in a specific contexts, which for human employees is technical possible but practical not really practical. So this moves "specific mostly deniable" cases, into "systematic proven bias" teritory. Or in other word legal risk goes from "limited/no issue" to "people can systematically f-you over if they know you use AI for hiring".
After a few runs it picked things up appropriately. I always got dinged on formal education though.
This stuff is gross.
In my experience, cold-applying has always worked essentially as a black hole, and LLMs haven't changed that much. The reality is that alternative avenues are always necessary to get the job you want. That could be a third-party recruiter; reaching out to a hiring manager on LinkedIn; or using your network to get referrals. Those continue to work whether the company is using a bone-headed tool like this or not.
This isn't to diminish the whispernet. Rather, it shows just how many important signals cannot be quantized.
> 30 for personal projects
These are insane weights for scoring a software engineer's resume.
Even better Wikipedia lists the abbreviation I am familiar with but give a different interpretation of the same words:
Is it possible the senior/principle jobs are not being applied to at a rate that LLM tools like this are required? Maybe star devs are getting recruiter referrals and this kind of tool is mostly used for filtering new grads?
Either way, perfectly dystopian.
There is another name for it: a waste of electricity.
But wait, not waste! Consumers paid for it fully, with nice profit margins.
You and me, paid.
Try using google flights, or booking.com: the prices shown in search results list are frequently significantly different from those in a single result. It's a nondeterministic compute when it's easy to spot it. But it's not always that easy.
It's all sad, to be honest.
- Varies from 102.0/100 to 100.0/100
- Missed lots of OSS work
- Misinterprets GSoC work (Thinks projects I started that were contributed to in GSoC implies that I received a GSoC stipend)
- Areas for improvement seem to vary inconsistently (There's not enough project detail to there's too much project detail)
I still don't make company cut offs ¯\_(ツ)_/¯
Provided:
* If it’s MoE we are talking about, that the duplicate inputs are for the whole batch (yes, your batch neighbours can impact your choice of experts. Blergh.)
* Your kernels are deterministic
* There’s no system wide effort switch that responds to, e.g. work load across the cluster (for a thinking model)
Upshot:
Temperature 0 is not deterministic in probably any existing cloud infra, but it could be for edge inference pretty reliably.
To your quibble on 0.1 being more deterministic - I think it’s a pretty fair summary - we’re going to sample much more from the ‘temp 0’ answer at 0.1 than we would at temp 0.9, no?
Ah yes, the much revered cosmological fairness constraint.
The volume is infeasible to review everyone for quality, even at an hour scale. The conclusion and solution is inevitable, though I wish it were different. 35% is actually really good if you’re not coming in through a referral.
The current reality is <1% and the person reviewing you is exhausted.
I'm saying this as somebody who most of the time has some side project going on.
Far worse would be different humans having the same weights.
What I'd really love is an actual number for a "human hallucination rate". How often will a random human
1) claim something that is wrong
2) defend the wrong claim and/or logic even when the problem is pointed out to them
(and this of course outside of the usual topics. In politics? I don't care. In religion? Don't care (well, maybe a bit more than politics). Let's say in physics or popular logic or something like that)
After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution.
That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0.
To contextualize this insight in your post and basically just repeat what you are saying: The mistake is not using a non-deterministic system. The mistake could be, in some sense, using it too little. Re-evaluating the same resume 5 times and seeing a high variance in scores is a more useful signal than evaluating it once.
It's totally fine to filter out resumes in a completely random, content-independent way. Grabbing the fourth resume down in the pile and offering them the job is a perfectly fair albeit stupid way to make a hiring decision. However, AIs are very, very good at capturing biases, and it would not at all surprise me if an AI told to filter resumes is going to end up filtering with some biases for things that you definitely do not want to filter on, like the name of the candidate. And it might be that everybody resume that claims it fixed a typo in a major open source project gets a pass, but resumes that only list their own projects get rejected 60% of the time, so you're losing more good candidates than bad.
Credentialing helps maintain a quality floor. Does this person have basic employable skill? Nothing more. It actually doesn’t help you identify levels of talent and skill which is a universal hiring problem.
We do have a credential - a CS degree. And you can see it is a mixed signal. Employers can choose of their own free will to take risks on employees that do have this credential, or not.
Mandating by law that you must have a CS degree doesn’t seem to help our field as we famously have high performers across the spectrum of formal education.
Also, it doesn't pick up certifications or awards. I tried some PRs people are suggesting with enhancements (https://github.com/Zem-0/hiring-agent), it helps, but overall their ATS is hugely biased towards people with large GitHub contributions to OSS.
Chickens coming home to roost.
Really depends on the program. In my undergrad program there were some very smart CS students who got great grades that really struggled with the programming. Smart and capable people can be bad at programming and lack many qualities that make for a good hire.
(A more charitable interpretation would be that aforementioned CTO was making a joke that didn't land.)
In pure math, it does not always do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options.
I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check".
But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.
That’s user-controlled too, not an inherent property of GPUs:
https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...
We expect computers to be consistent despite running programs that are not designed to be consistent.
This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.
But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.
Due to acting like an irrational gambling machine, I agree it can have unwanted indirect discrimination effect in general. But it will probably not differentiate "on the grounds of religion or belief, disability, age or sexual orientation". It is possible, but that would take a lot of work for the lawyers to prove to the court.
I believe the more interesting part is that the EU AI Act (still not in force in this regard until 2 December 2027). This will be clearly a high-risk AI system: "AI systems intended to be used for the recruitment or selection of natural persons, in particular to place targeted job advertisements, to analyse and filter job applications, and to evaluate candidates".
Which does not mean prohibited, but it could later turn out that LLMs will be excluded from being used in high-risk AI use cases (falling under article 6 with no exemptions).
Considering that none of the standards are published yet, I have absolultely no idea how they will ensure compliance with the following parts of Article 10 when using LLMs for such tasks: "(f) examination in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights or lead to discrimination prohibited under Union law, especially where data outputs influence inputs for future operations; (g) appropriate measures to detect, prevent and mitigate possible biases identified according to point (f)"
I don't think that's technically possible to do so with LLMs in general at the moment, even with the full cooperation of the model providers. Maybe you can do some meaningful audits for smaller models. But the EU AI Act may end up excluding all the generic "using-LLM-but-not-entirely-sure-why" vibe coded approaches from high-risk use cases (in Annex III). Which would make sense.
Which means there's a good chance this is somehow correlated in one way or another to race/gender/other protected classes in the US, just by the math of everything being correlated to everything.
Which means this is one good lawsuit away from being illegal in the US as well. It doesn't even necessarily have to "win", just do well enough in court to scare away anyone else from using this.
And boy oh boy would I hate to be on the receiving end of this lawsuit, trying to prove that my AI screener is completely in compliance with all hiring laws. That sounds like a nightmare.
It's generally illegal under GDPR Article 22.
> The data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her.
Exceptions in 22(2) are unlikely to apply. It's hard to argue that it's truly necessary (a) and consent (c) is almost always unavailable in employment context. (b) might apply, but it requires specific law in EU or Member State to authorize it.
> Resumes have never been a good predictor of success
Applies broadly to the world, it’s not unique to tech
Corpo bullshittery at its finest.
I guess there sadly are many nobodies who do this to hope to become somebody.
He's a 48.0/100, things that make you go Hmmm.
This open-source ATS by HackerRank has been blowing up recently: https://github.com/interviewstreet/hiring-agent
It’s popped up on LinkedIn and Reddit with hundreds, sometimes thousands, of likes.1 A coworker mentioned it to me in passing a few days ago.
I’ve decided to test it out.
First working run: 90/100. Felt pretty good!
I had some debug prints scattered around from troubleshooting the setup, so I cleaned those up and ran it again.
74/100.
Same resume. Same command. The only thing I changed was deleting print statements.
I disabled DEVELOPMENT_MODE and put it in a loop to run a hundred times.
The scores range from 66 to 99.
If your company’s cutoff sits at 85, I fail 65% of the time. Same exact resume, different luck.
Here a quick rundown on how the tool works:
Your PDF gets parsed into text. An LLM is called six times to extract structured information — your basics, work history, education, skills, projects, awards. It pulls your GitHub profile, scans your top repos, appends them as extra context. Then everything gets fed into the LLM at once to be graded.
The scoring is out of 100, with up to 20 bonus points on top:
35 points for open source contributions
30 for personal projects
25 for work experience
10 for technical skills
Up to 20 bonus points for startup experience, a portfolio site, a technical blog, etc.
The default model is gemma3:4b, running at temperature 0.1 — low, supposedly nudging the model toward deterministic outputs.
Here’s what I found when I looked at those individual categories.
Look at technical skills: I scored 8/10 in 98 out of 100 runs. Nearly perfect consistency. How come? Because technical skills are a checklist. You either know React or you don’t. There’s nothing for an LLM to judge — a five year old could match that check-list.
Now look at projects — there’s HUGE variation.
LLMs struggle to make a judgment call like that consistently. Sometimes my projects “lack architectural complexity”, sometimes they “demonstrate real-world deployment”. Which one the LLM spits out is a roll of the dice.
Temperature 0.1 is already low, but even going down to temperature 0 doesn’t fix this. Someone opened a GitHub issue back in October showing scores of 27, 34, 32, 34, 34, 30 across six consecutive runs at temperature 0.2 This non-determinism isn’t a bug you can just fine-tune away, it’s a fundamental design flaw.
I was worried part of this might be the model. After all, gemma3:4b was a local model running on my machine.
Gemini resulted in a tighter distribution — scores clustered between 48 and 64. But if your cutoff is 60, you’re still failing 28% of the time through no fault of your own.
The Open Source scores have become consistent — that’s a legit improvement. But project scores are still all over the place.
Experience has me the most concerned.
25/25.
Every single run.
I went back and pulled up an old resume — one internship on it.
Also 25/25.
The clue is in the prompt…
The entire thing is two lines long.
No rubric. No examples. No anchors for what earns a 15 versus a 25.
A junior engineer with one internship gets 25/25. A principal engineer with a decade of distributed systems gets 25/25. I get 25/25. Experience has two lines and no anchors — consistent, but useless. Projects has a detailed rubric with examples but it’s the noisiest category — inconsistent, also useless. There are some things that LLMs just can’t do well, no matter how you prompt.
Use an LLM to parse a resume into structured data — great, that’s what they’re good at. Use one to check whether someone knows Python — amazing. Use one to judge whether a candidate’s experience is worth 18 points or 24 points? You get a vibe-check. Something HR teams, bar raisers, and a dozen other initiatives have spent decades trying to avoid.
The 65% weighting on open source + projects doesn’t help either. I’d take the engineer with 30 years of experience who built S3 over someone with two internships and an open source project — but this tool wouldn’t. Some of the best engineers I know have built things that never ended up on GitHub. That’s over half of their score gone before any human looks their way.
If you’re an engineer with any say in how your company handles resume screening: please be very careful with AI-screening tools. A tool that can’t differentiate isn’t filtering for quality — it’s just filtering. You might as well throw out half the resumes and tell the the applicants you don’t fuck with bad luck.
Correction (June 28): A reader flagged that the resume_evaluation_criteria.jinja template says “Software Intern” on line 1 — nowhere documented, nowhere else referenced in the repo. The same template that later gives bonus points for “founder roles, co-founder positions, or early-stage engineer roles.” I re-ran with an explicit Senior SWE prompt and got identical results — the scoring dimensions are position-agnostic.
Viral LinkedIn (read at your own risk) and Reddit posts. They both claim the repo was open-sourced recently, but based on commit history it’s more likely that it just blew up recently and has been open sourced since October 2025.
Non-determinism at temperature 0 was flagged in this GitHub issue, opened October 2025.
No posts
> torch.bmm() when called on sparse-dense CUDA tensors
And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.
nonetheless, people will defend history as perfect and say those samples, like nepo babies, are "perfect".
The average user has very little. A word processor with inconsistent pagination or a spreadsheet with inconsistent totals is rightly seen as faulty.
If you have 1000 applications for every job, and you know that a bunch of these applications are "a bad fit", to put it mildly, you have to filter. And you cannot realistically give every resume a good, human look. By the time HR would be done, the market has already moved on five times.
So, what is the real difference between being overlooked because HR could only look at the first 100 resumes, or the AI filtered all 1000 resumes down to 100? In the end, a fuckton of potentially great people get their feelings hurt either way.
E.g:
“Where is the Eiffel Tower Located? One word only.”
“Where is the Effel Tower located? One word only.”
“Where is the Eiffel Tower located? One wor only.”
I’d be very surprised if those got different answers from even a small local model at temp 0.
Instead of spending all those resources on resume filtering, hire resume blind. Instead of using llms for a thing they are bad at (subjective decision making) use them to build a deterministic process that isn’t.
Use work sample hiring as the filter. Make the work sample automatic to sign up for and judge.
Here's a realistic proposition. HR just wants to inflate numbers so that they seem busy looking for the right fit. Keep posting open for 1 week, manually filter for another week, invite people, employ one. Plenty of people with degrees looking for jobs right now, I don't see what's the issue with just trying one. Companies desperately look for the "magic" applicant that checks all boxes, while also trying to pay them almost minimum wage.
That suggests determinism though.
I mean I agree with you overall. Either humans decision making is a system so complex it appears non-deterministic, or it is deterministic. Practically speaking, we are non-deterministic.
Let's not conflate non-deterministic with inaccurate though. Non-deterministic systems can be 100% accurate. https://en.wikipedia.org/wiki/Las_Vegas_algorithm
Even at 2 December 2027 it might be intentionally not enforced at all due to that for a while, through I think the goal is currently to amend it until then.
> that LLMs will be excluded from being used in high-risk AI use cases
no, it won't I can guarantee you this. At best they will get additional restrictions over time, as things go wrong. Anyone who could make this happen has way too much interest to not make it happen. (Most/All? EU country legal systems are overloaded to a point of not working correctly anymore, and have been before AI generated law suites and other AI nonsense started. I won't go into detail but many believe AI assistance (for certain tasks, always with a human doing any final decisions) is the only way to get out of this mess).
> standards are published yet
or exist,
like seriously this isn't a case of there being non public WIP standards which will pin all the nitty bitty details down, but cases of state agencies (and in last instance judges) having to decide if a specific standard (or implementation) is sufficient or not.
but also to some degree it shouldn't be tightly coupled to tech standards as there are often many ways to implement the things the law requires and accepting only one is undesirable (and likely wouldn't legally hold up). But having tech standards which are a "guaranteed to be enough if you comply with" (but not the only valid way) would have been preferable, bringing us to the next point
> have absolutely no idea how they will ensure compliance
nor do they know, the original non big corpo hijacked version had exceptions for most companies affected now. So it would only have affected a handful of huge companies, which have many of the things required already in place, in some form or another. Most likely this would have played out as this companies presenting how their measurements are "sufficient" and the agencies then evaluating it and potentially requiring some changes, going back and force over a longer duration leading to documented cases of rough technical standards about "what is sufficient" they then can pass to other organizations in the future. But now the law affects not just a handful of companies but like thousands, if not tens of thousands. Many not stuffed in a way where such a process could work, or even do the necessary documentation to show "compliance"...
So from a practicability POV, if enforced starting 2027, it currently excludes close to _any_ (meaningful) use of AI, down to a trivial linear regression or similar. Including any "old school ML/AI" any Bank uses for risk assessment.
Banking stopping running in December and there not being any (meaningfull) AI startups or adoption at all is not something anyone (in power in any state organ) wants to see, so guess how much it will be enforced ;)
And as mentioned the chance of AI as technology being excluded "in general" is close to none. Maybe specific usages could be excluded (and/or are already excluded) but thats it.
Oh and as a bonus a malicious reading of f+g remove any proper privacy protections for any AI usage in high risk context, where it is often most relevant... (a more sane reading allow it, with ... tricks).
Implicit bias theory sparked a massive number of studies that suggested everything influenced you from the color of the room, to what the person said to you before entering.
It’s been really hard to replicate and the conclusions that have been drawn are contradictory.
>Which means this is one good lawsuit away from being illegal in the US as well.
Uhh.. what? No that doesn't follow at all.
Screening resumes in a way that correlates to race, gender, etc. is not illegal. This is a fundamental distinction. The law is you cannot use those as filters. But the outcomes likely will be correlated. In fact to ensure they are not correlated you'd have to break the law and control for race, gender etc. Which is racism.
The models dont even get race as an input. If they did and they used it to select then yeah, that lawsuit sounds like it has merit. But a mere correlation in outcomes? In no way illegal what-so-ever.
https://news.bloomberglaw.com/litigation/workday-loses-bid-t...
I'm sorry, I'm not following this at all. When you say "better candidates are exponentially more likely to pass the filter", we're still are talking about a metric, yes? A metric that can be optimized? Why would switching from a hard cutoff to some sort of stochastic filter weighted by this metric discourage optimization?
Honest question, I'm not American.
The is a difference between
- having a right you can't wave - which is very similar to something being forbidden - but different to having a right you fully or partially can wave
Furthermore to some degree you are only "subject to a decision based on ..." if the decision has an effects affecting you.
In practice wrt. Article 22 this means companies can make a "decision solely based on automated processing[..]" iff they give you a (realistic) chance to object to it in which case they will do a human review of the decision where a human confirms/changes this decision based on reviewing the involved information.
There is a lot of gray area what a "chance to object" means and when a human review makes an decision no longer "solely based on automated processing" (a human just saying AI was right clearly doesn't count, but a human constructing a case why they would have decided the same way based on the why the AI did the decision can count, iff it's reasonable to assume a human might have come to the decision had it only been reviews by an human).
Or in other words GDRP Article 22, just "soso" meaningful in context of hiring.
Like if the AI did a mistake they have to reevaluate it, but as long as there are other similarly qualified competitor (they did hire/are in process of hiring) it quite easy to come up with a reason why they are a better choice for them. Or go through the motions of you being in round 2,3 of hiring and then find an excuse to not hire you.
But for anything else I wouldn't.
The entire chain will be affected from the different tokenization on down. Even if it lands in roughly the same semantic area, it doesn't mean it will land there with anything like the same syntactic selections. Anywhere there were multiple near-tokens could easily select a different route based on even minor fluctuations in the starting conditions. It's chaotic.
Give it a try. 4 letter difference. Add a few 100 tokens describing the task, such that the change becomes a tiny fraction of the input.
Discontinuities everywhere.
"Score this resumé. Applicant: Jim ..."
"Score this resumé. Applicant: Greg..."
Is it obvious to anyone that these will have the same modal response?
This is a highly general answer to a complicated topic; my main point is more that this is not going to be held to the standard of "beyond reasonable doubt", which would be hard to meet.
[1]: https://www.law.cornell.edu/wex/preponderance_of_the_evidenc...
If you know what you’re looking for, you just start skimming them and maybe ranking them based on your own rubric. If it’s an obvious “no” you can usually tell within 5 seconds skim. Once you have a handful of high ranking ones, stop, and talk to them. Repeat as necessary until you have a short list of people you’d want to hire. There might be 9900/10000 resumes you never even looked at and maybe one of them would have been slightly better but you can’t let perfection be the enemy of progress. Stand by your convictions of feeling the candidate is qualified and capable and meets what you expect and hire them, get back to business.
Having been in “talent shortage” mode for a long while I’d rather have 10000 resumes than 3. Having to pick one from a suboptimal selection is an awful position to be in, but sometimes a necessity.
I'll let you decide whether that's a dream or a nightmare...
Note the chance to object must be given before decision is made, i.e. not to give option for human review after the fact. Human must also be able to actually have meaningful chance to affect the decision.
If the decision is based on purely objective facts that are actually necessary (like you must have certain license) then human and computer always coming to same decision is likely correct and compliant, but as soon as you start putting in subjective criteria and human agrees with 100% of computer denials it becomes a lot harder to demonstrate that human is actually able to affect the decision as required by Article 5. Note that demonstration burden is on controller, not on data subject/DPA.
Objective criteria also isn't always enough by itself. If both human and computer calculate the same credit score and you must score X points to get a loan then human isn't actually able to affect the decision. Essentially the credit score calculation itself ends up being the automated decision rather than the formal rejection that is later given to data subject.
Like YT would have loved to make you opt out of it (and probably has it in their TOS) but there where multiple cases of courts forcing them to handle it properly in the past as far as I remember.
My _guess_ is that at least if you don't sign a proper contract you can always force a human reevaluation. But also only that (so only semi useful). Also even with a proper contract it's unclear if it would be possible in this specific case due to the contract being fundamentally one-side/unfair and semi-forced on you if it where wide spread on the market for the specific job you are trying to get.