Is there something special about these chess engines that makes SPSA more desirable for these use cases specifically? My intuition is that something like Bayesian optimization could yield stronger optimization results, and that the computational overhead of doing BO would be minimal compared to the time it takes to train and evaluate the models.
Response from the author of Viridithas, there is a link to this engine in her webpage.
A quick visit at the homepage suggests that it's probably the latter. I don't want to be rude, not posting out of malice, but if someone else was reading this and was trying to parse it, I think it might be helpful to compare notes and evaluate whether it's better to discard the article altogether.
Or, if attempting to use SPSA to say, perform a final post-training tune to the last layers of a neural network, this could be thousands of parameters or more.
That being said, it still seems possible to be that using a different black box optimization technique for a fairly constrained set of related magic numbers (say, fewer than 50) might lead to some real performance improvements in these systems, could be worth reaching out to the lc0 or stockfish development communities.
The idea of something being "defiantly" NSFW gave me a chuckle.
The video is probably the least bizarre thing there, if that's what you are warning about.
Feds this guy right here ^^
Chess engines have been impossible for humans to beat for well over a decade.
But a position in chess being solved is a specific thing, which is still very far from having happened for the starting position. Chess has been solved up to 7 pieces. Solving basically amounts to some absolutely massive tables that have every variation accounted for, so that you know whether a given position will end in a draw, black win or white win. (https://syzygy-tables.info)
Thanks for the warnings, kind strangers.
Although, setting any kind of hair on fire in public should be punishable, primarily because of stench of the burnt hairs.
One of my formative early internet experiences was loading up a video of a man being beheaded with a knife.
Luckily, I realized what was about to happen, and didn't subject myself to the whole thing.
Statisticians and operations researchers have spent a hundred years deciding how to do as few experiments as possible to tweak parameters in the ways that give the highest impact with statistical basis that the selections are good.
In the language of information and decision trees, these experiments are trying to in some sense ābranchā on the entropy minimizing variables.
The mass delusion of, "I don't understand what I'm reading, therefore it must be produced by an llm."
I think it's a pretty serious problem. Not that llm text exists on the internet, but that reasonable people are reflexively closed off to creativity because the mere existence of the possibility that something is created by an llm is in their minds grounds for disqualification.
I haven't verified OP's claim attributed to 'someone on the Stockfish discord', but if true, that's fascinating. There would be nothing left for the engine developers to do but improve efficiency and perhaps increase the win-to-draw ratio.
But I'm not sure whether that guy was guessing or confident about that claim.
https://github.com/official-stockfish/fishtest/wiki/Fishtest...
There's simply a lot of sample efficiency to gain by adapting the experiment to incoming data in a regime where one can repeatedly design n candidates, observe their effects, and repeat m times compared to a setting where one must design a fixed experiment with n*m samples.
A common property of llm psychosis is the development of an internal vocabulary that the llm learns, often reusing words but adopting specific meanings, for some reason quantum and quantic are very popular for this.
ML isn't my strong suit so I wouldn't be able to explain how, but Cosmo's article is almost entirely a refutation of the points made by the root article. No doubt he is very friendly, as someone would be to anyone interested in their field.
What I can speak about is the general construction of sentences, they read (in the most charitable of interpretations) like text messages:
"Good model vs bad model is ~200 elo, but search is ~1200 elo, so even a bad model + search is essentially an oracle to a good model without, and you can distill from bad model + search ā good model."
I take it that by "is ~X elo" they mean that implementing that strategy results in a gain of 200 ELO? Which would still be undefined, as 1000 to 1200 is not the same as 2800 to 3000, and improvements are of course not cumulative. I get that this reads more like internal notes, but it was published, so there was some expectation that it would be understood by someone else.
For a lot more reasons, the writing reminds me of notes written by me or by loved ones under influence of drugs. My estimation is that the article was written by a mind that used to be brilliant but is now just echoing that brilliance while, trying to keep their higher order cognitive functions while struggling to maintain the baseline of basic language use. I hope it is reversible and if per is reading this and my estimation is correct, that they perturb the weights in favour of quitting drugs and see if they win more or not.
Things LLM people can learn from
Since AlphaZero, lc0-style chess engines have been trained with RL. Specifically, you have the engine (search + model) play itself a bunch of times, and train the model to predict the outcome of the game.
It turns out this isn't necessary. Good model vs bad model is ~200 elo, but search is ~1200 elo, so even a bad model + search is essentially an oracle to a good model without, and you can distill from bad model + search ā good model.
So RL was necessary in some sense only one time. Once a good model with search was trained, every future engine (including their competitors!)1 can distill from that, and doesn't have to generate games (expensive). lc0 trained their premier model, BT4, with distillation and it got worse when you put it in the RL loop.
What makes distillation from search so powerful? People often compare this to distilling from best-of-n in RL, which I think is limited ā a chess engine that runs the model on 50 positions is roughly equivalent to a model 30x larger, whereas LLM best-of-50 is generously worth a model 2x larger. Perhaps this was why people wanted test-time search to work so badly when RLVR was right under their noses.
A recent technique is applying the distillation trick at runtime. At runtime, you evaluate early positions with your NN, then search them and get a more accurate picture. If your network says the position is +0.15 pawns better than search says, subtract 0.15 pawns from future evaluations. Your network live adapts to the position it's in!
The fundamental training objective of distilling from search is almost but not quite what we actually care about: winning. It's very correlated, but we don't actually care about how well the model estimates one position, we care about how well it performs after search, after looking at 100 positions.
To fix this, lc0 uses a weird technique called SPSA: you randomly perturb the weights in two directions, play a bunch of games, and go the direction that wins more.2 This works very well and can get +50 elo on small models.3
Consider for a moment how insane it is that this works at all. You're modifying the weights in purely random directions. You have no gradient whatsoever. And yet it works quite well! +50 elo is ~1.5x model size or ~a year's worth of development effort!
The main issue with this is that it's wildly expensive. To do a single step you must play thousands of games with dozens of moves and hundreds of position inferences per move.
Like LLMs, you train for a long time on a pseudo-objective that's close to what you want, then a short time on a very expensive and limited objective that's closer to what you want.
The underlying technique of SPSA can be applied to literally any number in your chess program. Modify the number, see if it wins more or loses more, move in the direction that wins more. You have a hand-tuned heuristic that if there's a checkmate in the search from a position you should back off by depth 1? Replace that with thousandths-of-a-depth and then tune it with SPSA ā turns out the optimal value is actually to back off by depth 1.09, which nets you 5 elo. You can do this for every number in your search algorithm. You can do something that looks a lot like gradient descent through arbitrary C++ because you have a grading function (winning).
lc0 uses a standard-ish transformer architecture, which they found to be hundreds of elo better than their old convolution-based models. It's still confusing to me that the transformer biases apply to literally every application imaginable. The only substantial architectural change they use is "smolgen", a system for generating attention biases. They claim smolgen is a ~1.2x throughput hit but an accuracy win equivalent to 2.5x model size. Why is it so good? I find all the explanations poor.
Their primary competitor, Stockfish, uses their data and so do most engines. Some engines don't, mostly because they care about originality. ā©
More specifically, they take a particular part of the weights, and then for every parameter you randomly choose +1 or ā1; this is the direction tensor. You then create two versions of the network, one where weight += direction and one where weight -= direction. Then you play these two versions against each other. If positive wins, you do weight += direction * lr. If negative wins, you do weight -= direction * lr. ā©
Up to about 15 elo on larger models. ā©
War was "solved" when someone made a weapon capable of killing all the enemy soldiers, until someone made a weapon capable of disabling the first weapon.
And the play style of Alpha Zero wasn't different in a way that needs a super trained chess intuition to see, it's outrageously different if you take a look at the games.
I guess my point is, that even if the current situation is basically a 'deadlock', it's been proven that it's not some sort of eternal knowledge of the game as of yet. There's still the possiblity that a new type of approach could blow the current top engines out of the water, with a completely different take on the game.
In that hypothetical of running 2 instances of Stockfish against one another on a modern laptop, with the key difference being minutes of compute time, it'd probably be very close to 100% of draws. Depending on how many games you run. So, if you run a million games, there's probably some outliers. If you run a hundred, maybe not.
When it comes to actually solved positions, the 7-piece tables take around 1TB of RAM to even run. These tablebases are used by Stockfish when you actually want to run it at peak strength. [3]
[0]: https://tcec-chess.com [1]: https://lichess.org/broadcast/tcec-s28-leagues--superfinal/m... [2]: https://lczero.org [3]: https://github.com/syzygy1/tb
Chess is a 2 player game of perfect, finite information, so by Zermelo's theorem either one side always wins with optimal play or it's a draw with optimal play. The argument from the Discord person simply says that Stockfish computationally can't come up with a way to beat itself. Whether this is true (and it really sounds like a question about depth in search) is separate from whether the game itself is solved, and it very much is not.
Solving chess would be a table that simply lists out the optimal strategy at every node in the game tree. Since this is computationally infeasible, we will certainly never solve chess absent some as yet unknown advance in computation.
Elo is defined such that the expected win-rate of a player should only depend on the difference in Elo rating to their opponent. https://en.wikipedia.org/wiki/Elo_rating_system#Mathematical...
I remember hearing that starting position is so draw-ish that it's not practical anymore
IMO AlphaZero was partially a result of the fact that using more compute also works. Stockfish 10 running on 4x as many CPUs would beat Stockfish 8 by a larger margin than AlphaZero did. To this day, nobody has determined what a "fair" GPU to CPU comparison is.
For what it's worth, Stockfish wins the rematch also. https://tcec-chess.com/#game=13&round=fl&season=cup16
In the TCEC game, I see "2. f4?!", so I'm guessing Stockfish was forced to played some specific opening, i.e. it was forced to make a mistake.
It's also almost certainly the case, in that I don't know why you would do it, that Stockfish given the black pieces and extensive pondering would be meaningfully better than Stockfish with a time capped move order. Most games are going to be draws so practically it would take awhile to determine this.
I'm of the view that the actual answer for chess is "It's a draw with optimal play."