for i = 1 to 100
sleep 1
nextof course with modern cpus you gotta adjust your i to be something more reasonable, like
for i = 1 to 1000000000
works every time
xor rcx,rcx
loop:
inc rcx
cmp rcx, -1
jne loop
Something like that roughly. Except, I didn't count all the way to -1, just roughly half way through the 64bit space. It stalls for a couple of minutes.
Now, I wasn't doing this for stressing profilers or anything fancy like that. I was looking into malware delayed execution techniques. Sleep() and other "Dear system, I'm gonna nap for a bit, wake up later" routines have the downside of being api or system service calls, and detecting branching decisions after sleeping is a tell-tale red flag for anti-malware systems.
Sure, you can detect techniques like this as well, but I figured I could insert idempotent instructions in the loop's inner basic block that makes it look like it's crunching data for legitimate reasons instead. It just needs to delay execution long enough to fool sandbox automated analysis. I've thought about delaying/frustrating human analysis too, and I'm sure better minds than mine have thought of better solutions, but making this part of an unpacking routine that relies on the computed value to decrypt malicious code seems to be the obvious thing to do. Which again, I'm sure it's been done, but doing it yourself and figuring out the anti-anti-analysis techniques is fun.
This adds enough latency to be noticeable and I’ve found pages that were “OK” in prod that were unbearable in my local environment. Most of the time it was N+1 queries. Sometimes it was a cache that wasn’t working as intended. Sometimes it simply was a feature that “looked cool” but offered no value.
I’m not sure if there is a proxy that would do this locally but I’ve found it invaluable.
Later on the early PCs we had a Turbo Button, but since everyone had it in Turbo mode all the time it essentially was a way to slow down the machine.
EDIT: Found an image of what I remember as the "C64 Snail". It is called "BREMSE 64" (which is German for brake) in the image.
https://retroport.de/wp-content/uploads/2018/10/bremse64_rex...
The authors have tested a rather obsolete CPU, with a 10-year-old Skylake microarchitecture, but more recent Intel/AMD CPUs have special optimizations for both NOP and MOV, executing them at the renaming stage, well before the normal execution units, so they may appear to have been executed in zero time.
For slowing down, one could use something really slow, like integer division. If that would interfere with the desired register usage, other reliable choices would be add with carry or perhaps complement carry flag. If it is not desired to modify the flags, one can use a RORX instruction for multiple bit rotation (available since Haswell, but not in older Atom CPUs), or one could execute BSWAP (available since 1989, therefore it exists in all 64-bit CPUs, including any Atom).
Really interesting idea. I suppose the underlying assumption is that speeding up a function might reveal a new performance floor that is much higher than the speedup would suggest. Spending time to halve a function's time only to discover that overall processing time only decreased slightly because another function is now blocking would indeed be quite bad.
Not sure how this handles I/O, or if it is enough to simply delay the results of I/O functions to simulate decreased bandwidth or increased latency.
If you're on Linux, you can use iptables to randomly drop a fraction of packets to simulate bad connections - even for localhost. The TCP retransmits will induce a tunable latency. You have to be careful with this on a remote host or you may find yourself locked out, unless you can reboot out of band
My first idea is to taskset the code to a particular CPU core, and see if linux will let me put that core in a low frequency mode. Has anyone here done this on AMD hardware? If so, were the available frequencies slow enough to be useful for this purpose?
- Making Postgres slower (https://news.ycombinator.com/item?id=44704736)
Taking this approach seems to be effective at surfacing issues that otherwise wouldn't show up in regular testing. I could see this being useful if it was standardized to help identify issues before they hit production.
If you have a business process that is complex and owned by many different groups then causal profiling can be a good tool for dealing with the complexity. For large workflows in particular this can be powerful as the orchestration owner can experiment without much coordination (other than making sure the groups know/agree that some processing might be delayed).
I had the impression, the turbo button was created to slow down new PCs, so they could run old software that relied heavily on CPU speed.
However, there are CPUs among the Intel/AMD CPUs that can execute up to a certain number of consecutive NOPs in zero time, i.e. they are removed from the instruction stream before reaching the execution units.
In general, no instruction set architecture specifies the time needed to execute an instruction. For every specific CPU model you must search its manual to find the latency and throughput for the instruction of interest, including for NOPs.
Some CPUs, like the Intel/AMD CPUs, have multiple encodings for NOP, with different lengths in order to facilitate instruction alignment. In that case the execution time may be not the same for all kinds of NOPs.
It’s not worth optimizing for situations that do not occur in practice.
The transistors used to detect register clearing using XOR foo,foo, on the other hand, are worth it, as lots of code has that instruction, and removing the data dependency (the instruction technically uses the contents of the foo register, but its result is independent of its value) can speed up code a lot.
The NOPs in effect use up a small fraction of the instruction decode bandwidth, and if they insert enough NOPs they can reduce the number of real instructions that are issued per cycle and slow down the program with a fine degree of precision.
The idea is that these are unavoidably slow and preferably non-parallelizable to compute one way, but fast or near instantaneous to verify the result. Examples include the Weslowski VDFs based on similar math to RSA, MIMC, and the use of zero knowledge proofs to provide proof of slow computations.
With the AT it usually slowed down to some arbitrary frequency and it was more like a gimmick.
Is this related to speculative execution? The high level description sounds like NOP works as sync points.
A double BSWAP is equivalent with a NOP, but no existing CPU does any special optimization for it and it is very unlikely that any future CPU would do something special with it, as it is mostly obsolete (nowadays MOVBE is preferable instead of it).
NOP cannot ensure a certain slow-down factor, except on exactly the same model of CPU.
However those NOPs are seldom executed frequently, because most are outside of loop bodies. Nevertheless, there are cases when NOPs may be located inside big loops, in order to align some branch targets to cache line boundaries.
That is why many recent Intel/AMD CPUs have special hardware for accelerating NOP execution, which may eliminate the NOPs before reaching the execution units.
However, like integer division, they may clobber registers that the program wants to use for other purposes.
For great slow-downs when the clobbered registers do not matter, I think that CPUID is the best, as it serializes the execution and it has a long execution time on all CPUs.
For small slow-downs I think that BSWAP is a good choice, as it modifies only 1 arbitrary register without affecting the flags, and it also is a less usual instruction so it is unlikely that it will ever receive special optimizations, like NOP and MOV.
However, multiple BSWAPs must be used, to occupy all available execution ports, otherwise if there is any execution port not occupied by the rest of the program the BSWAP may be executed concurrently, not requiring any extra time.
The elimination will still take some amount of time, and the smaller this amount is, the better, because it allows dialing in the slowdown more precisely. Of course how many NOPs you need for a particular basic block needs to be measured on the CPU you're profiling on, but that's also the case if you use a more consistently slow instruction, since the other instructions won't necessarily maintain a consistent speed relative to it.
They might be able to skip a plain 0x90, but something like mov rax, rax would still emit a uop for the cache, before being eliminated later in rename. So at best it would be a fairly limited optimization.
It's also nice because rename is a very predictable choke point, no matter what the rest of the frontend and the backend are busy doing.
1
2
3
4
5
6
7
mov dword ptr [rsp+0x18], r8d
mov dword ptr [rsp], ecx
mov qword ptr [rsp+0x20], rsi
mov ebx, dword ptr [rsi+0x10]
mov r9d, edx
cmp edx, 0x1
jnz 0x... <Block 55>
1
2
3
4
5
6
7
8
9
10
11
12
13
mov dword ptr [rsp+0x18], r8d
nop
mov dword ptr [rsp], ecx
nop
mov qword ptr [rsp+0x20], rsi
nop
mov ebx, dword ptr [rsi+0x10]
nop
mov r9d, edx
nop
cmp edx, 0x1
nop
jnz 0x... <Block 55>