the following instructions are actually slower than
common counterparts

    loopnz, jcxz

    all of the transcendental x87 instructions
	this doesnt seem to refer to MMX technology which aliases
	to the registers to the floating-point stack

    fbstp, fbld

    lods[bwdq], stos[bwdq], scas[bwdq], movs[bwdq]

    all of these with REP prefixes except rep movsb

	For specifics, consult Agner Fog's tables.
	
	the fsin/fcos instructions are wildly innaccurate
	and its better to use glibcs implementation when caluclating
	sines and cosines
	
== Performance notes (Merom) ==
	(cycles are the average values in the agner fogs table
	reciprocal throughput)
	
	`mov r,m` 		1 cycle
	`lea r,m`		1 cycle
	`test r,r/i`	0.33 cycles
	`test m,r/i` 	1 cycle
	`bt r,r/i` 		1 cycle
	
	`bt m,r` 		5 cycles
		is slower than (?)
		`mov r,m`
		`bt r,r`
		
	`inc m` 		1 cycle
		is faster than
		`inc r`
		`mov m,r`
		
	`cmp m,imm` 	1 cycle
		is faster then
		`mov r,m`
		`cmp r,imm`
		unless more than 2 compares are done with the same register
		later on
	
== Floating point numbers ==

	checking if a xmmword has a 0x00 can be done as follows

	xorps xmm0,xmm0
	movq xmm1,rax

	pcmpeqb xmm1,xmm0
	#^ stores 0xff for every matched byte
	pmovmskb ecx,xmm1
	#^ not sure what this does really...
	
	== Moving and converting values ==
	cvt- family of instructions convert integers to various formats
	
	to move a float into xmm1
	eax, 1000
	cvtsi2ss xmm1, eax
	; xmm1 = 1000.0
	
	cvtsi2-- converts to a float. si stands for Scalar Integer
	and
	cvtss2-- converts back to dword/qword
	
	so `cvtsi2ss` converts a dword or qword to a single prec. float value
	and `cvtss2si` converts one single scalar prec. float to a qword/dword
	
	wether its a qword or dword depends on the source/target register size

parameter passing order:
	rdi, rsi, rdx, rcx, r8, r9
	
result register: rax

rdx:rax - used for idiv and imul and div and mul

other:
	rsp - stack pointer
	rbp - base/frame pointer, saved by callee
	rbx - saved by callee (us)
	r8-r11 - misc
	r12-r15 - misc, saved by callee
	
r8 to r11 are also called scratch registers.
we do not need to preserve their values as a callee

unpreserved registers:
	rcx, r8,r9,r10,r11

Convetions: https://en.wikipedia.org/wiki/X86_calling_conventions

== Windows ==

	In windows, the register order is as follows:
	rcx, rdx, r8, r9
	more info at:
	https://www.nasm.us/xdoc/2.16.02rc5/html/nasmdo12.html#section-12.1

	its quite different, needs reading

	Even prologue and epilogue code is different
	{{{masm
		;prologue
		mov    [RSP + 8], RCX
		push   R15
		push   R14
		push   R13
		sub    RSP, fixed-allocation-size
		lea    R13, 128[RSP]
	}}}
	{{{masm
		;epilogue
		add      RSP, fixed-allocation-size
		pop      R13
		pop      R14
		pop      R15
		ret
	}}}
	More info here: https://learn.microsoft.com/en-us/cpp/build/prolog-and-epilog?view=msvc-170
	
	Apparently, we cannot really use `push` and `pop` for the
	extra parameters on the stack, because it inherently modifies
	the `RSP` register which might be causing all these weird stack
	alignment issues.
	Windows requires 0x20 minimum for the home addresses
	of saved registers
	
	The stack parameters in windows:
	[other saved regs ]    rsp+0x40 .. etc.
	[ param 1         ]    rsp+0x38 / rbp+0x10 .. etc.
	[ param 2         ]    rsp+0x30 / rbp+0x8
	[ rbp pushed      ]    rsp+0x28 (we need to skip this one over)
	[ local variables ] <- rbp / rsp+0x20
	[ r9 home         ]    rsp+0x18
	[ r8 home         ]    rsp+0x10
	[ rdx home        ]    rsp+0x8
	[ rcx home        ] <- rsp / rbp-0x20
	-- call happens
	[ return address  ] <- rsp-0x8
	so we cant just push parameters on the stack before the call
	that would place them bellow [ rcx home ] and shift
	the home location. Its why we can get seemingly random
	values into functions when this is not considered

== Assembly tricks (NASM) ==

	mov [rbp-SDL_rect.x], word 1
	mov [rbp-SDL_rect.y], word 2
	mov [rbp-SDL_rect.w], word 3
	mov [rbp-SDL_rect.h], word 4
	; easy encoding of 4 words of values into 1 64bit register
	; the above is equivalent to this when it comes to structure
	; and array initialization
	; I think this is also known as a 'vectorized' instruction
	; but just using the regular 64bit registers
	mov rdx, 1 | (2 << 16) | (3 << 32) | (4 << 48)
	mov [rbp-SDL_rect_address], rdx
	
	=== PLT ===
	Unix:
	To refer to a function in the PLT, we have to use `wrt ..plt` syntax
	`call SDL_Init wrt ..plt`
	Windows:
	In windows, we use `wrt ..imagebase` instead
	
	
	== JMP Tables ==
	All jumps in a jump table should contain the 'near' keyword
	afterwards to make them of equal size. NASM might decide to
	include fewer bytes for a jump thats a lot closer than the other ones,
	making it harder to calculate the size of a jmp table entry since
	they can change. Doing `jmp near ` avoids this problem.
	
	

== GCC ==
	Macro names should be ALL_CAPS when it is important to understand that it is a macro
	and all_lowercase when its supposed to be considered as a function but for pure effiency reasons.
	We can thus in theory, also use inline functions if we just want to inline stuff by default.
	To always inline a function, we must use the following format:
		`inline __attribute__ ((always_inline))   () { ... }`
	"Function"-like macros thus aren't really necessary or benefitial unless the param type really doesn't matter
	
	== ISSUES ==
	When interfacing with C *make sure* to setup proper call stacks with function prologues.
	Otherwise we get some sort stack-based segfault as it tries to access memory.
	
	This doesn't show up when the main function acts like \_start so there is no need
	to return but rather just exit.
	
== CPU BUGS ==
	FSRM (fast short repeat move)
	https://www.techradar.com/pro/security/a-cpu-mystery-intel-just-fixed-a-huge-security-flaw-affecting-nearly-every-cpu-out-there-today
	A cpu with this bug basically just breaks completelly
	JMP instructions being ignored, XSAVE and CALL instructions no longer
	correctly recording the RIP instruction pointer
	A debugger would report impossible states.
	Fairly new, affected CPUs
	https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html
	CVE-2023-23583
	
== Misc info (IRC) ==
12:42  what does it mean to 'move data using non-temporal hint' ?
12:44  like with the MOVNTI instruction
12:44  Non-Temporal SSE instructions (MOVNTI, MOVNTQ, etc.), don't follow the normal cache-coherency rules. Therefore non-temporal stores must be followed by an SFENCE instruction in order for their results to be seen by other processors in a timely fashion.
12:45  The "non temporal" phrase means lacking temporal locality. Caches exploit two kinds of locality - spatial and temporal, and by using a non-temporal instruction you're signaling to the processor that you don't expect the data item be used in the near future.
13:57  I see, thanks. An 'near future' means how long into the future? miliseconds, seconds, cycles?