The construction I'm trying to produce is called a One-way compression function.
Here's a quote from Wikipedia:
In cryptography, a one-way compression function is a function that transforms two fixed-length inputs into a fixed-length output. The transformation is "one-way", meaning that it is difficult given a particular output to compute inputs which compress to that output.
One-way compression functions are for instance used in the Merkle–Damgård construction inside cryptographic hash functions.
One-way compression functions are often built from block ciphers. Some methods to turn any normal block cipher into a one-way compression function are Davies–Meyer , Matyas–Meyer–Oseas , Miyaguchi–Preneel (single-block-length compression functions) and MDC-2/Meyer–Schilling , MDC-4 , Hirose (double-block-length compression functions).
Most of these constructions (as far as I've checked) make use of key scheduling (they initialize a different key for each hash).
I'm trying to find one which doesn't (only encrypts/decrypts).
The special characteristic here is that hash collisions are not very important for PoW (There's no issue of, say, a document being faked because someone was able to find some other document that hashes to the same value).
It could be that the method I suggested works well but is slightly more prone to hash collisions than the above constructions (or not). That's what I'm trying to find out.
It is likely that someone already published (or at least researched) something of this sort (possibly more than 20 years ago).
An expert who is familiar with the research may be able to identify the suitable papers.
Honestly, most of the risk I'm seeing is just the embarrassment might be flawed in some way. So if you ask in my name ("Some random person on the interwebs sent me this. Is it good enough for PoW?) it may save us both the embarrassment.
As for the measures for HW acceleration. I think a reasonable metric would be simply the ratio of the performance of the best accelerated version to the best unaccelerated version of the hash. The instructions themselves are for one round, this means ciphers or hashes with different number of rounds can't be compared to each other (I'm not a hardware expert though).
The most comprehensive benchmark database I've found so far for various CPUs are the SiSoftware benchmarks which seem up-to-date. Top score for non-engineering sample CPU is 88 GByte/s - 2x AMD EPYC 7542 32-Core = 1.375GByte/s per core (I don't know which cipher is used but I assume it's AES-128).
Edit: AMD EPYC 7F72 24-Core gives 57.07GByte/s. That's 2.375 GByte/s per core. The stock clock speed is 3.70GHz (assuming it's not overclocked). That averages as 1.55 cycles per clock (I'm still not 100% sure which cipher is being used here though).