-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea how to merge nt and nt-long, and implement other multi-limb hashes based on Merkle–Damgård construction #5534
Comments
IMO our biggest problem is md5crypt (both SIMD and OpenCL). Max length of 15 is seriously limited. Alas, I fail to understand that code so I simply can't fix it. I have tried many times to wrap my head around it. It seems I can't!
We do exactly this already, in the NT kernel, and the multi-block kernel is surprisingly fast. The bigger problem is lock stepping - like you say we need same number of blocks for all candidates in a batch (not necessarily a GWS batch, we might get away with the size of LWS. A warp? I can't remember). For SIMD, heterogenous (in terms of block count needed) batches are a much harder problem as they not merely run slower but fail completely unless a lot of code is added around it. We actually do that in several formats, but not any raw ones. It's not rocket science but it eats performance and is a proper PITA to code. Back to OpenCL, I think a KISS hybrid NT-opencl could simply set a boolean in set_key() when any candidate in current batch is over a single block's length. Any such batches would call the slightly slower kernel. That trivial change should take us a long way but if memory serves me, I tried it and somehow the results must have been disappointing enough that I ended up not using it? I can't remember - need to revisit the old discussions. |
Scalar fallbacks might be added to all formats to handle lengthy candidates. So md5crypt would not skip candidates (remaining slow though) and separate md5crypt-long would become obsolete. I forgot an available optimization: multi-block candidates starting with the same block can hash it once for all. Some modes like |
I wrote a simplified md5crypt pseudocode based on md5crypt-long and my old attempt to implement it in john-devkit. The following is python-like pseudocode. I did not test it. Also final = md5_raw(pw + s + pw)
final = md5_raw(pw + magic + s + # magic is for standard md5crypt only
(final * (125 // 16 + 1))[ : len(pw)] + # the "weird" loop
"".join(pw for bit in bin(len(pw)) if bit == '1'))
for i in range(0, 1000, 2):
t = ''
if i % 3: t += s
if i % 7: t += pw
final = md5_raw(final + t + pw)
t = ''
if (i + 1) % 3: t += s
if (i + 1) % 7: t += pw
final = md5_raw(pw + t + final)
result = final The "weird" loop is to fill As far as I understand, the hashing depends only onto lengths of candidate and salt, and it does not depend onto data. The second Easy way is to group candidates by exact length. Then all hashing calls should have the same length of data. The main loop could use bigger groups. |
I find it interesting to think about scalar branching as coding with co-routines. Imagine While co-routines are possible in C, usually logic is just inverted to do one iteration over explicit state (as in (Curiously it should be possible to use co-routines in js or lua (with luajit) to express ad-hoc dynamic formats, so scalar logic would be in a scripting language and hashing would be done in C with SIMD.) So each candidate would have a state with space for hash, length of data for hashing, the data for hashing, and format's specific state like current loop's id and iteration in the loop. Then straight code is split into parts to operate over state and enqueue all that state for hashing. There should be different queues by number of blocks in data. When some queue gets full vector, actual hashing happens. After hashing, related states are iterated more. Like co-routines get resumed. To avoid recursion, outer loop is needed instead of direct call on addition to queue. If all states are waiting hashing and there are not full vectors then incomplete vectors should be hashed (probably starting from the most filled to hopefully fill others). Currently Writing in such style can mix hashing from different iterations of algorithm, e.g. hashing with different number of rounds in one |
I think most of your reasoning is valid and indeed interesting - in theory. I'm not sure it would work in real world formats, because the ones needing it most (any raw hashing in SIMD or OpenCL) are the ones that are most sensitive to any added code. We've seen several times that very minor tweaks that intuitively should be good in fact resulted in regression. |
"Vector" here can be virtual and bigger than hardware SIMD vector. Interleaving is an example source for such vectors used in john. Other possible use is to wrap each hashing in algorithm with its own loop over multiple sets of data with the same flow. Pushing this idea to extreme, it is possible to hash using opencl and keep scalar state on cpu. So we would branch/iterate on cpu with one state and invoke |
Hm, I think a format can change its So a format might be able to use OTOH with in-format queues for length-driven branching, format should use this trick only if all queues have only full vectors. Probability of that is not obvious to me. And I even more unsure about usefulness for data-driven branching: for |
The trick with mkpc might be useful to utilize memory better: format could allocate one block per candidate (or even less) and just trigger |
Actually it could be done with the existing interface: |
Hashes based on Merkle–Damgård are computed in blocks (called limbs in john's sources): input is split into blocks of fixed size, then for each block compression function is applied to old state and the block producing new state, constant IV is used as old state for the first block. (Also there is length encoded into the last block. But it is not important for this idea.) It is applicable to raw-md4 and nt, raw-md5, raw-sha1, raw-sha256, and so on.
Fast and simple formats support lengths fitting into one block. Also such hashes support reversing of rounds because IV are constant. For two+ limbs, it is not possible to reverse rounds (as far as I know).
A simple format supporting lengths above one block might omit reversing. It would be slower for short hashes.
There might be two formats. Currently there are nt and nt-long for nt. User chooses the correct one. It is not easy to choose automatically because choice depends onto length of candidates that is not known beforehand in some cases.
Other approach is to allow binary to be found in
db_main
by both reversed hash and regular hash. It would require changes to formats interface. Also it would enlarge the db hurting cases like loading HIBP. (It is possible to have separate bitmaps for variants, so in theory using additional memory, it might be implemented without additional collisions.)And I got the idea of hybrid approach that can be enclosed into format. Reversing is just a transformation on a full hash. We may apply it to any particular digest's value. Formats do apply it loading hashes. Moreover formats apply reversing in
cmp_one
on full hash computed by openssl's scalar hashing. So we may use reversing this way: short candidates would be hashed with shortened compression function, longer candidates would be hashed with full compression function and then reversing would be applied for lookup. This way short candidates would be as fast as with special format while longer candidates would be slower slightly than usual.I think hybrid approach provides a viable trade-off between speed and usability. It would be even better with ability to disable reversing at run-time to speed up longer candidates.
125 ASCII chars for NT require 5 blocks of hashing (123 chars require 4 blocks). Everything above 1 block might be packed together for computations. So instead of storing by number of blocks (like all 2-block candidates, all 3-block candidates and so on), it is possible to have mixed sequence of limbs with marked ends. For instance, having vector size 2 it is possible to pack one 4-block candidate together with two 2-block candidates to hash 4 limbs only extracting 1 end after 2 limbs and another 2 ends after 2 more limbs. Simpler approach would be to compute 2 limbs for 2-block candidates and then 4 for lonely 4-block candidate. But the described mixed storage seems overcomplicated. Also 125 chars for raw-md5 fit into 3 blocks and mixed storage might be less profitable.
With or without reversing, hashing of candidates with mixed number of blocks requires some extra buffering to reduce number of incomplete vectors. (Better length-based queuing is not supported currently.)
The text was updated successfully, but these errors were encountered: