# Generating Solana Vanity Addresses Using OpenCL

## Faster Creation of Custom Solana Addresses

I have been researching Web3 related technologies recently. The first threshold to entering the Web3 digital world is owning a digital wallet. Wallet addresses can be understood as bank card numbers in the real world. Many people pursue numbers with good meanings (such as having as many 6s or 8s as possible at the end, or having unique meanings to themselves). The difference is that desirable bank card numbers usually require paying the bank to issue them, while wallet addresses only require you to spend time collision searching to generate them. These kinds of collided addresses are generally called vanity addresses, which are generated to satisfy one’s vanity. In essence and functionality, vanity addresses are no different from other addresses for ordinary users.

When running the address generation algorithm, GPUs have a huge advantage over CPUs: GPU stream processors have orders of magnitude advantages over CPUs, so using GPUs to generate vanity addresses will be much faster. After searching around, although Solana also has a tool called solanitywritten in CUDA, when I ran it on my RTX 3080, it did not perform much better than just using my CPU (someone also gave feedback that it did not achieve the expected performance at all). However, when I ran profanity2, an ETH vanity address generation tool written in OpenCL, the speed was orders of magnitude faster than just using CPU. So I started to study the encryption algorithms used to generate addresses, and decided to write one myself.

## Implementation of Wallet Address Generation Algorithms

For different types of Web3 wallets, the steps to generate addresses are actually very similar. The biggest difference lies in the choice of encryption algorithms:

- Generate a private key, usually generating 32 random bytes;
- Convert the random bytes into a large number and then multiply it by the G point on the elliptic curve to get the public key coordinate point. This process is called derivation;
- Convert this coordinate point back into bytes, and perform some encoding or hash processing on the bytes as the wallet address.

For ETH addresses, the selected elliptic curve is secp256k1, and for Sol addresses, the selected elliptic curve is ed25519. After finding the corresponding coordinate point, ETH will hash the public key with keccak, and take the last 20 bytes and convert to hex as the address; Sol will directly base58 encode the public key as the address.

So, implementing the ed25519 and basee58 two algorithms with OpenCL would get the job done.

Since OpenCL syntax itself is based on C99 extensions, implementing cryptographic algorithms from scratch is not the preferred approach. Finding a C language implementation, verifying there are no issues, and then porting to OpenCL is a safer and more convenient approach.

There are many choices available online. I chose：ed25519and base58。

## How GPUs Can Accelerate Computations

Before when I didn’t really understand OpenCL, I copied some OpenCL code for calculating hashes online as a starting point. But when running 100,000 iterations, it was even much slower than just using the CPU. So I asked an expert in game rendering: “How should GPU accelerated hash calculations be done? Why is my batch calculation slower than CPU?” However, his answer did not resolve my confusion. He believed hash calculations could not be divided into blocks for computation, so using GPUs did not have any advantages. Only computations like image processing that could be divided into fine-grained fragments could demonstrate the superiority of GPUs. His answer made sense to some extent. For a single hash calculation, the GPU would not be faster than the CPU. What I actually needed was batch parallel computing of multiple inputs. Later I figured out that my input and calling method was problematic. I wanted to run 100,000 hash calculations. I should not have called the OpenCL kernel code 100,000 times, but should have called the OpenCL kernel code once, and then set 100,000 workers, and then computed 100,000 times in the kernel code.

Workers can be understood as threads. Their quantity is specified by the global worker size parameter passed when calling the kernel.

Specifically for the address generation algorithm, my approach was: randomly generate 32 bytes as the seed, then set the global worker size to `256 ** 4`

. Each OpenCL thread gets the current thread id respectively, converts it to big endian byte format, and then overwrites the last four bytes. Each thread calculates once. If an address meeting the criteria is found, record it in the output. Then each round of OpenCL invocation, add 1 to the fifth last byte of the seed (carry over if max), to do iterative computation until an address meeting criteria is found.

## Migration to OpenCL

OpenCL programs are not like C programs that compile and directly run. Instead they are divided into two parts:

- Kernel code running on the compute device (i.e. GPU);
- Host code managing kernel compilation, execution, and data transfers.

The ed25519 algorithm and base58 algorithm are placed in the kernel code, and then an entry function is exposed for the host code to call and do subsequent processing of results.

The max obstacle I ran into during migration was the entry function problem. The OpenCL I was using which comes with macOS is version 1.2. Compared to standard C, OpenCL has the additional concept of address spaces, especially for parameters of kernel functions, where the address space has to be specified as global, private, or local.

When I moved to Nvidia devices, I ran into more troublesome issues, because since OpenCL 2.0, function parameters with unspecified address spaces default to generic. If a private address space variable is passed into this function, it causes compile errors. Variables declared by default are private address space, so I had to manually change every called function parameter to generic address space, over 100 places in total (wtf…).

## Differences with profanity2

Profanity2 itself actually fixed the vulnerability in profanity where private key seeds were not generated randomly enough. But additionally it also had a major improvement - it does not generate private key seeds. Instead, it uses public key offsets to calculate different vanity addresses. Even if the public key is leaked, due to properties of elliptic curves, it is still infeasible to reverse derive the private key, maximizing private key security.

When I first learned of this design, I felt it was really cool. So I researched the technical principles behind it in depth, and also wanted Solana addresses to use this method:

- Assume the original private key is
`k`

, offset is`delta`

，then the new private key can be expressed as`k' = k + delta`

. Due to properties of elliptic curves, this additive operation corresponds to point addition on the curve; - The new private key
`k'`

will generate a new public key`P'`

. Since public keys are private keys multiplied by the G point,`P' = (k + delta) * G = k * G + delta * G = P + (delta * G)`

where P is the original public key,`P'`

is the new public key, and`delta * G`

is the offset multiplied by G, representing the effect of moving on the elliptic curve.

Therefore, when profanity2 finds a delta that meets the target, adding delta to the original seed gives the target private key.

However, Solana addresses cannot use the public key + offset method to generate. This is because ed25519 public keys are generated slightly differently: the original private key seed needs to first go through a SHA512 operation to get result H, then use the first 32 bytes of H converted to a large number to multiply by G to get the public key. This also means that even if I offset the public key by delta to get the target address, the input is actually H + delta. We do not know what changes to the original seed will result in SHA512 giving exactly an offset by delta, since SHA512 is also irreversible.

## Result Comparison

I am using an M1 MacBook Air. The hash speed using the official `solana-keygen grind`

command is 0.3 Mh/s.

GPU | Memory Clock | TFLOPS | Hash Speed |
---|---|---|---|

Apple Silicon M1 | - | 2.3 TFLOPS | 1.2 MH/s |

RTX 3080 | 1188 MHz | 29.2 TFLOPS | 23.1 Mh/s |

RTX 4090 | 1313 MHz | 81.9 TFLOPS | 67.8 Mh/s |

Additionally, tools for generating these Vanity Addresses eat GPU compute performance (not graphics or VRAM), so estimating performance of different GPUs using the TFLOPS metric will be closer to real world.

The codehas been open sourced, feel free to compare.