HSM and tape solution to long-term unique account ledger bloat

I believe this to be a (partial?) solution to ledger bloat without compromising on features. I posted it on Reddit but quickly realised it would not reach anyone.

Nodes can be set up with two forms of storage; SSD (fast) and tape (slow, but cheapest and the most reliable form of storage long term). All activity takes place through the SSDs. The filesystem can utilise a heirarchical storage manager (HSM) to dynamically transfer data from SSD to tape in a least recently used manner and transfer it back when required.

This would make transactions from dormant accounts take longer than usual but I think this is a small price to pay for scalability and cost. I am actually working on a form of this system for supercomputers as part of my job, leveraging S3 (essentially pretending the tapes are cloud storage in order to use an existing protocol), and would be happy to help in any way.

3 Likes

I now see that a previous post evolved into a discussion similar to this. I didn't notice since the title didn't suggest as such. Edit: it is not quite the same, I don't think this can work with separate nodes for processing and archiving.

I like the idea, tiered databases will be useful in a lot of ways.

Cross-database transactions might be problematic unless we can do it in a failsafe way.

Then we need to pick how we want to slice it, nothing currently separates values like that. Ideally using existing database structures.

1 Like

I'm not exactly sure what you mean by cross-database or separating values. I kind of see it like an extra layer to the already heirarchical memory computers use.

If what I think you mean is what you mean, then the logic of choosing data to archive already exists; albeit as far as I know only for filesystems intended for supercomputers. This would have to be ported to the filesystem nodes use (ext4?). I think the same logic of least recently used would apply here so no need to reinvent the wheel in that regard.

Other than that, you need a copytool on the SSD side to post PUT/GET requests and a daemon in front of the tape for receiving them. Since this is offline there should be no security issues. We use S3 since there is so much existing compatibility with it and plan on supporting swift eventually (obviously you guys only need one protocol). You can also implement a load balancer in front of the tapes to support multiple tapes as storage requirements increase.

Thanks for taking the time to humour me by the way.

Right now the ledger is a single file and from what I can tell these operate at the file level. Are there block-level versions of this? The block store would need some tweaking to export the historical feed, which is essentially pruned blocks. This could be kept online or offline depending on configuration.

I need to get some sleep and look at the source code to be helpful from here but I will get back to you.

1 Like

Yeah, it does operate on a file level so a novel strategy for HSM would likely have to be implemented. Good point, the implementation does not have to suppose whether the tape(s) are remote.

If I understand correctly, this should allow any WIP block pruning strategy to be extended to most recent account blocks. Would this be by revising previous hashes?

This would mean a lookup is required to check for account existence upon creation. Either a list of archived addresses could be stored in a separate file on disk or tape metadata could be used. Or both, adding granularity by using the on-disk file as a cache.

The choice of pruning might be expensive so not really viable to do in real time. I would imagine at regular ledger growth intervals would be best. Since all of the input data required is on-chain (unlike transactions), it could be pre-computed using idle resources. Even if another form of validation is necessary, I don't think it would be too costly for the network; especially if there is some leeway with the timing.

I'm not sure how viable it is, but an ideal strategy could involve some randomness in pruned block choice so that it differs by node. Depending on chosen parameters, 51% validation could be reached without any nodes ending up having to access the tape data; but of course, in principle they could retrieve the blocks at any time.

With the experimental block pruning we already have a way to delete blocks and have the ledger operate correctly.

Rather than only deleting the block, it could also notify listeners via IPC of the block being evicted. Using an IPC notification means different archival programs can be built by third parties.

Am I correct in saying that the experimental block pruning cannot be applied to all blocks of an account? It seems this would be necessary to alleviate spam like the current one.

In that scenario, since there is no need to trust the archive, third parties could just use S3 and cloud storage. The tape is only necessary to keep records internal to nodes for security.

Yea, doing tiered storage of the account frontiers and pending entries would require support within the block_store code. It would require that any failed read in fast storage would be tried again on slow storage. More than likely we'd have to add an additional block processing pipeline to check each tier.

RocksDB might have support for this type of thing internally as it's very configurable. That would be the first thing to check.

1 Like

Cool, will look into that. The cost should only end up applying to new and dormant accounts. If what I said about using different pruning on each node is possible, then most of the time consensus could be reached by nodes that have not archived the relevant block(s). This would allow for a suboptimal solution to be used, as a first pass at least.

I guess blocks/accounts could have a weighted probability of being archived based on account balance and how far back in the ledger they were active.

1 Like

I like this idea. I was wondering if it could be combined with horizontal scaling/sharding of the ledger... such that if a node handling a transaction for an account cannot find that account in it's local store that it could pull the frontier block from a configured (and trusted) URL. I think what that ends up looking like is an archive node/cluster which persists all confirmed blocks and supports multiple tiers of storage (e.g. prune to tape/s3)

As well as handling requests for frontier blocks it could also offer enhanced bootstrap streams of full account histories (for full verification via remote nodes) or ledger shards (frontier blocks only for local/trusted nodes). In this way, it could support HA/Failover in the local trusted consensus cluster and also take remote bootstrap duties away from the sharded consensus nodes.

It feels like something like this would be needed to achieve 100k cps PR's.

Rocks-Db-Cloud Might be a plugin solution:

RocksDB-Cloud is a C++ library that brings the power of RocksDB to AWS, Google Cloud and Microsoft Azure. It leverages the power of RocksDB to provide fast key-value access to data stored in Flash and RAM systems. It provides for data durability even in the face of machine failures by integrations with cloud services like AWS-S3 and Google Cloud Services. It allows a cost-effective way to utilize the rich hierarchy of storage services (based on RAM, NvMe, SSD, Disk Cold Storage, etc) that are offered by most cloud providers. RocksDB-Cloud is developed and maintained by the engineering team at Rockset Inc. Start with rocksdb-cloud/cloud at master · rockset/rocksdb-cloud · GitHub.

Perhaps I don't understand fully, but I would think this falls into the category of horizontal scaling by virtue of what it is. I was not familiar with sharding but I think I sort of maybe half suggested something similar above. As I see it, this is independent of sharding and the cost-benefits would be multiplicative. In the interests of decentralisation, I would not be comfortable with a single archive; that is why I suggested that each node would have it's own archive locally on tape.

I am now thinking (and maybe this is what you mean) that archiving could be another layer to the network similar to to current nodes. It could have it's own PRV as well, or be an extended vote by the current PRs (don't know the correct term for this); as in nodes could choose which archives to trust. This might be too slow to reach consensus though so I still think local per-node tape storage is the best way. In any case, I agree that something like this is needed at some point but I think it's far enough off. I just wanted to bring it up while I thought of it.

Yeah that sounds great, I think something like this could be used for many purposes that SSDs are too expensive for.

Cool, I think it's likely that the RocksDB API could be used for this. It probably wouldn't even be that difficult to implement. As a side note, I guess it is a tradeoff between practicality and principle, but personally I would find Nano less attractive if it relied on closed-source cloud storage.

I'm thinking ahead to when Nano needs to handle closer to 100k cps - needed for global currency adoption. Disregarding the peer to peer bandwidth issues for now, I feel like it will be impossible to vertically scale a Node on single host to 100k cps. So I think that means that "A node" on the network has to become a cluster of machines operating in tandem but still presented as a single logical Node to the rest of the network. At this point it probably makes most sense to design this cluster thing so that different capabilities can scale independently. For example the Bootstrapping/Archive service should be able to stream large chunks of the cemented ledger efficiently. Whereas the voting service may only need large amounts of RAM but will cement blocks to a different service (what I was calling the Archive Node but maybe easier to think of it as the Archive service). So what I meant by trusted was a service that was local to the same cluster rather than going over the Nano network to a remote service (hosted by a different Node entirely).