Originally published here.
Last week I made the easy prediction that at re:Invent, AWS would announce more so-called ‘serverless’ capabilities. It’s no secret that they are all-in on moving from server management to service management. I guessed at a few specific possibilities – SFTP-as-a-Service, ‘serverless’ EC2, and a few others.
This week, I want to look at some of the other capabilities provided by AWS and make some predictions as to what announcements we might see. Why should any or all of this matter to you? If you’re in the business of processing, storing, and analyzing large sets of data, these updates may significantly impact the speed, efficiency, and cost at which you’re able to do so.
WEEK 2 PREDICTION: DATA 2.0
While AWS has a number of existing tools to manage data ingestion and processing (e.g. Data Pipeline, Glue, Kinesis), I think adding in an orchestration framework optimized for all the steps in a robust data processing framework would really allow for AWS’ data analytical tools (Athena, QuickSight, etc) to really shine.
DATA-MAPPING-AS-A-SERVICE
I cut my teeth with data integration on platforms like WebMethods. While it may have had some drawbacks, it was, as a solution set, really excellent at:
- Providing endpoints for data delivery
- Identification of data by location, format, or other specific data elements
- Routing the data to the right processors based on the above features
- Mapping of each data entry from one format to another
- Delivery of transformed data into target location
I can see an equivalent of something akin to a managed Apache NiFi solution – in a manner like AWS’ ElasticSearch Service. Tying in the ability to route various tasks to be executed by Lambda and/or Fargate, supporting Directed Acyclic Graph (DAC) modeling, and a tight integration into writing out data to S3 as both final and intermediate steps would be a game-changer for products that have to import and process data files – particularly from third parties.
S3 LIFECYCLE ON READ TIME
One of my pet peeves on the S3 lifecycle management is that moving from Standard to Infrequent Access storage class has nothing to do with the frequency of accessing the file. While I would imagine that the underlying capabilities of an object store makes it very difficult to actually do this, it would provide a much-needed metric to make storage decisions.
DYNAMODB DEEP DOCUMENT MODE
DynamoDB is a great hybrid key and document store. I use it often for small document store and retrieval. However, the current limits on document size and scan patterns make using DynamoDB as a managed MongoDB-level solution is a challenge. Providing more robust document-centric capabilities, while still supporting the scalability, replication, and global presence would significantly “up the game” for DynamoDB. As a wish-list factor for DynamoDB I would like to completely remove the pre-allocation of throughput for reads and writes. Let each request set an optional throttle, but charge me for what I actually use rather than what I might use. The current autoscaling is a significant improvement over nothing – but it can be improved.
RDS – POLYGLOT EDITION
For a while there, there was an interesting trend to try to combine multiple database paradigms into a single view – combining document + graph, etc. I think that AWS may try to tip their toe into this view.By combining a few of their existing products together behind the scenes, it would be interesting to link ElasticSearch, Aurora, and Neptune together for a solution that tries to combine the best of each of the storage paradigms. Like most all-in-one tools, I’m honestly not sure if it will just do the multiple features equally mediocre. I often recommend a multi-storage solution for clients for their data – each one optimized for a particular use case, so there may be something there.
S3 AUTO-CRAWLING AND METRICS
Imagine setting a flag on a data bucket so whenever a data file drops there, it is automatically classified, indexed, and ready for Athena, Glue, or Hive querying. Having some high-level metrics on the data within would be useful for other business decisions – row count – average values, etc. Adding in some SageMaker algorithms for data variance (e.g. random cut forest for discovering data outliers and/or trends) to fire off alerts would be incredible, too.
WRAPPING IT UP
In closing this week, I think there will be a lot of different announcements around data processing as an AWS-centric framework. AWS has most of the parts in play already – having AWS manage the wiring up of them so you only have to focus on the business value you are extracting from the data would realize the promise of the cloud for data processing.
Going to be at re:Invent? Drop a comment below and let me know what you hope to see there or your thoughts on what’s next.
Top comments (2)
S3's lifecycle management generally leaves much to be desired. I mean, it's great that you could have a multi-stage lifecycle for data. But, the fact that your only choice for sub-30day policies is to just straight to Glacier is kind of dreadful. S3 is potentially great as a repository for nearline/offline storage (i.e., backups) ...but it currently lacks the useful lifecycle capabilities you get used to in legacy products like NetBackup. And, even aside from the whole loss of POSIX attributes if you want to simply
sync
a filesystem to disk, performance of such is dreadful due to the whole common-key issue. Both the POSIX attributes an common-key problems are solveable, but it's painful to sort the programmatic logic out.Overall, it has the feel of "you guys have been pestering us, here's something to shut you up for a while", but not really a fully-realized HSM.
Maybe what AWS will introduce is an actual HSM-style interface to S3 or a service-overlay?
Also, I would hope that they're opting to flesh-out the EFS offering. Things like: