Last week, Dun & Bradstreet joined Informatica World as a sponsor. Together we are exploring the potential of converting climate risk insights into competitive advantages through uniquely leveraged geospatial data. This is good news for more innovative, resilient supply chains and proactive climate disclosures. Jason Lindauer
Khoi Hoang’s Post
More Relevant Posts
-
Are you ready to learn about #data explosion? Come to IT Xpo Stage 3 at 10:00 to learn about how to accelerate your data-driven transformation with DBaaS in #GartnerSYM Luca Olivari Christophe Bardy
To view or add a comment, sign in
-
-
Data engineer at UBS | 2X Databricks Certified | Azure databricks | Spark | Java | SQL | Pyspark | Unix | Market Risk | Analytics Backtesting | Agile Scrum | Always open to learn!
🔔 suggested : read Delta Lake basics before this at --> https://lnkd.in/eHjwsXuU 💡use of VACUUM command to optimize storage utilization💡 **Vaccum command to delete old/unused Delta data files** Consider we have a delta table with certain records, we also have the transaction logs (json files) for these records. Now for every insert , update, delete operation one transaction log is generated. Also when we run delete or update command the actual data residing in parquet files are not actually removed but only a corresponding entry is made in a log file. Even when we run optimize to do bin-packing, to pack smaller files into one large files, the smaller files are actually not removed and they still exist. There comes a need to delete such files which are not referenced any more or not used at all and become stale. If such files keep piling up, it will become a lot of data at one point of time. Unnecessary data. VACUUM command removes the data files that are no longer referenced in the latest transations logs and older than retention period which is by default 7 days. But it has to be used with caution, because once we run VACUUM on any version of data, we will completely lose that data. When VACUUM is run, it is impossible to do a time travel. Lets see the syntax of doing the same: %sql VACUUM db_name.delta_table_name RETAIN 1 HOURS DRY RUN Note that DRY RUN means we are not actually running the VACUUM command but it will display what all is going to be impacted as part of actual run. So doing a DRY RUN is also safety step to see what data will be deleted before actually deleting the data. As soon as we the above command, we will get an error and we will get this error if at all our mentioned retain period in the command is any less than the default 168 hrs (24*7) If we are sure that we really want to set VACUUM less than default retention period, then we have to set below config to false. "spark.databricks.delta.retentionDurationCheck.enabled" set it false and re-run the vacuum command and now with the dry run it should display what all data will be deleted. and again run without DRY RUN. It will delete all the files that are not referenced in the latest transaction logs. VACUUM is to be performed periodically as it saves on storage. Now if we try to run select * from db_name.table_name version as of 0 it will give an error as those files are permanently deleted. #deltalake #datalake #bigdata #optimization
Data engineer at UBS | 2X Databricks Certified | Azure databricks | Spark | Java | SQL | Pyspark | Unix | Market Risk | Analytics Backtesting | Agile Scrum | Always open to learn!
💡 This post covers a document which explains 'What is Delta Lake?'. 💡 What makes Delta lake preferrable when compared to data lake. 💡Delta lake optimization techniques in next posts.. 1 --> https://lnkd.in/ezjNVHJY - Data skipping with stats 2 --> https://lnkd.in/eRipJuUX - solving small file problem with bin-packing 3 --> https://lnkd.in/e8Vbbkfb - Zordering along with bin-packing for data pruning 4 --> https://lnkd.in/enPhy-pb - Vaccum command to delete old/unused Delta data files 5 --> https://lnkd.in/eUFu5bKQ - Photon Engine for Delta #deltalake #datalake #bigdata #optimization
To view or add a comment, sign in
-
Data engineer at UBS | 2X Databricks Certified | Azure databricks | Spark | Java | SQL | Pyspark | Unix | Market Risk | Analytics Backtesting | Agile Scrum | Always open to learn!
💡 This post covers a document which explains 'What is Delta Lake?'. 💡 What makes Delta lake preferrable when compared to data lake. 💡Delta lake optimization techniques in next posts.. 1 --> https://lnkd.in/ezjNVHJY - Data skipping with stats 2 --> https://lnkd.in/eRipJuUX - solving small file problem with bin-packing 3 --> https://lnkd.in/e8Vbbkfb - Zordering along with bin-packing for data pruning 4 --> https://lnkd.in/enPhy-pb - Vaccum command to delete old/unused Delta data files 5 --> https://lnkd.in/eUFu5bKQ - Photon Engine for Delta #deltalake #datalake #bigdata #optimization
To view or add a comment, sign in
-
Explore the intricacies of Delta Lake through my latest Medium article, a comprehensive book review of 'Delta Lake: Up and Running' by Bennie Haelen. Perfect one for aspiring data engineers! #dataengineering #deltalake #datalake
To view or add a comment, sign in
-
Our multi-model #database now features #vectorsearch to support efficient and accurate searching of high-dimensional data. Ready to work with vectors in #SurrealDB? Check out our reference guide for all the essentials to get started. 👉 https://sdb.li/3Y1azaX
To view or add a comment, sign in
-
-
We've all been there. The good news is, your data doesn't have to be in perfect shape to work with TextQL. Not sure where you stand in the spectrum of untidy data? Our data archeologists can audit your data to provide guidance on its preparedness for AI-driven querying.
To view or add a comment, sign in
-
-
The RESTORE feature of Delta Lake has been critical for me in production data lakes. With a single SQL command you can revert your data back to a clean state. Take the time to master the RESTORE command for Delta Lake, it will help save time and give you peace of mind! Check out the docs: https://lnkd.in/gzSWjziN #deltalake #datalake
To view or add a comment, sign in
-
The recording of our panel discussion on the topic of "Data Mesh in Practice: Buzzword Or Real Impact?" at applydata summit is now available for you to watch. It was an absolute pleasure sharing the stage with Swantje Kowarsch, Alexander Czernay, and Sean Gustafson and I thoroughly enjoyed this discussion! Check out the recording to hear about our experiences with data mesh and my own journey previously at Thoughtworks and now at FREENOW. #datamesh #datascience #dataleadership https://lnkd.in/ea5AfXqF
Panel Discussion: Data Mesh in Practice: Buzzword Or Real Impact?
https://www.youtube.com/
To view or add a comment, sign in
-
The Complexity Of Your Pipelines Is Directly Related To The Complexity Of Your Data. The most important point is that you are trying to preserve the natural shape and content of the data. Anything else and you are potentially screwing up. The shape on the source should be the shape on the target. You can eliminate columns but any columns you add are augmenting the data and may only be based upon values present in the data. There is no magic possible. Magical fields don't just appear. You clearly need to know the grain at the source because that should be the grain at the target. Anything else may open you up to presenting misleading data - especially if stakeholders are intimately familiar with the source. #data #database #dataquality
To view or add a comment, sign in
-
-
Struggling to manage massive amounts of data? Delta Tables offer a game-changing solution for your data lake! This first part of my series explores: • What Delta Tables are and how they organize your data for easy access. ️ • Key features like ACID transactions, scalable metadata, schema enforcement, and time travel. ⏱️ • How Delta Tables compare to traditional data storage solutions. 🆚 Stay tuned for Part 2, where we'll deep dive into Delta superpowers like Delta Logs, time travel, and partition pruning! #bigdata #datamanagement #datalake #deltatable #databricks #apachespark Link: https://lnkd.in/dYV2jMCn
To view or add a comment, sign in
-
Globalsoft's D&B connector for SaaS MDM is ready to enrich supplier data with climate risk insights data.