Khoi Hoang’s Post

1mo

Last week, Dun & Bradstreet joined Informatica World as a sponsor. Together we are exploring the potential of converting climate risk insights into competitive advantages through uniquely leveraged geospatial data. This is good news for more innovative, resilient supply chains and proactive climate disclosures. Jason Lindauer

1 Comment

Raj Lohia

1mo

Globalsoft's D&B connector for SaaS MDM is ready to enrich supplier data with climate risk insights data.

3 Reactions

To view or add a comment, sign in

More Relevant Posts

Belen Brito Falconi

Marketing Director - Southern Europe at Nutanix
8mo
Report this post
Are you ready to learn about #data explosion? Come to IT Xpo Stage 3 at 10:00 to learn about how to accelerate your data-driven transformation with DBaaS in #GartnerSYM Luca Olivari Christophe Bardy
Like Comment
To view or add a comment, sign in
Rekha Angadi

Data engineer at UBS | 2X Databricks Certified | Azure databricks | Spark | Java | SQL | Pyspark | Unix | Market Risk | Analytics Backtesting | Agile Scrum | Always open to learn!
5mo Edited
Report this post
🔔 suggested : read Delta Lake basics before this at --> https://lnkd.in/eHjwsXuU 💡use of VACUUM command to optimize storage utilization💡 **Vaccum command to delete old/unused Delta data files** Consider we have a delta table with certain records, we also have the transaction logs (json files) for these records. Now for every insert , update, delete operation one transaction log is generated. Also when we run delete or update command the actual data residing in parquet files are not actually removed but only a corresponding entry is made in a log file. Even when we run optimize to do bin-packing, to pack smaller files into one large files, the smaller files are actually not removed and they still exist. There comes a need to delete such files which are not referenced any more or not used at all and become stale. If such files keep piling up, it will become a lot of data at one point of time. Unnecessary data. VACUUM command removes the data files that are no longer referenced in the latest transations logs and older than retention period which is by default 7 days. But it has to be used with caution, because once we run VACUUM on any version of data, we will completely lose that data. When VACUUM is run, it is impossible to do a time travel. Lets see the syntax of doing the same: %sql VACUUM db_name.delta_table_name RETAIN 1 HOURS DRY RUN Note that DRY RUN means we are not actually running the VACUUM command but it will display what all is going to be impacted as part of actual run. So doing a DRY RUN is also safety step to see what data will be deleted before actually deleting the data. As soon as we the above command, we will get an error and we will get this error if at all our mentioned retain period in the command is any less than the default 168 hrs (24*7) If we are sure that we really want to set VACUUM less than default retention period, then we have to set below config to false. "spark.databricks.delta.retentionDurationCheck.enabled" set it false and re-run the vacuum command and now with the dry run it should display what all data will be deleted. and again run without DRY RUN. It will delete all the files that are not referenced in the latest transaction logs. VACUUM is to be performed periodically as it saves on storage. Now if we try to run select * from db_name.table_name version as of 0 it will give an error as those files are permanently deleted. #deltalake #datalake #bigdata #optimization

Rekha Angadi

Data engineer at UBS | 2X Databricks Certified | Azure databricks | Spark | Java | SQL | Pyspark | Unix | Market Risk | Analytics Backtesting | Agile Scrum | Always open to learn!
5mo Edited

💡 This post covers a document which explains 'What is Delta Lake?'. 💡 What makes Delta lake preferrable when compared to data lake. 💡Delta lake optimization techniques in next posts.. 1 --> https://lnkd.in/ezjNVHJY - Data skipping with stats 2 --> https://lnkd.in/eRipJuUX - solving small file problem with bin-packing 3 --> https://lnkd.in/e8Vbbkfb - Zordering along with bin-packing for data pruning 4 --> https://lnkd.in/enPhy-pb - Vaccum command to delete old/unused Delta data files 5 --> https://lnkd.in/eUFu5bKQ - Photon Engine for Delta #deltalake #datalake #bigdata #optimization
Like Comment
To view or add a comment, sign in
Rekha Angadi

Data engineer at UBS | 2X Databricks Certified | Azure databricks | Spark | Java | SQL | Pyspark | Unix | Market Risk | Analytics Backtesting | Agile Scrum | Always open to learn!
5mo Edited
Report this post
💡 This post covers a document which explains 'What is Delta Lake?'. 💡 What makes Delta lake preferrable when compared to data lake. 💡Delta lake optimization techniques in next posts.. 1 --> https://lnkd.in/ezjNVHJY - Data skipping with stats 2 --> https://lnkd.in/eRipJuUX - solving small file problem with bin-packing 3 --> https://lnkd.in/e8Vbbkfb - Zordering along with bin-packing for data pruning 4 --> https://lnkd.in/enPhy-pb - Vaccum command to delete old/unused Delta data files 5 --> https://lnkd.in/eUFu5bKQ - Photon Engine for Delta #deltalake #datalake #bigdata #optimization
Like Comment
To view or add a comment, sign in
Akhil Veluru

Senior Data Analyst at Capital One | Azure & AWS Certified | UTD Alumni '23
9mo Edited
Report this post
Explore the intricacies of Delta Lake through my latest Medium article, a comprehensive book review of 'Delta Lake: Up and Running' by Bennie Haelen. Perfect one for aspiring data engineers! #dataengineering #deltalake #datalake

Navigating the World of Delta Lake: An Essential Guide for Data Engineers

link.medium.com
Like Comment
To view or add a comment, sign in
SurrealDB

6,114 followers
2w Edited
Report this post
Our multi-model #database now features #vectorsearch to support efficient and accurate searching of high-dimensional data. Ready to work with vectors in #SurrealDB? Check out our reference guide for all the essentials to get started. 👉 https://sdb.li/3Y1azaX
2 Comments
Like Comment
To view or add a comment, sign in
TextQL

825 followers
7mo
Report this post
We've all been there. The good news is, your data doesn't have to be in perfect shape to work with TextQL. Not sure where you stand in the spectrum of untidy data? Our data archeologists can audit your data to provide guidance on its preparedness for AI-driven querying.
Like Comment
To view or add a comment, sign in
Darrius Wright

Helping businesses get value from their data.
3w
Report this post
The RESTORE feature of Delta Lake has been critical for me in production data lakes. With a single SQL command you can revert your data back to a clean state. Take the time to master the RESTORE command for Delta Lake, it will help save time and give you peace of mind! Check out the docs: https://lnkd.in/gzSWjziN #deltalake #datalake
Like Comment
To view or add a comment, sign in
Devangana Khokhar

Data Leader | Humanitarian Data Specialist | Author | Speaker
8mo
Report this post
The recording of our panel discussion on the topic of "Data Mesh in Practice: Buzzword Or Real Impact?" at applydata summit is now available for you to watch. It was an absolute pleasure sharing the stage with Swantje Kowarsch, Alexander Czernay, and Sean Gustafson and I thoroughly enjoyed this discussion! Check out the recording to hear about our experiences with data mesh and my own journey previously at Thoughtworks and now at FREENOW. #datamesh #datascience #dataleadership https://lnkd.in/ea5AfXqF

Panel Discussion: Data Mesh in Practice: Buzzword Or Real Impact?

https://www.youtube.com/

2 Comments
Like Comment
To view or add a comment, sign in
Nigel Shaw

Data Thinker
10mo
Report this post
The Complexity Of Your Pipelines Is Directly Related To The Complexity Of Your Data. The most important point is that you are trying to preserve the natural shape and content of the data. Anything else and you are potentially screwing up. The shape on the source should be the shape on the target. You can eliminate columns but any columns you add are augmenting the data and may only be based upon values present in the data. There is no magic possible. Magical fields don't just appear. You clearly need to know the grain at the source because that should be the grain at the target. Anything else may open you up to presenting misleading data - especially if stakeholders are intimately familiar with the source. #data #database #dataquality
Like Comment
To view or add a comment, sign in
Amine Brini

Senior Data engineer
3mo
Report this post
Struggling to manage massive amounts of data? Delta Tables offer a game-changing solution for your data lake! This first part of my series explores: • What Delta Tables are and how they organize your data for easy access. ️ • Key features like ACID transactions, scalable metadata, schema enforcement, and time travel. ⏱️ • How Delta Tables compare to traditional data storage solutions. 🆚 Stay tuned for Part 2, where we'll deep dive into Delta superpowers like Delta Logs, time travel, and partition pruning! #bigdata #datamanagement #datalake #deltatable #databricks #apachespark Link: https://lnkd.in/dYV2jMCn
Like Comment
To view or add a comment, sign in

1,584 followers

49 Posts

View Profile Follow

Khoi Hoang’s Post

More Relevant Posts

Panel Discussion: Data Mesh in Practice: Buzzword Or Real Impact?

https://www.youtube.com/

Explore topics