A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
High Availability and Resiliency in Databases With MaxScale
Flyway Database Migration From Spring Boot 3
Microsoft's SQL Server is a powerful RDBMS that is extensively utilized in diverse industries for the purposes of data storage, retrieval, and analysis. The objective of this article is to assist novices in comprehending SQL Server from fundamental principles to advanced techniques, employing real-world illustrations derived from nProbe data. nProbe is a well-known network traffic monitoring tool that offers comprehensive insights into network traffic patterns. Getting Started With SQL Server 1. Introduction to SQL Server SQL Server provides a comprehensive database management platform that integrates advanced analytics, robust security features, and extensive reporting capabilities. It offers support for a wide range of data types and functions, enabling efficient data management and analysis. 2. Installation Begin by installing SQL Server. Microsoft offers different editions, including Express, Standard, and Enterprise, to cater to varying needs. The Express edition is free and suitable for learning and small applications. Here is the step-by-step guide to install the SQL server. 3. Basic SQL Operations Learn the fundamentals of SQL, including creating databases, tables, and writing basic queries: Create database: `CREATE DATABASE TrafficData;` Create table: Define a table structure to store nProbe data: MS SQL CREATE TABLE NetworkTraffic ( ID INT PRIMARY KEY, SourceIP VARCHAR(15), DestinationIP VARCHAR(15), Packets INT, Bytes BIGINT, Timestamp DATETIME ); Intermediate SQL Techniques 4. Data Manipulation Inserting Data To insert data into the `NetworkTraffic` table, you might collect information from various sources, such as network sensors or logs. MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.1', '192.168.1.1', 150, 2048, '2023-10-01T14:30:00'); Batch insert to minimize the impact on database performance: MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.2', '192.168.1.2', 50, 1024, '2023-10-01T15:00:00'), ('10.0.0.3', '192.168.1.3', 100, 1536, '2023-10-01T15:05:00'), ('10.0.0.4', '192.168.1.4', 200, 4096, '2023-10-01T15:10:00'); Updating Data You may need to update records as new data becomes available or corrections are necessary. For instance, updating the byte count for a particular traffic record: MS SQL UPDATE NetworkTraffic SET Bytes = 3072 WHERE ID = 1; Update multiple fields at once: MS SQL UPDATE NetworkTraffic SET Packets = 180, Bytes = 3072 WHERE SourceIP = '10.0.0.1' AND Timestamp = '2023-10-01T14:30:00'; Deleting Data Removing data is straightforward but should be handled with caution to avoid accidental data loss. MS SQL DELETE FROM NetworkTraffic WHERE Timestamp < '2023-01-01'; Conditional delete based on network traffic analysis: MS SQL DELETE FROM NetworkTraffic WHERE Bytes < 500 AND Timestamp BETWEEN '2023-01-01' AND '2023-06-01'; Querying Data Simple Queries: Retrieve basic information from your data set. MS SQL SELECT FROM NetworkTraffic WHERE SourceIP = '10.0.0.1'; Select specific columns: MS SQL SELECT SourceIP, DestinationIP, Bytes FROM NetworkTraffic WHERE Bytes > 2000; Aggregate Functions Useful for summarizing or analyzing large data sets. MS SQL SELECT AVG(Bytes), MAX(Bytes), MIN(Bytes) FROM NetworkTraffic WHERE Timestamp > '2023-01-01'; Grouping data for more detailed analysis: MS SQL SELECT SourceIP, AVG(Bytes) AS AvgBytes FROM NetworkTraffic GROUP BY SourceIP HAVING AVG(Bytes) > 1500; Join Operations In scenarios where you have multiple tables, joins are essential. Assume another table `IPDetails` that stores additional information about each IP. MS SQL SELECT n.SourceIP, n.DestinationIP, n.Bytes, i.Location FROM NetworkTraffic n JOIN IPDetails i ON n.SourceIP = i.IPAddress WHERE n.Bytes > 1000; Complex Queries Combining multiple SQL operations to extract in-depth insights. MS SQL SELECT SourceIP, SUM(Bytes) AS TotalBytes FROM NetworkTraffic WHERE Timestamp BETWEEN '2023-01-01' AND '2023-02-01' GROUP BY SourceIP ORDER BY TotalBytes DESC; Advanced SQL Server Features 5. Indexing for Performance Optimizing SQL Server performance through indexing and leveraging stored procedures for automation is critical for managing large databases efficiently. Here’s an in-depth look at both topics, with practical examples, particularly focusing on enhancing operations within a network traffic database like the one collected from nProbe. Why Indexing Matters Indexing is a strategy to speed up the retrieval of records from a database by reducing the number of disk accesses required when a query is processed. It is especially vital in databases with large volumes of data, where search operations can become increasingly slow. Types of Indexes Clustered indexes: Change the way records are stored in the database as they sort and store the data rows in the table based on their key values. Tables can have only one clustered index. Non-clustered indexes: Do not alter the physical order of the data, but create a logical ordering of the data rows and use pointers to physical rows; each table can have multiple non-clustered indexes. Example: Creating an Index on Network Traffic Data Suppose you frequently query the `NetworkTraffic` table to fetch records based on `SourceIP` and `Timestamp`. You can create a non-clustered index to speed up these queries: MS SQL CREATE NONCLUSTERED INDEX idx_networktraffic_sourceip ON NetworkTraffic (SourceIP, Timestamp); This index would particularly improve performance for queries that look up records by `SourceIP` and filter on `Timestamp`, as the index helps locate data quickly without scanning the entire table. Below are additional instructions on utilizing indexing effectively. 6. Stored Procedures and Automation Benefits of Using Stored Procedures Stored procedures help in encapsulating SQL code for reuse and automating routine operations. They enhance security, reduce network traffic, and improve performance by minimizing the amount of information sent to the server. Example: Creating a Stored Procedure Imagine you often need to insert new records into the `NetworkTraffic` table. A stored procedure that encapsulates the insert operation can simplify the addition of new records: MS SQL CREATE PROCEDURE AddNetworkTraffic @SourceIP VARCHAR(15), @DestinationIP VARCHAR(15), @Packets INT, @Bytes BIGINT, @Timestamp DATETIME AS BEGIN INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES (@SourceIP, @DestinationIP, @Packets, @Bytes, @Timestamp); END; Using the Stored Procedure To insert a new record, instead of writing a full insert query, you simply execute the stored procedure: MS SQL EXEC AddNetworkTraffic @SourceIP = '192.168.1.1', @DestinationIP = '10.0.0.1', @Packets = 100, @Bytes = 2048, @Timestamp = '2024-04-12T14:30:00'; Automation Example: Scheduled Tasks SQL Server Agent can be used to schedule the execution of stored procedures. For instance, you might want to run a procedure that cleans up old records every night: MS SQL CREATE PROCEDURE CleanupOldRecords AS BEGIN DELETE FROM NetworkTraffic WHERE Timestamp < DATEADD(month, -1, GETDATE()); END; You can schedule this procedure to run automatically at midnight every day using SQL Server Agent, ensuring that your database does not retain outdated records beyond a certain period. By implementing proper indexing strategies and utilizing stored procedures, you can significantly enhance the performance and maintainability of your SQL Server databases. These practices are particularly beneficial in environments where data volumes are large and efficiency is paramount, such as in managing network traffic data for IFC systems. 7. Performance Tuning and Optimization Performance tuning and optimization in SQL Server are critical aspects that involve a systematic review of database and system settings to improve the efficiency of your operations. Proper tuning not only enhances the speed and responsiveness of your database but also helps in managing resources more effectively, leading to cost savings and improved user satisfaction. Key Areas for Performance Tuning and Optimization 1. Query Optimization Optimize queries: The first step in performance tuning is to ensure that the queries are as efficient as possible. This includes selecting the appropriate columns, avoiding unnecessary calculations, and using joins effectively. Query profiling: SQL Server provides tools like SQL Server Profiler and Query Store that help identify slow-running queries and bottlenecks in your SQL statements. Example: Here’s how you can use the Query Store to find performance issues: MS SQL SELECT TOP 10 qt.query_sql_text, rs.avg_duration FROM sys.query_store_query_text AS qt JOIN sys.query_store_plan AS qp ON qt.query_text_id = qp.query_text_id JOIN sys.query_store_runtime_stats AS rs ON qp.plan_id = rs.plan_id ORDER BY rs.avg_duration DESC; 2. Index Management Review and adjust indexes: Regularly reviewing the usage and effectiveness of indexes is vital. Unused indexes should be dropped, and missing indexes should be added where significant performance gains can be made. Index maintenance: Rebuilding and reorganizing indexes can help in maintaining performance, especially in databases with heavy write operations. Example: Rebuild an index using T-SQL: MS SQL ALTER INDEX ALL ON dbo.YourTable REBUILD WITH (FILLFACTOR = 90, SORT_IN_TEMPDB = ON, STATISTICS_NORECOMPUTE = OFF); 3. Database Configuration and Maintenance Database settings: Adjust database settings such as recovery model, file configuration, and buffer management to optimize performance. Routine maintenance: Implement regular maintenance plans that include updating statistics, checking database integrity, and cleaning up old data. Example: Set up a maintenance plan in SQL Server Management Studio (SSMS) using the Maintenance Plan Wizard. 4. Hardware and Resource Optimization Hardware upgrades: Sometimes, the best way to achieve performance gains is through hardware upgrades, such as increasing memory, adding faster disks, or upgrading CPUs. Resource allocation: Ensure that the SQL Server has enough memory and CPU resources allocated, particularly in environments where the server hosts multiple applications. Example: Configure maximum server memory: MS SQL EXEC sp_configure 'max server memory', 4096; RECONFIGURE; 5. Monitoring and Alerts System monitoring: Continuous monitoring of system performance metrics is crucial. Tools like System Monitor (PerfMon) and Dynamic Management Views (DMVs) in SQL Server provide real-time data about system health. Alerts setup: Configure alerts for critical conditions, such as low disk space, high CPU usage, or blocking issues, to ensure that timely actions are taken. Example: Set up an alert in SQL Server Agent: MS SQL USE msdb ; GO EXEC dbo.sp_add_alert @name = N'High CPU Alert', @message_id = 0, @severity = 0, @enabled = 1, @delay_between_responses = 0, @include_event_description_in = 1, @notification_message = N'SQL Server CPU usage is high.', @performance_condition = N'SQLServer:SQL Statistics|Batch Requests/sec|_Total|>|1000', @job_id = N'00000000-1111-2222-3333-444444444444'; GO Performance tuning and optimization is an ongoing process, requiring regular adjustments and monitoring. By systematically addressing these key areas, you can ensure that your SQL Server environment is running efficiently, effectively supporting your organizational needs. Conclusion Mastering SQL Server is a journey that evolves with practice and experience. Starting from basic operations to leveraging advanced features, SQL Server provides a powerful toolset for managing and analyzing data. As your skills progress, you can handle larger datasets like those from nProbe, extracting valuable insights and improving your network's performance and security. For those looking to dive deeper, Microsoft offers extensive documentation and a community rich with resources to explore more complex SQL Server capabilities. Useful References nProbe SQL Server SQL server performance tuning
I first shared this blog from my session (at GIDS 2024). If you attended it, thank you for coming and I hope you found it useful! If not, well, you have the resources and links anyway – I have written out the talk so that you can follow along with the slides if you need more context. Hopefully, the folks at GIDS will publish the video as well. I will add the link once it's available. Key Info Slides available here GitHub repository - Code and instructions on how to get the demo up and running Summarized Version of the Talk I had 30-mins – so, I kept it short and sweet! Setting the Context Foundation models (FMs) are the heart of generative AI. These models are pre-trained on vast amounts of data. Large language models (LLMs) are a class of FMs; for instance, the Claude family from Anthropic, Llama from Meta, etc. You generally access these using dedicated platforms; for example, Amazon Bedrock, which is a fully managed service with a wide range of models accessible via APIs. These models are pretty powerful, and they can be used standalone to build generative AI apps. So Why Do We Need Vector Databases? To better understand this, let's take a step back and talk about the limitations of LLMs. I will highlight a few common ones. LLM Limitations Knowledge cut-off: The knowledge of these models is often limited to the data that was current at the time it was pre-trained or fine-tuned. Hallucination: Sometimes, these models provide an incorrect response, quite “confidently." Lack of Access To External Data Sources Another reason is the lack of access to external data sources. Think about it: You can set up an AWS account and start using models on Amazon Bedrock. But, if you want to build generative AI applications that are specific to your business needs, you need domain or company-specific private data (for example, a customer service chatbot that can access customer details, order info, etc.). Now, it's possible to train or fine-tune these models with your data – but it's not trivial or cost-effective. However, there are techniques to work around these constraints – RAG (discussed later) being one of them, and vector databases play a key role. Dive Into Vector Databases Before we get into it, let's understand the following: What Is a Vector? In simple terms, vectors are numerical representations of text. There is input text (also called prompt). You pass it through something called an embedding model - think of it as a stateless function. You get an output which is an array of floating-point numbers. What’s important to understand is that Vectors capture semantic meaning, so they can be used for relevancy or context-based search, rather than simple text search. Types of Vector Databases I tend to categorize Vector databases into two types: Vector data type support within existing databases, such as PostgreSQL, Redis, OpenSearch, MongoDB, Cassandra, etc. Specialized vector databases, like Pinecone, Weaviate, Milvus, Qdrant, ChromaDB, etc. This field is also moving very fast and I’m sure we will see a lot more in the near future! Now you can run these specialized vector stores on AWS, via their dedicated cloud offerings. But I want to quickly give you a glimpse of the choices in terms of the first category that I referred to. The following are supported as native AWS database(s): Amazon OpenSearch service Amazon Aurora with PostgreSQL compatibility Amazon DocumentDB (with MongoDB compatibility) Amazon MemoryDB for Redis which currently has Vector search in preview (at the time of writing) Vector Databases in Generative AI Solutions Here is a simplified view of where vector databases sit in generative AI solutions: You take your domain-specific data and split/chunk them up. Pass them through an embedding model: this gives you these vectors or embeddings. Store these embeddings in a vector database. Then, there are applications that execute semantic search queries and combine them in various ways (RAG being one of them). Demo 1 (of 3): Semantic Search With OpenSearch and LangChain Find the details on the GitHub repository linked earlier. RAG: Retrieval Augmented Generation We covered the limitations of LLM – knowledge cut-off, hallucination, no access to internal data, etc. Of course, there are multiple ways to overcome this. Prompt-engineering techniques: Zero-shot, few-shot, etc., - Sure this is cost-effective, but how would this apply to domain-specific data? Fine-tuning: Take an existing LLM and train it using a specific dataset. But what about the infra and costs involved? Do you want to become a model development company or focus on your core business? These are just a few examples. RAG Technique Adopts a Middle Ground There are two key parts to a RAG workflow: Part 1: Data ingestion is where you take your source data (PDF, text, images, etc.), break it down into chunks, pass it through an embedding model, and store it in the vector database. Part 2: This involves the end-user application (e.g., a chatbot). The user sends a query – this input is converted to vector embedding using the same (embedding) model that was used for the source data. We then execute a semantic or similarity search to get the top-N closest results. That’s not all. Part 3: These results, also referred to as ”context,” are then combined with the user input and a specialized prompt. Finally, this is sent to an LLM – note this is not the embedding model, this is a large language model. The added context in the prompt helps the model provide a more accurate and relevant response to the user’s query. Demo 2 (of 3): RAG With OpenSearch and LangChain Find the details in the GitHub repository linked earlier. Fully-Managed RAG Experience: Knowledge Bases for Amazon Bedrock Another approach is to have a managed solution to take care of the heavy lifting. For example, if you use Amazon Bedrock, then Knowledge Bases can make RAG easier and more manageable. It supports the entire RAG workflow, from ingestion to retrieval and prompt augmentation. It supports multiple vector stores to store vector embedding data. Demo 3 (Of 3): Full-Managed RAG Knowledge Bases for Amazon Bedrock Find the details in the GitHub repository linked earlier. Now how do we build RAG applications using this? For application integration, this is exposed by APIs: RetrieveAndGenerate: Call the API, get the response - that's it. Everything (query embedding, semantic search, prompt engineering, LLM orchestration) is handled! Retrieve: For custom RAG workflows, where you simply extract the top-N responses (like semantic search) and integrate the rest as per your choice. Where Do I Learn More? Documentation is a great place to start! Specifically, "Knowledge bases for Amazon Bedrock" Code samples for Amazon Bedrock Lots of content and practical solutions in the generative AI community space! Ultimately, there is no replacement for hands-on learning. Head over to Amazon Bedrock and start building! Wrapping Up And, that's it. Like I said, I had 30-mins and I kept it short and sweet! This area is evolving very quickly. This includes vector databases, LLMs (there is one every week - feels like the JavaScript frameworks era!), and frameworks (like LangChain, etc.). It's hard to keep up but remember: the fundamentals are the same. The key is to grasp them - hopefully, this helps with some of it. Happy building!
SQL has a long and proven history. It has survived the fuss around NoSQL. Even if it is not perfect, it has been demonstrated to be the best available language for data. This is no surprise! The story began in the 1960s with the development of databases—an era marked by the introduction of the Integrated Data Store (IDS) at General Electric. However, it was Edgar Codd’s relational model that revolutionized data handling. His model, which turned data into a series of tables (or, more strictly, relations), has influenced database systems ever since. This era also saw the birth of SQL (Structured Query Language), which became the standard language for interacting with relational databases, including MariaDB and others. The Utility of Relational Database Systems So, why do we need all this database stuff? Let’s imagine you’re building an app, maybe a simple to-do list to keep track of your daily tasks. Initially, you might think, “Why not just save each task directly to a file?” After all, my programming language has constructs and libraries to save and read data from disk. Also, implementing this seems straightforward: create a task, write it to a file, delete a task, and remove it from the file. These are good points; however, as your app gains traction, users start to aggregate, and suddenly, you have thousands of users trying to add, delete, and modify tasks simultaneously. At this point, the simplicity of files becomes fragile. Imagine one user is updating a task at the exact moment another tries to delete it. Or maybe two users are editing the same task at the same time. With a simple file system, you’re likely to end up with corrupted or lost data because there’s no inherent mechanism to handle such conflicts. Databases handle these situations gracefully through the ACID properties. Essentially, a set of principles ensures that even if your app crashes midway through an update, the data remains consistent, and no half-completed tasks are left hanging. Back to the to-do app example, imagine trying to move your task “Buy groceries” from pending to completed which also requires changing the last_updated property, but your app crashes right in the middle. With a relational database, it’s all or nothing—either the task is marked complete and the last_updated property reflects the new time value, or it’s like you never tried to update it in the first place, avoiding those incorrect half-states. Now, let’s consider data relationships. In your app, tasks might belong to different categories or users. In a file system, maintaining these relationships is cumbersome. You might end up with a separate file for each category or user, but then how do you quickly find all tasks across categories or ensure two users don’t end up with the same task ID? Databases have the ability to manage complex relationships, making it easy to query all tasks for a specific user or category or even more complex queries like “show me the number of completed tasks for user U grouped by category C during the last month.” Security is another biggie. In a file system, if someone gains access to your files, they have your data. Databases offer robust security features, like access controls and encryption, safeguarding your data from unauthorized eyes. And then there’s the issue of growth. Your simple to-do app might evolve into a complex enterprise project management tool over time. With a file system, every change can feel like renovating a building with people still inside. Databases are built to be flexible and scalable, meaning they’re designed to grow with your needs, whether you’re adding new features or handling more users. In the end, choosing a database over a simple file system is about preparing for success while standing on solid ground. It’s about ensuring that as your app grows, your data remains secure, consistent, and manageable and your users happy. After all, no one likes losing their to-do list to a random crash or waiting forever for their tasks to load because the system is bogged down handling conflicts and searches! A Bit of History It was Edgar Codd who proposed the relational model for databases, and since he was a mathematician, he formalized the concepts by creating what is called relational algebra and relational calculus. All this was theoretical until IBM and others started to implement the concepts in academic and research projects. They also wanted to come up with a standard language for querying data in relational databases. At first, they invented QUEL (Querying Using the English Language) at the University of California, Berkeley. At IBM, researchers wanted to come up with their own language and started a project that I perceive more as a game between colleagues called SQUARE (Specifying Queries Using a Relational Environment). This led to a query language that had a scientific-like notation with subindexes and super-indexes, which was hard to type on computer keyboards. To solve this, they redefined the language to only use standard characters and, in an ingenious and probably friendly mockery way, called it SEQUEL. This name, however, was a trademark in the UK, which prevented them from using it. They removed the vowels in SEQUEL, and boom! SQL was born. By 1986, SQL would become an ISO and ANSI standard. As a curious historical remark, although their inventors had to rename SEQUEL to SQL, they continued to call it a “sequel.” Even today, many software developers and IT professionals continue to pronounce it “sequel.” The name Structured Query Language (SQL) would appear later. The Utility of SQL SQL is a declarative language, meaning that you specify what you want to get and not how to get it. The database is in charge of doing whatever needs to be done to get the data requested. SQL isolates database complexity. A database is a complex piece of software with tons of algorithms implemented in it. This algorithms deal with different ways to get data stored in disk or memory. Different algorithms are more efficient in different circumstances which includes different queries and different datasets. For example, in MariaDB, a component called the query optimizer is in charge of deciding what algorithms to use given a SQL query and stats gathered on the actual data. The query optimizer analyzes the SQL query, the data structures, the database schema, and the statistical distribution of the data. It then decides whether to use an index, which joining algorithm is the best, and how to sequence the operations. This process involves a remarkable amount of complexity and mathematical precision, all of which the database abstractly manages for you. As a developer, you only need to worry about constructing the query to get the data you need and let the database figure out whether to use an index (with some datasets, not using an index could be faster), B-trees, hash tables, and even whether to add the data to an in-memory cache, as well as many other things. SQL also allows you to handle writes, that is, creating and updating data. It also allows you to define the schema of the database, or in short and over-simplifying, the tables and their column structure. In fact, there’s much more that SQL allows you to do, and its functionality can be divided into four categories: Data definition language (DDL): Creating and manipulating the schema. Data manipulation language (DML): Inserting, updating, and deleting data from the database. Data query language (DQL): Retrieving data from the database. Data control language (DCL): Dealing with rights and permissions over the database and its objects. In my more than 15 years of experience in the industry, I have rarely seen the previous categories used in a work environment, with the exception of DDL, which refers to activities related to handling database schema updates. These categories are useful mostly in academic circles or in teams implementing relational database management software. However, it’s good to know that these terms exist and are used by others as they help in discussions around database technology. With this in mind, let me briefly touch on one of such discussions. Some would say that developers have to deal only with DML and DQL, while DDL and DCL are DBAs' concerns. In practice, this division is not so easy to make. Developers need to understand how database objects (like tables and columns) are created and how access to these objects is managed. However, it is true that developers spend most of their time writing SQL statements to modify and query data. You’ll see that this book focuses on DML and DQL while explaining other categories as needed. On the other hand, DBA’s are experts on everything database—from infrastructure and general database management to SQL query optimization and migration, a DBA is always a valuable brain to have in your team. Conclusion In conclusion, databases solve real problems that application developers face, thanks to their ability to ensure data integrity through ACID properties, manage complex relationships, and provide robust security features. I only scratched the surface here, but this should be enough to give novice IT practitioners a quick refresher on the importance of relational databases and SQL.
As a developer, you may encounter situations where your application's database must handle large amounts of data. One way to manage this data effectively is through database sharding, a technique that distributes data across multiple servers or databases horizontally. Sharding can improve performance, scalability, and reliability by breaking up a large database into smaller, more manageable pieces called shards. In this article, we'll explore the concept of database sharding, discuss various sharding strategies, and provide a step-by-step guide to implementing sharding in MongoDB, a popular NoSQL database. Understanding Database Sharding Database sharding involves partitioning a large dataset into smaller subsets called shards. Each shard contains a portion of the total data and operates independently from the others. By executing queries and transactions on a single shard rather than the entire dataset, response times are faster, and resources are utilized more efficiently. Sharding Strategies There are several sharding strategies to choose from, depending on your application's requirements: Range-based sharding: Data is partitioned based on a specific range of values (e.g., users with IDs 1-1000 in Shard 1, users with IDs 1001-2000 in Shard 2). Hash-based sharding: A hash function is applied to a specific attribute (e.g., user ID), and the result determines which shard the data belongs to. This method ensures a balanced distribution of data across shards. Directory-based sharding: A separate lookup service or table is used to determine which shard a piece of data belongs to. This approach provides flexibility in adding or removing shards but may introduce an additional layer of complexity. Geolocation-based sharding: Data is partitioned based on the geographical location of the users or resources, reducing latency for geographically distributed users. Implementing Sharding in MongoDB MongoDB supports sharding out-of-the-box, making it a great choice for developers looking to implement sharding in their applications. Here's a step-by-step guide to set up sharding in MongoDB. We will use the MongoDB shell which uses JavaScript syntax for writing commands and interacting with the database: 1. Set up a Config Server The config server stores metadata about the cluster and shard locations. For production environments, use a replica set of three config servers. Shell mongod --configsvr --dbpath /data/configdb --port 27019 --replSet configReplSet 2. Initialize the Config Server Replica Set This command initiates a new replica set on a MongoDB instance running on port 27019. Shell mongo --port 27019 > rs.initiate() 3. Set Up Shard Servers Start each shard server with the --shardsvr option and a unique --dbpath. Shell mongod --shardsvr --dbpath /data/shard1 --port 27018 mongod --shardsvr --dbpath /data/shard2 --port 27017 4. Start the mongos Process The mongos process acts as a router between clients and the sharded cluster. Shell mongos --configdb configReplSet/localhost:27019 5. Connect to the mongos Instance and Add the Shards Shell mongo > sh.addShard("localhost:27018") > sh.addShard("localhost:27017") 6. Enable Sharding for a Specific Database and Collection Shell > sh.enableSharding("myDatabase") > sh.shardCollection("myDatabase.myCollection", {"userId": "hashed"}) In this example, we've set up a MongoDB sharded cluster with two shards and used hash-based sharding on the userId field. Now, data in the "myCollection" collection will be distributed across the two shards, improving performance and scalability. Conclusion Database sharding is an effective technique for managing large datasets in your application. By understanding different sharding strategies and implementing them using MongoDB, you can significantly improve your application's performance, scalability, and reliability. With this guide, you should now have a solid understanding of how to set up sharding in MongoDB and apply it to your own projects. Happy learning!!
In the ever-expanding digital landscape, where data is generated at an unprecedented rate, the architecture of databases stands as the bedrock of efficient data management. With the rise of Big Data and Cloud technologies, alongside the integration of Artificial Intelligence (AI), the realm of database architectures has undergone a profound transformation. This article delves into the intricate world of database architectures, exploring their adaptation to Big Data and Cloud environments while also dissecting the evolving impact of AI on their structure and functionality. As organizations grapple with the challenges of handling vast amounts of data in real time, the significance of robust database architectures becomes increasingly apparent. From the traditional foundations of Relational Database Management Systems (RDBMS) to the flexible solutions offered by NoSQL databases and the scalability of cloud-based architectures, the evolution continues to meet the demands of today's data-driven landscape. Furthermore, the convergence of AI technologies introduces new dimensions to database management, enabling intelligent query optimization, predictive maintenance, and the emergence of autonomous databases. Understanding these dynamics is crucial for navigating the complexities of modern data ecosystems and leveraging the full potential of data-driven insights. The Traditional Foundation: Relational Database Management Systems (RDBMS) Traditionally, Relational Database Management Systems (RDBMS) have been the stalwarts of data management. Characterized by structured data organized into tables with predefined schemas, RDBMS ensures data integrity and transactional reliability through ACID (Atomicity, Consistency, Isolation, Durability) properties. Examples of RDBMS include MySQL, Oracle, and PostgreSQL. Embracing the Complexity of Big Data: NoSQL Databases The advent of Big Data necessitated a shift from the rigid structures of RDBMS to more flexible solutions capable of handling massive volumes of unstructured or semi-structured data. Enter NoSQL databases, a family of database systems designed to cater to the velocity, volume, and variety of Big Data (Kaushik Kumar Patel (2024)). NoSQL databases come in various forms, including document-oriented, key-value stores, column-family stores, and graph databases, each optimized for specific data models and use cases. Examples include MongoDB, Cassandra, and Apache HBase. Harnessing the Power of the Cloud: Cloud-Based Database Architectures Cloud-based database architectures leverage the scalability, flexibility, and cost-efficiency of cloud infrastructure to provide on-demand access to data storage and processing resources. Through models such as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Database as a Service (DBaaS), organizations can choose the level of abstraction and management that suits their needs. Multi-cloud and hybrid cloud architectures further enhance flexibility by enabling workload distribution across multiple cloud providers or integration with on-premises infrastructure (Hichem Moulahoum, Faezeh Ghorbanizamani (2024)). Notable examples include Amazon Aurora, Google Cloud Spanner, and Microsoft Azure Cosmos DB. Data Flow and Storage: On-Premises vs. Cloud Databases Understanding data flow and storage is crucial for managing both on-premises and cloud databases effectively. Here's a breakdown with a Data Base Architect (DBA) diagram for each scenario: On-Premises Database Explanation Application server: This interacts with the database, initiating data creation, retrieval, and updates. Data extraction: This process, often utilizing Extract, Transform, Load (ETL) or Extract, Load, transform (ELT) methodologies, extracts data from various sources, transforms it into a format compatible with the database, and loads it. Database: This is the core storage location, managing and organizing data using specific structures like relational tables or NoSQL document stores. Storage: This represents physical storage devices like hard disk drives (HDDs) or solid-state drives (SSDs) holding the database files. Backup system: Regular backups are crucial for disaster recovery and ensuring data availability. Data Flow Applications interact with the database server, sending data creation, retrieval, and update requests. The ETL/ELT process extracts data from various sources, transforms it, and loads it into the database. Data is persisted within the database engine, organized by its specific structure. Storage devices physically hold the database files. Backups are periodically created and stored separately for data recovery purposes Cloud Database Explanation Application server: Like the on-premises scenario, this interacts with the database but through an API gateway or SDK provided by the cloud service provider. API Gateway/SDK: This layer acts as an abstraction, hiding the underlying infrastructure complexity and providing a standardized way for applications to interact with the cloud database. Cloud database: This is a managed service offered by cloud providers that handles database creation, maintenance, and scaling automatically. Cloud storage: This represents the cloud provider's storage infrastructure, where database files and backups are stored. Data Flow Applications interact with the cloud database through the API gateway or SDK, sending data requests. The API gateway/SDK translates the requests and interacts with the cloud database service. The cloud database service manages data persistence, organization, and retrieval. Data is stored within the cloud provider's storage infrastructure. Key Differences Management: On-premises databases require in-house expertise for setup, configuration, maintenance, and backups. Cloud databases are managed services, with the provider handling these aspects, freeing up IT resources. Scalability: On-premises databases require manual scaling of hardware resources, while cloud databases offer elastic scaling, automatically adjusting to meet changing needs. Security: Both options require security measures like access control and encryption. However, cloud providers often have robust security infrastructure and compliance certifications. The Convergence of AI and Database Architectures The integration of Artificial Intelligence (AI) into database architectures heralds a new era of intelligent data management solutions. AI technologies such as machine learning and natural language processing augment database functionality by enabling automated data analysis, prediction, and decision-making. These advancements not only streamline operations but also unlock new avenues for optimizing database performance and reliability. Intelligent Query Optimization In the realm of intelligent query optimization, AI-powered techniques revolutionize how databases handle complex queries. By analyzing workload patterns and system resources in real time, AI algorithms dynamically adjust query execution plans to enhance efficiency and minimize latency. This proactive approach ensures optimal performance, even in the face of fluctuating workloads and evolving data structures. Predictive Maintenance Predictive maintenance, empowered by AI, transforms how organizations manage database health and stability. By leveraging historical data and predictive analytics, AI algorithms forecast potential system failures or performance bottlenecks before they occur. This foresight enables proactive maintenance strategies, such as resource allocation and system upgrades, mitigating downtime, and optimizing database reliability. Autonomous Databases Autonomous databases represent the pinnacle of AI-driven innovation in database architectures. These systems leverage AI algorithms to automate routine tasks, including performance tuning, security management, and data backups. By autonomously optimizing database configurations and addressing security vulnerabilities in real time, autonomous databases minimize operational overhead and enhance system reliability. This newfound autonomy allows organizations to focus on strategic initiatives rather than routine maintenance tasks, driving innovation and efficiency across the enterprise. Looking Towards the Future: Trends and Challenges As the trajectory of database architectures unfolds, a spectrum of trends and challenges beckons our attention: Edge Computing The proliferation of Internet of Things (IoT) devices and the rise of edge computing architectures herald a shift towards decentralized data processing. This necessitates the development of distributed database solutions capable of efficiently managing and analyzing data at the network edge, optimizing latency and bandwidth usage while ensuring real-time insights and responsiveness. Data Privacy and Security In an era of burgeoning data volumes, the preservation of data privacy and security assumes paramount importance (Jonny Bairstow, (2024)). As regulatory frameworks tighten and cyber threats escalate, organizations must navigate the intricate landscape of data governance to ensure compliance with stringent regulations and fortify defenses against evolving security vulnerabilities, safeguarding sensitive information from breaches and unauthorized access. Federated Data Management The proliferation of disparate data sources across diverse systems and platforms underscores the need for federated data management solutions. Federated database architectures offer a cohesive framework for seamless integration and access to distributed data sources, facilitating interoperability and enabling organizations to harness the full spectrum of their data assets for informed decision-making and actionable insights. Quantum Databases The advent of quantum computing heralds paradigm shifts in database architectures, promising exponential leaps in computational power and algorithmic efficiency. Quantum databases, leveraging the principles of quantum mechanics, hold the potential to revolutionize data processing by enabling faster computations and more sophisticated analytics for complex data sets. As quantum computing matures, organizations must prepare to embrace these transformative capabilities, harnessing quantum databases to unlock new frontiers in data-driven innovation and discovery. Conclusion The evolution of database architectures mirrors the relentless march of technological progress. From the rigid structures of traditional RDBMS to the flexibility of NoSQL databases and the scalability of cloud-based solutions, databases have adapted to meet the evolving needs of data-intensive applications. Moreover, the integration of AI augments database functionality, paving the way for more intelligent and automated data management solutions. As we navigate the future, addressing emerging challenges and embracing innovative technologies will be essential in shaping the next generation of database architectures. References Kaushikkumar Patel (2024), Mastering Cloud Scalability: Strategies, Challenges, and Future Directions: Navigating Complexities of Scaling in Digital Era Hichem Moulahoum, Faezeh Ghorbanizamani (2024), Navigating the development of silver nanoparticles based food analysis through the power of artificial intelligence D. Dhinakaran, S.M. Udhaya Sankar, D. Selvaraj, S. Edwin Raja (2024), Privacy-Preserving Data in IoT-based Cloud Systems: A Comprehensive Survey with AI Integration Mihaly Varadi, Damian Bertoni, Paulyna Magana, Urmila Paramval, Ivanna Pidruchna, (2024), AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences Jonny Bairstow, (2024), “Navigating the Confluence: Big Data Analytics and Artificial Intelligence - Innovations, Challenges, and Future Directions”
A new paradigm has emerged in software engineering and architecture, reshaping how we approach data storage and manipulation: Polyglot Persistence. Let’s draw an analogy to human language skills to unravel this concept. Imagine a polyglot, fluent in multiple languages. This individual can effortlessly switch between Portuguese, French, English, and others, choosing the best language for communication depending on the context, such as speaking Portuguese with Brazilians or French with French people. This flexibility dramatically increases their ability to communicate effectively and opens up numerous opportunities that would not be accessible if they were monolingual. Polyglot Persistence embodies a similar principle in data management. It refers to the strategy of using different data storage technologies to most effectively handle varying data storage needs. Just as a multilingual person selects the most suitable language for a conversation, a software engineer or architect leveraging polyglot persistence chooses the most appropriate database technology based on the specific requirements of each part of their application. This approach acknowledges a crucial truth: no single database can be the best at everything. Like human languages, each database type has nuances, strengths, and weaknesses. Relational databases might excel at complex transactional systems, whereas NoSQL databases like MongoDB or Cassandra might better handle large volumes of unstructured data. Graph databases like Neo4J shine when managing highly interconnected data, and key-value stores like Redis offer unmatched speed for simple queries. Understanding the Law of the Instrument As succinctly expressed by Abraham Maslow in 1966, the Law of the Instrument states that “if the only tool you have is a hammer, it is tempting to treat everything as if it were a nail.” This metaphor cautions against the tendency to rely excessively on familiar tools or methods, highlighting a cognitive bias toward a narrow approach to problem-solving and innovation. It emphasizes the limitations of adhering strictly to a single technology or approach, regardless of the complexity or variety of the task at hand. This principle in software engineering and database management illuminates the dangers of relying solely on a specific database technology, such as SQL, for every application need. The essence of the Law of the Instrument is a reminder for professionals to cultivate a multilingual mindset, recognizing that each problem may require a distinct solution or a combination of technologies. By expanding our toolkit to encompass diverse programming languages, frameworks, and database systems, we enhance our ability to choose the most effective solution for each unique challenge, thereby improving our applications' robustness, scalability, and efficiency. Polyglot Persistence Unleashed Unlock the Power of Polyglot Persistence with ‘Polyglot Persistence Unleashed.’ This comprehensive guide embarks you on a transformative journey, illustrating the integration of MongoDB, Cassandra, Neo4J, Redis, and Couchbase within Enterprise Java Architecture. It delves deep into NoSQL databases, Jakarta EE, and Microprofile, empowering you with the prowess to architect and implement sophisticated data storage solutions for robust and scalable Java applications. From in-depth exploration to practical examples, optimization strategies, and pioneering insights, this book is your ultimate guide to revolutionizing data management in your Java applications. Unlocking Data Access Excellence With Jakarta Data The Jakarta Data specification is a beacon of innovation for Java developers. It offers a potent API that effortlessly bridges the diverse worlds of relational and NoSQL databases. It fosters seamless integration of data access and manipulation, adhering to a domain-centric architecture that simplifies persistence complexities. Explore how Jakarta Data can transform your application’s data management, streamlining data operations and elevating your project’s overall efficiency and performance. Conclusion The adoption of polyglot persistence is not just a trend but a necessity in the age of versatile and demanding software applications. It echoes the Law of the Instrument by Abraham Maslow, reminding us that relying solely on familiar tools (or databases, in this context) limits our ability to solve problems creatively and effectively. As software engineers and architects, expanding our database technology repertoire allows us to tailor our solutions more precisely to our application’s needs, much like a polyglot who can easily navigate different cultural contexts. In embracing polyglot persistence, we unlock a world of possibilities, enabling more dynamic, efficient, and scalable solutions.
Extra panel in the link: https://turnoff.us/geek/too-many-indexes/#extra_panel
User-defined functions (UDFs) are a very useful feature supported in SQL++ (UDF documentation). Couchbase 7.6 introduces improvements that allow for more debuggability and visibility into UDF execution. This blog will explore two new features in Couchbase 7.6 in the world of UDFs: Profiling for SQL++ statements executed in JavaScript UDFs EXPLAIN FUNCTION to access query plans of SQL++ statements within UDFs The examples in this blog require the travel-sample dataset to be installed. Documentation to install sample buckets Profiling SQL++ Executed in JavaScript UDFs Query profiling is a debuggability feature that SQL++ offers. When profiling is enabled for a statement’s execution, the result of the request includes a detailed execution tree with timing and metrics of each step of the statement’s execution. In addition to the profiling information being returned in the results of the statement, it can also be accessed for the request in the system:active_requests and system:completed_requests system keyspaces. To dive deeper into request profiling, see request profiling in SQL++. In Couchbase 7.0, profiling was included for subqueries. This included profiling subqueries that were within Inline UDFs. However, in versions before Couchbase 7.6, profiling was not extended to SQL++ statements within JavaScript UDFs. In earlier versions, to profile statements within a JavaScript UDF, the user would be required to open up the function’s definition, individually run each statement within the UDF, and collect their profiles. This additional step will no longer be needed in 7.6.0! Now, when profiling is enabled, if the statement contains JavaScript UDF execution, profiles for all SQL++ statements executed in the UDF will also be collected. This UDF-related profiling information will be available in the request output, system:active_requests and system:completed_requests system keyspaces as well. Example 1 Create a JavaScript UDF “js1” in a global library “lib1” via the REST endpoint or via the UI. JavaScript function js1() { var query = SELECT * FROM default:`travel-sample`.inventory.airline LIMIT 1; var res = []; for (const row of query) { res.push(row); } query.close() return res; } Create the corresponding SQL++ function. SQL CREATE FUNCTION js1() LANGUAGE JAVASCRIPT AS "js1" AT "lib1"; Execute the UDF with profiling enabled. SQL EXECUTE FUNCTION js1(); The response to the statement above will contain the following: In the profile section of the returned response, the executionTimings subsection contains a field ~udfStatements. ~udfStatements: An array of profiling information that contains an entry for every SQL++ statement within the JavaScript UDF Every entry within the ~udfStatements section contains: executionTimings: This is the execution tree for the statement. It has metrics and timing information for every step of the statement’s execution. statement: The statement string function: This is the name of the function where the statement was executed and is helpful to identify the UDF that executed the statement when there are nested UDF executions. JavaScript { "requestID": "2c5576b5-f01d-445f-a35b-2213c606f394", "signature": null, "results": [ [ { "airline": { "callsign": "MILE-AIR", "country": "United States", "iata": "Q5", "icao": "MLA", "id": 10, "name": "40-Mile Air", "type": "airline" } } ] ], "status": "success", "metrics": { "elapsedTime": "20.757583ms", "executionTime": "20.636792ms", "resultCount": 1, "resultSize": 310, "serviceLoad": 2 }, "profile": { "phaseTimes": { "authorize": "12.835µs", "fetch": "374.667µs", "instantiate": "27.75µs", "parse": "251.708µs", "plan": "9.125µs", "primaryScan": "813.249µs", "primaryScan.GSI": "813.249µs", "project": "5.541µs", "run": "27.925833ms", "stream": "26.375µs" }, "phaseCounts": { "fetch": 1, "primaryScan": 1, "primaryScan.GSI": 1 }, "phaseOperators": { "authorize": 2, "fetch": 1, "primaryScan": 1, "primaryScan.GSI": 1, "project": 1, "stream": 1 }, "cpuTime": "468.626µs", "requestTime": "2023-12-04T20:30:00.369+05:30", "servicingHost": "127.0.0.1:8091", "executionTimings": { "#operator": "Authorize", "#planPreparedTime": "2023-12-04T20:30:00.369+05:30", "#stats": { "#phaseSwitches": 4, "execTime": "1.918µs", "servTime": "1.125µs" }, "privileges": { "List": [] }, "~child": { "#operator": "Sequence", "#stats": { "#phaseSwitches": 2, "execTime": "2.208µs" }, "~children": [ { "#operator": "ExecuteFunction", "#stats": { "#itemsOut": 1, "#phaseSwitches": 4, "execTime": "22.375µs", "kernTime": "20.271708ms" }, "identity": { "name": "js1", "namespace": "default", "type": "global" } }, { "#operator": "Stream", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 2, "execTime": "26.375µs" }, "serializable": true } ] }, "~udfStatements": [ { "executionTimings": { "#operator": "Authorize", "#stats": { "#phaseSwitches": 4, "execTime": "2.626µs", "servTime": "7.166µs" }, "privileges": { "List": [ { "Priv": 7, "Props": 0, "Target": "default:travel-sample.inventory.airline" } ] }, "~child": { "#operator": "Sequence", "#stats": { "#phaseSwitches": 2, "execTime": "4.375µs" }, "~children": [ { "#operator": "PrimaryScan3", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 7, "execTime": "22.082µs", "kernTime": "1.584µs", "servTime": "791.167µs" }, "bucket": "travel-sample", "index": "def_inventory_airline_primary", "index_projection": { "primary_key": true }, "keyspace": "airline", "limit": "1", "namespace": "default", "optimizer_estimates": { "cardinality": 187, "cost": 45.28617059639748, "fr_cost": 12.1780009122802, "size": 12 }, "scope": "inventory", "using": "gsi" }, { "#operator": "Fetch", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 10, "execTime": "18.376µs", "kernTime": "797.542µs", "servTime": "356.291µs" }, "bucket": "travel-sample", "keyspace": "airline", "namespace": "default", "optimizer_estimates": { "cardinality": 187, "cost": 192.01699202888378, "fr_cost": 24.89848658838975, "size": 204 }, "scope": "inventory" }, { "#operator": "InitialProject", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 7, "execTime": "5.541µs", "kernTime": "1.1795ms" }, "discard_original": true, "optimizer_estimates": { "cardinality": 187, "cost": 194.6878862611588, "fr_cost": 24.912769445246838, "size": 204 }, "preserve_order": true, "result_terms": [ { "expr": "self", "star": true } ] }, { "#operator": "Limit", "#stats": { "#itemsIn": 1, "#itemsOut": 1, "#phaseSwitches": 4, "execTime": "6.25µs", "kernTime": "333ns" }, "expr": "1", "optimizer_estimates": { "cardinality": 1, "cost": 24.927052302103924, "fr_cost": 24.927052302103924, "size": 204 } }, { "#operator": "Receive", "#stats": { "#phaseSwitches": 3, "execTime": "10.324833ms", "kernTime": "792ns", "state": "running" } } ] } }, "statement": "SELECT * FROM default:`travel-sample`.inventory.airline LIMIT 1;", "function": "default:js1" } ], "~versions": [ "7.6.0-N1QL", "7.6.0-1847-enterprise" ] } } } Query Plans With EXPLAIN FUNCTION SQL++ offers another wonderful capability to access the plan of a statement with the EXPLAIN statement. However, the EXPLAIN statement does not extend to plans of statements within UDFs, neither inline nor JavaScript UDFs. In earlier versions, to analyze the query plans for SQL++ within a UDF, it would require the user to open the function’s definition and individually run an EXPLAIN on all the statements within the UDF. These extra steps will be minimized in Couchbase 7.6 with the introduction of a new statement: EXPLAIN FUNCTION. This statement does exactly what EXPLAIN does, but for SQL++ statements within a UDF. Let’s explore how to use the EXPLAIN FUNCTION statement! Syntax explain_function ::= 'EXPLAIN' 'FUNCTION' function function refers to the name of the function. For more detailed information on syntax, please check out the documentation. Prerequisites To execute EXPLAIN FUNCTION, the user requires the correct RBAC permissions. To run EXPLAIN FUNCTION on a UDF, the user must have sufficient RBAC permissions to execute the function. The user must also have the necessary RBAC permissions to execute the SQL++ statements within the UDF function body as well. For more information, refer to the documentation regarding roles supported in Couchbase. Inline UDF EXPLAIN FUNCTION on an inline UDF will return the query plans of all the subqueries within its definition (see inline function documentation). Example 2: EXPLAIN FUNCTION on an Inline Function Create an inline UDF and run EXPLAIN FUNCTION on it. SQL CREATE FUNCTION inline1() { ( SELECT * FROM default:`travel-sample`.inventory.airport WHERE city = "Zachar Bay" ) }; SQL EXPLAIN FUNCTION inline1(); The results of the above statement will contain: function: The name of the function on which EXPLAIN FUNCTION was run plans: An array of plan information that contains an entry for every subquery within the inline UDF JavaScript { "function": "default:inline1", "plans": [ { "cardinality": 1.1176470588235294, "cost": 25.117642854609013, "plan": { "#operator": "Sequence", "~children": [ { "#operator": "IndexScan3", "bucket": "travel-sample", "index": "def_inventory_airport_city", "index_id": "2605c88c115dd3a2", "index_projection": { "primary_key": true }, "keyspace": "airport", "namespace": "default", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 12.200561852726496, "fr_cost": 12.179450078755286, "size": 12 }, "scope": "inventory", "spans": [ { "exact": true, "range": [ { "high": "\\"Zachar Bay\\"", "inclusion": 3, "index_key": "`city`", "low": "\\"Zachar Bay\\"" } ] } ], "using": "gsi" }, { "#operator": "Fetch", "bucket": "travel-sample", "keyspace": "airport", "namespace": "default", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.082370508382763, "fr_cost": 24.96843677065826, "size": 249 }, "scope": "inventory" }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Filter", "condition": "((`airport`.`city`) = \\"Zachar Bay\\")", "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.100006681495888, "fr_cost": 24.98421650449632, "size": 249 } }, { "#operator": "InitialProject", "discard_original": true, "optimizer_estimates": { "cardinality": 1.1176470588235294, "cost": 25.117642854609013, "fr_cost": 24.99999623833438, "size": 249 }, "result_terms": [ { "expr": "self", "star": true } ] } ] } } ] }, "statement": "select self.* from `default`:`travel-sample`.`inventory`.`airport` where ((`airport`.`city`) = \\"Zachar Bay\\")" } ] } JavaScript UDF SQL++ statements within JavaScript UDFs can be of two types as listed below. EXPLAIN FUNCTION works differently based on the way the SQL++ statement is called. Refer to the documentation to learn more about calling SQL++ in JavaScript functions. 1. Embedded SQL++ Embedded SQL++ is “embedded” in the function body and its detection is handled by the JavaScript transpiler. EXPLAIN FUNCTION can return query plans for embedded SQL++ statements. 2. SQL++ Executed by the N1QL() Function Call SQL++ can also be executed by passing a statement in the form of a string as an argument to the N1QL() function. When parsing the function for potential SQL++ statements to run the EXPLAIN on, it is difficult to get the dynamic string in the function argument. This can only be reliably resolved at runtime. With this reasoning, EXPLAIN FUNCTION does not return the query plans for SQL++ statements executed via N1QL() calls, but instead, returns the line numbers where the N1QL() function calls have been made. This line number is calculated from the beginning of the function definition. The user can then map the line numbers in the actual function definition and investigate further. Example 3: EXPLAIN FUNCTION on an External JavaScript Function Create a JavaScript UDF “js2” in a global library “lib1” via the REST endpoint or via the UI. JavaScript function js2() { // SQL++ executed by a N1QL() function call var query1 = N1QL("UPDATE default:`travel-sample` SET test = 1 LIMIT 1"); // Embedded SQL++ var query2 = SELECT * FROM default:`travel-sample` LIMIT 1; var res = []; for (const row of query2) { res.push(row); } query2.close() return res; } Create the corresponding SQL++ function. SQL CREATE FUNCTION js2() LANGUAGE JAVASCRIPT AS "js2" AT "lib1"; Run EXPLAIN FUNCTION on the SQL++ function. SQL EXPLAIN FUNCTION js2; The results of the statement above will contain: function: The name of the function on which EXPLAIN FUNCTION was run line_numbers: An array of line numbers calculated from the beginning of the JavaScript function definition where there are N1QL() function calls plans: An array of plan information that contains an entry for every embedded SQL++ statement within the JavaScript UDF JavaScript { "function": "default:js2", "line_numbers": [ 4 ], "plans": [ { "cardinality": 1, "cost": 25.51560885530435, "plan": { "#operator": "Authorize", "privileges": { "List": [ { "Target": "default:travel-sample", "Priv": 7, "Props": 0 } ] }, "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "PrimaryScan3", "index": "def_primary", "index_projection": { "primary_key": true }, "keyspace": "travel-sample", "limit": "1", "namespace": "default", "optimizer_estimates": { "cardinality": 31591, "cost": 5402.279801258844, "fr_cost": 12.170627071041082, "size": 11 }, "using": "gsi" }, { "#operator": "Fetch", "keyspace": "travel-sample", "namespace": "default", "optimizer_estimates": { "cardinality": 31591, "cost": 46269.39474997121, "fr_cost": 25.46387878667884, "size": 669 } }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "InitialProject", "discard_original": true, "optimizer_estimates": { "cardinality": 31591, "cost": 47086.49704894546, "fr_cost": 25.489743820991595, "size": 669 }, "preserve_order": true, "result_terms": [ { "expr": "self", "star": true } ] } ] } } ] }, { "#operator": "Limit", "expr": "1", "optimizer_estimates": { "cardinality": 1, "cost": 25.51560885530435, "fr_cost": 25.51560885530435, "size": 669 } } ] }, { "#operator": "Stream", "optimizer_estimates": { "cardinality": 1, "cost": 25.51560885530435, "fr_cost": 25.51560885530435, "size": 669 }, "serializable": true } ] } }, "statement": "SELECT * FROM default:`travel-sample` LIMIT 1 ;" } ] } Constraints If the N1QL() function has been aliased in a JavaScript function definition, EXPLAIN FUNCTION will not be able to return the line numbers where this aliased function was called.Example of such a function definition: JavaScript function js3() { var alias = N1QL; var q = alias("SELECT 1"); } If the UDF contains nested UDF executions, EXPLAIN FUNCTION does not support generating the query plans of SQL++ statements within these nested UDFs. Summary Couchbase 7.6 introduces new features to debug UDFs which will help users peek into UDF execution easily. Helpful References 1. Javascript UDFs: A guide to JavaScript UDFs Creating an external UDF 2. EXPLAIN statement
Paramount+ streaming platform has outdone itself this NFL season, shattering viewership records during the AFC Championship Game and now the Super Bowl — hailed as the “most-watched telecast in history” with 123.4 million average viewers. Over 200 million tuned in to the game at some point, approximately ⅔ of the population of the United States. It also set a new benchmark as the most-streamed Super Bowl ever. In anticipation of this level of interest, Paramount+ finalized their migration to a multi-region architecture early in 2023. Since then, the streaming platform has been operating across multiple regions in Google Cloud and running on a distributed SQL database that functions across multiple distant locations. Prior to this, the database tier posed the biggest architectural challenge, prompting them to begin the search for a multi-master distributed database: "Paramount+ was hosted on a single master (aka read/write) database. A single vertically scaled master database can only carry us forward so far. While the team considered sharding the data and spreading it out, our past experience taught us that this would be a laborious process. We started looking for a new multi-master capable database with the criteria that we had to make sure that we stick to a relational database due to the existing nature of the application. This narrowed down the criteria, and after some internal research and POCs we narrowed it down to a new player in the database space called YugabyteDB."- Quote from a Paramount+ team member So how can you achieve this level of app scalability and high availability across multiple regions? In this blog, I’ll use a sample application to analyze how services like Paramount+ can scale in a multi-region setup. The Key Component to Getting a Multi-Region Architecture Right Scaling the application tier across multiple regions is usually a no-brainer. Simply pick the most suitable cloud regions, deploy application instances there, and use a global load balancer to automatically route and load balance user requests. Things get more complicated when dealing with a multi-region database deployment, especially for transactional applications requiring low latency and data inconsistency. It is possible to achieve global data consistency by deploying a database instance with a single primary that handles all the user read and write requests. However, this approach means that only users near the cloud region with the database (US East above) will experience low latency for read-write requests. Users farther from the database’s cloud region will face higher latency since their requests travel longer distances. Additionally, an outage on the server, data center, or region hosting the database can make the application unavailable. Therefore, getting the database right is crucial when designing a multi-region service or application. Now, let’s experiment using YugabyteDB, the distributed database Paramount+ used for the Super Bowl and their global streaming platform. Two YugabyteDB Design Patterns for Multi-Region Applications YugabyteDB is a distributed SQL database built on PostgreSQL, essentially acting as a distributed version of PostgreSQL. Usually, the database is deployed in a multi-node configuration spanning several servers, availability zones, data centers, or regions. The Yugabyte database shards data across all nodes and then distributes the load by having all the nodes process read and write requests. Transactional consistency is ensured with the Raft consensus protocol that replicates changes synchronously among the cluster nodes. In multi-region database deployments, the latency between regions has the biggest impact on application performance. While there is no one-size-fits-all solution for multi-region deployments (with YugabyteDB or any other distributed transactional database), you can pick from several design patterns for global applications and configure your database so that it works best for your application workloads. YugabyteDB offers eight commonly used design patterns to balance read-write latency with two key aspects of highly available systems: the recovery time objective (RTO) and recovery point objective (RPO). Now, let’s review two of the design patterns from our list of eight — global database and follower reads — by looking into the latency of our sample multi-region application. Design Pattern #1: Global Database The global database design pattern assumes that a database is spread across multiple (i.e., three or more) regions or zones. If there’s a failure in one zone/region, the nodes in other regions/zones will detect the outage within seconds (RTO) and continue serving application workloads without any loss of data (RPO=0). With YugabyteDB, you can reduce the number of cross-region requests by defining a preferred region. All the shards/Raft leaders will be located in the preferred region, delivering low-latency reads for the users near the region and predictable latency for those further away. I provisioned a three-node YugabyteDB cluster (below) across the US East, Central, and West, with the US East region configured as the preferred region. Each region hosts an application instance that is connected to the node in the preferred region (US East). In this configuration, the round-trip latency between an application instance and the database varies by distance from the preferred region. For example, the app instance from the US East is 5 ms away from the preferred region, while the instance from the US West is 65 ms away. The US West and Central app instances are not connected to the database nodes in their local regions directly, because those nodes will still automatically route all the requests to the leaders in the preferred region. Our sample application is a movie recommendation service that takes user questions in plain English and uses a generative AI stack (OpenAI, Spring AI, and the PostgreSQL pgvector extension) to provide users with relevant movie recommendations. Suppose you are in the mood for a space adventure movie with an unexpected ending. You connect to the movie recommendation service and send the following API request: Shell http GET {app_instance_address}:80/api/movie/search \ prompt=='a movie about a space adventure with an unexpected ending' \ rank==7 \ X-Api-Key:superbowl-2024 The application performs a vector similarity search by comparing an embedding generated for the prompt parameter to the embeddings of the movie overviews stored in the database. It then identifies the most relevant movies and sends back the following response (below) in JSON format: JSON { "movies": [ { "id": 157336, "overview": "Interstellar chronicles the adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage.", "releaseDate": "2014-11-05", "title": "Interstellar", "voteAverage": 8.1 }, { "id": 49047, "overview": "Dr. Ryan Stone, a brilliant medical engineer on her first Shuttle mission, with veteran astronaut Matt Kowalsky in command of his last flight before retiring. But on a seemingly routine spacewalk, disaster strikes. The Shuttle is destroyed, leaving Stone and Kowalsky completely alone-tethered to nothing but each other and spiraling out into the blackness of space. The deafening silence tells them they have lost any link to Earth and any chance for rescue. As fear turns to panic, every gulp of air eats away at what little oxygen is left. But the only way home may be to go further out into the terrifying expanse of space.", "releaseDate": "2013-09-27", "title": "Gravity", "voteAverage": 7.3 }, { "id": 13475, "overview": "The fate of the galaxy rests in the hands of bitter rivals. One, James Kirk, is a delinquent, thrill-seeking Iowa farm boy. The other, Spock, a Vulcan, was raised in a logic-based society that rejects all emotion. As fiery instinct clashes with calm reason, their unlikely but powerful partnership is the only thing capable of leading their crew through unimaginable danger, boldly going where no one has gone before. The human adventure has begun again.", "releaseDate": "2009-05-06", "title": "Star Trek", "voteAverage": 7.4 } ], "status": {/i> "code": 200, "success": true } } The response speed of the application and the read latency of this API call depend on which application instance received and processed your request: If the request originates from US East, the latency can be as low as 9 ms since the database leaders are just a few milliseconds away from the US East-based application instance. However, latency is much higher for application instances in the US Central and West. This is because they must perform the vector similarity search on the US East database leaders and then receive/process a large result set with detailed information about suggested movies. Note: The numbers above are not meant as the baseline of a performance benchmark. I ran a simple experiment on commodity VMs with a handful of shared vCPUs and didn’t perform any optimizations for software stack components. The results were just a quick functional test of this multi-region deployment. Now, what if you want the application to generate movie recommendations at low latency regardless of the users’ location? How can you achieve low-latency reads across all regions? YugabyteDB supports several design patterns that can achieve this, including follower reads. Design Pattern #2: Follower Reads The follower reads pattern lets the application instances in secondary regions read from local nodes/followers instead of going to the database leaders in the preferred region. This pattern speeds up the reads to match those from the leaders, although the followers may not hold the most current data at the time of the request. To use this pattern, I had to: Connect the application instances from US Central and West to the database nodes from their respective regions. Allow follower reads by setting the following flags for the database session. Properties files SET session characteristics as transaction read only; SET yb_read_from_followers = true; With this configuration, read latency is similar across all the regions. There might be a 30 ms data lag on the database node in the US Central and a 65 ms data lag on the US West node. Why? My multi-region cluster is configured with a replication factor of 3. This means that a transaction would be considered committed once two nodes out of three confirm the changes. So, if the US East and Central nodes have acknowledged a transaction, the US West node might still be recording the change, explaining the lag during follower reads. Despite potential data lags, the entire data set on the followers always remains in a consistent state (across all tables and other database objects). YugabyteDB ensures data consistency through its transactional sub-system and the Raft consensus protocol, which replicates changes synchronously across the entire multi-region cluster. Now, let’s use follower reads to send the same HTTP request to US Central and West instances: Shell http GET {app_instance_address}:80/api/movie/search \ prompt=='a movie about a space adventure with an unexpected ending' \ rank==7 \ X-Api-Key:superbowl-2024 Now, the read latency across all the regions is consistently low and comparable: Note: The application instance from US East doesn’t need to use the follower reads pattern as long as it can work directly with the leaders from the preferred region. A Quick Note on Multi-Region Writes So far, we’ve used the global database with the preferred region and follower reads design patterns to ensure low latency reads across distant locations. This configuration can tolerate region-level outages with RTO measured in seconds and RPO=0 (no data loss). In this configuration, there is a tradeoff with write latency. If YugabyteDB has to keep a consistent copy of data across all regions, cross-region latency will affect the time needed for the Raft consensus protocol to synchronize changes across all locations. For example, suppose you want to watch the movie “Interstellar." You add it to your watch list with the following API call to the movie recommendations service (Note: 157336 is Interstellar’s internal ID): Shell http PUT {app_instance_address}:80/api/library/add/157336 X-Api-Key:superbowl-2024 The latency in my application setup is: Write latency was lowest for requests originating from the US East-based application instance directly connected to the database node in the preferred region (US East). Latency for writes from other locations was higher because their requests had to travel to leaders in the preferred region before a transaction could be executed and replicated across the entire cluster. Does this mean that the write latency is always high in a multi-region configuration? Not necessarily. YugabyteDB offers several design patterns that allow you to achieve low-latency reads and writes in a multi-region setting. One such pattern is latency-optimized geo-partitioning, where user data is pinned to locations closest to the users, resulting in single-digit millisecond latency for reads and writes. Video Summary Paramount+ successfully transitioning to a multi-region architecture shows that with the right design patterns, you can build applications that tolerate region-level outages, scale, and perform at low latencies across distant locations. The Paramount+ tech team learned the art of scaling by creating a streaming platform that accommodates millions of users during peak periods, with low latency and uninterrupted service. Implementing a multi-region setup correctly is essential. If you pick the right design pattern, you, too, can build multi-region applications that scale and tolerate all sorts of possible outages.
Docker has become an essential tool for developers, offering consistent and isolated environments without installing full-fledged products locally. The ideal setup for microservice development using Spring Boot with MySQL as the backend often involves a remotely hosted database. However, for rapid prototyping or local development, running a MySQL container through Docker offers a more streamlined approach. I encountered a couple of issues while attempting to set up this configuration with the help of Docker Desktop for a proof of concept. An online search revealed a lack of straightforward guides on integrating Spring Boot microservices with MySQL in Docker Desktop; most resources primarily focus on containerizing the Spring Boot application. Recognizing this gap, I decided to write this short article. Prerequisites Before diving in, we must have the following: A foundational understanding of Spring Boot and microservices architecture Familiarity with Docker containers Docker Desktop installed on our machine Docker Desktop Setup We can install Docker Desktop using this link. Installation is straightforward and includes steps that can be navigated efficiently, as illustrated in the accompanying screenshots. Configuring MySQL Container Once we have installed the Docker desktop when we launch, we will get through some standard questions, and we can skip the registration part. Once the desktop app is ready, then we need to search for the MySQL container, as shown below: We need to click Pull and then Run the container. Once you run the container, the settings dialog will pop up, as shown below. Please enter the settings as below: MYSQL_ROOT_PASSWORD: This environment variable specifies the password that will be set for the MySQL root superuser account. MYSQL_DATABASE: This environment variable allows us to specify the name of a database that will be created on image startup. If a user/password was supplied (see below), that user will be granted superuser access (corresponding to GRANT ALL) to this database. MYSQL_USER, MYSQL_PASSWORD: These variables are used to create a new user and set that user's password. This user will be granted superuser permissions for the database specified by the MYSQL_DATABASE variable. Upon running the container, Docker Desktop displays logs indicating the container's status. We can now connect to the MySQL instance using tools like MySQL Workbench to manage database objects. Spring Application Configuration In the Spring application, we can configure the configurations below in the application.properties. YAML spring.esign.datasource.jdbc-url=jdbc:mysql://localhost:3306/e-sign?allowPublicKeyRetrieval=true&useSSL=false spring.esign.datasource.username=e-sign spring.esign.datasource.password=Password1 We opted for a custom prefix spring.esign over the default spring.datasource for our database configuration within the Spring Boot application. This approach shines in scenarios where the application requires connections to multiple databases. To enable this custom configuration, we need to define the Spring Boot configuration class ESignDbConfig: Java @Configuration @EnableTransactionManagement @EnableJpaRepositories( entityManagerFactoryRef = "eSignEntityManagerFactory", transactionManagerRef = "eSignTransactionManager", basePackages ="com.icw.esign.repository") public class ESignDbConfig { @Bean("eSignDataSource") @ConfigurationProperties(prefix="spring.esign.datasource") public DataSource geteSignDataSource(){ return DataSourceBuilder.create().type(HikariDataSource.class).build(); } @Bean(name = "eSignEntityManagerFactory") public LocalContainerEntityManagerFactoryBean eSignEntityManagerFactory( EntityManagerFactoryBuilder builder, @Qualifier("eSignDataSource") DataSource dataSource) { return builder.dataSource(dataSource).packages("com.icw.esign.dao") .build(); } @Bean(name = "eSignTransactionManager") public PlatformTransactionManager eSignTransactionManager(@Qualifier("eSignEntityManagerFactory") EntityManagerFactory entityManagerFactory) { return new JpaTransactionManager(entityManagerFactory); } } @Bean("eSignDataSource"): This method defines a Spring bean for the eSign module's data source. The @ConfigurationProperties(prefix="spring.esign.datasource") annotation is used to automatically map and bind all configuration properties starting with spring.esign.datasource from the application's configuration files (like application.properties or application.yml) to this DataSource object. The method uses DataSourceBuilder to create and configure a HikariDataSource, a highly performant JDBC connection pool. This implies that the eSign module will use a dedicated database whose connection parameters are isolated from other modules or the main application database. @Bean(name = "eSignEntityManagerFactory"): This method creates a LocalContainerEntityManagerFactoryBean, which is responsible for creating the EntityManagerFactory. This factory is crucial for managing JPA entities specific to the eSign module. The EntityManagerFactory is configured to use the eSignDataSource for its database operations and to scan the package com.icw.esign.dao for entity classes. This means that only entities in this package or its subpackages will be managed by this EntityManagerFactory and thus, can access the eSign database. @Bean(name = "eSignTransactionManager"): This defines a PlatformTransactionManager specific way of managing transactions of the eSignmodule's EntityManagerFactory. This transaction manager ensures that all database operations performed by entities managed by the eSignEntityManagerFactory are wrapped in transactions. It enables the application to manage transaction boundaries, roll back operations on failures, and commit changes when operations succeed. Repository Now that we have defined configurations, we can create repository classes and build other objects required for the API endpoint. Java @Repository public class ESignDbRepository { private static final Logger logger = LoggerFactory.getLogger(ESignDbRepository.class); @Qualifier("eSignEntityManagerFactory") @Autowired private EntityManager entityManager; @Autowired ObjectMapper objectMapper; String P_GET_DOC_ESIGN_INFO = "p_get_doc_esign_info"; public List<DocESignMaster> getDocumentESignInfo(String docUUID) { StoredProcedureQuery proc = entityManager.createStoredProcedureQuery(P_GET_DOC_ESIGN_INFO, DocESignMaster.class); proc.registerStoredProcedureParameter("v_doc_uuid", String.class, ParameterMode.IN); proc.setParameter("v_doc_uuid", docUUID); try { return (List<DocESignMaster>) proc.getResultList(); } catch (PersistenceException ex) { logger.error("Error while fetching document eSign info for docUUID: {}", docUUID, ex); } return Collections.emptyList(); } } @Qualifier("eSignEntityManagerFactory"): Specifies which EntityManagerFactory should be used to create EntityManager, ensuring that the correct database configuration is used for eSign operations. Conclusion Integrating Spring Boot microservices with Docker Desktop streamlines microservice development and testing. This guide walks through the essential steps of setting up a Spring Boot application and ensuring seamless service communication with a MySQL container hosted on the Docker Desktop application. This quick setup guide is useful for proof of concept or setting up an isolated local development environment.
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Principal Sales Engineer,
Snowflake
Otavio Santana
Award-winning Software Engineer and Architect,
OS Expert
Denis Magda
Head of DevRel,
Yugabyte