Cloud Architecture Resources

DZone's Featured Cloud Architecture Resources

You Can Shape Trend Reports — Participate in DZone Research Surveys + Enter the Raffles!

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. you can find details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Why Cloud Native Is Vital to Your Organization's APIs: The Impact Could Be More Than Expected

By John Vester

CORE

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices. A recent conversation with a fellow staff engineer at a Top 20 technology company revealed that their underlying infrastructure is self-managed and does not leverage cloud-native infrastructure offered by major providers like Amazon, Google, or Microsoft. Hearing this information took me a minute to comprehend given how this conflicts with my core focus on leveraging frameworks, products, and services for everything that doesn't impact intellectual property value. While I understand the pride of a Top 20 technology company not wanting to contribute to the success of another leading technology company, I began to wonder just how successful they could be if they utilized a cloud-native approach. That also made me wonder how many other companies have yet to adopt a cloud-native approach… and the impact it is having on their APIs. Why Cloud? Why Now? For the last 10 years, I have been focused on delivering cloud-native API services for my projects. While cloud adoption continues to gain momentum, a decent percentage of corporations and technology providers still utilize traditional on-premises designs. According to The Cloud in 2021: Adoption Continues report by O'Reilly Media, Figure 1 provides a summary of the state of cloud adoption in December 2021. Figure 1. Cloud technology usage Image adapted from The Cloud in 2021: Adoption Continues, O'Reilly Media Since the total percentages noted in Figure 1 exceed 100%, the underlying assumption is that it is common for respondents to maintain both a cloud and on-premises design. However, for those who are late to enter the cloud native game, I wanted to touch on some common benefits that are recognized with cloud adoption: Focus on delivering or enhancing laser-focused APIs — stop worrying about and managing on-premises infrastructure. Scale your APIs up (and down) as needed to match demand — this is a primary use case for cloud adoption. Reduce risk by expanding your API presence — leverage availability zones, regions, and countries. Describe the supporting API infrastructure as code (IaC) — faster recovery and expandability into new target locations. Making the transition toward cloud native has become easier than ever, with the major providers offering free or discounted trial periods. Additionally, smaller platform-as-a-service (PaaS) providers like Heroku and Render provide solutions that allow teams to focus on their products and services and not worry about the underlying infrastructure design. The Cloud Native Impact on Your API Since this Trend Report is focused on modern API management, I wanted to focus on a few of the benefits that cloud native can have on APIs. Availability and Latency Objectives When providing APIs for your consumers to consume, the concept of service-level agreements (SLAs) is a common onboarding discussion topic. This is basically where expectations are put into easy-to-understand wording that becomes a binding contract between the API provider and the consumer. Failure to meet these expectations can result in fees and, in some cases, legal action. API service providers often take things a step further by establishing service-level objectives (SLOs) that are even more stringent. The goal here is to establish monitors and alerts to remediate issues before they breach contractual SLAs. But what happens when the SLOs and SLAs struggle to be met? This is where the primary cloud native use case can assist. If the increase in latency is due to hardware limitations, the service can be scaled up vertically (by increasing the hardware) or horizontally (by adding more instances). If the increase in latency is driven by geographical location, introducing service instances in closer regions is something cloud native providers can provide to remedy this scenario. API Management As your API infrastructure expands, a cloud-native design provides the necessary tooling to ease supportability and manageability efforts. From an infrastructure perspective, the underlying definition of the service is defined using an IaC approach, allowing the service itself to become defined in a single location. As updates are made to that base design, those changes can be rolled out to each target service instance, avoiding any drift between service instances. From an API management perspective, cloud native providers include the necessary tooling to manage the APIs from a usage perspective. Here, API keys can be established, which offer the ability to impose thresholds on requests that can be made or features that align with service subscription levels. Cloud Native !== Utopia While APIs flourish in cloud native implementations, it is important to recognize that a cloud-native approach is not without its own set of challenges. Cloud Cost Management CloudZero's The State Of Cloud Cost Intelligence 2022 report concluded that only 40% of respondents indicated that their cloud costs were at an expected level as noted in Figure 2. Figure 2. Cloud native cost realities Image adapted from The State Of Cloud Cost Intelligence, CloudZero This means that 60% of respondents are dealing with higher-than-expected cloud costs, which ultimately impact an organization's ability to meet planned objectives. Cloud native spending can often be remediated by adopting the following strategies: Require team-based tags or cloud accounts to help understand levels of spending at a finer grain. Focus on storage buckets and database backups to understand if the cost is in line with the value. Engage a cloud business partner that specializes in cloud spending analysis. Account Takeover The concept of accounts becoming "hacked" is prevalent in social media. At times, I feel like my social media feed contains more "my account was hacked" messages than the casual updates I was tuning in to read. Believe it or not, the concept of account takeover is becoming a common fear for cloud native adopters. Imagine starting your day only to realize you no longer have access to any of your cloud-native services. Soon thereafter, your customers begin to flood your support lines to ask what is going on… and where the data they were expecting to see with each API call is. Another potential consequence is that the APIs are shut down completely, forcing customers to seek out competing APIs. Remember, your account protection is only as strong as your weakest link. Make sure to employ everything possible to protect your account and move away from simple username + password account protection. Disaster Recovery It is also important to recognize that cloud native is not a replacement for maintaining a strong disaster recovery posture. Understand the impact of availability zone and region-wide outages — both are expected to happen. Plan to implement immutable backups — avoid relying on traditional backups and snapshots. Leverage IaC to establish all aspects of cloud native — and test it often. Alternative Flows Exist While a cloud-native approach provides an excellent landscape to help your business and partnerships be successful, there are likely use cases that present themselves as alternative flows for cloud native adoption: Regulatory requirements for a given service can often present themselves as an alternative flow and not be a candidate for cloud native adoption. Point of presence requirements can also become a blocker for cloud native adoption when the closest cloud-native location is not close enough to meet the established SLAs and SLOs. On the Other Side of API Cloud Adoption By adopting a cloud-native approach, it is possible to extend an API across multiple availability zones and geographical regions within a given point of presence. Figure 3. Multi-region cloud native adoption In Figure 3, each region contains an API service instance in three different geographical regions. Additionally, each region contains an API service instance running in three different availability zones — each with its own network and power source. In this example, there are nine distinct instances running across the United States. By introducing a global common name, consumers always receive a service response from the least-latent and available service instance. This approach easily allows for entire regions to be taken offline for disaster recovery validation without any interruptions of service at the consumer level. Conclusion Readers familiar with my work may recall that I have been focused on the following mission statement, which I feel can apply to any IT professional: Focus your time on delivering features/functionality that extend the value of your intellectual property. Leverage frameworks, products, and services for everything else. —John Vester When I think about my conversion with the staff engineer at the Top 20 tech company, I wonder how much more successful his team would be without having to worry about the underlying infrastructure being managed with their on-premises approach. While the other side of cloud native is not without challenges, it does adhere to my mission statement. As a result, projects that I have worked on for the last 10 years have been able to remain focused on meeting the needs of API consumers while staying in line with corporate objectives. From an API perspective, cloud native offers additional ways to adhere to my personal mission statement by describing everything related to the service using IaC and leveraging built-in tooling to manage the APIs across different availability zones and regions. Have a really great day! This is an excerpt from DZone's 2024 Trend Report, Modern API Management: Connecting Data-Driven Architectures Alongside AI, Automation, and Microservices.Read the Free Report More

Decoding Synchronous and Asynchronous Communication in Cloud-Native Applications

By Gaurav Gaur

CORE

Optimize Your Machine Learning Batch Inference Jobs Using AWS

By Sai Sharanya Nalla

Analyze and Debug Quarkus-Based AWS Lambda Functions With X-Ray

By Jeroen Reijn

CORE

Kubernetes: Future or Fad?

Kubernetes has pretty much become synonymous with container orchestration, and all competition is either assimilated (or rewritten to be based on Kubernetes, see Openshift), or basically disappeared into the void (sorry Nomad). Does that mean that development will slow down now? Or is it the beginning of something even bigger? Maybe Kubernetes is on the verge of becoming a generic name, just like Kleenex, or maybe a verb like “to google.” Years ago, somebody asked me in an interview what I think of Docker, and if I see a future for containerization. At the time my answer was quick and easy. First of all, containerization wasn’t a new concept. BSD, Solaris, as well as other systems, had them for years. They were kind of new to Linux though (at least in a widespread fashion), so they were here to stay. It was the next logical evolutionary step to virtualization. Docker, however, at least in my mind, was different. Towards Docker, I simply answered, “It’s the best tool we have today, but I hope it won’t be the final solution we can come up with.” While Docker turned around and is just coming back, the tooling we use today is unanimously built upon the specs defined by the open container initiative (OCI) and its OCI image format. So what will the future hold for Kubernetes? Is it going to stay or is it going to step into the abyss and will it “just be another platform that was replaced by something else,” as Michael Levan wrote in The Future of Kubernetes. The Rise of Kubernetes Jokingly, when looking up container orchestration in the dictionary, you’ll probably find the synonym Kubernetes, but it took Kubernetes about a decade to get to where it is today. Initially built by Google on the key learnings of Borg, Google’s internal container orchestration platform, Kubernetes was released in September 2014. By the release of Kubernetes, Borg itself was already over a decade old. By 2013 many of the original team members of the Borg started to look into the next step. Project 7 was born. At its release, Kubernetes was still using Docker underneath. A combination that most probably helped elevate Kubernetes' popularity. Docker was extremely popular at the time, but people started to find insufficiencies when trying to organize and run large numbers of containers. Kubernetes was about to fix it. With its concepts of building blocks, independently deployed and the actors (or agents) it was easy to extend but still understandable. In addition, resources are declarative by nature (written as JSON or YAML files), which enables version control of those definitions. Ever since the release, Kubernetes enabled more and more use cases, hence more companies started using it. I think a major step for the adoption was the release of Helm in 2016 which simplified the deployment process for more complex applications and enabled an “out of the box” experience. Now Kubernetes was “easy” (don’t quote me on easy though!). Today every cloud provider and their mothers offer a managed Kubernetes environment. Due to a set of standard interfaces, those services are mostly interchangeable. One of the big benefits of Kubernetes. Anyhow, it’s only mostly. The small inconsistencies and implementations or performance differences may give you quite the time of your life. Not a good one though. But let’s call it “run everywhere” because it mostly is. A great overview of the full history of Kubernetes, with all major milestones, can be found in The History of Kubernetes by Ferenc Hámori. Kubernetes, Love or Hate When we look into the community, opinions on Kubernetes diverge. Many people point out the internal (yet hidden) complexity of Kubernetes. A complexity that only increases with new features and additional functionality (also third-party) being added. This complexity is real, a reason why folks like Kelsey Hightower or Abdelfettah Sghiouar call Kubernetes a platform to build platforms (listen to our Cloud Commute podcast episode with Abdelfettah), meaning it should be used by the cloud providers (or company internal private cloud teams) to build a platform for container deployment, but it shouldn’t be used by the developers or just everyone. However, Kelsey also claimed that Kubernetes is a good place to start, not the endgame. Kelsey Hightower on Kubernetes being a platform to build platforms On the other end of the spectrum, you have people who refer to Kubernetes as the operating system of the cloud area. And due to the extensibility and feature richness, they’re probably not too far off. Modern operating systems have mostly one job, abstract away the underlying hardware and its features. That said, Kubernetes abstracts away many aspects of the cloud infrastructure and the operational processes necessary to run containers. In that sense, yes, Kubernetes is probably a Cloud OS. Especially since we started to see implementations of the Kubernetes running on operating systems other than Linux. Looking at you Microsoft. If you’re interested in learning more about the idea of Kubernetes as a Cloud OS, Natan Yellin from Robusta Dev wrote a very insightful article named Why Kubernetes will sustain the next 50 years. What Are the Next Steps? The most pressing question for Kubernetes as it stands today, is what will be next? Where will it evolve? Are we close to the end of the line? Looking back at Borg, a decade in, Google decided it was time to reiterate on orchestration and build upon the lessons learned. Kubernetes is about to hit its 10-year anniversary soon. So does that mean it’s time for another iteration? Many features in Kubernetes, such as secrets, were fine 10 years ago. Today we know that an encoded “secret” is certainly not enough. Temporary user accounts, OIDC, and similar technologies can and are integrated into Kubernetes already, increasing the complexity of it. Looking beyond Kubernetes, technology always runs in three phases, the beginning or adoption phase, the middle where everyone “has” to use it, and the end, where companies start to phase it out. Personally, I feel that Kubernetes is at its height right now, standing right in the middle. That doesn’t give us any prediction about the time frame for hitting the end though. At the moment it looks like Kubernetes will keep going and growing for a while. It doesn’t show any signs of slowing down. Other technologies, like micro virtual machines, using kata containers or Firecracker, are becoming more popular, offering higher isolation (hence security), but aren’t as efficient. The important element though, they offer a CRI-compatible interface. Meaning, they can be used as an alternative runtime underneath Kubernetes. In the not-too-distant future, I see Kubernetes offering multiple runtime environments, just as it offers multiple storage solutions today. Enabling running simple services in normal containers, but moving services with higher isolation needs to micro VMs. And there are other interesting developments, based on Kubernetes, too. Edgeless Systems implements a confidential computing solution, provided as a Kubernetes distribution named Constellation. Confidential computing makes use of CPU and GPU features that help to hardware-encrypt memory, not only for the whole system memory space, but per virtual machine, or even per container. That enables a whole set of new use cases, with end-to-end encryption for highly confidential calculations and data processes. While it’s possible to use it outside Kubernetes, the orchestration and operation benefits of running those calculations inside containers, make them easy to deploy and update. If you want to learn more about Constellation, we had Moritz Eckert from Edgeless Systems in our podcast not too long ago. Future or Fad? So, does Kubernetes have a bright future and will stand for the next 50 years, or will we realize that it is not what we’re looking for very soon-ish? If somebody would ask me today what I think about Kubernetes, I think I would answer similarly to my Docker answer. It is certainly the best tool we have today, making it the to-go container orchestration tool of today. Its ever-increasing complexity makes it hard to see the same in the future though. I think there are a lot of new lessons learned again. It’s probably time for a new iteration. Not today, not tomorrow, but somewhere in the next few years. Maybe this new iteration isn’t an all-new tool, but Kubernetes 2.0, who knows - but something has to change. Technology doesn’t stand still, the (container) world is different from what it was 10 years ago. If you asked somebody at the beginning of containerization, it was all about how containers have to be stateless, and what we do today. We deploy databases into Kubernetes, and we love it. Cloud-nativeness isn’t just stateless anymore, but I’d argue a good one-third of the container workloads may be stateful today (with ephemeral or persistent state), and it will keep increasing. The beauty of orchestration, automatic resource management, self-healing infrastructure, and everything in between is just too incredible to not use for “everything.” Anyhow, whatever happens to Kubernetes itself (maybe it will become an orchestration extension of the OCI?!), I think it will disappear from the eyes of the users. It (or its successor) will become the platform to build container runtime platforms. But to make that happen, debug features need to be made available. At the moment you have to look way too deep into Kubernetes or agent logs to find out and fix issues. The one who never had to find out why a Let’s Encrypt certificate isn’t updating may raise a hand now. To bring it to a close, Kubernetes certainly isn’t a fad, but I strongly hope it's not going to be our future either. At least not in its current incarnation.

By Christoph Engelbert

3 Ways Blockchain Reinforces Data Integrity in the Cloud

People initially became interested in blockchain several years ago after learning about it as a decentralized digital ledger. It supports transparency because no one can change information stored on it once added. People can also watch transactions as they happen, further enhancing visibility. But how does blockchain support the integrity of cloud-stored data? 3 Ways Blockchain Supports the Integrity of Cloud-Stored Data 1. Protecting and Facilitating the Sharing of Medical Records Technological advancements have undoubtedly improved the ease of sharing medical records between providers. When patients go to new healthcare facilities, all involved parties can easily see those individuals’ histories, treatments, test results, and more. Such records keep everyone updated about what’s happened to patients, which significantly reduces the likelihood of redundancies and confusion that could extend a health management timeline. Cloud computing has also accelerated information-sharing efforts within healthcare and other industries. It allows medical professionals to access and collaborate through scalable platforms. Many healthcare workers also appreciate how they can access cloud apps from anywhere. That convenience supports physicians who must travel for continuing medical education events, travel nurses, surgeons who split their time between multiple hospitals, and others who often work from numerous locations. However, despite these cloud computing benefits, a security-related downside is platforms use a centralized infrastructure to allow record sharing across users. That characteristic leaves cloud tools open to data breaches. In one case, researchers proposed addressing this shortcoming with a blockchain architecture to authenticate users and enable opportunities for sharing medical records securely. The group prioritized blockchain due to its immutability while seeking to create a system that allowed patients and their providers to share and store medical records securely. The researchers also wanted to design something that was not at risk of data loss or other failures. The researchers implemented so-called “special recognition keys” to identify medical-related specifics, such as identifying doctors, patients, and hospitals. When testing their system, some metrics studied included the time to complete a transaction and how well the communication-related attributes performed. The outcomes suggested the researchers’ approach worked far better than existing solutions. 2. Improving Access Control Data breaches can be costly, catastrophic events. Although there’s no single solution for preventing them, people can make meaningful progress by focusing on access control. One of the most convenient things about the cloud is it allows all authorized users to access content regardless of their location. However, as the number of people engaging with a cloud platform increases, so does the risk of compromised credentials that could allow hackers to enter networks and wreak havoc. Many corporate leaders have prioritized cloud-first strategies. That approach can strengthen cybersecurity because service providers have numerous security features to supplement internal measures. Additionally, cloud-based backup capabilities facilitate faster data recovery if cyberattacks occur. However, research suggests some access control practices used by cloud administrators have significant shortcomings that could make cyberattacks more likely. For example, one study about access management for cloud platforms found 49% of administrators store passwords in a spreadsheet. That’s a huge security risk for many reasons, but it also highlights the need for better password hygiene practices. Fortunately, the blockchain is well-positioned to solve this problem. In one example, researchers developed a blockchain system that uses attribute-based encryption technology to improve how cloud users access content. The setup also contains an audit contract that dynamically manages who can use the cloud and when. The team’s creation built a fine-grained and searchable system that maintained access control by strengthening cloud security and getting the desired results without excessive computing power. Results also showed this system increased storage capacity. When the group performed a security analysis on their blockchain creation, they found it stopped chosen-plaintext attacks and cybersecurity breaches based on guessed keywords. A theoretical examination and associated experiments suggested this tool worked better from a computing power and storage efficiency perspective than comparable alternatives. 3. Curbing Emerging Technologies’ Potential Threats Even as new technologies show tremendous progress and excite people about the future, some individuals specifically investigate how they could harm others through technological advancements. Developments associated with ChatGPT and other generative AI tools are excellent examples. Indeed, these chatbots can save people time by assisting them with tasks such as idea generation or outline creation. However, because these tools create believable-sounding paragraphs in seconds, some cybercriminals use generative artificial intelligence (genAI) chatbots to write phishing emails much faster than before. It’s easy to imagine the ramifications of a cybercriminal who writes a convincing phishing message and uses it to access someone’s cloud-stored information. ChatGPT runs on a cloud platform built by OpenAI, which created the chatbot. A lesser-known issue affecting data integrity is OpenAI uses interactions with the tool to train future versions of the algorithms. People can opt out of having their conversations become part of the training, but many people haven’t or don’t know the process for doing it. As workers eagerly tested ChatGPT and similar tools, some committed potential security breaches without realizing it. Consider if a web developer enters a proprietary code string into ChatGPT and asks the tool for help debugging it. That seemingly minor decision could result in sensitive information becoming part of training data and no longer being carefully protected by the developer’s employer. Some leaders quickly established rules for appropriate usage or banned generative AI tools to address these threats. A February 2024 study also showed some workers kept entering sensitive information when using ChatGPT despite knowing the associated risks. It’s still unclear how the blockchain will support data integrity for people using cloud-based generative AI tools, but many professionals are upbeat about the potential. Conclusion: Using Blockchain for Cloud Data Protection Entities ranging from government agencies to e-commerce stores use cloud platforms daily. These options are incredibly convenient because they eliminate geographical barriers and allow people to use them through an active internet connection anywhere in the world. However, many cloud tools store sensitive data, such as health records or payment details. Since cloud platforms hold such a wealth of information, hackers will likely continue targeting them. Although most cloud providers have built-in security features, cybercriminals continually seek ways to circumvent such protections. The examples here show why the blockchain is an excellent candidate for much-needed additional safeguards.

By Emily Newton

Operation and Network Administration Management of Telecom 5G Network Functions Using Openshift Kubernetes Tools

The world of Telecom is evolving at a rapid pace, and it is not just important, but crucial for operators to stay ahead of the game. As 5G technology becomes the norm, it is not just essential, but a strategic imperative to transition seamlessly from 4G technology (which operates on OpenStack cloud) to 5G technology (which uses Kubernetes). In the current scenario, operators invest in multiple vendor-specific monitoring tools, leading to higher costs and less efficient operations. However, with the upcoming 5G world, operators can adopt a unified monitoring and alert system for all their products. This single system, with its ability to monitor network equipment, customer devices, and service platforms, offers a reassuringly holistic view of the entire system, thereby reducing complexity and enhancing efficiency. By adopting a Prometheus-based monitoring and alert system, operators can streamline operations, reduce costs, and enhance customer experience. With a single monitoring system, operators can monitor their entire 5G system seamlessly, ensuring optimal performance and avoiding disruptions. This practical solution eliminates the need for a complete overhaul and offers a cost-effective transition. Let's dive deep. Prometheus, Grafana, and Alert Manager Prometheus is a tool for monitoring and alerting systems, utilizing a pull-based monitoring system. It scrapes, collects, and stores Key Performance Indicators (KPI) with labels and timestamps, enabling it to collect metrics from targets, which are the Network Functions' namespaces in the 5G telecom world. Grafana is a dynamic web application that offers a wide range of functionalities. It visualizes data, allowing the building of charts, graphs, and dashboards that the 5G Telecom operator wants to visualize. Its primary feature is the display of multiple graphing and dashboarding support modes using GUI (Graphical user interface). Grafana can seamlessly integrate data collected by Prometheus, making it an indispensable tool for telecom operators. It is a powerful web application that supports the integration of different data sources into one dashboard, enabling continuous monitoring. This versatility improves response rates by alerting the telecom operator's team when an incident emerges, ensuring a minimum 5G network function downtime. The Alert Manager is a crucial component that manages alerts from the Prometheus server via alerting rules. It manages the received alerts, including silencing and inhibiting them and sending out notifications via email or chat. The Alert Manager also removes duplications, grouping, and routing them to the centralized webhook receiver, making it a must-have tool for any telecom operator. Architectural Diagram Prometheus Components of Prometheus (Specific to a 5G Telecom Operator) Core component: Prometheus server scrapes HTTP endpoints and stores data (time series). The Prometheus server, a crucial component in the 5G telecom world, collects metrics from the Prometheus targets. In our context, these targets are the Kubernetes cluster that houses the 5G network functions. Time series database (TSDB): Prometheus stores telecom Metrics as time series data. HTTP Server: API to query data stored in TSDB; The Grafana dashboard can query this data for visualization. Telecom operator-specific libraries (5G) for instrumenting application code. Push gateway (scrape target for short-lived jobs) Service Discovery: In the world of 5G, network function pods are constantly being added or deleted by Telecom operators to scale up or down. Prometheus's adaptable service discovery component monitors the ever-changing list of pods. The Prometheus Web UI, accessible through port 9090, is a data visualization tool. It allows users to view and analyze Prometheus data in a user-friendly and interactive manner, enhancing the monitoring capabilities of the 5G telecom operators. The Alert Manager, a key component of Prometheus, is responsible for handling alerts. It is designed to notify users if something goes wrong, triggering notifications when certain conditions are met. When alerting triggers are met, Prometheus alerts the Alert Manager, which sends alerts through various channels such as email or messenger, ensuring timely and effective communication of critical issues. Grafana for dashboard visualization (actual graphs) With Prometheus's robust components, your Telecom operator's 5G network functions are monitored with diligence, ensuring reliable resource utilization, tracking performance, detection of errors in availability, and more. Prometheus can provide you with the necessary tools to keep your network running smoothly and efficiently. Prometheus Features The multi-dimensional data model identified by metric details uses PromQL (Prometheus Querying Language) as the query language and the HTTP Pull model. Telecom operators can now discover 5G network functions with service discovery and static configuration. The multiple modes of dashboard and GUI support provide a comprehensive and customizable experience for users. Prometheus Remote Write to Central Prometheus from Network Functions 5G Operators will have multiple network functions from various vendors, such as SMF (Session Management Function), UPF (User Plane Function), AMF (Access and Mobility Management Function), PCF (Policy Control Function), and UDM (Unified Data Management). Using multiple Prometheus/Grafana dashboards for each network function can lead to a complex and inefficient 5G network operator monitoring process. To address this, it is highly recommended that all data/metrics from individual Prometheus be consolidated into a single Central Prometheus, simplifying the monitoring process and enhancing efficiency. The 5G network operator can now confidently monitor all the data at the Central Prometheus's centralized location. This user-friendly interface provides a comprehensive view of the network's performance, empowering the operator with the necessary tools for efficient monitoring. Grafana Grafana Features Panels: This powerful feature empowers operators to visualize Telecom 5G data in many ways, including histograms, graphs, maps, and KPIs. It offers a versatile and adaptable interface for data representation, enhancing the efficiency and effectiveness of your data analysis. Plugins: This feature efficiently renders Telecom 5G data in real-time on a user-friendly API (Application Programming Interface), ensuring operators always have the most accurate and up-to-date data at their fingertips. It also enables operators to create data source plugins and retrieve metrics from any API. Transformations: This feature allows you to flexibly adapt, summarize, combine, and perform KPI metrics query/calculations across 5G network functions data sources, providing the tools to effectively manipulate and analyze your data. Annotations: Rich events from different Telecom 5G network functions data sources are used to annotate metrics-based graphs. Panel editor: Reliable and consistent graphical user interface for configuring and customizing 5G telecom metrics panels Grafana Sample Dashboard GUI for 5G Alert Manager Alert Manager Components The Ingester swiftly ingests all alerts, while the Grouper groups them into categories. The De-duplicator prevents repetitive alerts, ensuring you're not bombarded with notifications. The Silencer is there to mute alerts based on a label, and the Throttler regulates the frequency of alerts. Finally, the Notifier will ensure that third parties are notified promptly. Alert Manager Functionalities Grouping: Grouping categorizes similar alerts into a single notification system. This is helpful during more extensive outages when many 5G network functions fail simultaneously and when all the alerts need to fire simultaneously. The telecom operator will expect only to get a single page while still being able to visualize the exact service instances affected. Inhibition: Inhibition suppresses the notification for specific low-priority alerts if certain major/critical alerts are already firing. For example, when a critical alert fires, indicating that an entire 5G SMF (Session Management Function) cluster is not reachable, AlertManager can mute all other minor/warning alerts concerning this cluster. Silences: Silences are simply mute alerts for a given time. Incoming alerts are checked to match the regular expression matches of an active silence. If they match, no notifications will be sent out for that alert. High availability: Telecom operators will not load balance traffic between Prometheus and all its Alert Managers; instead, they will point Prometheus to a list of all Alert Managers. Dashboard Visualization Grafana dashboard visualizes the Alert Manager webhook traffic notifications as shown below: Configuration YAMLs (Yet Another Markup Language) Telecom Operators can install and run Prometheus using the configuration below: YAML prometheus: enabled: true route: enabled: {} nameOverride: Prometheus tls: enabled: true certificatesSecret: backstage-prometheus-certs certFilename: tls.crt certKeyFilename: tls.key volumePermissions: enabled: true initdbScriptsSecret: backstage-prometheus-initdb prometheusSpec: retention: 3d replicas: 2 prometheusExternalLabelName: prometheus_cluster image: repository: <5G operator image repository for Prometheus> tag: <Version example v2.39.1> sha: "" podAntiAffinity: "hard" securityContext: null resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi serviceMonitorNamespaceSelector: matchExpressions: - {key: namespace, operator: In, values: [<Network function 1 namespace>, <Network function 2 namespace>]} serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false Configuration to route scrape data segregated based on the namespace and route to Central Prometheus. Note: The below configuration can be appended to the Prometheus mentioned in the above installation YAML. YAML remoteWrite: - url: <Central Prometheus URL for namespace 1 by 5G operator> basicAuth: username: name: <secret username for namespace 1> key: username password: name: <secret password for namespace 1> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 1> action: keep - url: <Central Prometheus URL for namespace 2 by 5G operator> basicAuth: username: name: <secret username for namespace 2> key: username password: name: <secret password for namespace 2> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 2> action: keep Telecom Operators can install and run Grafana using the configuration below. YAML grafana: replicas: 2 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - Grafana topologyKey: "kubernetes.io/hostname" securityContext: false rbac: pspEnabled: false # Must be disabled due to tenant permissions namespaced: true adminPassword: admin image: repository: <artifactory>/Grafana tag: <version> sha: "" pullPolicy: IfNotPresent persistence: enabled: false initChownData: enabled: false sidecar: image: repository: <artifactory>/k8s-sidecar tag: <version> sha: "" imagePullPolicy: IfNotPresent resources: limits: cpu: 100m memory: 100Mi requests: cpu: 50m memory: 50Mi dashboards: enabled: true label: grafana_dashboard labelValue: "Vendor name" datasources: enabled: true defaultDatasourceEnabled: false additionalDataSources: - name: Prometheus type: Prometheus url: http://<prometheus-operated>:9090 access: proxy isDefault: true jsonData: timeInterval: 30s resources: limits: cpu: 400m memory: 512Mi requests: cpu: 50m memory: 206Mi extraContainers: - name: oauth-proxy image: <artifactory>/origin-oauth-proxy:<version> imagePullPolicy: IfNotPresent ports: - name: proxy-web containerPort: 4181 args: - --https-address=:4181 - --provider=openshift # Service account name here must be "<Helm Release name>-grafana" - --openshift-service-account=monitoring-grafana - --upstream=http://localhost:3000 - --tls-cert=/etc/tls/private/tls.crt - --tls-key=/etc/tls/private/tls.key - --cookie-secret=SECRET - --pass-basic-auth=false resources: limits: cpu: 100m memory: 256Mi requests: cpu: 50m memory: 128Mi volumeMounts: - mountPath: /etc/tls/private name: grafana-tls extraContainerVolumes: - name: grafana-tls secret: secretName: grafana-tls serviceAccount: annotations: "serviceaccounts.openshift.io/oauth-redirecturi.first": https://[SPK exposed IP for Grafana] service: targetPort: 4181 annotations: service.alpha.openshift.io/serving-cert-secret-name: <secret> Telecom Operators can install and run Alert Manager using the configuration below. YAML alertmanager: enabled: true alertmanagerSpec: image: repository: prometheus/alertmanager tag: <version> replicas: 2 podAntiAffinity: hard securityContext: null resources: requests: cpu: 25m memory: 200Mi limits: cpu: 100m memory: 400Mi containers: - name: config-reloader resources: requests: cpu: 10m memory: 10Mi limits: cpu: 25m memory: 50Mi Configuration to route Prometheus Alert Manager data to the Operator's centralized webhook receiver. Note: The below configuration can be appended to the Alert Manager mentioned in the above installation YAML. YAML config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' routes: - receiver: '<Network function 1>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 1>" - receiver: '<Network function 2>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 2>" Conclusion The open-source OAM (Operation and Maintenance) tools Prometheus, Grafana, and Alert Manager can benefit 5G Telecom operators. Prometheus periodically captures all the status of monitored 5G Telecom network functions through the HTTP protocol, and any component can be connected to the monitoring as long as the 5G Telecom operator provides the corresponding HTTP interface. Prometheus and Grafana Agent gives the 5G Telecom operator control over the metrics the operator wants to report; once the data is in Grafana, it can be stored in a Grafana database as extra data redundancy. In conclusion, Prometheus allows 5G Telecom operators to improve their operations and offer better customer service. Adopting a unified monitoring and alert system like Prometheus is one way to achieve this.

By BINU SUDHAKARAN PILLAI

Best Practices for Migration of COTS Applications to Cloud

The first step in a Cloud Adoption journey for any enterprise is Application Portfolio Analysis. During this assessment, we see custom in-house (Bespoke) applications, Commercial-Off-The-Shelf (COTS) applications, Software-as-a-Service (SaaS) applications, etc. The constitution of these applications in the portfolio varies between enterprises and industries. As an outcome of the assessment, the applications are dispositioned into one of the seven common migration strategies (7-R’s of Migration: Retire, Retain, Refactor, Replatform, Repurchase, Rehost, and Relocate) and arrive at a roadmap for cloud migration. While COTS applications are generally perceived as low-hanging fruits during cloud migrations, they come with their own challenges. For example, the currency of technical stack (Operating System, Product Versions, Frameworks, Databases, etc.), managing licenses and adherence to security requirements in the cloud, etc. Understanding these challenges is critical to arriving at migration strategies. Let us deep dive a bit and uncover them. Challenges and Best Practices From our experiences in cloud migrations, we have observed some common challenges and best practices to mitigate them. These have helped us in successful migrations for many clients. Disposition COTS products are used by enterprises as a ready-made solution for their business needs. However, in many cases, these applications trail other evolving applications and become outdated or difficult to integrate with other modern applications. Some applications do not undergo any upgrades/enhancements and outlive the usual application lifecycle. It becomes challenging to migrate such applications and requires due diligence, stakeholder concurrence, etc. Best Practices Perform a comprehensive assessment, considering the business needs, performance, compatibility, risks, and cost-benefit study. Involving all stakeholders like Business, IT, Independent Software Vendor (ISV), etc. is important for the best outcome of this assessment. Based on the analysis, choose one of the following strategies. Rehost the application and database (lift-and-shift the virtual machine (as-is)) to the Infrastructure-as-a-Service (IaaS) compute instance on the cloud. Rehost the application to IaaS and re-platform the database to a Platform-as-a-Service (PaaS) instance on the cloud. Replatform to ISV Managed SaaS solution. This solution can be from the same ISV or from another ISV that provides an easy migration path to their solution. The following challenges will also contribute to the dispositioning of the COTS applications. ISV Support Pure COTS applications are usually straightforward to migrate provided the vendor supports and certifies the application to run on the target cloud platform. Some ISVs provide customized versions of their product to suit the business needs of an enterprise. For such applications, the Vendor has their share of ownership and responsibility to maintain them on the client's premises. While these applications are managed by an application team, the knowledge of the application, the intrinsic details, and the application roadmap is with the vendor. The ISVs have their own timelines and schedules for their releases, and they would require sufficient notice to have their personnel engaged to support the migration. This impacts the migration plan of the COTS application and its dependencies. Best Practices Start engaging with the ISV early in the migration journey preferably in the planning stage. Some ISVs would require a new professional services contract to be engaged for providing support. Understand the rules of engagement and ensure the contract clearly details the roles and responsibilities of the ISV. Some COTS products require certification by the ISV to run on the cloud. Understand the requirements and clearly document them as this would have cost implications. There are also scenarios when enterprises decide to migrate without vendor support due to cost and they have in-house expertise on the product. Target Cloud Support Most enterprises incorrectly assume the COTS product is compatible with any target platform. However, there are instances where they realize that the products do not work as intended after the migration journey and the vendor refuses to support the chosen platform/technology stack. Some examples are: The ISV might not have an immediate plan for supporting the target operating system or the target Database in the cloud. One of the objectives of moving to the cloud is to provide for High Availability and Disaster Recovery and some of the COTS products may not support front-ending by a Load Balancer or provide support for Clustering. Best Practices During assessment, determine the ISV support for the target technology stack. Some applications would require modifications/upgrades to support. It’s good to ask specific questions regarding the support like Does the COTS product support Windows 2022 in AWS Cloud? If yes, does it require modifications to support it? Does the COTS product which currently uses SQL Server on-premises support migration to RDS SQL Server in AWS Cloud? It’s also good to have the ISV involved in the design of the target architecture. Getting the target architecture approved by the ISV is a good practice. Security Requirements One of the common issues encountered during migrations is that COTS products do not adhere to the Security Policies defined by the Information Security(InfoSec) Team. For Example, privileged credentials used by COTS applications are often embedded in cleartext within application configuration files, database tables, scripts, etc. The credentials would not be changed frequently, and the same would also be used across multiple environments or in other applications as well. This is perceived as a security vulnerability by the InfoSec team. Best Practices Understand from the ISV if there is a direct integration available to a secure vault like Azure Key Vault, AWS Secrets Manager, HashiCorp Vault, etc. Alternatively, the ISV must provide a patch to encrypt the password stored locally on the server, or database. Provide for rotation of credentials based on policies like the criticality of the application or based on data sensitivity requirements. Some COTS products do not provide support for integration with vaults due to the legacy software stack and the application cannot be modified. In such cases, an exception from the InfoSec team is sought with a remediation plan. For example, a product upgrade, or enhancement, say within 6 months after the migrations, as it involves cost, potential change in integrations with other applications, etc. Licenses ISVs use different types of licensing models for their products. Some offer one-time, perpetual licenses, while others require enterprises to renew the licenses (subscription-based). Similarly, some licenses are tied to the server's metadata (IP address, MAC address, or hostname) while others are portable. During migrations, another common issue observed is licensing conflict. Inadequate licenses restrict the application from functioning on the cloud while the license is already tied to the running instance at on-premises or vice-versa. There could also be a change in the licensing model when the COTS product is moved to the cloud. For example, moving from an on-premises deployment to a SaaS model would require a move from a perpetual licensing model to a subscription-based one. Best Practices Understand the current licensing model, the number of licenses available, and request for additional licenses. Some ISVs provide temporary licenses that will allow the application to run simultaneously on-premises and on the target cloud platform. Understand if there are license checks done by the installed software. Some COTS applications send internet egress traffic for license validation. This would help in planning for firewall rules during migration. Team Organization and Co-ordination In the case of business-owned applications, the IT presence will be limited to providing platform support. Also, for customized COTS products, the ISV is a key participant and contributor in the migration. Involving them late in the play is a common mistake and it causes delay, the need for expedited engagement, and in turn proves to be an expensive affair. Along with identifying the contributors from IT (infra and database support), business (testers), etc. to support the migrations, placing the right ownership of actions on the ISV is also important. Best Practices ISV team should have tasks assigned in the project plan and it is necessary to communicate the tasks and the timeline to them as early as possible. ISV should have clearly defined responsibilities. For example, during the installation of the COTS software in the cloud, the application team would perform the installation with support from the ISV or the ISV themselves might perform the installation. These activities should be listed in the runbook against the ISV listed as task owner. ISVs might also require access to the cloud environment during migration or for later support. Access requirements for the ISV can be evaluated again during migration and provided for. Integrations While most modern COTS products support enhanced security controls, you will come across a few products that use non-secure ports or integration mechanisms for communication. For example, http ports (80, 8080) or FTP (21). In the cloud, one of the security controls enforced is the encryption of data in transit. Additionally, the other applications having an affinity with the COTS product may take a modernization path, involving a change in the framework(Struts to Spring Boot), data models(XML to JSON), etc. This may require some changes to the COTS product. Application Remediation for such enhancements would require a considerable amount of time and testing. Best Practices Start the identification of these integrations much earlier and work with the ISV for the changes to the COTS application. Factor efforts for comprehensive testing. Despite the identification of such requirements early in the migration, it’s a possibility that changes couldn’t be made to the COTS product due to timelines and various other factors. In such cases, it's normal to get an exception approval from InfoSec for allowing these ports. We can also understand from the ISV if there are automation possibilities like enabling CI/CD pipelines, configuration management, etc. that will reduce the manual effort and errors in deployment. Automation can also assist in enabling faster recovery in case of an outage. Data Migration Ensuring data remains secure during and after the migration is a significant challenge. This is required as enterprises must consider data encryption, access controls, and compliance requirements such as GDPR, HIPAA, or PCI-DSS. COTS applications often have large volumes of data stored in various formats and structures. Migrating this data to the cloud while maintaining its integrity and consistency can be complex, especially if the data is spread across multiple sources or databases. Best Practices Evaluate the options to encrypt the data in transit and at rest with the ISV and other stakeholders, as it may require changes to the product. Understand the complexity of the data and its structure by doing a thorough analysis. Work with the ISV if there are proprietary tools for migration of the data and obtain clearance for using the tool from the InfoSec organization. This is a longer process hence it is essential to be addressed very early in the migration life cycle. Plan for incremental data migration to ensure data integrity during cutover to the target. Other Challenges Containerization Containerizing a COTS product is a popular solution as the application can benefit from isolation, portability, scalability, and efficient utilization of resources. While the benefits are huge, this migration path is usually tricky because the ISVs may not have container images and even if they agree to build a container image, they may not have the resources to maintain images on a continuous basis. So, it’s necessary to understand these intricacies before proceeding with containerization. Refactor Refactor the COTS application to a custom application. This is usually perceived as a project by itself, driven by a strong business case and it involves considerable cost, time, and manpower. This could lead to build-vs-buy decisions or even buying and building (customizations). It's advised to take this route only when you have in-house knowledge about the application. Conclusion Every migration provides a lot of learnings and insights to carry forward into successive migrations. Based on our migration experiences for various enterprises across industries, we have shared the challenges and the mitigations that have helped us overcome them. As mentioned earlier, the number of COTS applications would vary across enterprises and industries and the challenges are similar. Addressing these challenges early in the migration cycle would help in the cloud journey while ensuring maximum benefits from the cloud for your application portfolio.

By Rakesh Rao

Implementing EKS Multi-Tenancy Using Capsule (Part 1)

What Is Multi-Tenancy? Tenancy enables users to share cluster infrastructure among: Multiple teams within the organization Multiple customers of the organization Multi-environments of the application Shared clusters save costs and simplify administration. Security and isolation are key factors to consider when cluster resources are to be shared. Two prominent isolation models to achieve multi-tenancy are hard and soft tenancy models. The key difference between these models lies in the level of isolation provided between tenants. Soft tenancy has a lower level of isolation and uses mechanisms like namespaces, quotas, and limits to restrict tenant access to resources and prevent them from interfering with each other while hard tenancy has stronger isolation. Often involves separate clusters or virtual machines for each tenant, with minimal shared resources. Kubernetes Native Services in Multi-Tenant Implementations Kubernetes has a built-in namespace model to create logical partitions of the cluster as isolated slices. Though basic levels of tenancy can be achieved, using namespaces has some limitations: Implementing advanced multi-tenancy scenarios, like Hierarchical Namespaces (HNS) or exposing Container as a Service (CaaS) becomes complicated because of the flat structure of Kubernetes namespaces. Namespaces have no common concept of ownership. Tracking and administration challenges persist if the team controls multiple namespaces. Enforcing resource quotas and limits fairly across all tenants requires additional effort. Only highly privileged users can create namespaces. This means that whenever a team wants a new namespace, they must raise a ticket to the cluster administrator. While this is probably acceptable for small organizations, it generates unnecessary toil as the organization grows. To solve this problem, Kubernetes provides the Hierarchical Namespace Controller (HNC), which allows the user to organize the namespaces into hierarchies. Namespaces are organized in a tree structure, where child namespaces inherit resources and policies from parent namespaces. While HNC supports a soft-tenancy approach leveraging existing namespaces, however, is a newer project still under incubation in the Kubernetes community. Other wide projects that provide similar capabilities are Capsule, Rafay, Kiosk, etc. In this article series, we will discuss implementing multi-tenant solutions using the Capsule framework. Capsule is a commercially supported open-source project that implements multi-tenancy using virtual control planes. Each tenant gets a dedicated control plane with its own API server and etcd instance, creating a virtualized Kubernetes cluster experience. Capsule is one of the recommended platforms by the Kubernetes community for multi-tenancy. Major components of the Capsule framework include: Capsule controller: Aggregates multiple namespaces in a lightweight abstraction called Tenant. Capsule policy engine: Achieves tenant isolation by the various Network and Security Policies, Resource Quota, Limit Ranges, RBAC, and other policies defined at the tenant level. A user who owns the tenant is called a Tenant Owner. There is a small contrast between the roles of a tenant owner and namespace administrator. Listed below are the roles and responsibilities of the cluster admin, the tenant owner, and the namespace administrator. Install Capsule Framework We will use the AWS EKS cluster to perform the exercise. This article assumes you have already created an EKS cluster "eks-cluster1" and the following software is already installed on your local machine. AWS CLI (Version 2) Kubectl (v1.21) Curl (8.1.2) Helm Charts (3.8.2) Go Lang (v1.20.6) Capsule can be installed in the two ways listed below: Using YAML Installer AWS CLI (Version 2) PowerShell aws eks --region us-east-1 update-kubeconfig --name eks-cluster1 kubectl apply -f https://raw.githubusercontent.com/clastix/capsule/master/config/install.yaml If you face any error in applying the YAML file, re-running the same command should fix the problem. If you see the status of the pod as “ImagePullback” or “errImagePull,” delete the pod of deployment (not the deployment). Using Helm As a cluster admin or root user, run the following commands to install using Helm. PowerShell aws eks --region us-east-1 update-kubeconfig --name eks-cluster1 helm repo add clastix https://clastix.github.io/charts helm install capsule clastix/capsule -n capsule-system --create-namespace Verify Capsule Installation What gets installed with the Capsule framework: Namespace: capsule-system Deployments in Namespace: capsule-controller-manager Services Exposed: capsule-controller-manager-metrics-service capsule-webhook-service Secrets in Namespace: capsule-ca capsule-tls Webhooks: In Kubernetes, webhooks are a mechanism for external services to interact with the Kubernetes API server during the lifecycle of API requests. They act like HTTP callbacks, triggered at specific points in the request flow. This allows external services to perform validations or modifications on resources before they are persisted in the cluster. There are two main types of webhooks used in Kubernetes for admission control: Mutating Admission Webhooks and Validating Admission Webhooks. The following webhooks are installed: capsule-mutating-webhook-configuration capsule-validating-webhook-configuration Custom Resource Definitions (CRDs): CRDs allow the user to extend the API and introduce new types of resources beyond the built-in ones. Imagine them as blueprints for creating your own custom resources that can be managed alongside familiar resources like Deployments and Pods. The CRDs below are installed: capsuleconfigurations.capsule.clastix.io globaltenantresources.capsule.clastix.io tenantresources.capsule.clastix.io tenants.capsule.clastix.io Cluster Roles capsule-namespace-deleter capsule-namespace-provisioner Cluster Role Bindings capsule-manager-rolebinding capsule-proxy-rolebinding Follow the below steps to check if Capsule is installed properly: Login in to verify the following commands as a root user or cluster administrator. This should list ‘capsule-system’ namespace. PowerShell aws eks --region us-east-1 update-kubeconfig --name eks-cluster1 kubectl get ns Run the below commands to see capsule-related components. PowerShell kubectl -n capsule-system get deployments kubectl -n capsule-system get svc kubectl -n capsule-system delete deployment capsule-controller-manager kubectl get mutatingwebhookconfigurations kubectl get validatingwebhookconfigurations Get capsule CRDs installed. PowerShell kubectl get crds If any of the CRDs are missing, apply the respective kubectl command mentioned below. Please note the Capsule version in the said URL, your mileage may vary according to the desired upgrading version. PowerShell kubectl apply -f https://raw.githubusercontent.com/clastix/capsule/v0.3.3/charts/capsule/crds/globaltenantresources-crd.yaml kubectl apply -f https://raw.githubusercontent.com/clastix/capsule/v0.3.3/charts/capsule/crds/tenant-crd.yaml kubectl apply -f https://raw.githubusercontent.com/clastix/capsule/v0.3.3/charts/capsule/crds/tenantresources-crd.yaml View the clusterroles and rolesbindings by running the below commands kubectl get clusterrolebindings kubectl get clusterroles Verify the resource utilization of the framework. PowerShell kubectl -n capsule-system get pods kubectl top pod <<pod name>> -n capsule-system --containers The Capsule framework creates one pod replica. The CPU (cores) should be around 3m and Memory (bytes) around 26Mi. Verify the tenants available by running the below command as Cluster admin. The result should be “No Resources Found.” PowerShell kubectl get tenants Summary In this part, we have understood what multi-tenancy is, different types of tenant isolation models, challenges with Kubernetes native services, and installing the Capsule framework on AWS EKS. In the next part, we will further deep-dive into creating tenants and policy management.

By Phani Krishna Kollapur Gandla

Amazon Bedrock: Leveraging Foundation Models With Quarkus and AWS

Bedrock is the new Amazon service that democratizes the users' access to the most up-to-date Foundation Models (FM) made available by some of the highest-ranked AI actors. Their list is quite impressive and it includes but isn't limited to: Titan Claude Mistral AI Llama2 ... Depending on your AWS region, some of these FMs might not be available. For example, as per this post, in my region, which is eu-west-3 the only available FMs are Titan and Mistral AI, but things are changing very fast. So, what's the point of using this service which, apparently, doesn't do anything else than give you access to other FMs? Well, the added value of Amazon Bedrock is to expose via APIs all these FMs, giving you the opportunity to easily integrate generative AI in your applications, through ubiquitous techniques like Serverless or REST. This is what this post is trying to demonstrate. So, let's go! A Generative AI Gateway The project chosen in order to illustrate this post is showing a Generative AI Gateway, where the user is given access to a certain number of FMs, each one being specialized in a different type of use case like, for example, text generation, conversational interfaces, text summarization, image generation, etc. The diagram below shows the general architecture of the sample application. The sample application architecture diagram As you can see, the sample application consists of the following components: A web front-end that allows the user to select an FM, to configure its parameters, like the temperature, the max tokens, etc. and to start the dialog with it, for example asking questions. Our application being a Quarkus one, we are using here the quarkus-primefaces extension. An AWS REST Gateway that aims at exposing dedicated endpoints, depending on the chosen FM. Here we're using the quarkus-amazon-lambda-rest extension which, as you'll see soon, is able to automatically generate the SAM (Serverless Application Model) template required to deploy the REST Gateway to AWS. Several REST endpoints processing POST requests and aiming at invoking the chosen FM via a Bedrock client. The FM responses are brought back to our web application, through the REST Gateway. Let's look now in greater detail at the implementation. The REST Gateway The module bedrock-gateway-api of our Maven multi-module project, implements this component. It consists of a Quarkus RESTeasy API exposing several endpoints which are processing POST requests, having the user interaction as input parameters, and returning the FM responses. The input parameters are strings and, in the case where the user requests result in a really large amount of text, they are input files. The endpoints process these POST requests by converting the associated input into an FM-specific syntax, including the following parameters: The temperature: a real number between 0 and 1 which aims at influencing the FM's predictability. A lower value consists of a more predictable output while a higher one will generate a more random response. The top P: a real number between 0 and 1 whose value is supposed to select the most likely tokens in a distribution. A lower value results in a more limited number of choices for the response. The max-tokens: an integer value representing the maximum number of words that the FM will process for any given request. The Bedrock documentation is at your disposal in order to bring you all the required missing details concerning the parameters above. The Bedrock client used to interact with the FM service is instantiated as shown below: Java private final BedrockRuntimeAsyncClient client = BedrockRuntimeAsyncClient.builder().region(Region.EU_WEST_3).build(); This requires using the following Maven artifact: XML <dependency> <groupId>software.amazon.awssdk</groupId> <artifactId>bedrockruntime</artifactId> </dependency> There is a synchronous and an asynchronous Bedrock client and, given the relative latency generally associated with an FM invocation, we have chosen the 2nd one. The Web Front-End The Web front-end is a simple Jakarta Faces application implemented using the PrimeFaces library as well as the Facelets notation in order to define the layouts. If this architecture choice might surprise the reader more to JavaScript/TypeScript-based front-ends, then please have a look at this article. The only special thing to be noticed is the way it uses the Microprofile JAX-RS Client implementation by Quarkus to call the AWS REST Gateway. Java @RegisterRestClient @Path("/bedrock") @Produces(MediaType.TEXT_PLAIN) @Consumes(MediaType.APPLICATION_JSON) public interface BedrockAiEndpoint { @POST @Path("mistral2") Response callMistralFm (BedrockAiInputParam bedrockAiInputParam); @POST @Path("titan2") Response callTitanFm (BedrockAiInputParam bedrockAiInputParam); } This interface is all that's required, Quarkus will generate from it the associated implementation client class. Running the Sample Application The application can be run in two ways: Executing locally the AWS REST Gateway and the associated AWS Lambda endpoints; Executing in the cloud the AWS REST Gateway and the associated AWS Lambda endpoints. Running Locally The shell script named run-local.sh runs locally the AWS REST Gateway together with the associated AWS Lambda endpoints. Here is the code: Shell #!/bin/bash mvn -Durl=http://localhost:3000 clean install sed -i 's/java11/java17/g' bedrock-gateway-api/target/sam.jvm.yaml sam local start-api -t ./bedrock-gateway-api/target/sam.jvm.yaml --log-file ./bedrock-gateway-api/sam.log & mvn -DskipTests=false failsafe:integration-test docker run --name bedrock -p 8082:8082 --rm --network host nicolasduminil/bedrock-gateway-web:1.0-SNAPSHOT ./cleanup-local.sh The first thing that we need to do here is to build the application by running the Maven command. This will result, among others, in a Docker image named nicolasduminil/bedrock-gateway-web which is dedicated to run the web front-end. It also will result in the generation by Quarkus of the SAM template (target\sam.jvm.yam) that creates the AWS CloudFormation stack containing the AWS REST Gateway together with the endpoints AWS Lambda functions. For some reason, the Quarkus quarkus-amazon-lambda-rest extension used for this purpose configures the runtime as being Java 11 and, even after having contacted the support, I didn't find any way to change that. Accordingly, the sed command is used in the script to modify the runtime to be Java 17. Then, the sam cli is used to run the command start-api which will execute locally the gateway with the required endpoints. Next, we are in the position to run the integration tests, on behalf of the Maven failsafe plugin. We couldn't do it while initially running the build as the local stack wasn't deployed yet. Last but not least, the script starts a Docker container running the nicolasduminil/bedrock-gateway-web image, created previously by the quarkus-container-image-jib extension. This is our front end. Now, in order to test it, you can jump to the next section which explains how. Running in the Cloud The script named `deploy.sh`, shown below, deploys in the cloud our application: Shell #!/bin/bash mvn -pl bedrock-gateway-api -am clean install sed -i 's/java11/java17/g' bedrock-gateway-api/target/sam.jvm.yaml RANDOM=$$ BUCKET_NAME=bedrock-gateway-bucket-$RANDOM STACK_NAME=bedrock-gateway-stack echo $BUCKET_NAME > bucket-name.txt aws s3 mb s3://$BUCKET_NAME sam deploy -t bedrock-gateway-api/src/main/resources/template.yaml --s3-bucket $BUCKET_NAME --stack-name $STACK_NAME --capabilities CAPABILITY_IAM API_ENDPOINT=$(aws cloudformation describe-stacks --stack-name $STACK_NAME --query 'Stacks[0].Outputs[0].OutputValue' --output text) mvn -pl bedrock-gateway-web -Durl=$API_ENDPOINT clean install docker run --name bedrock -p 8082:8082 --rm --network host nicolasduminil/bedrock-gateway-web:1.0-SNAPSHOT This time things are a bit more complicated. The Maven build in the script's first line uses the -pl switch to select only the bedrock-gateway-api module. This is because, in this case, we don't know in advance the AWS RESY Gateway URL, which the other module, bedrock-gateway-web needs in order to it the Microprofile JAX-RS client. Next, the sed command serves the same purposes as previously but, in order to deploy our stack in the cloud, we need an S3 bucket. And since the S3 bucket names have to be unique worldwide, we need to generate them randomly and store them in a text file, such that to be able to find them later, when it comes to destroying it. Now, it's time to deploy our CloudFormation stack. Please notice the way we catch the associated URL, by using the --query and the --output option. This is the moment to build the bedrock-gateway-web module as we have now the AWS REST Gateway URL, which we're passing as an environment variable, via the -D option of Maven. At this point, we only have to start our Docker container and start testing. Testing the Application In order to test the application, be it locally or in the cloud, proceed as follows: Clone the repository: Shell $ git clone https://github.com/nicolasduminil/bedrock-gateway.git cdin the root directory: Shell $ cd bedrock-gateway Run the start script (run-local.sh or deploy.sh). The execution might take a while, especially if this is the first time you're running it. Fire your preferred browser to http://localhost:8082. You'll be presented with the screen below: Using the menu bar, select the Titan sandbox. A new screen will be presented to you, as shown below. Using the sliders, configure as you wish the parameters Temperature, Top P and Max tokens. Then type in the text area labeled Prompt your question the chosen FM. Its response will display in the rightmost text area labeled Response. Please use different combinations of parameters to notice the differences between the two FM responses. And in the case you're testing in the cloud, don't forget to run the script cleanup.sh when finished, such that to avoid being invoiced. Have fun!

By Nicolas Duminil

CORE

Analyze Your ALB/NLB Logs With ClickHouse

In the dynamic world of cloud computing, data engineers are constantly challenged with managing and analyzing vast amounts of data. A critical aspect of this challenge is effectively handling AWS Load Balancer Logs. This article examines the integration of AWS Load Balancer Logs with ClickHouse for efficient log analysis. We start by exploring AWS’s method of storing these logs in S3 and its queuing system for data management. The focus then shifts to setting up a log analysis framework using S3 and ClickHouse, highlighting the process with Terraform. The goal is to provide a clear and practical guide for implementing a scalable solution for analyzing AWS NLB or ALB access logs in real time. To understand the application of this process, consider a standard application using an AWS Load Balancer. Load Balancers, as integral components of AWS services, direct logs to an S3 bucket. This article will guide you through each step of the process, demonstrating how to make these crucial load-balancer logs available for real-time analysis in ClickHouse, facilitated by Terraform. However, before delving into the specifics of Terraform’s capabilities, it’s important to first comprehend the existing infrastructure and the critical Terraform configurations that enable the interaction between S3 and SQS for the ALB. Setting Up the S3 Log Storage Begin by establishing an S3 bucket for ALB log storage. This initial step is vital and involves linking an S3 bucket to your ALB. The process starts with creating an S3 Bucket, as demonstrated in the provided code snippet (see /example_projects/transfer/nlb_observability_stack/s3.tf#L1-L3). ProtoBuf resource "aws_s3_bucket" "nlb_logs" { bucket = var.bucket_name } The code snippet demonstrates the initial step of establishing an S3 bucket. This bucket is specifically configured for storing AWS ALB logs, serving as the primary repository for these logs. ProtoBuf resource "aws_lb" "alb" { /* your config */ dynamic "access_logs" { for_each = var.access_logs_bucket != null ? { enabled = true } : {} content { enabled = true bucket = var.bucket_name prefix = var.access_logs_bucket_prefix } } } Next, we configure an SQS queue that works in tandem with the S3 bucket. The configuration details for the SQS queue are outlined here. ProtoBuf resource "aws_sqs_queue" "nlb_logs_queue" { name = var.sqs_name policy = <<POLICY { "Version": "2012-10-17", "Id": "sqspolicy", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": "sqs:SendMessage", "Resource": "arn:aws:sqs:*:*:${var.sqs_name}", "Condition": { "ArnEquals": { "aws:SourceArn": "${aws_s3_bucket.nlb_logs.arn}" } } } ] } POLICY } This code initiates the creation of an SQS queue, facilitating the seamless delivery of ALB logs to the designated S3 bucket. As logs are delivered, they are automatically organized within a dedicated folder: Regularly generated new log files demand a streamlined approach for notification and processing. To establish a seamless notification channel, we'll configure an optimal push notification system via SQS. Referencing the guidelines outlined in Amazon S3's notification configuration documentation, our next step involves the creation of an SQS queue. This queue will serve as the conduit for receiving timely notifications, ensuring prompt handling and processing of newly generated log files within our S3 bucket. This linkage is solidified through the creation of the SQS queue (see /example_projects/transfer/nlb_observability_stack/s3.tf#L54-L61). ProtoBuf resource "aws_s3_bucket_notification" "nlb_logs_bucket_notification" { bucket = aws_s3_bucket.nlb_logs.id queue { queue_arn = aws_sqs_queue.nlb_logs_queue.arn events = ["s3:ObjectCreated:*"] } } The configurations established thus far form the core infrastructure for our log storage system. We have methodically set up the S3 bucket, configured the SQS queue, and carefully linked them. This systematic approach lays the groundwork for efficient log management and processing, ensuring that each component functions cohesively in the following orchestrated setup: The illustration above showcases the composed architecture, where the S3 bucket, SQS queue, and their interconnection stand as pivotal components for storing and managing logs effectively within the AWS environment. Logs are now in your S3 bucket, but reading these logs may be challenging. Let’s take a look at a data sample: Plain Text tls 2.0 2024-01-02T23:58:58 net/preprod-public-api-dt-tls/9f8794be28ab2534 4d9af2ddde90eb82 84.247.112.144:33342 10.0.223.207:443 244 121 0 15 - arn:aws:acm:eu-central-1:840525340941:certificate/5240a1e4-c7fe-44c1-9d89-c256213c5d23 - ECDHE-RSA-AES128-GCM-SHA256 tlsv12 - 18.193.17.109 - - "%ef%b5%bd%8" 2024-01-02T23:58:58 The snippet above represents a sample of the log data residing within the S3 bucket. Understanding this data's format and content will help us to build an efficient strategy to parse and store it. Let’s move this data to DoubleCloud Managed Clickhouse. Configuring VPC and ClickHouse With DoubleCloud The next step involves adding a Virtual Private Cloud (VPC) and a managed ClickHouse instance. These will act as the primary storage systems for our logs, ensuring secure and efficient log management (see /example_projects/transfer/nlb_observability_stack/network.tf#L1-L7). ProtoBuf resource "doublecloud_network" "nlb-network" { project_id = var.project_id name = var.network_name region_id = var.region cloud_type = var.cloud_type ipv4_cidr_block = var.ipv4_cidr } Next, we’ll demonstrate how to integrate a VPC and ClickHouse into our log storage setup. The following step is to establish a ClickHouse instance within this VPC, ensuring a seamless and secure storage solution for our logs (see /example_projects/transfer/nlb_observability_stack/ch.tf#L1-L35). ProtoBuf resource "doublecloud_clickhouse_cluster" "nlb-logs-clickhouse-cluster" { project_id = var.project_id name = var.clickhouse_cluster_name region_id = var.region cloud_type = var.cloud_type network_id = resource.doublecloud_network.nlb-network.id resources { clickhouse { resource_preset_id = var.clickhouse_cluster_resource_preset disk_size = 34359738368 replica_count = 1 } } config { log_level = "LOG_LEVEL_INFORMATION" max_connections = 120 } access { data_services = ["transfer"] ipv4_cidr_blocks = [ { value = var.ipv4_cidr description = "VPC CIDR" } ] } } data "doublecloud_clickhouse" "nlb-logs-clickhouse" { project_id = var.project_id id = doublecloud_clickhouse_cluster.nlb-logs-clickhouse-cluster.id } Integrating S3 Logs With ClickHouse To link S3 and ClickHouse, we utilize DoubleCloud Transfer, an ELT (Extract, Load, Transform) tool. The setup for DoubleCloud Transfer includes configuring both the source and target endpoints. Below is the Terraform code outlining the setup for the source endpoint (see /example_projects/transfer/nlb_observability_stack/transfer.tf#L1-L197). ProtoBuf resource "doublecloud_transfer_endpoint" "nlb-s3-s32ch-source" { name = var.transfer_source_name project_id = var.project_id settings { object_storage_source { provider { bucket = var.bucket_name path_prefix = var.bucket_prefix aws_access_key_id = var.aws_access_key_id aws_secret_access_key = var.aws_access_key_secret region = var.region endpoint = var.endpoint use_ssl = true verify_ssl_cert = true } format { csv { delimiter = " " // space as delimiter advanced_options { } additional_options { } } } event_source { sqs { queue_name = var.sqs_name } } result_table { add_system_cols = true table_name = var.transfer_source_table_name table_namespace = var.transfer_source_table_namespace } result_schema { data_schema { fields { field { name = "type" type = "string" required = false key = false path = "0" } field { name = "version" type = "string" required = false key = false path = "1" } /* Rest of Fields */ field { name = "tls_connection_creation_time" type = "datetime" required = false key = false path = "21" } } } } } } } This Terraform snippet details the setup of the source endpoint, including S3 connection specifications, data format, SQS queue for event notifications, and the schema for data in the S3 bucket. Next, we focus on establishing the target endpoint, which is straightforward with ClickHouse (see /example_projects/transfer/nlb_observability_stack/transfer.tf#L199-L215). ProtoBuf resource "doublecloud_transfer_endpoint" "nlb-ch-s32ch-target" { name = var.transfer_target_name project_id = var.project_id settings { clickhouse_target { clickhouse_cleanup_policy = "DROP" connection { address { cluster_id = doublecloud_clickhouse_cluster.nlb-logs-clickhouse-cluster.id } database = "default" password = data.doublecloud_clickhouse.nlb-logs-clickhouse.connection_info.password user = data.doublecloud_clickhouse.nlb-logs-clickhouse.connection_info.user } } } } The preceding code snippets for the source and target endpoints can now be combined to create a complete transfer configuration, as demonstrated in the following Terraform snippet (see /example_projects/transfer/nlb_observability_stack/transfer.tf#L217-L224). ProtoBuf resource "doublecloud_transfer" "nlb-logs-s32ch" { name = var.transfer_name project_id = var.project_id source = doublecloud_transfer_endpoint.nlb-s3-s32ch-source.id target = doublecloud_transfer_endpoint.nlb-ch-s32ch-target.id type = "INCREMENT_ONLY" activated = false } With the establishment of this transfer, a comprehensive delivery pipeline takes shape: The illustration above represents the culmination of our efforts — a complete delivery pipeline primed for seamless data flow. This integrated system, incorporating S3, SQS, VPC, ClickHouse, and the orchestrated configurations, stands ready to handle, process, and analyze log data efficiently and effectively at any scale. Exploring Logs in ClickHouse With ClickHouse set up, we now turn our attention to analyzing the data. This section guides you through querying your structured logs to extract valuable insights from the well-organized dataset. To begin interacting with your newly created database, the ClickHouse-client tool can be utilized: Shell clickhouse-client \ --host $CH_HOST \ --port 9440 \ --secure \ --user admin \ --password $CH_PASSWORD Begin by assessing the overall log count in your dataset. A straightforward query in ClickHouse will help you understand the scope of data you’re dealing with, providing a baseline for further analysis. Shell SELECT count(*) FROM logs_alb Query id: 6cf59405-2a61-451b-9579-a7d340c8fd5c ┌──count()─┐ │ 15935887 │ └──────────┘ 1 row in set. Elapsed: 0.457 sec. Now, we'll focus on retrieving a specific row from our dataset. Executing this targeted query allows us to inspect the contents of an individual log entry in detail. Shell SELECT * FROM logs_alb LIMIT 1 FORMAT Vertical Query id: 44fc6045-a5be-47e2-8482-3033efb58206 Row 1: ────── type: tls version: 2.0 time: 2023-11-20 21:05:01 elb: net/*****/***** listener: 92143215dc51bb35 client_port: 10.0.246.57:55534 destination_port: 10.0.39.32:443 connection_time: 1 tls_handshake_time: - received_bytes: 0 sent_bytes: 0 incoming_tls_alert: - chosen_cert_arn: - chosen_cert_serial: - tls_cipher: - tls_protocol_version: - tls_named_group: - domain_name: - alpn_fe_protocol: - alpn_be_protocol: - alpn_client_preference_list: - tls_connection_creation_time: 2023-11-20 21:05:01 __file_name: api/AWSLogs/******/elasticloadbalancing/eu-central-1/2023/11/20/****-central-1_net.****.log.gz __row_index: 1 __data_transfer_commit_time: 1700514476000000000 __data_transfer_delete_time: 0 1 row in set. Elapsed: 0.598 sec. Next, we'll conduct a simple yet revealing analysis. By running a “group by” query, we aim to identify the most frequently accessed destination ports in our dataset. Shell SELECT destination_port, count(*) FROM logs_alb GROUP BY destination_port Query id: a4ab55db-9208-484f-b019-a5c13d779063 ┌─destination_port──┬─count()─┐ │ 10.0.234.156:443 │ 10148 │ │ 10.0.205.254:443 │ 12639 │ │ 10.0.209.51:443 │ 13586 │ │ 10.0.223.207:443 │ 10125 │ │ 10.0.39.32:443 │ 4860701 │ │ 10.0.198.39:443 │ 13837 │ │ 10.0.224.240:443 │ 9546 │ │ 10.10.162.244:443 │ 416893 │ │ 10.0.212.130:443 │ 9955 │ │ 10.0.106.172:443 │ 4860359 │ │ 10.10.111.92:443 │ 416908 │ │ 10.0.204.18:443 │ 9789 │ │ 10.10.24.126:443 │ 416881 │ │ 10.0.232.19:443 │ 13603 │ │ 10.0.146.100:443 │ 4862200 │ └───────────────────┴─────────┘ 15 rows in set. Elapsed: 1.101 sec. Processed 15.94 million rows, 405.01 MB (14.48 million rows/s., 368.01 MB/s.) Conclusion This article has outlined a comprehensive approach to analyzing AWS Load Balancer Logs using ClickHouse, facilitated by DoubleCloud Transfer and Terraform. We began with the fundamental setup of S3 and SQS for log storage and notification, before integrating a VPC and ClickHouse for efficient log management. Through practical examples and code snippets, we demonstrated how to configure and utilize these tools for real-time log analysis. The seamless integration of these technologies not only simplifies the log analysis process but also enhances its efficiency, offering insights that are crucial for optimizing cloud operations. Explore the complete example in our Terraform project here for a hands-on experience with log querying in ClickHouse. The power of ClickHouse in processing large datasets, coupled with the flexibility of AWS services, forms a robust solution for modern cloud computing challenges. As cloud technologies continue to evolve, the techniques and methods discussed in this article remain pertinent for IT professionals seeking efficient and scalable solutions for log analysis.

By Andrei Tserakhau

Cybersecurity in the Cloud: Integrating Continuous Security Testing Within DevSecOps

Cloud computing has revolutionized software organizations' operations, offering unprecedented scalability, flexibility, and cost-efficiency in managing digital resources. This transformative technology enables businesses to rapidly deploy and scale services, adapt to changing market demands, and reduce operational costs. However, the transition to cloud infrastructure is challenging. The inherently dynamic nature of cloud environments and the escalating sophistication of cyber threats have made traditional security measures insufficient. In this rapidly evolving landscape, proactive and preventative strategies have become paramount to safeguard sensitive data and maintain operational integrity. Against this backdrop, integrating security practices within the development and operational workflows—DevSecOps—has emerged as a critical approach to fortifying cloud environments. At the heart of this paradigm shift is Continuous Security Testing (CST), a practice designed to embed security seamlessly into the fabric of cloud computing. CST facilitates the early detection and remediation of vulnerabilities and ensures that security considerations keep pace with rapid deployment cycles, thus enabling a more resilient and agile response to potential threats. By weaving security into every phase of the development process, from initial design to deployment and maintenance, CST embodies the proactive stance necessary in today's cyber landscape. This approach minimizes the attack surface and aligns with cloud services' dynamic and on-demand nature, ensuring that security evolves in lockstep with technological advancements and emerging threats. As organizations navigate the complexities of cloud adoption, embracing Continuous Security Testing within a DevSecOps framework offers a comprehensive and adaptive strategy to confront the multifaceted cyber challenges of the digital age. Most respondents (96%) of a recent software security survey believe their company would benefit from DevSecOps' central idea of automating security and compliance activities. This article describes the details of how CST can strengthen your cloud security and how you can integrate it into your cloud architecture. Key Concepts of Continuous Security Testing Continuous Security Testing (CST) helps identify and address security vulnerabilities in your application development lifecycle. Using automation tools, it analyzes your complete security structure and discovers and resolves the vulnerabilities. The following are the fundamental principles behind it: Shift-left approach: CST promotes early adoption of safety measures by bringing security testing and mitigation to the start of the software development lifecycle. This method reduces the possibility of vulnerabilities in later phases by assisting in the early detection and resolution of security issues. Automated security testing: Critical to CST is automation, which allows for consistent and quick evaluation of security measures, scanning for vulnerabilities, and code analysis. Automation ensures consistent and rapid security evaluation. Continuous monitoring and feedback: As part of CST, safety incidents and feedback chains are monitored in real-time, allowing security vulnerabilities to be identified and fixed quickly. Integrating Continuous Security Testing Into the Cloud Let's explore the phases involved in integrating CST into cloud environments. Laying the Foundation for Continuous Security Testing in the Cloud To successfully integrate Continuous Security Testing (CST), you must prepare your cloud environment first. Use a manual tool like OWASP or an automated security testing process to perform a thorough security audit and ensure your cloud environments are well-protected to lay a robust groundwork for CST. Before diving into integrating Continuous Security Testing (CST) within your cloud infrastructure, it's crucial to lay a solid foundation by meticulously preparing your cloud environment. This preparatory step involves conducting a comprehensive security audit to identify vulnerabilities and ensure your cloud architecture is fortified against threats. Leveraging tools such as the Open Web Application Security Project (OWASP) for manual evaluations or employing sophisticated automated security testing processes can significantly aid this endeavor. Conduct a detailed inventory of all assets and resources within your cloud architecture to assess your cloud environment's security posture. This includes everything from data storage solutions and archives to virtual machines and network configurations. By understanding the full scope of your cloud environment, you can better identify potential vulnerabilities and areas of risk. Next, systematically evaluate these components for security weaknesses, ensuring no stone is left unturned. This evaluation should encompass your cloud infrastructure's internal and external aspects, scrutinizing access controls, data encryption methods, and the security protocols of interconnected services and applications. Identifying and addressing these vulnerabilities at this stage sets a robust groundwork for the seamless integration of Continuous Security Testing, enhancing your cloud environment's resilience to cyber threats and ensuring a secure, uninterrupted operation of cloud-based services. By undertaking these critical preparatory steps, you position your organization to leverage CST effectively as a dynamic, ongoing practice that detects emerging threats in real-time and integrates security seamlessly into every phase of your cloud computing operations. Establishing Effective Security Testing Criteria The cornerstone of implementing Continuous Security Testing (CST) within cloud ecosystems is meticulously defining the security testing requirements. This pivotal step involves identifying a holistic suite of testing methodologies encompassing your security landscape, ensuring thorough coverage and protection against potential vulnerabilities. A multifaceted approach to security testing is essential for a robust defense strategy. This encompasses a variety of criteria, such as: Vulnerability scanning: Systematic examination of your cloud environment to identify and classify security loopholes. Penetration testing: Simulated cyber attacks against your system to evaluate the effectiveness of security measures. Compliance inspections: Assessments to ensure that cloud operations adhere to industry standards and regulatory requirements. Source code analysis: Examination of application source code to detect security flaws or vulnerabilities. Configuration analysis: Evaluation of system configurations to identify security weaknesses stemming from misconfigurations or outdated settings. Container security analysis: Analysis focused on the security of containerized applications, including their deployment, management, and orchestration. Organizations can proactively identify and rectify security vulnerabilities within their cloud architecture by selecting the appropriate mix of these testing criteria. This proactive stance enhances the overall security posture and embeds a culture of continuous improvement and vigilance across the cloud computing landscape. Adopting a comprehensive and systematic approach to security testing ensures that your cloud environment remains resilient against evolving cyber threats, safeguarding your critical assets and data effectively. Choosing the Right Security Testing Tools for Automation The transition to automated security testing tools is critical for achieving faster and more accurate security assessments, significantly reducing the manual effort, workforce involvement, and resources dedicated to routine tasks. A diverse range of tools exists to support this need, including Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST), and safety measures for Infrastructure as Code (IaC) etc. These technologies are easy to integrate into Continuous Integration/Continuous Deployment (CI/CD) pipelines and improve security by finding and fixing vulnerabilities before development. More than half of DevOps teams conduct SAST scans, 44% conduct DAST scans, and almost 50% inspect containers and dependencies as part of their security measures. However, when choosing the right tool for automation, consider features like ease of use, the ability to get updated with the vulnerability, and ROI vs. the cost of the tool. When choosing the right automation tools, evaluating them based on several critical factors beyond their primary functionalities is vital. The ease of integration into existing workflows, their capacity for timely updates in response to new vulnerabilities, and the balance between their cost and the return on investment they offer are crucial considerations. These factors ensure that the selected tools enhance security measures and align with the organization's overall security strategy and resource allocation, facilitating a more secure and efficient development lifecycle. Continuous Monitoring and Improvement The bedrock of maintaining an up-to-date and secure cloud infrastructure lies in the practices of continuous monitoring and iterative improvement throughout the entirety of its lifecycle. Integrate your cloud log with Security Information and Event Management (SIEM) capabilities to get centralized security intelligence and initiate continuous monitoring and improvement. Similarly, ELK Stack (Elasticsearch, Logstash, Kibana) is another tool that can help you visualize, collect, and analyze your log data. Regularly monitoring your security landscape and adapting based on the insights gleaned from testing and monitoring outputs are essential. Such a proactive approach not only aids in preemptively identifying and mitigating potential threats but also ensures that your security framework remains robust and adaptive to the ever-evolving cyber threat landscape. Strategic Risk Management and Mitigation Efforts Effective security management requires a strategic approach to evaluating and mitigating vulnerabilities, guided by their criticality, exploitability, and potential repercussions for the organization. Utilizing threat modeling techniques enables a targeted allocation of resources, focusing on areas of highest risk to reduce exposure and avert potential security incidents. Following identifying critical vulnerabilities, devising and executing a comprehensive risk mitigation strategy is imperative. This strategy should encompass a range of solutions tailored to diminish the identified risks, including the deployment of software patches and updates, the establishment of enhanced security protocols, the integration of additional safeguarding measures, or even the strategic overhaul of existing systems and processes. Organizations can fortify their defenses by prioritizing and systematically addressing vulnerabilities based on severity and impact, ensuring a more secure and resilient operational environment. Benefits of Continuous Security Testing in the Cloud There are numerous benefits of using continuous security testing in cloud environments. Early vulnerability detection: Using CST, you can identify security issues early on and address them before they pose a risk. Enhanced security quality: To better defend your cloud infrastructure against cyberattacks, security testing gives it an additional layer of protection. Enhanced innovation and agility: CST enables faster release cycles by identifying risks early on, allowing you to take proactive measures to counter them. Enhanced team collaboration: CST promotes collaboration between different teams to cultivate a culture of collective accountability for security. Compliance with industry standards: By routinely assessing its security controls and procedures, you can lessen the possibility of fines and penalties for noncompliance with corporate policies and legal requirements. Conclusion In the rapidly evolving landscape of cloud computing, Continuous Security Testing (CST) emerges as a cornerstone for safeguarding cloud environments against pervasive cyber threats. By weaving security seamlessly into the development fabric through automation and vigilant monitoring, CST empowers organizations to detect and neutralize vulnerabilities preemptively. The adoption of CST transcends mere risk management; it fosters an environment where security, innovation, and collaboration converge, propelling businesses forward. This synergistic approach elevates organizations' security posture and instills a culture of continuous improvement and adaptability. As businesses navigate the complexities of the digital age, implementing CST positions them to confidently address the dynamic nature of cyber threats, ensuring resilience and securing their future in the cloud.

By Prithvish Kovelamudi

Understanding the 2024 Cloud Security Landscape

With technology and data growing at an unprecedented pace, cloud computing has become a no-brainer answer for enterprises worldwide to foster growth and innovation. As we swiftly move towards the second quarter of 2024, predictions by cloud security reports highlight the challenges of cloud adoption in the cloud security landscape. Challenges Gartner Research forecasts a paradigm shift in adopting public cloud Infrastructure as a Service (IaaS) offerings. By 2025, a staggering 80% of enterprises are expected to embrace multiple public cloud IaaS solutions, including various Kubernetes (K8s) offerings. This growing reliance on cloud infrastructure raises the critical issue of security, which the Cloud Security Alliance painfully highlights. According to the Cloud Security Alliance(CSA), only 23% of organizations report full visibility into their cloud environments. This lack of visibility, despite the vast potential of cloud technologies, can make organizations susceptible to potential threats within their infrastructure. Another issue that compounds the cloud visibility issues even further is duplicate alerts. A staggering 63% of organizations face duplicate security alerts, hindering security teams' ability to sort genuine threats from noise. The challenge above can be mitigated using a unified security approach, but it has been discovered that 61% of organizations are utilizing between 3 to 6 different tools. The landscape becomes more complicated to understand, highlighting the urgency of covering gaps in security defense mechanisms. A well-defined security defense mechanism minimizes manual intervention from security teams and promotes the need for automation and streamlined processes in operations. Security teams spending most of their time on manual tasks associated with security alerts not only discourages efficient resource use but also diminishes the productivity of teams working towards addressing critical security vulnerabilities. CSA statistics reveal that only a mere 18% of organizations take more than four days to remediate critical vulnerabilities, underscoring the urgency of this issue. Such delays leave systems vulnerable to potential breaches and compromises and highlight the pressing need for action. Moreover, the recurrence of vulnerabilities within a month of remediation underscores the necessity for proactive team collaboration. According to CSA, inefficient collaboration between security and development teams inadvertently creates defense gaps and heightens the risk of exploitation. By promoting communication between these critical teams, organizations can better strengthen their defenses and mitigate security threats. It is clear that the cloud security landscape requires a more comprehensive approach to gaining visibility into cloud environments. By implementing the best practices outlined below, organizations can move closer to their objective of establishing secure and resilient cloud infrastructure. Best Practices This section will delve into the essential pillars of cloud security for safeguarding your cloud assets, beginning with the following: Unified Security One of the main challenges in cloud security adoption is the lack of a unified security framework. A Unified Security Framework comprises various tools and processes that collect information from different systems and display it cohesively on one screen. When compared with traditional security tools which require their own set of architecture to work and then require additional add-ons to collect data, unified security solutions are a better way to gain a holistic view of an organization's security posture. The Unified Security framework consolidates various security processes, such as threat intelligence, access controls, and monitoring capabilities, to streamline visibility and management while facilitating collaboration between different teams, such as IT, security, and compliance. Zero Trust Architecture (ZTA) Zero Trust Architecture (ZTA) uses a "never trust, always verify" approach. All the stages of cloud data communication, regardless of their location in the cloud hierarchy, should be protected with verification mechanisms and adhere to zero-trust solutions. An effective zero-trust solution implemented over a cloud architecture should inspect all the unencrypted and encrypted traffic before it reaches its desired destination, with the access requests for the requested data verified beforehand for their identity and requested content. Adaptive custom access control policies should be implemented that not only change contexts based on the attack surface but also eliminate the risk of any false movements that compromise the functionality of devices. By adopting the zero-trust practices mentioned, organizations can implement robust identity and access management (IAM) with granular protection for applications, data, networks, and infrastructure. Encryption Everywhere Data encryption is a major challenge for many organizations, which can be mitigated by encrypting data at rest and in transit. An encryption-as-a-service solution can be implemented, which provides centralized encryption management for authorizing traffic across data clouds and centers. All application data can be encrypted with one centralized encryption workflow, which ensures the security of sensitive information. The data will be governed by identity-based policies, which ensure cluster communication is verified and services are authenticated based on trusted authority. Moreover, encrypting data across all layers of the cloud infrastructure—including applications, databases, and storage—increases the overall consistency and automation of cloud security. Automated tools can streamline the encryption process while making it easier to apply encryption policies consistently across the entire infrastructure. Continuous Security Compliance Monitoring Continuous security compliance monitoring is another crucial pillar for strengthening the cloud security landscape. Organizations specifically operating in healthcare (subject to HIPAA regulations) and payments (under PCI DSS guidelines) involve rigorous assessment of infrastructure and processes to protect sensitive information. To comply with these regulations, continuous compliance monitoring can be leveraged to automate the continuous scanning of cloud infrastructure for compliance gaps. The solutions can analyze logs and configuration for security risks by leveraging the concept of "compliance as code," where security considerations are embedded into every stage of the software development lifecycle (SDLC). By implementing these streamlined automated compliance checks and incorporating them into each stage of development, organizations can adhere to regulatory mandates while maintaining agility in cloud software delivery. Conclusion To conclude, achieving robust cloud security necessitates using a Unified Security approach with a Zero-Trust Architecture through continuous encryption and compliance monitoring. By adopting these best practices, organizations can strengthen their defenses against evolving cyber threats, safeguard sensitive data, and build trust with customers and stakeholders.

By Rajat Gupta

Automatic Snapshots Using Snapshot Manager

Overview AWS has a Lambda function that can be leveraged to take automatic snapshots of the KDA (Kinesis Data Analytics) applications that are running in a specific region. Refer the Lambda function source code here. Users can modify the Lambda function to their needs. Note: As of August 30, 2023, Amazon Kinesis Data Analytics (KDA) has been renamed to Amazon Managed Service for Apache Flink. What Does It Do? Takes snapshots of all the running KDA applications. Get a list of the number of snapshots each KDA application has. Check if the number of snapshots is greater than the threshold number specified. Delete any old snapshots if they exceed the threshold number. Resources Used Lambda Functions CloudWatch events (Amazon EventBridge) SNS topic AWS KDA (Kinesis Data Analytics) AWS DynamoDB table (optional) IAM Roles/Policies Required "Resources": "*" CloudWatch Logs Policies "logs:CreateLogStream" "logs:CreateLogGroup" "logs:PutLogEvents" Kinesis Analytics Policies "kinesisanalytics:ListApplicationSnapsots" "kinesisanalytics:DescribeApplicationSnapshot" "kinesisanalytics:DescribeApplication" "kinesisanalytics:DeleteApplicationSnapshot" "kinesisanalytics:ListApplications" "kinesisanalytics:CreateApplicationSnapshot" Amazon SNS Policies "sns:Publish" Architecture When a schedule is being created, each time a CloudWatch event is run, it triggers the Snapshot Manager lambda function which will perform the snapshotting and then send out the notification on to the SNS topic. Note: DynamoDB might not be used in most cases (but it's shown in the diagram if it's required to store metadata about snapshots in some cases). What Happens Inside the Lambda Function? When the Lambda function is invoked, it first reads all the environment variables. Then, a call is made to retrieve a list of all running KDA applications. Next, it iterates through the list to check if each KDA application is in a running state. If not, or if it's not currently healthy, it throws an error and sends a notification to the specified SNS topic. If the KDA application is running, a snapshot is taken for it. The snapshot creation process is initiated. If it times out, it throws an error and sends a notification to the SNS topic. Upon successful completion of the snapshot, the deletion of the older snapshots is initiated. It retrieves a list of all snapshots and checks whether the application's snapshot count exceeds the maximum number configured as an environment variable. If the number exceeds the maximum, the oldest snapshot is deleted. Resource Creation SNS Topic Create an SNS topic to send out emails/any other notifications on failed snapshots. Once that is done, we also need to create a subscription to that topic. Both success and failure snapshots are sent to the topic. That can be changed to only send notifications on failures rather than every successful one. Once the failures we need to send out notifications We might need an email to get the notifications Configure the message to also send out the failure messages Alternatively, you can also send a notification to Slack Lambda Firstly, we need to create a Lambda function. Select Python as the language and x86_64 as the architecture. Once done, we need to paste the Snapshot Manager Python code into the editor. We'll also need to make a few config changes as mentioned below. Change the timeout to be around "X" minutes (configurable) to make sure that all the KDA application snapshots are completed. Add a few environment variables: aws_region: To get a list of all the applications running in that region number_of_older_snapshots_to_retain: To retain the number of old snapshots snapshot_creation_wait_time_seconds: Wait for this before you check whether the snapshot has been completed or not. We perform a check for 4 times and then fail if it doesn't get completed after 4 retries. agents: Number of parallel snapshots that can be taken at a single time. realm: Get a list of pipelines in which the realm name sns_topic_arn: The ARN of the SNS topic on which we send out the email notification in case of a failure. Amazon EventBridge CloudWatch events have been renamed to Amazon EventBridge. We can create a rule that executes as specified. When creating a rule, we can specify a schedule for when the expression should run. This can be a cron job expression or a fixed rate. We need to determine how often this rule needs to be invoked, whether it's every "X" days, hours, or minutes. Once the schedule is defined, we need to select the Lambda function that was previously created. Sample Settings Number of old snapshot retentions: 100 (to keep at least 2 days of snapshots) Frequency: 30 minutes Number of parallel runs to take snapshots (parallelism): 5 Failure Scenarios Here are some failure scenarios and how snapshots help or not: Scenario 1 A KDA pipeline has restarted due to an error in the Flink. In this scenario, the KDA pipeline automatically restarts. However, it does not take periodic snapshots to recover from failures. Checkpointing, a feature in Flink that runs approximately every 30 seconds to 1 minute, comes into play here. The pipeline will recover from the latest checkpoint. Scenario 2 A KDA pipeline is stuck in a restart loop due to changes, such as bad configurations or source/sync changes. When attempting to update the KDA application, the update fails because the application tries to take a snapshot before the update is performed. Since the KDA pipeline is stuck in a restart loop, taking a snapshot fails. In such cases, the snapshot manager's periodic snapshots become essential. First, we need to stop the pipeline from running, perform the required changes/updates, and then restart the application from the latest snapshot. No data will be lost, but there might be duplicates because the pipeline will reprocess all messages from the last snapshot. A dedupe functionality can be added to de-duplicate messages. Scenario 3 In some cases, even a single snapshot cannot be taken because the application is stuck in a restart loop. In such situations, we adjust the configuration and restart the pipeline without any snapshots since none were taken. In this particular case, there will be a loss of data.

By Shailendra Suryawanshi

Cloud Architecture

DZone's Featured Cloud Architecture Resources

Top Cloud Architecture Experts

The Latest Cloud Architecture Topics