MongoDB Developer

Coding with MongoDB - news for developers, tips and deep dives

February 22, 2021

Developer

Exploring Data with MongoDB Atlas, Databricks, and Google Cloud

Soi cầu xổ số miền bắc MongoDB Atlas supports Google Cloud (GC), enabling you to easily spin up managed MongoDB clusters within GC in minutes. We’re excited to share that Databricks recently launched Databricks on GC, giving customers the freedom to move and analyze their data within GC and MongoDB Atlas. With the latest update to Databricks, it’s now easier to get started with a cloud-first approach on GC that leverages MongoDB Atlas with its flexible data model designed for modern applications and Databricks for more advanced analytics use cases. The following tutorial illustrates how to use MongoDB Atlas on GC and Databricks. We’ll use sample sales data in MongoDB Atlas and calculate the rolling average using Databricks on GC. This tutorial covers the following: How to read data from MongoDB Atlas on GC into Spark How to run the MongoDB Connector for Spark as a library in Databricks How to use the PySpark libraries to perform rolling averages of sales data How to write these averages back to MongoDB so they are accessible to applications Create Databricks Workspace To provision a new Databricks workspace, you will need to have a GC project already created. If you do not already have a Databricks cluster deployed on GC, follow the online documentation to create one. Note: It is important to follow the documentation, because there are a few key settings you will need to make in your GC project, such as enabling container.googleapis.com, storage.googleapis.com, and deploymentmanager.googleapis.com services and adjusting certain Google Cloud quotas before creating your Databricks cluster. In this example we have already created the Google Cloud project mongodb-supplysales and are ready to go to the Google Marketplace and add Databricks to our project. Within your Google project, click on “Marketplace” and enter “Databricks” in the search box. Click on the resulting tile and follow the instructions. Once your Databricks cluster is created, navigate to the Databricks cluster with the URL provided. Here you can create a new workspace. Once you’ve created your workspace, you will be able to launch it from the URL provided: Logging into your workspace brings up the following welcome screen: In this article, we will create a notebook to read data from MongoDB and use the PySpark libraries to perform the rolling average calculation. We can create our Databricks cluster by selecting the “+ Create Cluster” button from the Clusters menu. Note: For the purposes of this walkthrough we chose only one worker and preemptible instances; in a production environment you would want to include more workers and autoscaling. Before we create our cluster, we have the option under Advanced Options to provide Spark configuration variables. One of the common settings for Spark config is to define spark.mongodb.output.uri and spark.mongodb.input.uri . First we need to create the MongoDB Atlas cluster so we have a connection string to enter for these values. At this point, open a new browser tab and navigate to MongoDB Atlas. Prepare a MongoDB Atlas Instance Once in the MongoDB Atlas portal, you will need to do the following before you can use Atlas with Databricks: Create your MongoDB Atlas cluster Define user credentials for use in the Spark connector Define network access Add sample data (optional for this article) Create Your MongoDB Atlas Cluster If you already have a MongoDB Atlas account, log in and create a new Atlas cluster. If you do not have an account, you can set up a free cluster at the following URL: http://crystalriverhog.com/cloud . Once your account is set up, you can create a new Atlas cluster by using the “+ New Cluster” dialog. MongoDB provides a free tier for Google Cloud. Once you provide a cluster name and click on “create,” Atlas will take approximately five to seven minutes to create your Atlas cluster. Define Database Access By default there are no users created in an Atlas cluster. To create an identity for our Spark cluster to connect to MongoDB Atlas, launch the “Add New Database User” dialog from the Database Access menu item. Notice that there are three options for authentication to MongoDB Atlas: Password, Certificate, and AWS IAM authentication. Select “Password,” and enter a username and password. Atlas provides granular access control: For example, you could restrict this user account to work only with a specific Atlas cluster or define the account as temporary and have Atlas expire within a specific time period. Defining Network Access MongoDB Atlas does not allow any connection from the internet by default. You need to include MongoDB Atlas as part of a VPC peering or AWS PrivateLink configuration. If you do not have that set up with your cloud provider, you need to specify from which IP addresses Atlas can accept incoming connections. You can do this via the “Add IP Address” dialog in the Network Access menu. In this article, we will add “0.0.0.0,” allowing access from anywhere, because we don’t know specifically which IP our Databricks cluster will be running on. MongoDB Atlas can also make this IP access list temporary, which is great for situations where you need to allow access from anywhere. Add Sample Data Now that we have added our user account and allowed network access to our Atlas cluster, we need to add some sample data. Atlas provides several sample collections that are accessible from the menu item on the cluster. In this example, we will use the sales collection within the sample_supplies database. Update Spark Configuration with Atlas Connection String Copy the MongoDB Atlas connection string by clicking on the Connect button and selecting “Connect your application.” Copy the contents of the connection string and note the placeholders for username and password . You will have to change those to your own credentials. Return to your Databricks workspace. Under Advanced Options in your Databricks workspace, paste the connection string for both the spark.mongodb.output.uri and spark.mongodb.input.uri variables. Note that you will need to update the credentials in the MongoDB Atlas connection string with those you defined previously. For simplicity in your PySpark code, change the default database in the connection string from MyFirstDatabase to sample_supplies. (This is optional, because you can always define the database name via Spark configuration options at runtime.) Start the Databricks Cluster Now that your Spark config is set, start the cluster. Note: If the cluster fails to start, check the event log and view the JSON tab. This is an example error message you will receive if you forgot to increase the SSD storage quota: Add MongoDB Spark Connector Once the cluster is up and running, click on “Install New” from the Libraries menu. Here we have a variety of ways to create a library, including uploading a JAR file or downloading the Spark connector from Maven. In this example, we will use Maven and specify org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 as the coordinates. Click on “Install” to add our MongoDB Spark Connector library to the cluster. Note: If you get the error message “Maven libraries are only supported on Databricks Runtime version 7.3 LTS, and versions >= 8.1,” you can download the MongoDB Spark Connector JAR file from http://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.12/3.0.1/ and then upload it to Databricks by using the Upload menu option. Create a New Notebook Click on the Databricks home icon from the menu and select “Create a blank notebook.” Attach this new notebook to the cluster you created in the previous step. Because we defined our MongoDB connection string as part of the Spark conf cluster configuration, your notebook already has the MongoDB Atlas connection context. In the first cell, paste the following: from pyspark.sql import SparkSession pipeline="[{'$match': { 'items.name':'printer paper' }}, {'$unwind': { path: '$items' }}, {'$addFields': { totalSale: { \ '$multiply': [ '$items.price', '$items.quantity' ] } }}, {'$project': { saleDate:1,totalSale:1,_id:0 }}]" salesDF = spark.read.format("mongo").option("collection","sales").option("pipeline", pipeline).option("partitioner", "MongoSinglePartitioner").load() Run the cell to make sure you can connect the Atlas cluster. Note: If you get an error such as “MongoTimeoutException,” make sure your MongoDB Atlas cluster has the appropriate network access configured. The notebook gave us a schema view of what the data looks like. Although we could have continued to transform the data in the Mongo pipeline before it reached Spark, let’s use PySpark to transform it. Create a new cell and enter the following: from pyspark.sql.window import Window from pyspark.sql import functions as F salesAgg=salesDF.withColumn('saleDate', F.col('saleDate').cast('date')).groupBy("saleDate").sum("totalSale").orderBy("saleDate") w = Window.orderBy('saleDate').rowsBetween(-7, 0) df = salesAgg.withColumn('rolling_average', F.avg('sum(totalSale)').over(w)) df.show(truncate=False) Once the code is executed, the notebook will display our new dataframe with the rolling averages column: It is this cell where we will provide some additional transformation of the data such as grouping the data by saleDate and provide a summation of the totalSale per day. Once the data is in our desired format, we define a window of time as the past seven entries and then add a column to our data frame that is a rolling average of the total sales data. Once we have performed our analytics, we can write the data back to MongoDB for additional reporting, analytics, or archiving. In this scenario, we are writing the data back to a new collection called sales-averages: df.write.format("mongo").option("collection","sales-averages").save() You can see the data by using the Collections tab within the MongoDB Atlas cluster UI. WIth the data in MongoDB Atlas, you can now leverage many of the services available, including Atlas Online Archive, Atlas Search, and Atlas Data Lake. Summary The integration between MongoDB Atlas, Google Cloud, and Databricks enables you to gain deep insights into your data and gives you freedom to move and analyze data as your needs evolve. Check out the resources below for more information: Getting started with MongoDB Atlas MongoDB Spark Connector MongoDB Atlas on Google Cloud

May 11, 2021
Developer

MongoDB Query API Webinar: FAQ

Soi cầu xổ số miền bắc Last week we held a live webinar on the MongoDB Query API and our lineup of idiomatic programming language drivers. There were many great questions during the session, and in this post, what I want to do is share the most frequently asked ones with you. But first - here is a quick summary of what MongoDB Query API is all about if you are unfamiliar with it. What is MongoDB Query API? MongoDB is built upon the document data model . The document model is designed to be intuitive, flexible, universal, and powerful. You can easily work with a variety of data, and because documents map directly to the objects in your code, it fits naturally in your app development experience. MongoDB Query API lets you work with data as code and build any class of application faster by giving you extensive query capabilities natively in any modern programming language. Whether you’re working with transactional data, looking for search capabilities, or trying to run sophisticated real-time analytics, MongoDB Query API can meet your needs. MongoDB Query API has some unique features like its expressive query, primary and secondary indexes, powerful aggregations and transformations, on-demand materialized views, and more — enabling you to work with data of any structure, at any scale. Some key features to highlight: Indexes To optimize any workload and query pattern you can take advantage of a large set of index types like multi-key (for arrays), wildcard, geospatial, and more and index any field no matter how deeply nested it is within your documents. Fully featured secondary indexes are document-optimized and include partial, unique, case insensitive, and sparse. Aggregation Pipeline Aggregation pipeline lets you group, transform, and analyze your data to support any class of workload. You can choose from dozens of aggregation stages and over 200 operators to build modular and expressive pipelines. You can also use low-code tools like MongoDB Compass to drag and drop stages, examine intermediate output, and export to your programming language of choice. On-Demand Materialized Views The powerful $merge aggregation stage allows you to combine the results of your aggregation pipeline with existing collections to update and enrich data without having to recompute your entire data set. You can output results to sharded and unsharded collections while simultaneously defining indexes on each view Geospatial and Graph Utilize MongoDB’s built-in natively ability to store and run queries against geospatial data Use operators like $graphLookup to quickly traverse connected data sets These are just a few of the features we highlighted in the MongoDB Query API webinar. No matter what type of application you are thinking of building or managing, MongoDB Query API can meet your needs as the needs of your users and application change. FAQs for MongoDB Query API Here are the most common questions asked during the webinar: Do we have access to the data sets presented in this webinar? Yes, you can easily create a cluster and load the sample data sets into Atlas. Instructions on how to get started are here . How can I access full-text search capabilities? Text search is a standard feature of MongoDB Atlas. You can go to cloud.mongodb.com to try it out using sample data sets. Does VS code plugin support Aggregation? Yes, it does. You can learn more about the VS code plugin on our docs page. If you need to pass variable values in the aggregation, say the price range from the app as an input, how would you do that? This is no different than sending a query - since you construct your aggregation in your application you just fill in the field you want with value/variable in your code. Is there any best practice document on MongoDB query API to have stable performance and utilize minimum resources? Yes, we have tips and tricks on optimizing performance by utilizing indexes, filters, and tools here . Does MongoDB support the use of multiple different indexes to meet the needs of a single query? Yes, this can be accomplished by the use of compound indexes. You can learn more about it in our docs here . If you work with big data and create a collection, is it smarter to create indexes first or after the collection is filled (regarding the time to create a collection)? It is better to create the indexes first as they will take less time to create if the collection is empty, but you still have an option to create the index once the data is there in the collection. There are multiple great benefits of MongoDB’s indexing capabilities: When building indexes, there is no impact on your app’s availability since the index operation is online. Flexibility to add and remove indexes at any time. Ability to hide indexes to evaluate the impact of removing them before officially dropping them. Where do I go to learn more? Here are some resources to help you get started: MongoDB Query API page MongoDB University MongoDB Docs You can also check out the webinar replay here .

May 7, 2021
Developer

How to Get Started with MongoDB Atlas and Confluent Cloud

Soi cầu xổ số miền bắc Every year more and more applications are leveraging the public cloud and reaping the benefits of elastic scale and rapid provisioning. Forward-thinking companies such as MongoDB and Confluent have embraced this trend, building cloud-based solutions such as MongoDB Atlas and Confluent Cloud that work across all three major cloud providers. Companies across many industries have been leveraging Confluent and MongoDB to drive their businesses forward for years. From insurance providers gaining a customer-360 view for a personalized experience to global retail chains optimizing logistics with a real-time supply chain application, the connected technologies have made it easier to build applications with event-driven data requirements. The latest iteration of this technology partnership simplifies getting started with a cloud-first approach, ultimately improving developer’s productivity when building modern cloud-based applications with data in motion. Today, the MongoDB Atlas source and sink connectors are generally available within Confluent Cloud. With Confluent’s cloud-native service for Apache Kafka® and these fully managed connectors, setup of your MongoDB Atlas integration is simple. There is no need to install Kafka Connect or the MongoDB Connector for Apache Kafka, or to worry about scaling your deployment. All the infrastructure provisioning and management is taken care of for you, enabling you to focus on what brings you the most value — developing and releasing your applications rapidly. Let’s walk through a simple example of taking data from a MongoDB cluster in Virginia and writing it into a MongoDB cluster in Ireland. We will use a python application to write fictitious data into our source cluster. Step 1: Set up Confluent Cloud First, if you’ve not done so already, sign up for a free trial of Confluent Cloud . You can then use the Quick Start for Apache Kafka using Confluent Cloud tutorial to create a new Kafka cluster. Once the cluster is created, you need to enable egress IPs and copy the list of IP addresses. This list of IPs will be used as an IP Allow list in MongoDB Atlas. To locate this list, select “Custer Settings” and then the “Networking” tab. Keep this tab open for future reference: you will need to copy these IP addresses into the Atlas cluster in Step 2. Step 2: Set Up the Source MongoDB Atlas Cluster For a detailed guide on creating your own MongoDB Atlas cluster, see the Getting Started with Atlas tutorial. For the purposes of this article, we have created an M10 MongoDB Atlas cluster using the AWS cloud in the us-east-1 (Virginia) data center to be used as the source, and an M10 MongoDB Atlas cluster using the AWS cloud in the eu-west-1 (Ireland) data center to be used as the sink. Once your clusters are created, you will need to configure two settings in order to make a connection: database access and network access. Network Access You have two options for allowing secure network access from Confluent Cloud to MongoDB Atlas: You can use AWS PrivateLink, or you can secure the connection by allowing only specific IP connections from Confluent Cloud to your Atlas cluster. In this article, we cover securing via IPs. For information on setting up using PrivateLink, read the article Using the Fully Managed MongoDB Atlas Connector in a Secure Environment . To accept external connections in MongoDB Atlas via specific IP addresses, launch the “IP Access List” entry dialog under the Network Access menu. Here you add all the IP addresses that were listed in Confluent Cloud from Step 1. Once all the egress IPs from Confluent Cloud are added, you can configure the user account that will be used to connect from Confluent Cloud to MongoDB Atlas. Configure user authentication in the Database Access menu. Database Access You can authenticate to MongoDB Atlas using username/password, certificates, or AWS identity and access management (IAM) authentication methods. To create a username and password that will be used for connection from Confluent Cloud, select the “+ Add new Database User” option from the Database Access menu. Provide a username and password and make a note of this credential, because you will need it in Step 3 and Step 4 when you configure the MongoDB Atlas source and sink connectors in Confluent Cloud. Note: In this article we are creating one credential and using it for both the MongoDB Atlas source and MongoDB sink connectors. This is because both of the clusters used in this article are from the same Atlas project. Now that the Atlas cluster is created, the Confluent Cloud egress IPs are added to the MongoDB Atlas Allow list, and the database access credentials are defined, you are ready to configure the MongoDB Atlas source and MongoDB Atlas sink connectors in Confluent Cloud. Step 3: Configure the Atlas Source Now that you have two clusters up and running, you can configure the MongoDB Atlas connectors in Confluent Cloud. To do this, select “Connectors” from the menu, and type “MongoDB Atlas” in the Filters textbox. Note: When configuring MongoDB Atlas source And MongoDB Atlas sink, you will need the connection host name of your Atlas clusters. You can obtain this host name from the MongoDB connection string. An easy way to do this is by clicking on the "Connect" button for your cluster. This will launch the Connect dialog. You can choose any of the Connect options. For purposes of illustration, if you click on “Connect using MongoDB Compass.” you will see the following: The highlighted part in the above figure is the connection hostname you will use when configuring the source and sink connectors in Confluent Cloud. Configuring the MongoDB Atlas Source Connector Selecting “MongoDbAtlasSource” from the list of Confluent Cloud connectors presents you with several configuration options. The “Kafka Cluster credentials” choice is an API-based authentication that the connector will use for authentication with the Kafka broker. You can generate a new API key and secret by using the hyperlink. Recall that the connection host is obtained from the MongoDB connection string. Details on how to find this are described at the beginning of this section. The “Copy existing data” choice tells the connector upon initial startup to copy all the existing data in the source collection into the desired topic. Any changes to the data that occur during the copy process are applied once the copy is completed. By default, messages from the MongoDB source are sent to the Kafka topic as strings. The connector supports outputting messages in formats such as JSON and AVRO. Recall that the MongoDB source connector reads change stream data as events. Change stream event metadata is wrapped in the message sent to the Kafka topic. If you want just the message contents, you can set the “Publish full document only” output message to true. Note: For source connectors, the number of tasks will always be “1”: otherwise you will run the risk of duplicate data being written to the topic, because multiple workers would effectively be reading from the same change stream event stream. To scale the source, you could create multiple source connectors and define a pipeline that looks at only a portion of the collection. Currently this capability for defining a pipeline is not yet available in Confluent Cloud. Step 4: Generate Test Data At this point, you could run your python data generator application and start inserting data into the Stocks.StockData collection at your source. This will cause the connector to automatically create the topic “demo.Stocks.StockData.” To use the generator, git-clone the stockgenmongo folder in the above-referenced repository and launch the data generation as follows: python stockgen.py -c "< >" Where the MongoDB connection URL is the full connection string obtained from the Atlas source cluster. An example connection string is as follows: mongodb+srv://kafkauser:kafkapassword123@democluster.lkyil.mongodb.net Note: You might need to pip-install pymongo and dnspython first. If you do not wish to use this data generator, you will need to create the Kafka topic first before configuring the MongoDB Atlas sink. You can do this by using the Add a Topic dialog in the Topics tab of the Confluent Cloud administration portal. Step 5: Configuring the MongoDB Atlas Sink Selecting “MongoDB Atlas Sink” from the list of Confluent Cloud connectors will present you with several configuration options. After you pick the topic to source data from Kafka, you will be presented with additional configuration options. Because you chose to write your data in the source by using JSON, you need to select “JSON” in the input message format. The Kafka API key is an API key and secret used for connector authentication with Confluent Cloud. Recall that you obtain the connection host from the MongoDB connection string. Details on how to find this are described previously at the beginning of Step 3. The “Connection details” section allows you to define behavior such as creating a new document for every topic message or updating an existing document based upon a value in the message. These behaviors are known as document ID and write model strategies. For more information, check out the MongoDB Connector for Apache Kafka sink documentation . If order of the data in the sink collection is not important, you could spin up multiple tasks to gain an increase in write performance. Step 6: Verify Your Data Arrived at the Sink You can verify the data has arrived at the sink via the Atlas web interface. Navigate to the collection data via the Collections button. Now that your data is in Atlas, you can leverage many of the Atlas platform capabilities such as Atlas Search, Atlas Online Archive for easy data movement to low-cost storage, and MongoDB Charts for point-and-click data visualization. Here is a chart created in about one minute using the data generated from the sink cluster. Summary Apache Kafka and MongoDB help power many strategic business use cases, such as modernizing legacy monolithic systems, single views, batch processing, and event-driven architectures, to name a few. Today, Confluent and MongoDB Cloud and MongoDB Atlas provide fully managed solutions that enable you to focus on the business problem you are trying to solve versus spinning your tires in infrastructure configuration and maintenance. Register for our joint webinar to learn more!

May 6, 2021
Developer

Reducing Queue Times by Using Speculative Execution

Soi cầu xổ số miền bắc When solving concurrency problems in software, the simplest solution is often to make the trickiest part of the problem serial. Here at MongoDB, this is exactly the approach we took to implement a commit queue, where engineers submit code changes to be tested and then merged into a repository. This worked well for many smaller repositories, but for large ones such as the MongoDB Server , testing submissions one at a time proved to be too slow, with engineers sometimes waiting hours for their code to finally make it into the repository. To solve this challenge, we introduced some speculative execution on top of our original approach, which reduced the wait time for a typical week by 62%. Background Many of the engineers at MongoDB submit their code changes to a commit queue, which runs a basic set of tests on these changes before merging them to the correct repository. The main difference between the commit queue running the tests and an engineer running the tests is that the commit queue tests with the latest changes to the code base, whereas an engineer has checked out the code base at some point in the past. To ensure that it has the latest changes, the commit queue tests only one set of changes at a time before either merging the changes if the tests pass, or rejecting them and notifying the author if the tests fail. This serial approach makes the system easy to understand, but it also presents an optimization opportunity to reduce the time spent waiting for tests to start. Design Approach Parallelization The only part of this system we needed to keep serial was the part that merged changes into the repository, because this ensures that changes would be merged in the same order in which they were submitted. By far, the slowest part of the commit queue is actually running the tests, and this is the work that we wanted to split among multiple machines. Let’s assume as an example that all submissions to the commit queue take 10 minutes to run. Let’s also assume that in one day there are 30 submissions to the commit queue at roughly the same time. With the previous requirement that the queue runs serially, this means it would take 300 minutes to get through all the submissions. If we parallelize testing the submissions among 30 machines, it would take only 10 minutes of actual time from when the last change was submitted to get through all the submissions. Speculative Execution With a serial queue, each successful submission checks out the latest code in the repository, applies its changes, runs tests, then commits the code back to the repository before the next submission starts. If we do these steps in parallel however, checking out the latest code in the repository will not include the changes from submissions that would have merged before the one being tested. Parallelizing our tests requires some extra steps to ensure that submissions run tests with the code changes from prior submissions. In order to know what code changes should be applied to which tests, the commit queue must still maintain the concept of an order for each submission. That way, the third entry in the queue will know that it must apply the changes from the first and second entries, in addition to its own code changes. If any test for a submission fails, it’s rejected from the queue and any submissions after it are rerun without the code changes from the one that failed. If all tests for a submission finish running, the submission will wait to be merged until the one immediately in front of it is merged. Performance Considerations Testing with merged code changes like this requires that most of the tests pass; otherwise the system will do a lot more work than it would have done if it tested submissions one at a time, and we lose all the benefits from parallelism. In the worst-case scenario where nothing passes, the nth submission in the queue would need to be restarted each time something in front of it fails, leading to total times that any submission is run. This means that if engineers add 10 submissions to the commit queue, the new parallel approach runs tests as many as 55 times, whereas with the old serial approach the tests would always run 10 times. Maybe this worst-case scenario isn’t a big deal if the majority of submissions pass (and at MongoDB, 85% of them do). However we’d like to guarantee that an unusually bad day doesn’t make the machines running the tests do an excessive amount of unneeded work. To make this guarantee possible, we inserted a checkpoint into the queue, so that only the batch of submissions in front of the checkpoint are running tests. In the example of 10 submissions to the queue, placing the checkpoint after submission No. 3 would mean that the first three submissions start running tests while submissions No. 4 and later wait until the first three finish. It’s totally possible that everything still fails, but adding this checkpoint prevents us from doing too much extra work. With the checkpoint, a queue of length n would run: total submissions, where f is the position of the checkpoint, \ is the integer division operator, and % is the modulus operator. If engineers add 10 submissions to the queue and the checkpoint is after submission No. 3, this hybrid approach would run tests as many as 19 times, compared with 55 with a fully parallel and 10 with a fully serial approach. The following infographic helps visualize this example. Colors represent the current status of the submission: green means successful, red means failed, yellow means in progress, and gray means not yet started. Results The graph below depicts the average length of time a submission would wait before it started running tests for a representative week when processing submissions serially. Contrast these times with the graph below, which shows a representative week with the hybrid approach. For the depicted weeks, the overall average time dropped from 1,238 seconds with the serial approach to 469 seconds with the hybrid approach — a reduction of 62%. Conclusion With this hybrid approach of parallelizing the longest-running parts of the system but keeping key parts serial, we were able to reap the benefits of each approach. We saw drastic reductions in wait times while still maintaining the concept of an ordering for our commit queue. What led us to this approach were the requirements that the result should be noticeably faster for typical sizes of the problem (a queue with one to nine submissions), but could not be drastically slower in the worst-case scenario. These two guiding principles will often yield to designs that work well in real-world scenarios, even though they may not handle all edge cases gracefully.

April 27, 2021
Developer

Accelerate Data Modernization with Infosys Data Model Converter

Are you in the process of migrating applications from a relational database to MongoDB? If so, you’re likely trying to best understand and decide how your enterprise data needs to be modeled. Our previous blog discussed how Infosys Data Services Suite can help enterprises move data seamlessly from legacy relational databases to MongoDB. But moving data is only one part of the puzzle. The more significant step is choosing the target data model, or schema design, a process that usually requires several hours of highly skilled talent. That’s why we created this follow-up blog to help you get started. Rethinking Schema Design Ultimately, schema design can be the difference between an inefficient, disorganized database and a strategic one that empowers the entire company. Schema design in MongoDB requires a change in perspective for data architects, developers, and database administrators. They have to: Rethink the legacy relational data model. This model flattens data into rigid two-dimensional tabular structures of rows and columns. The new data model is a rich and dynamic one with embedded sub-documents and arrays Rethink how the data platform works. In relational databases, it is extremely difficult to change the data platform as the application evolves. However, in MongoDB, the apps and APIs come first and the data platform dynamically accommodates the data Getting Schema Design Right Begin the schema design process by considering the application’s requirements. You’ll want to model the data in a way that leverages the flexibility of the document model. In schema migrations, it may seem easy at first to simply mirror the flat schema of the relational database in the document model. However, this negates the advantages enabled by the rich and embedded data structures of the document model. For example, data that belongs to a parent-child relationship in two RDBMS tables can be collapsed (embedded) into a single document in MongoDB. The application data access patterns should also drive schema design with a specific focus on: The read/write ratio of database operations and whether it is more important to optimize the performance of one operation over another The types of queries and updates performed by the databases The lifecycle of the data and growth rate of documents Simplifying Schema Design with Infosys Data Model Converter Infosys has developed a solution called Infosys Data Model Convertor that processes source relational schema and the above-mentioned signals as inputs and automatically provides target MongoDB schema suggestions. Infosys Data Model Converter is available as part of Infosys Modernization Suite which accelerates enterprises’ modernization journey. Each schema suggestion is accompanied by a detailed analysis report. The data modeler can use this as a starting point and iterate over the schema to arrive at the final MongoDB schema. The Infosys Data Model Converter reduces 50-60% of the effort typically spent in schema design. Key Features Boosts productivity by augmenting the migration of RDBMS to NoSQL database Saves time by automatically extracting schema, query and data patterns from an existing RDBMS Comprehensively analyzes the RDBMS entity relations, data and read-and-write patterns Applies a rich set of rules and generates a fully compliant NoSQL target state data model Offers flexibility by externalizing the rules for organization-specific customizations Connects and deploys the model to the target NoSQL platform with sample data Discover more ways in which Infosys can help you unlock value from modernization. Contact us for any modernization questions.

April 15, 2021
Developer

MongoDB Connector for Apache Kafka 1.5 Available Now

Soi cầu xổ số miền bắc Today, MongoDB has released version 1.5 of the MongoDB Connector for Apache Kafka! This article highlights some of the key features of this new release in addition to continuing to improve the overall quality & stability of the Connector . DeleteOne write model strategy When messages arrive on Kafka topics, the MongoDB Sink Connector reads them and by default will upsert them into the MongoDB cluster specified in the sink configuration. However, what if you didn’t want to always upsert them? This is where write strategies come in and provide you with the flexibility to define what you want to do with the document. While the concept of write strategies is not new to the connector, in this release there is a new write strategy available called DeleteOneBusinessKeyStrategy . This is useful for when a topic contains records identifying data that should be removed from a collection in the MongoDB sink. Consider the following: You run an online store selling fashionable face masks. As part of your architecture, the website sends orders to a Kafka topic, “web-orders” which upon message arrival kicks off a series of actions such as sending an email confirmation, and inserting the order details into an “Orders” collection in a MongoDB cluster. A sample Orders document: { _id: ObjectId("6053684f2fe69a6ad3fed028"), 'customer-id': 123, 'order-id': 100, order: { lineitem: 1, SKU: 'FACE1', quantity: 1 } } This process works great, however, when a customer cancels an order, we need to have another business process to update our inventory, send the cancellation, email and remove the order from our MongoDB sink. In this scenario a cancellation message is sent to another Kafka topic, “canceled-orders”. For messages in this topic, we don’t just want to upsert this into a collection, we want to read the message from the topic and use a field within the document to identify the documents to delete in the sink. For this example, let’s use the order-id key field and define a sink connector using the DeleteOneBusinessKeyStrategy as follows: "connector.class": "com.mongodb.kafka.connect.MongoSinkConnector", "topics":"FaceMaskWeb.OrderCancel", "connection.uri":"mongodb://mdb1", "database":"FaceMaskWeb", "collection":"Orders", "writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.DeleteOneBusinessKeyStrategy", "document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy", "document.id.strategy.partial.value.projection.type": "AllowList", "document.id.strategy.partial.value.projection.list": "order-id", "value.converter":"org.apache.kafka.connect.json.JsonConverter", "value.converter.schemas.enable":false, "document.id.strategy.overwrite.existing": true Now when messages arrive in the “FakeMaskWeb.OrderCancel” topic, the “order-id” field is used to delete documents in the Orders collection. For example, using the sample document above, if we put this value into the OrderCancel topic { “order-id”: 100 } It would cause the document in the Orders collection with order-id and value 100 to be deleted. For a complete list of write model strategies check out the MongoDB Kafka Connector Sink documentation . Qlik Replicate Qlik Replicate is recognized as an industry leader in data replication and ingestion. With this new release of the Connector, you can now replicate and stream heterogeneous data from data sources like Oracle, MySQL, PostGres and others to MongoDB via Kafka and the Qlik Replicate CDC handler . To configure the MongoDB Connector for Apache Kafka to consume Qlik Replicate CDC events, use “com.mongodb.kafka.connect.sink.cdc.qlik.rdbms.RdbmsHandler” as the value for the change data capture handler configuration parameter. The handler supports, insert, refresh, read, update and delete events. Errant Record Reporting Kafka Connect, the service which manages connectors that integrate with a Kafka deployment, has the ability to write records to a dead letter queue (DLQ) topic if those records could not be serialized or deserialized. Starting with Apache Kafka version 2.6, there was added support for error reporting within the sink connectors. This gives sink connectors the ability to send individual records to the DLQ if the connector deems the records to be invalid or problematic. For example, if you are projecting fields in the sink that do not exist in the kafka message or if your sink is expecting a JSON document and the message arrives in a different format. In these cases an error is written to the DLQ versus failing the connector. Various Improvements As with every release of the connector, we are constantly improving the quality and functionality. This release is no different. You’ll also see pipeline errors now showing up in the connect logs, and the sink connector can now be configured to write to the dead letter queue! Next Steps Download the latest MongoDB Connector for Apache Kafka 1.5 from the Confluent Hub ! Read the MongoDB Connector for Apache Kafka documentation . Questions/Need help with the connector? Ask the Community . Have a feature request? Provide Feedback or a file a JIRA .

April 7, 2021
Developer

Dive Deeper into Chart Data with New Drill-Down Capability

With the latest release of MongoDB Charts, you’re now able to dive deeper into the data that’s aggregated in your visualizations. At a high level, we generally create charts, graphs and visualizations of our data to answer questions about our business or products. Oftentimes, we need to “double click” on those visualizations to get insight into each individual data point that makes up the line, bar, column, etc. How the drill-down functionality works: Step 1: Right click on the data point you are interested in drilling down into Step 2: Click "show data for this item" Step 3: View the data in tabular or document format Each view can be better for different circumstances. For data without too many fields or no nested arrays, it might be quicker and more easily viewed in a table. On the other hand, the JSON view allows you to explore the structure of documents and click into arrays. Scenarios where more detailed information can help: Data visualization use cases are relatively broad spanning, but oftentimes they fall into 3 main categories: monitoring data, finding insights, and embedding analytics into applications. I’ll be focusing on the first two of these three as there are many different ways you could potentially build drilling-down into data via embedded charts. (Read more about our click events and embedded analytics ). For data or performance monitoring purposes , we're not speaking so much about the performance of your actual database and its underlying infrastructure, but the performance of the application or system built on top of the database. Imagine I have an application or website that takes reviews, if I build a chart like the one below where I want to easily see when an interaction hits a threshold that I want to dive deeper into, I now have the ability to quickly see the document that created that data point. This chart shows app ratings given after a user session in an app. For this example, we want to dive into any rating that was below a 3 (out of 5). This scatter plot shows I have two such ratings that cross that threshold. With the drill-down capability, I can easily see all the details captured in that user session. For finding new insights, let’s imagine I’m tracking how many transactions happen on my ecommerce site over time. In the column chart below, you can see I have purchases by month for the last year and a half (note, there’s a gap because this example is for a seasonal business!). Just by glancing at the chart, I can quickly see purchases have increased over time, and my in-app purchases have increased my overall sales. However, I want to see more about the documents that were aggregated to create those columns, so I can quickly see details about the transaction amount and location without needing to create another chart or dashboard filter. In both examples, I was able to answer a deeper level question that the original chart couldn’t answer on it’s own. We hope this new feature helps you and your stakeholders get more out of MongoDB Charts, regardless if you’re new to it or have been visualizing your Atlas data with it for months, if not years! If you haven’t tried Charts yet, you can get started for free by signing up for a MongoDB Atlas and deploying a free tier cluster.

April 6, 2021
Developer

Ready to get Started with MongoDB Atlas?

Start Free