MongoDB Refresher – Schema Design and Performance

In continuation to MongoDB refresher, listing down the next major items to look into once you understand the operational aspects MongoDB.
These are Schema Design and Performance.

Schema Design

With any database management system, one of the crucial aspect is the architecture of the database. MongoDB is pitched as schema-less, but that doesn’t mean that there is no need to think through how data should be placed within the collections.

Following pointers can assist you in building your next DB design:

1.  The schema design of MongoDB is based on the access pattern of the application. Unlike relational DB schema, we don’t build an agnostic schema, which is normalized so that any type of access pattern is usually supported(maybe after 20 joins, inner queries, co-related queries!). Instead, with MongoDB, we do it based on the what type of queries or data would be related to each other according to our application’s needs. With MongoDB you think of what queries your application needs answer for, and build a schema around those queries.

2. MongoDB doesn’t have joins to you have to embed such data into the document itself. Only in special cases where the data size of a document may go over 16mb limit per document imposed by underlying engine, that you have to make separate collection for such data.
What is embedding? you might ask. From the relational DBMS world you can say, it is when you put a related entity(document) inside another. Here is an example(image from MongoDB website):

Data model with embedded fields that contain all related information.

 

3. Operation on a single document is always atomic. So in a way you can do without need of having transactions, as the data is already embedded in a single document instead of being in multiple tables like a relational DBMS.

4. One to One relations :

When to embed?
–  You need the operation to be atomic

When not to embed?
– The size of document may become greater than 16mb by embedding.
– Embedded document isn’t updated/accessed or used frequently, As this will lead to fetching unnecessary data into memory.

5. Mapping one to many :
Take example of a city having many people.
In such a case  usually go for separate collections and put id of the smaller collection in larger one. So in this case put the city:_id in the people document.

 

6. Many to Many:
A Student/Teacher or Book/Author would be examples of such relationships.
If the collections are going to be smaller then you can choose to embed one into the other i.e both documents have embedded documents of related data.
For example: Books-Authors, would be a smaller data-set so could be embedded into both the documents.
If your documents are large in number(and/or size) another way would be to keep an array of _ids in both the collections.

Advantage of embedding data is low disk latency(faster reads) and atomicity.
Disadvantage would be if the embedded data grows, you might hit the 16mb limit and might need to de-fragment data. Duplicate and redundant data is another concern depending on use case. 

As a side note, you can use GridFS if you want to store blob data which is larger than 16mb. It will allow upto 128mb.

 

Performance

1. There are two storage engines for MongoDB MMAP(Deprecated as of 4.0) and WiredTiger.

2. MMAP
a. Based on memory management provided by OS
b. Has collection level locking.

3. WiredTiger Engine
a. Has document level locking.
b. An update is a new entry instead of an actual update.
c. It uses compression so takes up lesser space.

4. Indexes.
a. Allow for faster retrieval of data.
b. But slows down the writes as the indexes have to be updated.
c. Make your indexes carefully, think through your query requirements.

5. db.collectionName.createIndex({a:1}); will create index on field a. To create compound index pass an object with multiple keys like {a:1,b:1}

6. Use db.collectionName.getIndexes() to get the indexes on a collection.

7. Multi-Key indexes are formed when one of the fields in the document is an array .

8. You can form indexes with dot notation in the same way that you retrieve a column’s value.
db.collectionName.createIndex({"a.b":1});

9. Sparse indexes: when you have a field missing in some of the documents and you want to create an index on that field, then you create
an index on that field using a sparse index.
db.collectionName.createIndex({"a.b":1}, {unique:true, sparse:true});

10. Index creation works in the foreground and the background. The foreground is fast but blocking background is slow but non-blocking. You can specify it as
db.collectionName.createIndex({"a.b":1}, {unique:true, sparse:true, background:true});

11. Explain: syntax db.collectionName.explain().find({"something":1});
Older MongoDB version supported `db.collectionName.find({“something”:1}).explain()`

12 You can change the verbosity of explain command by passing in “executionStats” and “allPlansExecuation” as parameters.

13. Covered Queries : queries which can be satisfied just from the indexes. For such the query to work the “conditions” and “selections” should fall within the indexes.

14. Choosing index : MongoDB server compares the best index for the query using some kind of threshold query, then puts it into a cache for future queries. this caches lives for 1000 writes to the collection or till the sever is restarted or a index is updated.

15. How big should index be? an index should fit in the memory(RAM) for queries to be faster.

16. Geospatial indexes : MongoDB supports Geospatial data, where the 2d points can be stored as an array of x and y points inside of a collection. You can put a special index called “2d” on the field, this will allow you to do queries like db.collectionName.find({location:{$near:[x,y]}});

17. Geospatial sphere indexes : another type of index that you can use for earth location data. is “2dsphere”
E.g of how you might build a collection that has a that “earth location” and a query
{ "_id" : { "$oid" : "535471aaf28b4d8ee1e1c86f" }, "store_id" : 8, "loc" : { "type" : "Point", "coordinates" : [ -37.47891236119904, 4.488667018711567 ] } }

Here “type” and “coordinates” are reserved names
Now to query something like that you will do.

db.stores.find({
    loc: {
        $near: {
            $geometry: {
                type: "Point",
                coordinates: [-130, 39]
            },
            $maxDistance: 1000000
        }
    }
})

18. Text Indexes: These allow you to use full text search on a field. You can also get relevancy using the appropriate query. Mentioned below is a query for text index and search query with relevancy.
db.reviews.createIndex( { comments: "text" } )

db.reviews.find( 
{ $text: { $search: "very good" } }, 
{ score: { $meta: "textScore" } } 
).sort( { score: { $meta: "textScore" } } );

19. Selectivity of an index: Take an example of an index created by student_id and class_id and another index on only class_id.
Assuming students can be in large numbers.
If you do a query like – where student_id greater than 50 and class_id = 1, the query execution will take greate amount of time if it uses the student_id and class_id index. As number of classes would be relatively less you should use the class_id index. You can use the .hint() function at end of a query to specify which index to use.

db.collectionName.find(
   { name: "John Doe", age: { $gt: "10" } }
).hint( { age: 1 } )

 

20. Structuring your indexes:

The better your indexes the better will be your queries. If you always sort your result-set by some field, then think of how you can include that field into the index, so that your queries become faster.
An example: you are sorting a student record by grade, in such a case we can include the grade field in the index itself.
The explain feature of MongoDB will help you a lot with understanding how queries run.

Keep this in mind while building an index.
Equality field: field on which queries will perform an equality test.
Sort field: field on which queries will specify a sort.
Range field: field on which queries perform a range test.

and this should be the order in your index:

Equality fields before range fields
Sort fields before range fields
Equality fields before sort fields

21. You can set profiling level to log slow queries, this will help you in narrowing down the queries that need improvement. You can set the profiler on and off, using setProfilingLevel, this will start logging query stats in system.profile collection which you can query to get various results.

22. Similar to the Unix top, MongoDB has a utility called mongotop which lets you see what collections are being utilized in the given time-frame.

23. Use mongostat to know details of the server.

Leave a Reply

Your email address will not be published. Required fields are marked *