Python to Cosmos

Another step in my Python journey. This time I want to explore what it takes to work with CosmosDB in Python. I document the progress as I go and then update the overview.

In the end I have learned these:

  1. Python Package Management (PIP) – to consume packages from the community
  2. Install Python 3 and get it worked on Mac alongside with preinstalled Python 2.7
  3. Azure Cosmos Python SDK
  4. More confident in writing Python code

Things look simple and easy in abstract, in thinking. But the devil is in the detail, in the implementation level.

Python Package Management – PIP

Just like NuGet in .NET ecosystem, PIP allows developers to consume packages created by the community. According to the documentation, when Python 2.7 or 3.x is installed, PIP is installed as well.

However, when I run on Mac, it said the PIP is not a valid command. The solution is described in SO. In short, run this command

sudo -H python -m ensurepip

After that, run this command to see what you can do with PIP

pip

Most of the time, the install command is used.

pip install {package_name}

Cosmos

Cosmos has a Python SDK which is available via azure_cosmos package. There is also a step by step instruction. If one wishes to see the sample code, MS provides a sample code here

Installing the package is not going well. If this command does not work

sudo -H pip azure-cosmos

Then try this one. It works well for me

sudo -H pip azure-cosmos --upgrade --ignore-installed six

Python Versions

Mac OS has preinstalled Python version 2.7.10, so the pip is attached to the version. I want to have the latest version which is 3.7.4 at the moment. I definitely do not want to mess up the current version. Mac OS has many built in functions with Python, well at least as my guess.

brew install python

Will install the latest Python version available with the new command python3 So does the pip3. Let install azure-cosmos again for Python 3

sudo -H pip3 azure-cosmos --upgrade --ignore-installed six

Get Hands Dirty

Most of the code is out there with a few search. My objective is just to get the flow and how code is organized in Python.
The first step is to create Azure CosmosDB account first. By creating a new resource group, I can clean it everything when not in used.

Import Cosmos Package

To consume a library, it needs to be imported into the file, the same as using in C#

# Import cosmos client and alias
import azure.cosmos.cosmos_client as cosmos_client

# To ensure it is valid
print(cosmos_client)

Using alias allows us to write readable code. Instead of writing the full path azure.cosmos.cosmos_client, just write cosmos_client or whatever alias we please. Executing the above code will display the cosmos_client module

<module 'azure.cosmos.cosmos_client' from '/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/azure/cosmos/cosmos_client.py'>

So far so good!

Config to Cosmos

To connect and work with Cosmos, we need a couple of information. I usually prefer to wrap them in a single config object. In Python, a dictionary is perfect for this.

# Config for cosmos database. This is a dictionary in python
'''
Working with CosmosDB requires
    ENDPOINT: The URI identified the cosmos server.
    MASTER_KEY: Authorization key, just like username/password in SQL.
    DATABASE: The name of the database.
    COLLECTION: Or CONTAINER name.
Operations are per collection in Cosmos.
'''
config = {
    'ENDPOINT':'',
    'MASTER_KEY':'',
    'DATABASE':'',
    'COLLECTION':''
}

ENDPOINT and MASTER_KEY are from Azure CosmosDB account, you should be able to find them under the Keys section.
DATABASE and COLLECTION are from users, well from me actually. I am playing around with Cosmos using Python so I want to create them dynamically. So let write some lines to ask for that

# Prompt users for database name and collection
config['DATABASE'] = input("Database: ")
config['COLLECTION'] = input("Collection: ")

Very straightforward! Ask users for input with a prompt. OK! Ready to create my very first database

# Import the library and create an alias. Look the same as namespace in C#
import azure.cosmos.cosmos_client as cosmos_client

print(cosmos_client)
# Config for cosmos database. This is a dictionary in python
'''
Working with CosmosDB requires
    ENDPOINT: The URI identified the cosmos server.
    MASTER_KEY: Authorization key, just like username/password in SQL.
    DATABASE: The name of the database.
    COLLECTION: Or CONTAINER name.
Operations are per collection in Cosmos.
'''
config = {
    'ENDPOINT':'Fill in Azure CosmosDB account endpoint',
    'MASTER_KEY':'Fill in the primary key',
    'DATABASE':'',
    'COLLECTION':''
}
# Prompt users for database name and collection
config['DATABASE'] = input("Database: ")
config['COLLECTION'] = input("Collection: ")

print(config)

# 1. Initialize the client that can talk to the server
client = cosmos_client.CosmosClient(
    url_connection=config['ENDPOINT'], 
    auth={
        'masterKey':config['MASTER_KEY']
        })

# 2. Create a database
db = client.CreateDatabase({'id':config['DATABASE']})

print(db)
  1. Initialize a client instance
  2. Ask it to create a database

So how does it look?

<module 'azure.cosmos.cosmos_client' from '/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/azure/cosmos/cosmos_client.py'>
Database: Python-102
Collection: C102
{
    'ENDPOINT': 'Secret value', 
    'MASTER_KEY': 'Secret value', 
    'DATABASE': 'Python-102', 
    'COLLECTION': 'C102'
}
{
    'id': 'Python-102', 
    '_rid': 'CzhnAA==', 
    '_self': 'dbs/CzhnAA==/', 
    '_etag': '"00000900-0000-1800-0000-5d3be5650000"', 
    '_colls': 'colls/', 
    '_users': 'users/', 
    '_ts': 1564206437
}

The database Python-102 is created. The returned value is a dictionary containing essential information of the database. One important piece is the _self key, which is used to identify the database when creating a collection. Here it is, create a collection

# 3. Create a collection/container
collection = client.CreateContainer(
        db['_self'],
        {'id':config['COLLECTION']})

print(collection)

The cosmos_client exposes many APIs to work with CosmosDB. It is easy to create a document, query documents, and other advanced stuff if I wish to.

Say Something

The task sounds easy in your my head. The devil appears when I get my hands dirty in code. Thought, it is not a tough challenge. When I started to write both code and blog post at the same time, I gain triple outcome.

I started to love Python. It is a pleasure to write code in Python. I have been using Visual Studio and Windows for my career. I started to play around with Mac recently. Python helps me comfortable with Mac. This post is about Python and Cosmos. But I have to write my feeling after all.

What’s next? Write some utilities for myself. The goal is still to explore more about Python.

Welcome to Python

Have been working with C# for a decade, it is time to learn a new language. Some options popping over my head, of course they must be different than C# – Java is excluded immediately, are Python, Ruby on Rails, Closure, … kind of scripting languages. I know some about JavaScript. So I definitely do not want to dig deeper into it.

I remember a statistic from StackOverflow 2018, Python has a good position. I also heard somewhere that Python is awesome. So here I come Python.

If one wishes to know what Python is, ask Google. It gives you all the awesome resources ranging from introduction to intermediate to advanced. The community is strong as well.

The post is my journey, my thoughts when I started to learn Python. It also serves as my documentation. If someone ask me how to get started to Python, I can give them the link. Well, at least that what I thought.

So let’s get started!

Environment

Each developer has their own favorite environments. I come from the .NET world so Visual Studio Code is my tool. Of course, I can use Visual Studio. But seems Visual Studio Code is a better choice.

  1. Visual Studio Code + Python Extension
  2. Install Python on Windows. Download and follow the instruction here at python org

Installing both Visual Studio Code and Python are fast. Everything is ready in a matter of minutes.

Quick Overview

Some notes about python:

  1. Python file extension is py Ex: hello-python.py.
  2. Can take a single Python file and run with python command line.
  3. Interactive shell is powerful. From the console/terminal/PowerShell, type the command python and the Python shell is ready.
  4. Dynamic type. The type is known at the execution time.
  5. Write "Hello World" program in a single line of code: print("Hello World") directly from the shell.

I can write functional code or object oriented code.

Everything in Python is an object. There are 2 functions that I find very useful – type() and id().

name = "Thai Anh Duc"

duplicatedname = "Thai Anh Duc"

# Want to know the type?
type(name)

# Want to know where it is store in the memory?
id(name)

# Are they equal? Are they the same?
name == duplicatedname

So name equals duplicatedname. But they are not the same. By using the id() function, we know that are store in different locations.

Materials

I learned from:

  1. Pluralsight Courses: It is always my first and default option when learning technologies.
  2. Tech Beamer: Very detail with examples, explanations.

Conventions

Python uses indentation for code block instead of square brackets ({}) in C#. There is a coding guidelines, standard PEP8. At the beginning, I do not worry too much about them. I just focus on some simple building blocks. And fortunately, the Visual Studio Code has helped me checked and suggested correction.

This block of code is enough for me to remember what are important

if(1 > 0):
    print("Of course, it is.")
for index in [1, 2, 3, 4]:
    print("hello {0}. This is a loop, display by string format".format(index))
    print("Same code block")

def i_am_a_function(parameters):
    # Comment what the function does
    print(parameters)

Data Types

Date types are crucial to code, even with a dynamic type language. As a developer, you have to know what kind of data you are dealing with. Are we talking about a string, an integer, a big decimal, or a date, …

Refer to Tech Beamer for detail of each type. A special note is at the number types. There are 3 types:

  1. int – for integer value. There is no limit as far as the computer can hold it. Unlike C#, there are limit on each number type, such as Int16, Int32, Int64.
  2. float – for floating value. There is a limit of 15 decimal digits.
  3. complex – for other cases. For example, the operator 7/3, instead of returning the value 2.333333, it is represented as (2,1) – this is the complex type.

Because type is dynamic, sometimes it is a good idea to check the type first. Python has isinstance function

name = "Thai Anh Duc"
# It is a string
isinstance(name, str)

age = 36
# It is an int
isinstance(age, int)

bank_amount = 1.1
# It is a float
isinstance(bank_amount, float)

Function and Class

One might not need a class but function.

def append_lower(name):
    # Lower case the name and then append " python" to the end
    return name.lower() + " python"

class Python101():
    # User-defined classes should be camel case
    static_constant_field = "Static field, access at the class level"
    def __init__(self, comment)
        # Constructor
        # :param self: this in C# language
        # :param comment: constructor parameter. Which tells Python that it requires a parameter
        self.comment = comment # Declare an instance field and assign value

    def say_hi(self)
        print(self.comment)

print(Python101.static_constant_field)

p101 = Python101("You are awesome")
p101.say_hi()

The code speaks for itself.

Summary

It is quite easy for a C# developer to get started with Python. With the above building blocks, I am ready to write some toy programs to play around with Python. There are many that I want to see how Python does. The initial list are:

  1. IO
  2. Network
  3. Threading

The list goes on as I learn. I am ready to go next.

Observation – Watch Out Boundaries

When I first started my software development career I think writing software was hard. And it is true. However, the definition of writing software at that time was different. What I really meant was my code met the functional requirement (and it was not always true) and ran. That was when I did not see in production. So everything was working fine in my machine.

Over the time, I have had chances to bring my code into production and seen them running. No surprise that they did not work well in production. Every developer knows the famous "It works well in my machine". And it might work well in test. But it always has problems in production.

Why? There is no single answer for that problem. Software are developed by developers with different level of skills, experience, intelligent, … Even the team has the most talented developers in the world, their products still have bugs. So I am not trying to find a solution for that problem. Instead, I embrace and observe the fact. There is no silver solution but there are tips and tricks to prevent as much as possible.

From my own experience, from Pluralsight courses, from youtube, … from any source I touched, I want to document what I have observed. I do not intend to go in the detail of each item. Rather I want to have a list and some explanations, references. The detail varies project by project.

SQL

If you build applications with SQL Server as data storage, watch out 2 common unbounded patterns

  1. Select N+1
  2. Select * without top n

Asking Google for "Select N+1" you will know what it is immediately. There are detail explanation with code example. Usually developers that have worked in ORM know it very well.

The second watch out is a bit tricky. That is when your applications issue a query to SQL in this pattern

SELECT * FROM dbo.Employee WHERE [Predicate]

In the test environment, there is no problem. However, in the production, the data is huge and that query might return millions of records.

These days many applications do not talk to the database directly. Instead there are ORM/LinqToSQL in between. And this piece of code is not uncommon

var employeesByName = ctx.Employees.Where(x => x.Name.StartWith("Smith")).ToList();

Some ORMs might put a limit on the generated queries. What developers need to do is to review all the generated queries.

The rule of thumb is that always control the number of returned records. Put the max on everything.

Connections

There are many kinds of connections that applications make – connect to the database, connect to external services. When making such those calls, there are some watch out

  1. Timeout: Make sure a timeout value is set on everything. Usually the modern frameworks have default values. Just make sure there is and you are aware of them.
  2. Close connections properly. Just imagine what happens if you have your door opened? Bad things happen.

JSON – Serialization and Deserialization

JSON is cool. Developers work with JSON everyday in one form or another. Many take it for granted. We rarely pay attention to the size of data. I once experienced such a problem here – hidden cost of an architecture.
So if you have to explicitly use JSON directly in your code, ask these questions

  1. Do I have to use it? Is there any other options?
  2. What size? Is the size under controlled?

Some might argue that RAM is cheap. So why should we care too much about the size? Yes. RAM is cheap but it has limit. Once it reaches the limit, your application will freeze or crash. And if your applications are running on the cloud, everything your applications consume, there is cost involved.

Enumerable, List, ForEach

Does developer write code without the usage of loop? Have we ever wondered how many items are there in a list? When talking about a list, we should be aware that all items are stored in memory. So the size really matters here even it is trivial.
Another trap is at the Enumerable. Enumerable represents a sequence that mostly expected to iterate only once. By nature, we do not know the size of a sequence (There is no Count property on the IEnumerable interface). Therefore, when calling .ToList(), be aware of all the nasty things can happen.

I brought them here for reference. It might not a thing that takes down production. But it is nice to be aware of as well.

There might be more about boundaries to watch out. Those are what I have come up so far. What’s yours?

C# 7 Tuple Better Test Assertion

Recently I read C# In Depth 4.0 where I met the new Tuple design in C# 7. It is really a cool feature. Beside the syntax sugar, it offers capacities that developers can leverage.

At the time of reading it, I was tasked with writing unit tests in my job. It triggered my memory about Semantic Comparison with Likeness. The main idea of semantic comparison is to compare 2 objects with certain properties. It allows developers to define what equality means. The tuple supports the equality by default. So maybe I should be able to use the tuple to accomplish the same thing as Likeness.

In this post, I will write a simple unit test without Likeness or tuple, then refactors it with Likeness, finally uses Tuple. Let’s explore some code.

public class Product
{
    public Guid Id { get; set;}
    public string Name { get; set;}
    public double Price { get; set;}
    public string Description { get; set;}
}

[TestFixture]
public class ProductTests
{
    [Test]
    public void Test_Are_Products_Same()
    {
        var expectedProduct = new Product
        {
            Name = "C#",
            Price = 10,
            Description = "For the purpose of demoing test"
        };

        var reality = new Product
        {
            Name = "C#",
            Price = 10,
            Description = "For the purpose of demoing test"
        };

        // Assert that 2 products are the same. Id is ignored
    }
}

The task is simple. How are we going to assert the 2 products?

Old Fashion

Very simple. We simply assert property by property.

[TestFixture]
public class ProductTests
{
    [Test]
    public void Test_Are_Products_Same()
    {
        var expectedProduct = new Product
        {
            Name = "C#",
            Price = 10,
            Description = "For the purpose of demoing test"
        };

        var reality = new Product
        {
            Name = "C#",
            Price = 10,
            Description = "For the purpose of demoing test"
        };

        Assert.AreEqual(expectedProduct.Name, reality.Name);
        Assert.AreEqual(expectedProduct.Price, reality.Price);
        Assert.AreEqual(expectedProduct.Description, reality.Description);
    }
}

Some might think that we should override the Equals method in the Product class. I do not think it is a good idea. The definition of equality between production and unit test are tremendously different. Be careful before overriding equality.

The product class has 3 properties (except the Id property). So the code still looks readable. Think about the situation where there are 10 properties.

Likeness – Semantic Comparison

There is a blog post explaining it in the detail. In this demo, we can rewrite our simple test.

[TestFixture]
public class ProductTests
{
    [Test]
    public void Test_Are_Products_Same()
    {
        var expectedProduct = new Product
        {
            Name = "C#",
            Price = 10,
            Description = "For the purpose of demoing test"
        }.AsSource()
        .OfLikeness<Product>();

        var reality = new Product
        {
            Name = "C#",
            Price = 10,
            Description = "For the purpose of demoing test"
        };

        Assert.AreEqual(expectedProduct, reality);
    }
}

Likeness is a powerful tool in your testing toolbox. Check it out if you are interested in.

Tuple – Customized

The idea is that we can produce a tuple containing asserted properties and compare them. This allows us to flatten the structure if wished.

[TestFixture]
public class ProductTests
{
    [Test]
    public void Test_Are_Products_Same()
    {
        var expectedProduct = (Name = "C#", Price = 10, Description = "For the purpose of demoing test");

        var reality = new Product
        {
            Name = "C#",
            Price = 10,
            Description = "For the purpose of demoing test"
        };

        Assert.AreEqual(expectedProduct, (reality.Name, reality.Price, reality.Description));
    }
}

It might not look different from the Likeness approach. And I do not say which approach is better. It is just another way of doing things.

Summary

So which approach is better? None of them. Each has their own advantages and disadvantages. They are options in your toolbox. How they are used depends on you, developers. Definitely I will take advantages of the new Tuple in both production and unit test code.

The Process of Making Elegant Unit Tests

Unit tests are part of the job that developers do while building software. Some developers might not write unit tests. But, IMO, majority does. If you are one of those, how do you treat the unit test code comparing to the production code?

  1. Do you think about maintainability?
  2. Will you refactor the test to make it better? Note: I use the term refactoring from the Refactoring book by Martin Fowler.
  3. Have you used the advantages that the test framework offer?

To be honest, I had not thought about that much. In the beginning of my career, I wrote tests that followed the current structure of the projects. I did not question that much. Over the time, I started to feel the pain so I made changes Unit Test from Pain to Joy which has served me well in that project.

Write Tests

Recently, I have had a chance to work on another project. I was tasked with writing unit tests (and integration tests) to get used to the system and to be able to run tests with multiple credentials, AKA login user.

Here is the test, not a real one of course. But the idea is the same. I want to run the test with different credentials. The username and password must be passed to the parameters. This is very useful when you look at a test report. By parameterizing the report will show values passed to the test.

[TestFixture]
internal class MultiCredentialsTest
{
    [Test]
    [TestCase("read_user", "P@ssword")]
    [TestCase("write_user", "P@ssword")]
    public void RunWithDifferentCredentials(string username, string password)
    {
        // The test body goes here.
    }
}

And there will be many of them. It worked as expected. But there are potential problems. Can you guess the problem?

  1. What happens when one of test users is changed either username or password?
  2. What if we want to add more test users into the test suites?

Think about the situation where there are hundreds even thousands of them. It will be a pain. I need a solution to centralize the test data. My process has started.

Make Them Better – Manageable

It was time to look at what NUnit offers. NUnit has supported TestCaseSource. You should check it out first if you have not known it. In the nutshell, it allows developers to centralize test data in a manageable manner. That was exactly what I was looking for.

I created a TestCredentialsSource to produce the same test data. I would prefer the name TestCredentialsFactory, but seems the "source" fits better in the unit test context.

internal class TestCredentialsSource
{
    public static  IEnumerable<object> ReadWriteUsers = new IEnumerable<object>{
        new object[]{"read_user", "P@ssword"},
        new object[]{"write_user", "P@ssword"}
    }
}

The test was rewritten with version V1. There are 2 versions for comparison.

[TestFixture]
internal class MultiCredentialsTest
{
    [Test]
    [TestCase("read_user", "P@ssword")]
    [TestCase("write_user", "P@ssword")]
    public void RunWithDifferentCredentials(string username, string password)
    {
        // The test body goes here.
    }

    [Test]
    [TestCaseSource(typeof(TestCredentialsSource), "ReadWriteUsers")]
    public void RunWithDifferentCredentials_V1(string username, string password)
    {
        // The test body goes here.
    }
}

In the test, I did not have to deal with test values. The test data was encapsulated in the TestCredentialsSource with the ReadWriteUsers static field.

Make Them Even Better – Reuse and Duplication

It was good with known specific set of users. There were certain tests that want to run with a specific user. It should be fairly easy with another property in the TestCredentialsSource

internal class TestCredentialsSource
{
    public static  IEnumerable<object> ReadWriteUsers = new IEnumerable<object>{
        new object[]{"read_user", "P@ssword"},
        new object[]{"write_user", "P@ssword"}
    }

    public static   IEnumerable<object> SpecificUser = new IEnumerable<object>{
        new object[] {"special_user", "P@ss12345"}
    }
}

What if I wanted to test with only "read_user"? What if I wanted to combine the "read_user" with "special_user" for another test? One option was to define them in the TestCredentialsSource. Which was still fine because it was still manageable in a single file. But it was awkward.

Was there any better alternative?

Yes, there was. Let’s encapsulate the data in a class. Welcome to TestCredentials class.

internal class TestCredentials
{
    public string Username { get; }
    public string Password { get; }
    public TestCredentials(string username, string password)
    {
        Username = username;
        Password = password;
    }
    ///<summary>
    /// Convert the object into an array of properties object which can be used by the TestDataSource
    ///</summary>
    public object[] ToTestSource() => new object[] { Username, Password };

    public static TestCredentials ReadUser = new TestCredentials("read_user", "P@ssword");
    public static TestCredentials WriteUser = new TestCredentials("write_user", "P@ssword");
    public static TestCredentials SpecialUser = new TestCredentials("special_user", "P@ss12345");
}

The class supplied 3 factory methods to construct the needed credentials. This was the only single place where the data was provided without any duplication.
The TestCredentialsSource became much cleaner

internal class TestCredentialsSource
{
    public static  IEnumerable<object> ReadWriteUsers = new IEnumerable<object>{
        TestCredentials.ReadUser.ToTestSource(),
        TestCredentials.WriteUser.ToTestSource()
    }

    public static   IEnumerable<object> SpecificUser = new IEnumerable<object>{
        TestCredentials.SpecialUser.ToTestSource()
    }
}

Cool! The data has gone from the source definition. But there was still one thing that I did not like much – the setup of "SpecificUser" in the TestCredentialsSource. Having a source for a single value did not sound right to me.

There was a solution – convert the TestCredentials to a source that NUnit can understand. Implement the IEnumerable. TestCaseData is defined by the NUnit framework

internal class TestCredentials : IEnumerable<TestCaseData>
{
    public string Username { get; }
    public string Password { get; }
    public TestCredentials(string username, string password)
    {
        Username = username;
        Password = password;
    }
    ///<summary>
    /// Convert the object into an array of properties object which can be used by the TestDataSource
    ///</summary>
    public object[] ToTestSource() => new object[] { Username, Password };

    public static TestCredentials ReadUser = new TestCredentials("read_user", "P@ssword");
    public static TestCredentials WriteUser = new TestCredentials("write_user", "P@ssword");
    public static TestCredentials SpecialUser = new TestCredentials("special_user", "P@ss12345");

    public IEnumerator<TestCaseData> GetEnumerator()
    {
        return new List<TestCaseData>
            {
                new TestCaseData(Username, Password)
            }.GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

With the in place, I could write the 2 below tests. There was no limitation of what combinations I could make of test credentials.

[Test]
[TestCaseSource(typeof(TestCredentials), "SpecialUser")]
public void RunWithDifferentCredentials_SpecialUser(string username, string password)
{
    // The test body goes here.
}

[Test]
[TestCaseSource(typeof(TestCredentials), "SpecialUser")]
[TestCaseSource(typeof(TestCredentials), "WriteUser")]
public void RunWithDifferentCredentials_CombinedUsers(string username, string password)
{
    // The test body goes here.
}

The End

The good result does not come by accident. I could have stopped at any step in the process. By pushing a little bit further, by asking the right questions, the end result was more than I had expected.
If you are writing tests code, should you look at them again and ask questions? Give it a try and see how far it takes you.