Code Bug Fix: Cosmos Db – insert if not exists

Original Source Link

Dear Cosmos fellow members,

I am creating Azure Function (Python) with CosmosTrigger that:

  • reads data from source_collection
  • writes data to target_collection (only if document does not exist)

To achieve “insert if not exists” functionality I configured target_collection with unique key “name”.

Example:

  1. write to target_collection {"name": "John", "id": "1"} => OK
  2. write to target_collection [{"name": "John", "id": "2"}, {"name": "Mary", "id": "3"}] => System.Private.CoreLib: Exception while executing function: Functions.CosmosTrigger. Microsoft.Azure.DocumentDB.Core: Entity with the specified id already exists in the system.

Unique key “name”: “John” already exists in target_collection. So outdocs: func.Out[func.Document], outdocs.set() returns conflict 409 and “Mary” never gets written to target_collection.

Important: when I upload json with array of items from Data Explorer to target_collection, all conflicts return an error, but all items get processed.
When I try to write it from Azure function, first conflict raises an error and remaining items never make it to target_collection.

My questions:

  1. Is there a way to ignore raised errors and process whole array of items with func.Out[func.Document].set() ?
  2. Alternatively could you please advise me what would be the best way to implement bulk “insert if not exist” behaviour in Cosmos?
def main(docs: func.DocumentList, outdocs: func.Out[func.Document]) -> str:
  compl_docs = func.DocumentList()

    compl_docs_dict = {
      "name": "John",
      "id": "2"
    }

    compl_docs.append(func.Document.from_dict(compl_docs_dict))

    compl_docs_dict = {
      "name": "Mary",
      "id": "3"
    }

    compl_docs.append(func.Document.from_dict(compl_docs_dict))

  outdocs.set(compl_docs)

SOLUTION
this is how insert-if-not-exist can be done in cosmos:

  1. create cosmos client in your code and use ExecuteProcedure to send docs from azure function to cosmos procedure
  2. add stored procedure to your cosmos collection that handles conflicts. You can find bulkUpsert_v_2 from Microsoft and modify it so when conflict 409 occurs, nothing will be written to cosmos.
Tagged : / / / /

Code Bug Fix: Count regex matches in one column by values in another column with pandas

Original Source Link

I am working with pandas and have a dataframe that contains a list of sentences and people who said them, like this:

 sentence                 person
 'hello world'              Matt
 'cake, delicious cake!'    Matt
 'lovely day'               Maria
 'i like cake'             Matt
 'a new day'                Maria
 'a new world'              Maria

I want to count non-overlapping matches of regex strings in sentence (e.g. cake, world, day) by the person. Note each row of sentence may contain more than one match (e.g cake):

person        'day'        'cake'       'world'
Matt            0            3             1
Maria           2            0             1

So far I am doing this:

rows_cake = df[df['sentences'].str.contains(r"cake")
counts_cake = rows_cake.value_counts()

However this str.contains gives me rows containing cake, but not individual instances of cake.

I know I can use str.counts(r"cake") on rows_cake. However, in practise my dataframe is extremely large (> 10 million rows) and the regexes I am using are quite complex so I am looking for a more efficient solution if possible.

Maybe you should first try to get the sentence itself and then use re to do your optimized regex stuff like that:

for row in df.itertuples(index=False):
   do_some_regex_stuff(row[0], row[1])#in this case row[0] is a sentence. row[1] is person

As far as I know itertuples is quiet fast (Notes no.1 here). So the only optimization problem you have is with regex itself.

I came up with rather simple solution. But cant claim it to be the fastest or efficient.

import pandas as pd
import numpy as np

# to be used with read_clipboard()
'''
sentence    person
'hello world'   Matt
'cake, delicious cake!' Matt
'lovely day'    Maria
'i like cake'   Matt
'a new day' Maria
'a new world'   Maria
'''

df = pd.read_clipboard()
# print(df)

Output:

                  sentence person
0            'hello world'   Matt
1  'cake, delicious cake!'   Matt
2             'lovely day'  Maria
3            'i like cake'   Matt
4              'a new day'  Maria
5            'a new world'  Maria

.

# if the list of keywords is fix and relatively small
keywords = ['day', 'cake', 'world']

# for each keyword and each string, counting the occourance
for key in keywords:
    df[key] = [(len(val.split(key)) - 1) for val in df['sentence']]

# print(df)

Output:

                 sentence person  day  cake  world
0            'hello world'   Matt    0     0      1
1  'cake, delicious cake!'   Matt    0     2      0
2             'lovely day'  Maria    1     0      0
3            'i like cake'   Matt    0     1      0
4              'a new day'  Maria    1     0      0
5            'a new world'  Maria    0     0      1

.

# create a simple pivot with what data you needed
df_pivot = pd.pivot_table(df, 
values=['day', 'cake', 'world'], 
columns=['person'], 
aggfunc=np.sum).T

# print(df_pivot)

Final Output:

        cake  day  world
person
Maria      0    2      1
Matt       3    0      1

Open to suggestions if this seems to be a good approach especially given the volume of data. Eager to learn.

since this primarily involves strings, I would suggest taking the computation out of Pandas – Python is faster than Pandas in most cases when it comes to string manipulation :

#read in data
df = pd.read_clipboard(sep='s{2,}', engine='python')

#create a dictionary of persons and sentences : 
from collections import defaultdict, ChainMap
d = defaultdict(list)
for k,v in zip(df.person, df.sentence):
    d[k].append(v)


d = {k:",".join(v) for k,v in d.items()}

#search words
strings = ("cake", "world", "day")

#get count of words and create a dict
m = defaultdict(list)
for k,v in d.items():
    for st in strings:
        m[k].append({st:v.count(st)})

res = {k:dict(ChainMap(*v)) for k,v in m.items()}


print(res)
{'Matt': {'day': 0, 'world': 1, 'cake': 3},
 'Maria': {'day': 2, 'world': 1, 'cake': 0}}

output = pd.DataFrame(res).T

       day  world   cake
Matt    0     1     3
Maria   2     1     0

test the speeds and see which one is better. it would be useful for me and others as well.

Tagged : / / /

Code Bug Fix: regex: eliminate all numbers between slashes

Original Source Link

Expected: http://some_url.com/api/v1/1/2/3/4" -> http://some_url.com/api/v1/*/*/*/*/
What I use:

re.sub(r"/d+/?", "/*/", str(url), flags=re.IGNORECASE)

Actual: http://some_url.com/api/v1/*/2/*/4/

You may use

/d+(?=/|$)
/d+(?![^/])

and replace with /*. See the regex demo.

In Python:

url = re.sub(r"/d+(?=/|$)", "/*", url)

Details

  • / – a / char
  • d+ – 1+ digits
  • (?![^/]) – a negative lookahead that fails the match if the next char is not a character other than / (so, end of string or / are required immediately to the right of the current location, same as with the positive (?=/|$) lookahead)

See Python demo online:

import re
url = 'http://some_url.com/api/v1/1/2/3/4'
url = re.sub(r"/d+(?=/|$)", "/*", url)
print(url)
# => http://some_url.com/api/v1/*/*/*/*

You could use

/d+(?=/|$)

See a demo on regex101.com.

Tagged : / / /

Code Bug Fix: How to merge multiple dataframe columns within a common dataframe in pandas in fastest way possible?

Original Source Link

Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:

Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4

The columns which are common in all 5 dataframes – df, df1, df2, df3 and df4 are A, B, C and D

EDIT

The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.

Input df
A  B  C  D   X    Y    Z    W
1  2  3  4  nan  nan  nan  nan
2  3  4  5  nan  nan  nan  nan
5  9  7  8  nan  nan  nan  nan
4  8  6  3  nan  nan  nan  nan
df1
A  B  C  D   X    Y    Z    W
2  3  4  5  100  nan  nan  nan
4  8  6  3  200  nan  nan  nan
df2
A  B  C  D   X    Y    Z    W
1  2  3  4  nan  50  nan  nan
df3
A  B  C  D   X    Y    Z     W
1  2  3  4  nan  nan  1000  nan
4  8  6  3  nan  nan  2000  nan
df4
A  B  C  D   X    Y    Z    W
2  3  4  5  nan  nan  nan  25
5  9  7  8  nan  nan  nan  35
4  8  6  3  nan  nan  nan  45
Output df
A  B  C  D   X    Y    Z    W
1  2  3  4  nan  50   1000  nan
2  3  4  5  100  nan   nan  25
5  9  7  8  nan  nan   nan  35
4  8  6  3  200  nan  2000  45

Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn’t seem to be the most efficient way.
Can this be done by using just 1 line of code instead?

Any help will be appreciated. Many thanks in advance.

Tagged : / /

Code Bug Fix: Module ‘cv2’ has no ‘VideoCapture’ member

Original Source Link

import numpy
import cv2

cap = cv2.VideoCapture(0)

while(True):
    ret, frame = cap.read()

    cv2.imshow('frame',frame)
    if cv2.waitKey(20) & 0xFF == ord('q'):
        break


    cap.release()
    cv2.destroyAllWindows()

This my code andi get this error
1.Module ‘cv2’ has no ‘VideoCapture’ member
2.Module ‘cv2’ has no ‘imshow’ member
3.Module ‘cv2’ has no ‘waitKey’ member
4.Module ‘cv2’ has no ‘destroyAllWindows’ member

enter image description here

Import statements shouldn’t be in the same line.

import numpy
import cv2

Also make sure that you have installed opencv using

pip install opencv-python

Tagged : / / / /

Code Bug Fix: folium Choropleth colors showing grey only

Original Source Link

I’m trying to show happiness levels based on the country using folium choropleth, however, it doesn’t work and all countries are just grey. This is what I get:

image output

json file: https://github.com/python-visualization/folium/blob/master/examples/data/world-countries.json

csv file: https://drive.google.com/file/d/1aI5tILdPYyx0yjbgPPZcJHEqxbZ4oPBg/view?usp=sharing

and this is my code:

import folium
import pandas as pd


country_geo = 'world-countries.json'

country_data = pd.read_csv('Happiness_Dataset_2016.csv')

bins = list(country_data['Happiness Score'].quantile([0, 0.25, 0.5, 0.75, 1]))

m = folium.Map(location=[0,0], zoom_start=2)


folium.Choropleth(
    geo_data=country_geo,
    name='choropleth',
    data=country_data,
    columns=['Country','Happiness Score'],
    Key_on='feature.properties.name',
    fill_color='BuPu',
    fill_opacity=0.2,
    line_opacity=0.5,

    legend_name='Happiness Rates (%)',
    bins =bins,

    reset=True
).add_to(m)
# folium.LayerControl().add_to(m)
m
m.save('worldmap.html')

Here the error:

Key_on='feature.properties.name',

Modify it as:

key_on='feature.properties.name',

and you get:

enter image description here

Tagged : / / / /

Code Bug Fix: how to ajax post form as json type to python then get data correct way?

Original Source Link

js part

$('#btnUpdate').click(function(){
              var formData = JSON.stringify($("#contrast_rule_set").serializeArray());
              $.ajax({
                  type: "POST",
                  url: "./contrast_rule_set",
                  data: formData,
                  success: function(){},
                  dataType: "json",
                  contentType : "application/json"
                });
          })

python part

@app.route('/get_test', methods=['GET','POST'])
def get_test():
    web_form_data = request.json
    print(web_form_data)
    print(type(web_form_data))
    print(jsonify(web_form_data))
    print(json.dumps(web_form_data))

python print console like

[{'name': 'logic_1', 'value': '1'}, {'name': 'StudyDescription_1', 'value': ''}, {'name': 'SeriesDescription_1', 'value': 'C\+'}, {'name': 'ImageComments_1', 'value': ''}, {'name': 'logic_2', 'value': '1'}, {'name': 'StudyDescription_2', 'value': '\-C'}, {'name': 'SeriesDescription_2', 'value': '\-C'}, {'name': 'ImageComments_2', 'value': '\-C'}, {'name': 'logic_3', 'value': '1'}, {'name': 'StudyDescription_3', 'value': ''}, {'name': 'SeriesDescription_3', 'value': '\+C'}, {'name': 'ImageComments_3', 'value': '\+C'}]
<class 'list'>

how to get list to json data type (or converter ) (html side code adjust or python side code adjust? )
then hope to get data like json type (data is from my another json file )

 {
        'Logic': 'AND',
        'StudyDescription': '',
        'SeriesDescription': 'C+',
        'ImageComments': ''
    },
    {
        'Logic': 'NOT',
        'StudyDescription': '-C',
        'SeriesDescription': '-C',
        'ImageComments': '-C'
    },
    {
        'Logic': 'AND',
        'StudyDescription': '',
        'SeriesDescription': '+C',
        'ImageComments': '+C'
    }

Tagged : / /

Code Bug Fix: Getting output from a Jupyterhub notebook back to my local PC – suggestions?

Original Source Link

We have a very large, multi-functional Excel spreadsheet that gathers data from various databases and process the data sets. This has grown over the years to be a monster, and the data sets have grown also – Excel is struggling. We now have a shiny new JupyterHub environment – and we are starting doing “good things” for specific parts of the process with pandas.

However – we are not going to be able to replace all the spreadsheet features to produce a replacement overnight – and I’m thinking I’d like to “hollow out” some of the key “heavy lifting” parts from Excel, to Python/panda based processes on Jupyter hub, and get this data back into the Excel for the final publication / massaging parts – but I’m stuck.

Our IT team say that we cannot access any of our (windows based) shares from the Jupyterlab environment and are pretty intransigent about this – so how can I get data out invisibly/transparently?

I’d like to be able to invoke a Jupyter notebook from a command-line/curl on my PC or similar, preferably without a browser being shown, and for the book to automatically run (which behind the scenes accesses our DB and crunches the big data), and somehow get the data back into my PC (all without interaction).

I realise this is a big question, and I am not so much after code, more a design pattern/approach and key places to look/investigate to at least allow me to figure out the implementation (as you may have guessed I am a newbie at this).

I can get the data in a link no problem – but this needs to be “non-interactive”

    import pandas as pd
    import numpy as np

    def csv_download_link(df, csv_file_name, delete_prompt=True):
        """Display a download link to load a data frame as csv from within a 
    Jupyter notebook"""
        df.to_csv(csv_file_name, index=False)
        from IPython.display import FileLink
        display(FileLink(csv_file_name))


    df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

    csv_download_link(df, 'df.csv')

So – how else can I do this completely programatically from my client PC?

Many thanks

Ben

Tagged : / /

Linux HowTo: Anaconda Prompt Closing after Keras installation

Original Source Link

So I recently downloaded Keras through the Anaconda prompt (I’ve downloaded other packages successfully). It actually worked, as I was able to import Keras in Jupyter Notebook. However, now I get the following issue when I open the Anaconda prompt:

C:Users[User Name]>python C:Users[User Name]Anaconda3etckerasload_config.py  1>temp.txt
    python: can't open file 'C:Users

[User]': [Errno 2] No such file or directory

C:Users[User Name]>set /p KERAS_BACKEND= 0<temp.txt

C:Users[User Name]>del temp.txt

C:Users[User Name]>python -c "import keras"  1>nul 2>&1

Things to Note:

At first, I simply cannot type or enter anything into the window.
After closing and reopening it, the same message opens up, but is followed by a bunch of text I can’t read because the window immediately closes afterward.
In the error message provided, the third line has [User] instead of user. That is because my user’s folder name on my computer is “FirstName LastName”. There is a space which I’m wondering might be the issue.
I don’t want to uninstall Anaconda, as I have a project due in a few days and don’t want any installation/loss of data issues.

Yes, the space in your user name is the issue. 
If your user name is Jay kishan Panjiyar, type

python "C:UsersJay kishan PanjiyarAnaconda3etckerasload_config.py"  > temp.txt

at the prompt, using quotes to tell the system that
that string with spaces in it is all one filename. 
Or, if you’re in your home directory (C:UsersJay kishan Panjiyar)
when you do this (as your illustration suggests),
it should be good enough to say

python Anaconda3etckerasload_config.py  > temp.txt

P.S. Plain > is equivalent to 1>.

This problem is faced by many Python developers. Don’t worry and try to resolve it by following below guidelines –

1.Uninstall Keras first (you can delete keras files by going inside folder where package is installed)

2.goto the location C:UsersusernameAppDataLocalContinuumanaconda3etccondaactivate.d

You can see the keras batch files inside in both activate.d and deactivate.d, which runs every time the anaconda prompt is opened.DELETE them.

3.Reinstall Keras.

Tagged : / / /

Code Bug Fix: Read CSV/Excel files from SFTP file, make some changes in those files using Pandas, and save back

Original Source Link

I want to read some CSV/Excel files on a secure SFTP folder, make some changes (fixed changes in each file like remove column 2) in those files, upload them to a Postgre DB and also the upload them to a different SFTP path in Python

What’s the best way to this?

I have made a connection to the SFTP using pysftp library and am reading the Excel:

import pysftp
import pandas as pd

myHostname = "*****"
myUsername = "****"
myPassword = "***8"
cnopts =pysftp.CnOpts()
cnopts.hostkeys = None  

sftp=pysftp.Connection(host=myHostname, username=myUsername, 
password=myPassword,cnopts=cnopts)
print ("Connection succesfully stablished ... ")
sftp.chdir('test/test')
#sftp.pwd
a=[]
for i in sftp.listdir_attr():
    with sftp.open(i.filename) as f:
        df=pd.read_csv(f)

How should I proceed with the upload to DB and making those changes to the CSV permanent?

You have the download part done.

For the upload part, see How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python? – While it’s for Paramiko, pysftp Connection.open method behaves identically to Paramiko SFTPClient.open, so the code is the same.

Full code can be like:

with sftp.open("/remote/path/data.csv", "r+", bufsize=32768) as f:
    # Download CSV contents from SFTP to memory
    df = pd.read_csv(f)

    # Modify as you need (just an example)
    df.at[0, 'Name'] = 'changed'

    # Upload the in-memory data back to SFTP
    f.seek(0)
    df.to_csv(f, index=False)
    # Truncate the remote file in case the new version of the contents is smaller
    f.truncate(f.tell())

The above updates the same file. If you want to upload to a different file, use this:

# Download CSV contents from SFTP to memory
with sftp.open("/remote/path/source.csv", "r") as f:
    df = pd.read_csv(f)

# Modify as you need (just an example)
df.at[0, 'Name'] = 'changed'

# Upload the in-memory data back to SFTP
with sftp.open("/remote/path/target.csv", "w", bufsize=32768) as f:
    df.to_csv(f, index=False)

For the purpose of bufsize, see:
Writing to a file on SFTP server opened using pysftp “open” method is slow


Obligatory warning: Do not set cnopts.hostkeys = None, unless you do not care about security. For the correct solution see Verify host key with pysftp.

That’s several questions in one question 🙂

I would suggest go with that approach:

  1. Make a local copy of the file (not sure how big it is, no point to shuffle it around between your local machine and sftp server. You cna use get method
  2. Make operations on your data with pandas, then dump it back to csv with to_csv method
  3. load data to the postgree using either pandas.io or pure SQLAlchemy. Check the docs here
  4. Upload the file to the destination you want with put method
Tagged : / / / /