Code Bug Fix: print the value of column which has the maximum decimal point [closed]

Original Source Link

this is my dataframe:

     col1       col2
0    12.13      13.13
1    100.133    12.19994
2    11.16664   140.13
3    9.13       2.13
4    3.23       10.13

Now i want the column value which has the maximum decimal point length.

**OUTPUT:**
     maximum_de_point
**COL1** 11.16664
**COL2** 12.19994

Something like this should work

import pandas as pd
import numpy as np

col1 = [
    12.13,
    100.133,
    11.16664,
    9.13,
    3.23
]

col2 = [
    13.13,
    12.19994,
    140.13,
    2.13,
    10.13
]

df = pd.DataFrame(np.array([col1, col2]).T, columns=['col1','col2'])

# decimal lengths of col1
len1 = df['col1'].astype('str').apply(lambda a: len(a.split('.')[1]))

# decimal lengths of col2
len2 = df['col2'].astype('str').apply(lambda a: len(a.split('.')[1])) 

# get value at that index
col1_max = df['col1'][len1[len1 == max(len1)].index].tolist()[0]
col2_max = df['col2'][len2[len2 == max(len2)].index].tolist()[0]

One option is splitting the values with str.split, taking the str.len of the decimal part and finding the idxmax of each column. Then lookup with the resulting values:

df_ixmax = df.astype(str).apply(lambda x: x.str.split('.').str[1].str.len()).idxmax(0)
df_ixmax[:] =df.lookup(*df_ixmax.reset_index().values[:,::-1].T)

df_ixmax
col1    11.16664
col2    12.19994
dtype: float64

Or we could also use decimal.Decimal, which enables to obtain a count of the decimal places through the returned named tuple by as_tuple(), and then index the dataframe similarly to above from the result:

from decimal import Decimal 

ix = [[Decimal(str(x)).as_tuple().exponent for x in col] for col in df.values.T]
max_vals = df.values[np.array(ix).argmin(1), np.arange(df.shape[1])]
pd.Series(max_vals, index=df.columns)

col1    11.16664
col2    12.19994
dtype: float64

Tagged : / / / /

Code Bug Fix: python pandas dataframe to matlab struct using scipy.io

Original Source Link

I am trying to save a pandas dataframe to a matlab .mat file using scipy.io.

I have the following:

array1 = np.array([1,2,3])
array2 = np.array(['a','b','c'])
array3 = np.array([1.01,2.02,3.03])
df = DataFrame({1:array1, 2:array2,3:array3}, index=('array1','array2','array3'))
recarray_ = df.to_records()
## Produces:
# rec.array([('array1', 1, 'a', 1.01), ('array2', 2, 'b', 2.02),
#   ('array3', 3, 'c', 3.03)], 
#  dtype=[('index', 'O'), ('1', '<i4'), ('2', 'O'), ('3', '<f8')])
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_records()})

In Matlab, I would expect this to produce a struct containing three arrays (one int, one char, one float) but it actually produces is a struct containing 3 more structs, each containing four variables; ‘index’, 1, ‘2’, 3. When trying to select 1, ‘2’ or 3 I get the error ‘The variable struct(1, 1).# does not exist.’

Can anyone explain the expected behaviour and how best to save DataFrames to .mat files?

I am using the following workaround in the meantime. Please let me know if you have a better solution:

a_dict = {col_name : df[col_name].values for col_name in df.columns.values}

## optional if you want to save the index as an array as well:
# a_dict[df.index.name] = df.index.values
scipy.io.savemat('test_struct_to_mat.mat', {'struct':a_dict})

I think what you need is to create the dataframe like this:

df = DataFrame({'array1':array1, 'array2':array2,'array3':array3})

and save it like this:

scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_dict("list")})

So the code should be something like:

# ... import appropritely
array1 = np.array([1,2,3])
array2 = np.array(['a','b','c'])
array3 = np.array([1.01,2.02,3.03])
df = DataFrame({'array1':array1, 'array2':array2,'array3':array3})
scipy.io.savemat('test_recarray_struct.mat', {'struct':df.to_dict("list")})

Tagged : / /

Code Bug Fix: Fill panda columns with conditions

Original Source Link

I’m trying to fill a column C with conditions: if the value of column B is None, then fill column C with the value of column A. If column B is not None, then fill column C with the value 3

I tried:

import pandas
df = pandas.DataFrame([{'A': 5, 'B': None, 'C': ''},
                   {'A': 2, 'B': "foo", 'C': ''},
                   {'A': 6, 'B': "foo", 'C': ''},
                   {'A': 1, 'B': None, 'C': ''}])

df["C"] = df["B"].apply(lambda x: 3 if (x != None) else df["A"])

My output:

TypeError: object of type ‘int’ has no len()

I know the problem is df[“A”], but I don’t know how to solve it

Good output:

df = pandas.DataFrame([{'A': 5, 'B': None, 'C': 5},
                   {'A': 2, 'B': "foo", 'C': 3},
                   {'A': 6, 'B': "foo", 'C': 3},
                   {'A': 1, 'B': None, 'C': 1}])

Use numpy.where with test None by Series.isna:

df["C"] = np.where(df["B"].isna(), df['A'], 3)
#alternative
#df["C"] = df['A'].where(df["B"].isna(), 3)
   print (df)
 A     B  C
0  5  None  5
1  2   foo  3
2  6   foo  3
3  1  None  1

Tagged : /

Code Bug Fix: Python Pandas read_csv float64 to string

Original Source Link

I’m reading below CSV file as read_csv

day,order,variant,variant_name
2020-05-04,OR_001,1000000548952,Product1
2020-05-04,OR_001,1000000056488,Product4
2020-05-04,OR_002,1000000528985,Product2

When I read this in a dataframe and then print the variant column

print(df_SalesOrders["variant"])

I get the below output.

0     1.000001e+12
1     1.000000e+12
2     1.000001e+12

Could someone please let me know how can I preserve the original number which I believe is treated as float64.

I tried the below code however that didn’t help.

myarr = df_SalesOrders.variant.astype(str)

Any help is much appreciated.
Thanks.

You need to add you change to your dataframe

df_SalesOrders.variant = df_SalesOrders.variant.astype('int64')

Please call

pd.options.display.float_format = '{:,.4f}'.format 
df

Tagged : /

Code Bug Fix: Filling in data based on index

Original Source Link

I have a multiindex dataframe with years and months as shown;


            A       B       C       D       E       F       G       H       I
2019    8   15.0    15.0    8.0     9.0     18.0    27.0    56.0    21.0    168.0
        9   20.0    21.0    11.0    12.0    26.0    37.0    73.0    41.0    241.0
        10  25.0    39.0    20.0    19.0    51.0    49.0    133.0   74.0    411.0
        11  32.0    65.0    34.0    26.0    110.0   110.0   193.0   147.0   718.0
        12  36.0    72.0    38.0    31.0    122.0   152.0   205.0   167.0   821.0
2020    1   42.0    73.0    39.0    35.0    131.0   179.0   205.0   173.0   876.0
        2   32.0    71.0    37.0    30.0    113.0   141.0   212.0   151.0   787.0
        3   29.0    60.0    32.0    26.0    99.0    120.0   187.0   145.0   700.0
        4   20.0    32.0    16.0    17.0    45.0    62.0    108.0   82.0    381.0
        5   16.0    28.0    15.0    13.0    37.0    38.0    96.0    71.0    314.0

And I want to append another dataframe to the end for the rest of the year which has a monthly value for each column.


    A       B       C       D       E       F       G       H       I                               
1   41.0    84.0    41.0    37.0    144.0   183.0   221.0   187.0   952.0
2   35.0    80.0    40.0    34.0    131.0   165.0   219.0   174.0   875.0
3   29.0    65.0    32.0    27.0    102.0   123.0   191.0   145.0   701.0
4   20.0    39.0    20.0    18.0    59.0    64.0    137.0   88.0    432.0
5   15.0    26.0    14.0    13.0    40.0    43.0    96.0    55.0    303.0
6   12.0    18.0    9.0     10.0    24.0    35.0    71.0    26.0    200.0
7   12.0    15.0    7.0     9.0     20.0    32.0    58.0    21.0    174.0
8   12.0    16.0    8.0     9.0     18.0    26.0    59.0    21.0    170.0
9   17.0    22.0    11.0    12.0    27.0    37.0    77.0    40.0    240.0
10  23.0    39.0    19.0    19.0    55.0    54.0    120.0   80.0    408.0
11  31.0    63.0    31.0    28.0    110.0   111.0   180.0   137.0   716.0
12  36.0    71.0    36.0    32.0    131.0   168.0   200.0   161.0   858.0

I need to try and append the dataframes to infill any blank months from the first dataframe with the values from the second.

I suppose there are two questions;
– Do I need to add a second index onto the second dataframe in order to join these?
– I suppose I need to do some form of ‘if’ statement to say only infill future blank months?

A lot of attempts at joins/concat/append are giving me the following;

(2019, 12)  36.0    72.0    38.0    31.0    122.0   152.0   205.0   167.0   821.0
(2020, 1)   42.0    73.0    39.0    35.0    131.0   179.0   205.0   173.0   876.0
(2020, 2)   32.0    71.0    37.0    30.0    113.0   141.0   212.0   151.0   787.0
(2020, 3)   29.0    60.0    32.0    26.0    99.0    120.0   187.0   145.0   700.0
(2020, 4)   20.0    32.0    16.0    17.0    45.0    62.0    108.0   82.0    381.0
(2020, 5)   16.0    28.0    15.0    13.0    37.0    38.0    96.0    71.0    314.0
1           41.0    84.0    41.0    37.0    144.0   183.0   221.0   187.0   952.0
2           35.0    80.0    40.0    34.0    131.0   165.0   219.0   174.0   875.0
3           29.0    65.0    32.0    27.0    102.0   123.0   191.0   145.0   701.0
4           20.0    39.0    20.0    18.0    59.0    64.0    137.0   88.0    432.0
5           15.0    26.0    14.0    13.0    40.0    43.0    96.0    55.0    303.0
6           12.0    18.0    9.0 10.0    24.0    35.0    71.0    26.0    200.0
7           12.0    15.0    7.0 9.0 20.0    32.0    58.0    21.0    174.0
8           12.0    16.0    8.0 9.0 18.0    26.0    59.0    21.0    170.0
9           17.0    22.0    11.0    12.0    27.0    37.0    77.0    40.0    240.0
10          23.0    39.0    19.0    19.0    55.0    54.0    120.0   80.0    408.0
11          31.0    63.0    31.0    28.0    110.0   111.0   180.0   137.0   716.0
12          36.0    71.0    36.0    32.0    131.0   168.0   200.0   161.0   858.0

So I would need the new dataframe to infill the first dataframe based on month.

Any help much appreciated, can’t seem to figure it out.

Use DataFrame.combine_first with DataFrame.reindex by second DataFrame with values of years from df1:

mux = pd.MultiIndex.from_product([df1.index.levels[0], df2.index])
df = df1.combine_first(df2.reindex(mux, level=1))
print (df)
            A     B     C     D      E      F      G      H      I
2019 1   41.0  84.0  41.0  37.0  144.0  183.0  221.0  187.0  952.0
     2   35.0  80.0  40.0  34.0  131.0  165.0  219.0  174.0  875.0
     3   29.0  65.0  32.0  27.0  102.0  123.0  191.0  145.0  701.0
     4   20.0  39.0  20.0  18.0   59.0   64.0  137.0   88.0  432.0
     5   15.0  26.0  14.0  13.0   40.0   43.0   96.0   55.0  303.0
     6   12.0  18.0   9.0  10.0   24.0   35.0   71.0   26.0  200.0
     7   12.0  15.0   7.0   9.0   20.0   32.0   58.0   21.0  174.0
     8   15.0  15.0   8.0   9.0   18.0   27.0   56.0   21.0  168.0
     9   20.0  21.0  11.0  12.0   26.0   37.0   73.0   41.0  241.0
     10  25.0  39.0  20.0  19.0   51.0   49.0  133.0   74.0  411.0
     11  32.0  65.0  34.0  26.0  110.0  110.0  193.0  147.0  718.0
     12  36.0  72.0  38.0  31.0  122.0  152.0  205.0  167.0  821.0
2020 1   42.0  73.0  39.0  35.0  131.0  179.0  205.0  173.0  876.0
     2   32.0  71.0  37.0  30.0  113.0  141.0  212.0  151.0  787.0
     3   29.0  60.0  32.0  26.0   99.0  120.0  187.0  145.0  700.0
     4   20.0  32.0  16.0  17.0   45.0   62.0  108.0   82.0  381.0
     5   16.0  28.0  15.0  13.0   37.0   38.0   96.0   71.0  314.0
     6   12.0  18.0   9.0  10.0   24.0   35.0   71.0   26.0  200.0
     7   12.0  15.0   7.0   9.0   20.0   32.0   58.0   21.0  174.0
     8   12.0  16.0   8.0   9.0   18.0   26.0   59.0   21.0  170.0
     9   17.0  22.0  11.0  12.0   27.0   37.0   77.0   40.0  240.0
     10  23.0  39.0  19.0  19.0   55.0   54.0  120.0   80.0  408.0
     11  31.0  63.0  31.0  28.0  110.0  111.0  180.0  137.0  716.0
     12  36.0  71.0  36.0  32.0  131.0  168.0  200.0  161.0  858.0

Tagged : / / / /

Code Bug Fix: Why are there zeros between numbers and Dataframe titles?

Original Source Link

Somehow some “0” come up below the title of my Dataframe when i append to a dictionary and then concatenate it to a Dataframe…

                                  Open      High       Low     Close
                            0         0         0         0
Time                                                       
2013.10.29 00:00:00 -0.001090 -0.003290 -0.006910 -0.006255
2013.10.30 00:00:00 -0.006144 -0.005078 -0.003908 -0.000512
2013.10.31 00:00:00 -0.000442  0.001646  0.002732 -0.000985
2013.11.01 00:00:00 -0.000842 -0.000017  0.000998  0.001132
2013.11.04 00:00:00  0.003941  0.005085  0.005387  0.009340

    df={} 

    for name in series.columns: 
        # .... do some series manipulations and convert it to numpy arrays and to lists and so on...
        # => create "list_" which is a list of numbers


        df[name]=pd.DataFrame(list_.copy())  

    df=pd.concat(df,axis=1)  
    return df 

Solved:


    df={} 

    for name in series.columns: 
        # .... do some series manipulations and convert it to numpy arrays and to lists and so on...
        # => create "list_" which is a list of numbers


        df[name]=pd.Series(list_.copy())   # <--- Here, use a series instead :)

    df=pd.concat(df,axis=1)  
    return df 

Tagged : / / /

Code Bug Fix: Count regex matches in one column by values in another column with pandas

Original Source Link

I am working with pandas and have a dataframe that contains a list of sentences and people who said them, like this:

 sentence                 person
 'hello world'              Matt
 'cake, delicious cake!'    Matt
 'lovely day'               Maria
 'i like cake'             Matt
 'a new day'                Maria
 'a new world'              Maria

I want to count non-overlapping matches of regex strings in sentence (e.g. cake, world, day) by the person. Note each row of sentence may contain more than one match (e.g cake):

person        'day'        'cake'       'world'
Matt            0            3             1
Maria           2            0             1

So far I am doing this:

rows_cake = df[df['sentences'].str.contains(r"cake")
counts_cake = rows_cake.value_counts()

However this str.contains gives me rows containing cake, but not individual instances of cake.

I know I can use str.counts(r"cake") on rows_cake. However, in practise my dataframe is extremely large (> 10 million rows) and the regexes I am using are quite complex so I am looking for a more efficient solution if possible.

Maybe you should first try to get the sentence itself and then use re to do your optimized regex stuff like that:

for row in df.itertuples(index=False):
   do_some_regex_stuff(row[0], row[1])#in this case row[0] is a sentence. row[1] is person

As far as I know itertuples is quiet fast (Notes no.1 here). So the only optimization problem you have is with regex itself.

I came up with rather simple solution. But cant claim it to be the fastest or efficient.

import pandas as pd
import numpy as np

# to be used with read_clipboard()
'''
sentence    person
'hello world'   Matt
'cake, delicious cake!' Matt
'lovely day'    Maria
'i like cake'   Matt
'a new day' Maria
'a new world'   Maria
'''

df = pd.read_clipboard()
# print(df)

Output:

                  sentence person
0            'hello world'   Matt
1  'cake, delicious cake!'   Matt
2             'lovely day'  Maria
3            'i like cake'   Matt
4              'a new day'  Maria
5            'a new world'  Maria

.

# if the list of keywords is fix and relatively small
keywords = ['day', 'cake', 'world']

# for each keyword and each string, counting the occourance
for key in keywords:
    df[key] = [(len(val.split(key)) - 1) for val in df['sentence']]

# print(df)

Output:

                 sentence person  day  cake  world
0            'hello world'   Matt    0     0      1
1  'cake, delicious cake!'   Matt    0     2      0
2             'lovely day'  Maria    1     0      0
3            'i like cake'   Matt    0     1      0
4              'a new day'  Maria    1     0      0
5            'a new world'  Maria    0     0      1

.

# create a simple pivot with what data you needed
df_pivot = pd.pivot_table(df, 
values=['day', 'cake', 'world'], 
columns=['person'], 
aggfunc=np.sum).T

# print(df_pivot)

Final Output:

        cake  day  world
person
Maria      0    2      1
Matt       3    0      1

Open to suggestions if this seems to be a good approach especially given the volume of data. Eager to learn.

since this primarily involves strings, I would suggest taking the computation out of Pandas – Python is faster than Pandas in most cases when it comes to string manipulation :

#read in data
df = pd.read_clipboard(sep='s{2,}', engine='python')

#create a dictionary of persons and sentences : 
from collections import defaultdict, ChainMap
d = defaultdict(list)
for k,v in zip(df.person, df.sentence):
    d[k].append(v)


d = {k:",".join(v) for k,v in d.items()}

#search words
strings = ("cake", "world", "day")

#get count of words and create a dict
m = defaultdict(list)
for k,v in d.items():
    for st in strings:
        m[k].append({st:v.count(st)})

res = {k:dict(ChainMap(*v)) for k,v in m.items()}


print(res)
{'Matt': {'day': 0, 'world': 1, 'cake': 3},
 'Maria': {'day': 2, 'world': 1, 'cake': 0}}

output = pd.DataFrame(res).T

       day  world   cake
Matt    0     1     3
Maria   2     1     0

test the speeds and see which one is better. it would be useful for me and others as well.

Tagged : / / /

Code Bug Fix: How to merge multiple dataframe columns within a common dataframe in pandas in fastest way possible?

Original Source Link

Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:

Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4

The columns which are common in all 5 dataframes – df, df1, df2, df3 and df4 are A, B, C and D

EDIT

The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.

Input df
A  B  C  D   X    Y    Z    W
1  2  3  4  nan  nan  nan  nan
2  3  4  5  nan  nan  nan  nan
5  9  7  8  nan  nan  nan  nan
4  8  6  3  nan  nan  nan  nan
df1
A  B  C  D   X    Y    Z    W
2  3  4  5  100  nan  nan  nan
4  8  6  3  200  nan  nan  nan
df2
A  B  C  D   X    Y    Z    W
1  2  3  4  nan  50  nan  nan
df3
A  B  C  D   X    Y    Z     W
1  2  3  4  nan  nan  1000  nan
4  8  6  3  nan  nan  2000  nan
df4
A  B  C  D   X    Y    Z    W
2  3  4  5  nan  nan  nan  25
5  9  7  8  nan  nan  nan  35
4  8  6  3  nan  nan  nan  45
Output df
A  B  C  D   X    Y    Z    W
1  2  3  4  nan  50   1000  nan
2  3  4  5  100  nan   nan  25
5  9  7  8  nan  nan   nan  35
4  8  6  3  200  nan  2000  45

Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn’t seem to be the most efficient way.
Can this be done by using just 1 line of code instead?

Any help will be appreciated. Many thanks in advance.

Tagged : / /

Code Bug Fix: folium Choropleth colors showing grey only

Original Source Link

I’m trying to show happiness levels based on the country using folium choropleth, however, it doesn’t work and all countries are just grey. This is what I get:

image output

json file: https://github.com/python-visualization/folium/blob/master/examples/data/world-countries.json

csv file: https://drive.google.com/file/d/1aI5tILdPYyx0yjbgPPZcJHEqxbZ4oPBg/view?usp=sharing

and this is my code:

import folium
import pandas as pd


country_geo = 'world-countries.json'

country_data = pd.read_csv('Happiness_Dataset_2016.csv')

bins = list(country_data['Happiness Score'].quantile([0, 0.25, 0.5, 0.75, 1]))

m = folium.Map(location=[0,0], zoom_start=2)


folium.Choropleth(
    geo_data=country_geo,
    name='choropleth',
    data=country_data,
    columns=['Country','Happiness Score'],
    Key_on='feature.properties.name',
    fill_color='BuPu',
    fill_opacity=0.2,
    line_opacity=0.5,

    legend_name='Happiness Rates (%)',
    bins =bins,

    reset=True
).add_to(m)
# folium.LayerControl().add_to(m)
m
m.save('worldmap.html')

Here the error:

Key_on='feature.properties.name',

Modify it as:

key_on='feature.properties.name',

and you get:

enter image description here

Tagged : / / / /

Code Bug Fix: Read CSV/Excel files from SFTP file, make some changes in those files using Pandas, and save back

Original Source Link

I want to read some CSV/Excel files on a secure SFTP folder, make some changes (fixed changes in each file like remove column 2) in those files, upload them to a Postgre DB and also the upload them to a different SFTP path in Python

What’s the best way to this?

I have made a connection to the SFTP using pysftp library and am reading the Excel:

import pysftp
import pandas as pd

myHostname = "*****"
myUsername = "****"
myPassword = "***8"
cnopts =pysftp.CnOpts()
cnopts.hostkeys = None  

sftp=pysftp.Connection(host=myHostname, username=myUsername, 
password=myPassword,cnopts=cnopts)
print ("Connection succesfully stablished ... ")
sftp.chdir('test/test')
#sftp.pwd
a=[]
for i in sftp.listdir_attr():
    with sftp.open(i.filename) as f:
        df=pd.read_csv(f)

How should I proceed with the upload to DB and making those changes to the CSV permanent?

You have the download part done.

For the upload part, see How to Transfer Pandas DataFrame to .csv on SFTP using Paramiko Library in Python? – While it’s for Paramiko, pysftp Connection.open method behaves identically to Paramiko SFTPClient.open, so the code is the same.

Full code can be like:

with sftp.open("/remote/path/data.csv", "r+", bufsize=32768) as f:
    # Download CSV contents from SFTP to memory
    df = pd.read_csv(f)

    # Modify as you need (just an example)
    df.at[0, 'Name'] = 'changed'

    # Upload the in-memory data back to SFTP
    f.seek(0)
    df.to_csv(f, index=False)
    # Truncate the remote file in case the new version of the contents is smaller
    f.truncate(f.tell())

The above updates the same file. If you want to upload to a different file, use this:

# Download CSV contents from SFTP to memory
with sftp.open("/remote/path/source.csv", "r") as f:
    df = pd.read_csv(f)

# Modify as you need (just an example)
df.at[0, 'Name'] = 'changed'

# Upload the in-memory data back to SFTP
with sftp.open("/remote/path/target.csv", "w", bufsize=32768) as f:
    df.to_csv(f, index=False)

For the purpose of bufsize, see:
Writing to a file on SFTP server opened using pysftp “open” method is slow


Obligatory warning: Do not set cnopts.hostkeys = None, unless you do not care about security. For the correct solution see Verify host key with pysftp.

That’s several questions in one question 🙂

I would suggest go with that approach:

  1. Make a local copy of the file (not sure how big it is, no point to shuffle it around between your local machine and sftp server. You cna use get method
  2. Make operations on your data with pandas, then dump it back to csv with to_csv method
  3. load data to the postgree using either pandas.io or pure SQLAlchemy. Check the docs here
  4. Upload the file to the destination you want with put method
Tagged : / / / /