PythonIt’s a magical language. In fact, it is one of the fastest-growing programming languages in the world in recent years. It has proved its practicability in various fields of development and data science. The entire Python system and libraries are for users around the world, whether beginners or advanced.A proper choice. One of the reasons for its success and popularity is its powerful libraries, which make it dynamic and fast.
In this article, we will see some Python libraries for data science tasks other than pandas, scikit-learning, and matplotlib. Even when you see libraries like pandas, scikit-learns, it’s brain-blowing.Machine learning tasks emerge, but it’s always helpful to understand and learn about other Python libraries in this area.
Extracting data from web pages is one of the important tasks of data scientists. Wget is a free, non-interactive utility for downloading files from the Internet. It supports HTTP, HTTPS and FTP protocols, as well as retrieval through HTTP proxy. Because it’s non-interactive, even if the user doesn’tLog in or work in the background. So next time you want to download a picture of a website or page, WGet can help you.
$ pip install wget
import wget url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3' filename
=wget.download(url) 100% [................................................] 3841532 / 3841532
Pendulum can be used for those frustrated with the use of date and time in python. It’s a python package that eases date-time operations and is a python prototype replacement. For further information, please refer to this document.
$ pip install pendulum
import pendulum dt_toronto = pendulum.datetime(2012, 1, 1, tz='America/Toronto') dt_vancouver
=pendulum.datetime(2012, 1, 1, tz='America/Vancouver')
I’ve seen most classification algorithms work, with almost the same number of samples per class, such as balanced. But in real life, most of the data sets are unbalanced, which will affect the prediction of learning stage and subsequent machine learning algorithms. Fortunately, this imbalance was createdD library solves this problem. It is compatible with scikit-learning and is part of the scikit-learning-contrib project. Next time you encounter an unbalanced data set, you can try using this library.
pip install -U imbalanced-learn #or conda install -c conda-forge imbalanced-learn
Refer to the documentation for usage and examples.
NLPCleaning up text data in tasks often requires changing keywords in sentences or extracting keywords from sentences. Normally, these operations can be done with regular expressions, but if you encounter thousands of searches, it can be a problem. Python’s FlashText module, the module baseThe FlashText algorithm provides appropriate substitution and so on. The best part of FlashText is that the run time is independent of the number of search terms, and you can learn more here.
$ pip install flashtext
from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() #
keyword_processor.add_keyword(<unclean name>, <standardised name>)
keyword_processor.add_keyword('Big Apple', 'New York') keyword_processor.add_keyword('Bay Area')
keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
keywords_found ['New York', 'Bay Area']
keyword_processor.add_keyword('New Delhi', 'NCR region') new_sentence =
keyword_processor.replace_keywords('I love Big Apple and new delhi.') new_sentence 'I love New
York and NCR region.'
This name sounds strange, but when dealing with string matching, fuzzy wuzzy is a very useful library that can easily implement operations, such as string comparison ratio, token ratio, etc. It also facilitates matching records stored in different databases.
$ pip install fuzzywuzzy
from fuzzywuzzy import fuzz from fuzzywuzzy import process # Simple Ratio
fuzz.ratio("this is a test", "this is a test!") 97 # Partial Ratio
fuzz.partial_ratio("this is a test", "this is a test!") 100
More interesting examples can be found in GitHub repo.
Time series analysis is one of the most common problems in machine learning. PyFlux is an open source library for time series problems in Python. The database has a good modern time series model including but not limited to ARIMA, GARCH and VAR models. In short,PyFlux provides a probabilistic method for time series modeling, which is worth trying.
pip install pyflux
Refer to the relevant documentation for examples of usage.
Visualization of results is an important aspect of data science. Being able to visualize results has great advantages. IPyvolume is a Python library that visualizes 3D volumes and fonts (for example, 3d) in Jupyter notebook with minimal configuration and effortScatter plot). However, it is currently in the pre-1.0 stage. A good analogy is that IPyvolume’s volshow is a 3D array and matplotlib’s imshow is a 2D array. You can read more about it here.
Using pip $ pip install ipyvolume Conda/Anaconda $ conda install -c conda-forge ipyvolume
pip install dash==0.29.0 # The core dash backend pip install dash-html-components==0.13.2
# HTML components pip install dash-core-components==0.36.0 # Supercharged components pip install dash-table==3.1.3
# Interactive DataTable component (new!)
The following example shows the highly interactive graphics of the drop-down table. When the user selects a value in the drop-down list, the application code dynamically exports data from Google Finance to Pandas DataFram.
OpenAIGym is a toolkit for developing and comparing reinforcement learning algorithms. It is compatible with any numerical library, such as TensorFlow or Theano. Gym libraries are a necessary set of test problems, also known as environments – you can use them to train reinforcement learning algorithms. These environmental toolsThere is a shared interface that allows the writing of general algorithms.
pip install gym
Following is an example of 1000 steps in the runtime environment CartPole-v0, rendering the environment at each step.
You can learn more about the environment here.
These are Python libraries that I chose to be useful for data science, not common ones such as numpy, pandas, etc. If you know of other libraries that can be added to the list, please mention them in the comments below. Don’t forget to try.
This article is the original content of Yunqi Community, which can not be reproduced without permission.