Tuesday, December 31, 2013

Ceph in 2013, a year in review [feedly]


 
 
Shared via feedly // published on Ceph // visit site
Ceph in 2013, a year in review

Wow, what a ride this year has been! Ceph has come a long way, and it doesn't show any signs of slowing down. As the new year looms I thought it might be good to reflect a bit on 2013 and some of the notable achievements from the Ceph community.

I would categorize the accomplishments this year in three main categories: Community, Commercial, and Development. This year saw a great blend of all categories and include three major releases, a user committee, passing the $13 million mark in funding, numerous integrations, and tons more. Read on for details!

Community

Arguably the most notable and important part about Ceph is our amazing community, and they have really stepped it up this year. We have seen contributions grow at a fevered pitch and the infrastructure surrounding them has had to keep pace. Our infrastructure has seen changes like the updated ceph wiki, our new metrics dashboard, tons of documentation work, and the continuing work on ceph.com. All of this infrastructure takes quite a bit of time and effort to keep the lights on, and the community is starting to take a much larger role in ensuring that things continue to run smoothly. If you are interested in volunteering to help, make sure to drop a note to the community mailing list.

Speaking of the community mailing list, this year saw the birth of the Ceph User Committee to help manage the coordination and infrastructure of community-facing activities. This committee will primarily work to spread the word of Ceph through in-person meetings, increased web content, and infrastructure support. Anyone is welcome to join the Ceph User Committee, and people with all skill sets are needed. Whether you're a developer, designer, organizer, or just interested in having a beer with your fellow Ceph users, there are things you can help with so feel free to drop a note on the list to say hi.

While the community has really started to self organize, there are also a number of gatherings that the core team has spent a lot of effort setting up. Both the Ceph Day in-person events and the online Ceph Developer Summit have been well attended and continue to grow. While the Ceph Days are not new to this year, the program was expanded to start happening on a more regular basis. In the coming year we hope to expand it again into a monthly (or close to it) roadshow. Look for a stop in your neck of the woods! The Ceph Developer Summit, however, was new this year with the Dumpling release. The concentrated collaboration sessions to discuss all aspects of upcoming Ceph development have resulted in some amazing contributions from the community.

Another community effort was something that started as Inktank holding "office hours" to help new Ceph users with questions. This effort expanded shortly thereafter into the "Geek on Duty," adding several community volunteers to ensure that anyone can come ask questions about Ceph and know they wont be met with radio silence. Together these community efforts have really helped Ceph grow into a vibrant and exciting project.

Commercial

While the Open Source community has been amazing, we have also seen a huge uptake in the commercial community. While Inktank has been leading the charge with almost 300% customer growth, steady feature growth and stability in the code, tipping the scales at just over $13 million in funding, and a commercial offering, they have definitely not been the only ones working hard building a business around Ceph. There is a wealth of amazing partner, customer, and commercial use springing up. This includes cloud services from Dreamhost, hardware partner work from folks like Penguin Computing, cloud work from SUSE, a couple of great University projects from Dell, and many others. It is amazing to watch the number of businesses who are adopting, and in turn helping to grow, Ceph.

Another sector that has been helping Ceph to grow is the financial industry. Starting with the FinTech Innovation Lab in New York City, Inktank was able to put Ceph in front of Banks, Venture Capitalists, and Entrepreneurs heavily involved in the financial industry. The feedback and momentum from that program has really helped Ceph to mature and expand in ways it would have otherwise been very difficult to do.

Development

Beyond the intangibles of community growth there has also been a huge amount of progress in the code. We have seen very concerted effort integrating Ceph with all manner of other software. Whether it's specific applications like Xen or iSCSI support or larger integration projects like OpenNebula or Ganeti, the number of things utilizing a Ceph backend are growing at an astonishing rate.

In addition to the number of projects that use Ceph, it is also much easier to get Ceph itself. Packaging, orchestration, and deployment have all experienced massive improvements this year. While Puppet, Chef, Ansible, Salt, and Juju concoctions have all come a long way, the award for "most improved" definitely goes to Ceph-Deploy. No matter what method of deployment you choose, the process of getting a Ceph cluster up and running on your hardware du jour has become much easier, and as a result we have seen rates of adoption growing over time.

The core Ceph project also has experienced quite a bit of growth this year. Sage moved the project to a three month development cycle which has proven to be an excellent driving force in project momentum. With this aggressive release cadence the project was able to polish three major releases this year (Cuttlefish, Dumpling, and Emperor) and work on Firefly is already underway for a February release next year!

Summary

2013 has certainly been an amazing year for the Ceph project, and we can't wait to see what 2014 will bring! With Firefly arriving in February, movement towards a Ceph Foundation, and expanded community participation efforts, it certainly promises to be exciting. If you would like to join us on the epic journey the community is always happy to welcome new passionate people. Drop a line on the mailing lists or reach out to the user committee and we'll be happy to get you started! Happy new year from the Ceph team, see you in 2014!

scuttlemonkey out





Sent from my iPad

The Cloudcast #126 - 2013 in Review & 2014 Predictions [feedly]


 
 
Shared via feedly // published on The Cloudcast (.net) - Weekly Cloud Computing Podcast // visit site
The Cloudcast #126 - 2013 in Review & 2014 Predictions
Aaron and Brian wrap up 2013 with thoughts on the year in Cloud Computing and make a few predictions for 2014. Topics include AWS, OpenStack, PaaS, DevOps, SDN and any other buzz word we can think of. Music Credit: Nine Inch Nails (nin.com)





Sent from my iPad

Project Karcygwins and Virtualised Storage Performance [feedly]


 
 
Shared via feedly // published on Latest blog entries // visit site
Project Karcygwins and Virtualised Storage Performance

Introduction

Over the last few years we have witnessed a revolution in terms of storage solutions. Devices capable of achieving millions of Input/Output Operations per Second (IOPS) are now available off-the-shelf. At the same time, Central Processing Unit (CPU) speeds remain largely constant. This means that the overhead of processing storage requests is actually affecting the delivered throughput. In a world of virtualisation, where extra processing is required in order to securely pass requests from virtual machines (VM) to storage domains, this overhead becomes more evident.

It is the first time that such an overhead became a concern. Until recently, the time spent within I/O devices was much longer than that of processing a request within CPUs. Kernel and driver developers were mainly worried about: (1) not blocking while waiting for devices to complete; and (2) sending requests optimised for specific device types. While the former was addressed by techniques such as Direct Memory Access (DMA), the latter was solved by elevator algorithms such as Completely Fair Queueing (CFQ).

Today, with the large adoption of Solid-State Drives (SSD) and the further development of low-latency storage solutions such as those built on top of PCI Express (PCIe) and Non-Volatile Memory (NVM) technologies, the main concern lies in not losing any unnecessary time in processing requests. Within the Xen Project community, some development already started in order to allow scalable storage traffic from several VMs. Linux kernel maintainers and storage manufacturers are also working on similar issues. In the meantime, XenServer Engineering delivered Project Karcygwins which allowed a better understanding of current bottlenecks, when they are evident and what can be done to overcome them. 

Project Karcygwins

Karcygwins was originally intended as three separate projects (Karthes, Cygni and Twins). Due to their topics being closely related, they were merged. Those three projects were proposed based on subjects believed to be affecting virtualised storage throughput.

Project Karthes aimed at assessing and mitigating the cost in mapping (and unmapping) memory between domains. When a VM issues an I/O request, the storage driver domain (dom0 in XenServer) requires access to certain memory areas in the guest domain. After the request is served, these areas need to be released (or unmapped). This is also an expensive operation due to flushes required in different cache tables. Karthes was proposed to investigate the cost related to these operations, how they impacted the delivered throughput and what could be done to mitigate them.

Project Cygni aimed at allowing requests larger than 44 KiB to be passed between a guest and a storage driver domain. Until recently, Xen's blkif protocol defined a fixed array of data segments per request. This array had room for 11 segments corresponding to a 4 KiB memory page each (hence the 44 KiB). The protocol has since been updated to support indirect I/O operations where the segments actually contained other segments. This change allowed for much larger requests at a small expense.

Project Twins aimed at evaluating the benefits of using two communication rings between dom0 and a VM. Currently, only one ring exists and it is used both for requests from the guests and responses from the back end. With two rings, requests and responses can be stored in their own ring. This new strategy allows for larger inflight data and better use of caching.

Due to initial findings, the main focus of Karcygwins stayed on Project Karthes. The code allowing for requests larger than 44 KiB, however, was constantly included in the measurements to address the goals proposed for Project Cygni. The idea of using split rings (Project Twins) was postponed and will be investigated at a later stage.

Visualising the Overhead

When a user installs a virtualisation platform, one of the first questions to be raised is: "what is the performance overhead?". When it comes to storage performance, a straightforward way to quantify this overhead is to measure I/O throughput on a bare metal Linux installation and repeat the measurement (on the same hardware) from a Linux VM. This can promptly be done with a generic tool like dd for a variety of block sizes. It is a simple test that does not cover concurrent workloads or greater IO depths.

karcyg-fig0.png

Looking at the plot above we can see that, on a 7.2k RPM SATA Western Digital Blue WD5000AAKX, read requests as large as 16 KiB can reach the maximum disk throughput at just over 120 MB/s (red line). When repeating the same test from a VM (green and blue lines), however, we see that the throughput for small requests is much lower. They eventually reach the same 120 MB/s mark, but only with larger requests.

The green line represents the data path where blkback is directly plugged to the back end storage. This is the kernel module that receive requests from the VM. While this is the fastest virtualisation path in the Xen world, it lacks certain software-level features such as thin-provisioning, cloning, snapshotting and the capability of migrating guests without centralised storage.

The blue line represents the data path where requests go through tapdisk2. This is a user space application that runs in dom0 and can implement the VHD format. It also has an NBD plugin for migration of guests without centralised storage. It allows for thin-provisioning, cloning and snapshotting of Virtual Disk Images (VDI). Because requests transverse more components before reaching the disk, it is understandingly slower.

Using Solid-State Drives and Fast RAID Arrays

The shape of the plot above is not the same for all types of disks, though. Modern disk setups can achieve considerable higher data rates before flattening their throughputs.

karcyg-fig1.png

Looking at the plot above, we can see a similar test executed from dom0 on two different back end types. The red line represents the throughput obtained from a RAID0 formed by two SSDs (Intel DC S3700). The blue line represents the throughput obtained from a RAID0 formed by two SAS disks (Seagate ST). Both arrays were measured independently and are connected to the host through a PERC H700 controller. While the Seagate SAS array achieves its maximum throughput at around 370 MB/s when using 48 KiB requests, the Intel SSD array continues to speed up even with requests as large as 4 MiB. Focusing on each array separately, it is possible to compare these dom0 measurements with measurements obtained from a VM. The plot below isolates the Seagate SAS array.

karcyg-fig2.png

Similar to what is observed on the measurements taken on a single Western Digital, the throughput measured from a VM is smaller than that of dom0 when requests are not big enough. In this case, the blkback data path (the pink line) allows the VM to reach the same throughput offered by the array (370 MB/s) with requests larger than 116 KiB. The other data paths (orange, cyan and brown lines) represent user space alternatives that reach different bottlenecks and even with large requests cannot match the throughput measured from dom0.

It is interesting to observe that some user space implementations vary considerably in terms of performance. When using qdisk as the back end along the blkfront driver from the Linux Kernel 3.11.0 (the orange line), the throughput is higher for requests of sizes such as 256 KiB (when compared to other user space alternatives -- the blkback data path remains faster). The main difference in this particular setup is the support for persistent grants. This technique, implemented in 3.11.0, reuses memory grants and drastically reduces the map and unmap operations. It requires, however, an additional copy operation within the guest. The trade-off may have different implications when varying factors such as hardware architecture and workload types. More on that on the next section.

karcyg-fig3.png

When repeating these measurements on the Intel SSD array, a new issue came to light. Because the array delivers higher throughput with no signs of abating as larger requests are issued, none of the virtualisation technologies are capable of matching the throughput measured from dom0. While this behaviour will probably differ with other workloads, this is what has been observed when using a single I/O thread with queue depth set to one. In a nutshell, 2 MiB read requests from dom0 achieves 900 MB/s worth of throughput while a similar measurement from one VM will only reach 300 MB/s when using user space back ends. This is a pathological example chosen for this particular hardware architecture to show how bad things can get.

Understanding the Overhead

In order to understand why the overhead is so evident in some cases, it is necessary to take a step back. The measurements taken on slower disks show that all virtualisation technologies are somewhat slower than what is observed in dom0. On such disks, this difference disappears as requests grow in size. What happens at that point is that the actual disk becomes "maxed out" and cannot respond faster no matter the request size. At the same time, much of the work done at the virtualisation layers do not get slower proportionally to the amount of data associated with requests. For example, interruptions between domains are unlikely to take longer simply because requests are bigger. This is exactly why there is no visible overhead with large enough requests on certain disks.

However, the question remains: what is consuming CPU time and causing such a visible overhead on the example previously presented? There are mainly two techniques that can be used to answer that question: profiling and tracing. Profiling allows instruction pointer samples to be collected at every so many events. The analysis of millions of such samples reveals code in hot paths where time is being spent. Tracing, on the other hand, measures the exact time passed between two events.

For this particular analysis, the tracing technique and the blkback data path have been chosen. To measure the amount of time spent between events, the code was actually modified and several RDTSC instructions have been inserted. These instructions read the Time Stamp Counters (TSC) and are relatively cheap while providing very accurate data. On modern hardware, TSCs are constant and consistent across cores of a host. This means that measurements from different domains (i.e. dom0 and guests) can be matched to obtain the time passed, for example, between blkfront kicking blkback. The diagram below shows where trace points have been inserted.

karcyg-fig4.png

In order to gather meaningful results, 100 requests have been issued in succession. Domains have been pinned to the same NUMA node in the host and turbo capabilities were disabled. The TSC readings were collected for each request and analysed both individually and as an average. The individual analysis revealed interesting findings such as a "warm up" period where the first requests are always slower. This was attributed to caching and scheduling effects. It also showed that some requests were randomly faster than others in certain parts of the path. This was attributed to CPU affinity. For the average analysis, the 20 fastest and slowest requests were initially discarded. This produced more stable and reproducible results. The plots below show these results.

karcyg-fig5.png

karcyg-fig6.png

Without persistent grants, the cost of mapping and unmapping memory across domains is clearly a significant factor as requests grow in size. With persistent grants, the extra copy on the front end adds up and results in a slower overall path. Roger Pau Monne, however, showed that persistent grants can improve aggregate throughput from multiple VMs as it reduces contention on the grant tables. Matt Wilson, following on from discussions on the Xen Developer Summit 2013, produced patches that should also assist grant table contention.

Conclusions and Next Steps

In summary, Project Karcygwins allowed the understanding of several key elements in storage performance for both Xen and XenServer:

  • The time spent in processing requests (in CPU) definitely matters as disks get faster
  • Throughput is visibly affected for single-threaded I/O on low-latency storage
  • Kernel-only data paths can be significantly faster
  • The cost of mapping (and unmapping) grants is the most significant bottleneck at this time

It also raised the attention on such issues with the Linux and Xen Project communities by having these results shared over a series of presentations and discussions:

Next, new research projects are scheduled (or already underway) to:

  • Look into new ideas for low-latency virtualised storage
  • Investigate bottlenecks and alternatives for aggregate workloads
  • Reduce the overall CPU utilisation of processing requests in user space

Have a happy 2014 and thanks for reading!






Sent from my iPad

Citrix Partner Hubbub: “Training update featuring XenDesktop 7″ [feedly]

Citrix Partner Hubbub: "Training update featuring XenDesktop 7″
http://feedly.com/e/EEL96cHE

Learn how to deploy and manage app and desktop solutions with Citrix XenDesktop 7 by taking two updated courses. Increase your knowledge and skills related to managing and deploying app and desktop solutions with Citrix XenDesktop 7. Two updated five-day, instructor-led trainings are available at select Citrix Authorized Learning Centers (CALCs). The course enhancements include changes to the courseware manual, updates to student resources and…

Read More

Yes, It's that Time of Year: 2014 Predictions [feedly]

Yes, It's that Time of Year: 2014 Predictions
http://feedly.com/e/kmCeUFRX

It's traditional to close out the year by making predictions for the year ahead, and I'm as prone as anyone else to prognosticate and risk being proved wrong 12 months later.

With that disclaimer in mind, here are my top five predictions for DevOps, cloud and IT in 2014:

The Science Behind the 2013 Puppet Labs DevOps Survey Of Practice [feedly]

The Science Behind the 2013 Puppet Labs DevOps Survey Of Practice
http://feedly.com/e/hS3VWpwJ

(This is a post written by Gene Kim and Jez Humble)

Last year, we both had the privilege of working with Puppet Labs to develop the 2012 DevOps Survey Of Practice. It was especially exciting for us, because we were able to benchmark the performance of over 4000 IT organizations, and to gain an understanding what behaviors result in their incredible performance. This continues research that Gene has been doing of high performing IT organizations since 1999.

In this blog post, Jez Humble and I will discuss the research hypotheses that we're setting out to test in the 2013 DevOps Survey Of Practice, explain the mechanics of how these types of cross-population studies actually work (so you help this research effort or even start your own), then describe the key findings that came out of the 2012 study.

But first off, if you're even remotely interested in DevOps, go take the 2013 Puppet Labs DevOps Survey here! The survey closes on January 15, 2014, so hurry! It only takes about ten minutes.

2013 DevOps Survey Research Goals

Last year's study (which we'll describe in more detail below) found that high performing organizations that were employing DevOps practices were massively outperforming their peers: they were doing 30x more frequent code deploys, and had deployment lead times measured in minutes or hours (versus lower performers, who required weeks, months or quarters to complete their deployments).

The high performers also had far better deployment outcomes: their changes and deployments had twice the change success rates, and when the changes failed, they could restore service 12x faster.

The goal of the 2013 study is to gain a better understanding of exactly what practices are required to achieve this high performance. Our hypothesis is that the following are required, and we'll be looking to independently evaluate the effect of each of these practices on performance:

  • small teams with high trust that span the entire value stream: Dev, QA, IT Operations and Infosec
  • shared goals and shared pain that span the entire value stream
  • small development batch sizes
  • presence of continuous, automated integration and testing
  • emphasis on creating a culture of learning, experimentation and innovation
  • emphasis on creating resilient systems

We are also testing two other hypotheses that one of us (Gene) is especially excited about, because it's something he's wanted to do ever since 1999!

Lead time: In plant manufacturing, lead time is the time required to turn raw materials into finished goods. There is a deeply held belief in the Lean community that lead time is the single best predictor of quality, customer satisfaction and employee happiness. We are testing this hypothesis for the DevOps value stream in the 2013 survey instrument.

Organizational performance: Last year, we confirmed that DevOps practices correlate with substantially improved IT performance (e.g., deploy frequencies, lead times, change success rates, MTTR). This year, we will be testing whether improved IT performance correlates with improved business performance. In this year's study, we've added inserted three questions that are known to correlate with organizational performance, which is known to correlate with business performance (e.g., competitiveness in the marketplace, return on assets, etc.).

Our dream headline would be, "high performing organizations not only do 30x more frequent code deployments than their peers, but they also outperform the S&P 500 by 3x as measured by shareholder return and return on assets."

Obviously, there are many other variables that contribute to business performance besides Dev and Ops performance (e.g., profitability, market segment, market share, etc.). However, in our minds, the reliance upon IT performance is obvious: as Chris Little said, "Every organization is an IT business, regardless of what business they think they're in."

When IT does poorly, the business will do poorly. And when IT helps the organization win, those organizations will out-perform their competitors in the marketplace.

(This hypothesis forms the basis of the hedge fund that Erik wants to create in the last chapter of "The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win", where they would make long or short bets, based on the known operating characteristics of the IT organization.)

The Theory Behind Cross-Population Studies and Survey Instruments

Like last year, this year's DevOps survey is a cross-population study, designed to explore the link between organizational performance and organizational practices and cultural norms.

What is a cross-population study? It's a statistical research technique designed to uncover what factors (e.g., practices, cultural norms, etc.) correlate with outcomes (e.g., IT performance). Cross-population studies are often used in medical research to answer questions like, "is cigarette smoking a significant factor in early mortality?"

Properly designed cross-population studies are considered a much more rigorous approach of testing efficacy of what practices work than say, interviewing people about what they think worked, ROI stories from vendors, or collecting "known, best practices."

When doing survey design, we might state our hypotheses in the following form: "we believe that IT organizations which have high trust have higher IT performance." In other words, the higher the trust levels in the IT organization, the higher the performance.

We then put this question in the survey instrument, and then analyze the results. If we were to plot the results on a graph, we would put the dependent variable (i.e., performance) on the Y-axis, and the independent variable (i.e., presence of high trust) on the X-axis.

We would then test to see if there is a correlation between the two. Shown below is an example of what it looks like when the two variables have low or no correlation, and one that has a significant positive correlation.

If we were to find a significant correlation, such as displayed on the right, we could then assert that "the higher your organization's trust levels, in general, the higher your IT performance."

(Graph adapted from Wikipedia entry on Correlation and Dependence.)

The 2012 DevOps Survey

In this section, we will describe the the key findings that came out of the 2012 DevOps Survey, as well as a brief discussion of the research hypotheses that went into the survey design.

In the DevOps community, we have long asserted that certain practices enables organizations simultaneously deliver fast flow of features to market, while providing world-class stability, reliability and security.

We designed the survey to validate this, and tested a series of technical practices to determine which of them correlated with high performance.

The survey ran for 30 days, and we had 4,039 completed respondents. (This is an astonishingly high number, by the way. When Kurt Milne and Gene Kim did similar studies in 2006, each study typically required $200K to do the survey design, gather responses from a couple hundred people, and then perform survey analysis.)

You can find the slides that Gene Kim, Jez Humble and James Turnbull presented at the 2013 Velocity Conference here, and the full Puppet Labs infographics and results here.

 

The first surprise was how much the high performing organizations were outperforming their non-high-performing peers:

  • Agility metrics
    • 30x more frequent code deployments
    • 8,000x faster lead time than their peers
  • Reliability metrics
    • 2x the change success rate
    • 12x faster MTTR

In other words, they were more agile: they were deploying code 30x more frequently, and the lead time required to go from "code committed" to "successfully running in production" was completed 8,000x faster — high performers had lead times measured in minutes or hours, while lower performers had lead times measured in weeks, months or even quarters.

Not only were the high performers doing more work, but they had far better outcomes: when the high performers deployed changes and code, they were twice as likely to be completed successfully (i.e., without causing a production outage or service impairment), and when the change failed and resulted in an incident, the time required to resolve the incident was 12x faster.

We were astonished and delighted with this finding, as it showed not only that it was possible to break the core, chronic conflict, but that it seemed to confirm that just as in manufacturing, agility and reliability go hand in hand. In other words, lead time correlates with both both agility and reliability.

(I will post more on my personal interpretations of the 2012 DevOps Survey Of Practice in a future post.)

Conclusion

We hope this gives you a good idea of why we've worked so hard on the 2012 and 2013 DevOps Survey, as well as how to conduct your own cross-population studies. Please let us know if you have any questions or if there's anything we can do for you.

And of course, help us understand what in DevOps and Continuous Delivery work by taking 10 minutes to participate in the 2013 Puppet Labs DevOps Survey here by January 15, 2014!

Thank you! –Gene Kim and Jez Humble

The post The Science Behind the 2013 Puppet Labs DevOps Survey Of Practice appeared first on IT Revolution.

Move Virtual Desktops to Cloud [feedly]


 
 
Shared via feedly // published on CitrixTV RSS Feed // visit site
Move Virtual Desktops to Cloud
Citrix and Cisco expand desktop virtualization solutions to bring you additional choice and flexibility.
Views:3
Length:04:12





Sent from my iPad

2013 : A Year to Remember [feedly]


 
 
Shared via feedly // published on blog.xen.org // visit site
2013 : A Year to Remember

2013 has been a year of changes for the Xen Community. I wanted to share my five personal highlights of the year. But before I do this, I wanted to thank everyone who contributed to the Xen Project in 2013 and the years before. Open Source is about bringing together technology and people : without your contributions, the Xen Project would not be a thriving and growing open source project.

Xen Project joins Linux Foundation

The biggest community story of 2013, was the move of Xen to the Linux Foundation in April. For me, this journey started in December 2011, when I won in-principle agreement from Citrix to find a neutral, non-profit home for Xen. This took longer than I hoped: even when the decision was made to become a Linux Foundation Collaborative project, it took many months of hard work to get everything off the ground. Was it worth it? The answer is a definite yes: besides all the buzz and media interest in April 2013, interest in and usage of Xen has increased in the remainder of 2013. The Xen Project became a first class citizen within the open source community, which it was not really before.

Wiki Page Visits

Monthly visits by users to the Xen Project wiki doubled after moving Xen to the Linux Foundation.

Of course, the ripples of this change will be felt for many years to come. Some of them, are covered in the other 4 highlights of 2013. I personally believe that the Xen Project Advisory Board (which is made up of 14 major companies that fund the project), will have a positive impact on the community going forward. This will become apparent next year, when initiatives that are funded by the Advisory Board – such as an independently hosted test infrastructure, more coordinated marketing and PR, growing the Xen talent pool and many others – will kick into gear.

Developer Community Growth

Besides growth in website visits, we have also seen a marked increase in developer list conversations in 2013.

Besides growth in website visits, we have also seen a marked increase in developer list conversations in 2013.

In 2013, we have also seen significant growth of our developer community. Significant growth is showing in a number of different metrics, such as conversations on the developer list, the number of contributors to the project (an increase of 11% compared to 2012) as well as an increase of patches submitted. This means that in 2014, we will have to look at some challenges associated with this growth: for example, developer list traffic in November 2013 was beyond 4500 messages (compared to 2700 in January 2013). Too much for many of our developers.

Shorter Release Cycles

Another notable change that started in late 2012, was a reduction of the Release Cadence for the Xen Hypervisor and a better approach to release planning. I wanted to thank George Dunlap – our Xen Release coordinator – for driving these changes. Let's look at release times since Xen 4.0: it took 11 months to develop Xen 4.1, 18 months to develop Xen 4.2, 10 months to develop Xen 4.3 and 6 or 7 months for Xen 4.4 (planned to release in February 2014). The goal is to release Xen twice a year, while increasing the number of features that are going into each Xen release. If you look at the list of planned Xen 4.4 features, we are well on track to achieving this goal.

Innovation, Innovation, Innovation

In 2013 the Xen Project started to innovate in many different technology areas. This is reflected in the many presentations that were given at the Xen Project Developer Summit. Besides the usual improvements to performance and scalability, I wanted to pick out some personal highlights.

  • Xen 4.3 saw some real advances in cloud security: very timely, given that cloud and internet security was a very hot topic in 2013. Next year, we will look at making many of these features easier to use and integrate them better into Linux distros.
  • Another notable change is PVH guest support (coming to Xen 4.4 for Linux and FreeBSD). PVH combines the best elements of HVM and PV into a mode which allows Xen to take advantage of many of the hardware virtualization features without needing to emulate an entire physical server. This will allow for increased efficiency, as well as reduced footprint in Linux, FreeBSD and other operating systems. A special thank you to Mukesh Rathor from Oracle, who developed this groundbreaking technology.
  • Of course, we also have to mentioned Xen on ARM support that first appeared in Xen 4.3 and will be hardened for Xen 4.4. This support is helping to expand Xen Hypervisor usage into new market segments. In October 2013, we saw first prototypes of Android running on top of Xen at remarkable speed. But more on this later. A special thank you to Stefano Stabellini and Ian Campbell for driving this effort.
  • Support for VMWare guests: just before XMas Verizon posted a patch series for review, that will allow users to run Linux, Windows and other guest images that were built for VMware products, unchanged within Xen. These features will not make it into Xen 4.4, but should be available later in 2014.
  • Intel and Samsung showed groundbreaking work in GPU virtualization at the last Xen Project Developer Summit, which have the potential to extend Xen into new market sefgments.

Of course, not all of these innovations will make it into Xen 4.4: some will appear in Xen 4.5 or later.

New Frontiers of Virtualization

One of the things which surprised me personally, is that we are seeing the Xen Hypervisor adopted in many new (and unexpected) market segments. Examples are: Automotive and In-Vehicle Infotainment, Mobile Use-cases, Network Function Virtualization, Set-Top Boxes and other Embedded Applications. This will without doubt, be a theme of 2014. It may seem counter intuitive, but I believe that expanding the use of Xen to new frontiers, will create benefits and opportunities in server virtualization and cloud computing. It also proves, that Xen is an extremely flexible platform that can be customized for many different applications.

In any case, thank you all for making 2013 an exceptional year!

And a Happy New Year to all of you!






Sent from my iPad

Real size of a Ceph RBD image [feedly]


 
 
Shared via feedly // published on Ceph // visit site
Real size of a Ceph RBD image

RBD images are thin-provisionned thus you don't always know the real size of the image. Moreover, Ceph doesn't provide any simple facility to check the real size of an image. This blog post took his inspiration from the Ceph mailing list.

Create an image:

1  
$ rbd create -s 1024 toto  

The magic formula using block differential:

1  2  
$ rbd diff rbd/tota | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'  0 MB  

Further testing:

1  2  3  4  5  6  7  8  9  10  
$ rbd map toto    $ rbd showmapped  id pool image snap device  2 rbd toto - /dev/rbd1    $ dd if=/dev/zero of=/dev/rbd1 bs=1M count=10 oflag=direct  10+0 records in  10+0 records out  10485760 bytes (10 MB) copied, 6.91826 s, 1.5 MB/s  

So we wrote 10M, we should get 10MB in the ouput :).

1  2  
$ rbd diff rbd/toto | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'  10 MB  

Thanks to Olivier Bonvalet for the AWK command.






Sent from my iPad

Benchmarking Ceph erasure code plugins [feedly]


 
 
Shared via feedly // published on Ceph // visit site
Benchmarking Ceph erasure code plugins

The erasure code implementation in Ceph relies on the jerasure library. It is packaged into a plugin that is dynamically loaded by erasure coded pools.
The ceph_erasure_code_benchmark is implemented to help benchmark the competing erasure code plugins implementations and to find the best parameters for a given plugin. It shows the jerasure technique cauchy_good with a packet size of 3072 to be the most efficient on a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz when compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The test was done assuming each object is spread over six OSDs and two extra OSDs are used for parity ( K=6 and M=2 ).

  • Encoding: 4.2GB/s
  • Decoding: no processing necessary (because the code is systematic)
  • Recovering the loss of one OSD: 10GB/s
  • Recovering the loss of two OSD: 3.2GB/s

The processing is done on the primary OSDs and therefore distributed on the Ceph cluster. Encoding and decoding is an order of magnitude faster than the typical storage hardware throughput.

Ceph is compiled from sources with:

./autogen.sh ; ./configure ; make  

which compiles the ceph_erasure_code_benchmark benchmark tool.
The results of the erasure code bench script ( which relies on ceph_erasure_code_benchmark ) were produced on a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz and compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5).

CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark  \  PLUGIN_DIRECTORY=src/.libs  \  qa/workunits/erasure-code/bench.sh  

They can be interpreted as follows:

seconds         KB      plugin  k m work.       iter.   size    eras.  0.612510        1048576 example 2 1 encode      1024    1048576 0  0.317254        1048576 example 2 1 decode      1024    1048576 1  

The first line used the example plugin to encode 1048576KB (1GB) in 0.612510 seconds which is ~1.7GB/s. The measure was done by iterating 1024 times to encode a 1048576 (1MB) bytes buffer. The second line used the example plugin to decode 1048576KB (1GB) when 1 chunk has been erased (last column) in 0.317254 seconds which is ~3.1GB/s. The measure was done by iterating 1024 times to decode a 1048576 (1MB) bytes buffer that was encoded once.
When using the Jerasure Ceph plugin and the Reed Solomon technique to sustain the loss of two OSDs (i.e. K=6 and M=2 ) the results are:

seconds         KB      plugin          k m work.       iter.   size    eras.  0.103921        1048576 jerasure        6 2 decode      1024    1048576 1  0.277644        1048576 jerasure        6 2 decode      1024    1048576 2  0.238322        1048576 jerasure        6 2 encode      1024    1048576 0  

The first line shows that if 1 OSD is lost ( erased ), it can be recovered at a rate of 10GB/s ( 1/0.103921 ). If 2 OSDs are lost, recovering both of them can be done at a rate of 3.6GB/s ( 1/0.277644 ). Encoding can be done at a rate of 4.2GB/s ( 1/0.238322 ).
The corresponding jerasure technique is cauchy_good with a packet size of 3072:

--parameter erasure-code-packetsize=3072  --parameter erasure-code-technique=cauchy_good  

After profiling a single call and reducing the number of iterations from 1024 to 10 because valgrind makes the run significantly slower:

valgrind --tool=callgrind src/ceph_erasure_code_benchmark    --plugin jerasure    --workload encode    --iterations 10    --size 1048576    --parameter erasure-code-k=6    --parameter erasure-code-m=2    --parameter erasure-code-directory=.libs    --parameter erasure-code-technique=cauchy_good    --parameter erasure-code-packetsize=3072  

It shows that 97% of the time is spent in table lookups.






Sent from my iPad

Profiling CPU usage of a ceph command (callgrind) [feedly]


 
 
Shared via feedly // published on Ceph // visit site
Profiling CPU usage of a ceph command (callgrind)

After compiling Ceph from sources with:

./configure --with-debug CFLAGS='-g' CXXFLAGS='-g'  

The crushtool test mode is used to profile the crush implementation with:

valgrind --tool=callgrind \           --callgrind-out-file=crush.callgrind \           src/crushtool \           -i src/test/cli/crushtool/one-hundered-devices.crushmap \           --test --show-bad-mappings  

The resulting crush.callgrind file can then be analyzed with

kcachegrind crush.callgrind  


Any Ceph command can be profiled in this way.






Sent from my iPad

RBD image bigger than your Ceph cluster [feedly]


 
 
Shared via feedly // published on Ceph // visit site
RBD image bigger than your Ceph cluster

Some experiment with gigantic overprovisioned RBD images.

First, create a large image, let's 1 PB:

1  2  3  4  5  6  7  
$ rbd create --size 1073741824 huge  $ rbd info huge  rbd image 'huge':   size 1024 TB in 268435456 objects   order 22 (4096 kB objects)   block_name_prefix: rb.0.8a14.2ae8944a   format: 1  

Problems rise as soon as you attempt to delete the image. Eventually try to remove it:

1  2  3  4  5  6  
$ time rbd rm huge  Removing image: 100% complete...done.    real 1944m40.850s  user 475m37.192s  sys 475m51.184s  

Keeping an of every exiting objects is terribly inefficient since this will kill our performance. The major downside with this technique is when shrinking or deleting an image it must look for all objects above the shrink size.

In dumpling or later RBD can do this in parallel controlled by --rbd-concurrent-management-ops (undocumented option), which defaults to 10.


You still have another option, if you've never written to the image, you can just delete the rbd_header file. You can find it by listing all the objects contained in the image. Something like rados -p <your-pool> ls | grep <block_name_prefix> will do the trick. After this, removing the RBD image will take a second.

1  2  3  4  5  6  7  8  9  10  11  12  
$rados -p rbd ls  huge.rbd  rbd_directory    $ rados -p rbd rm huge.rbd  $ time rbd rm huge  2013-12-10 09:35:44.168695 7f9c4a87d780 -1 librbd::ImageCtx: error finding header: (2) No such file or directory  Removing image: 100% complete...done.    real 0m0.024s  user 0m0.008s  sys 0m0.008s  





Sent from my iPad

Profiling CPU usage of a ceph command (gperftools) [feedly]


 
 
Shared via feedly // published on Ceph // visit site
Profiling CPU usage of a ceph command (gperftools)

After compiling Ceph from sources with:

./configure --with-debug CFLAGS='-g' CXXFLAGS='-g'  

The crushtool test mode is used to profile the crush implementation with:

LD_PRELOAD=/usr/lib/libprofiler.so.0 \  CPUPROFILE=crush.prof src/crushtool \    -i src/test/cli/crushtool/one-hundered-devices.crushmap \    --test --show-bad-mappings  

as instructed in the cpu profiler documentation. The resulting crush.prof file can then be analyzed with

google-pprof --ignore=vector --focus=bucket_choose \    --gv ./src/crushtool crush.prof  

and displays the following result:

Any Ceph command can be profiled in this way.






Sent from my iPad

New Ceph Wiki is Live [feedly]


 
 
Shared via feedly // published on Ceph // visit site
New Ceph Wiki is Live

For those who have used the wiki in recent history you may have noticed that it had been sitting in a read-only state for a little bit around the holidays here. Today the wiki is back in action and better than ever! While we are still using MindTouch, we have moved to the SaaS version that allows us to offload the physical infrastructure and gain a few nice features as well.

Logging In

While the new version is quite nice to look at there are a few things that I would like to point out. First, when you log in you may notice that it redirects you to wikilogin.ceph.com, this is normal. We are running our own custom OAuth plugin that will allow you to continue using your google credentials as before. The first time you log in it will ask you to choose a new user name. You can plug in your preferred user name, or a new one, it doesn't matter. The previous content and edits have been archived and are not assigned to any existing users. You should only have to do this once. If you have problems please contact community@inktank.com or ping scuttlemonkey on IRC and I'll make sure to get you squared away.

Content and Functionality

With respect to the content and functionality there are a few things worth pointing out. if you take a look at some of the guide content there are a few different types (tabs) that you will see: "Guide Content," "How-To," and "Reference." These are pre-defined page templates that will help to classify and aggregate content in the appropriate places for easy consumption. Every user should be able to create pages and utilize the template that they feel best suits the content. If you have questions let me know.

Some of the new content features we have been discussing are slowly being added and will continue to be tweaked. The basics for the Chum Bucket have started, but the sorting and tagging have not yet been added. Look for these in a future update.

Ultimately there is a lot of content that could still be added and this is where we need help from the community! If you are interested in helping out feel free to dive right in or ask the community team where you can be of the greatest help.

Getting Acquianted with MindTouch

While many projects choose MediaWiki, we have decided to go with MindTouch for a while to see if things like their advanced knowledge base, polished UI, and automated content management tools might be a bit nicer in the long run. We realize that this may be a bit of a learning curve for some people and as such are providing a few resources if you wish to explore this new tool:

Documentation – MindTouch documentation and support resources can be found at https://help.mindtouch.us.

Training Videos – MindTouch training plans and Self-Training videos can be found at https://help.mindtouch.us/Support/Training.

Getting Started with MindTouch – Use these FAQ's to get started with MindTouch. They cover a wide variety of topics and MindTouch is always improving their material.
https://help.mindtouch.us/01MindTouch_TCS/User_Guide/001_Getting_Started

As always, if you have questions, concerns, or anything for the good of the cause, feel free to contact the community team or scuttlemonkey on IRC.

scuttlemonkey out





Sent from my iPad

OpenStack, Ceph RBD and QoS [feedly]


 
 
Shared via feedly // published on Ceph // visit site
OpenStack, Ceph RBD and QoS

The Havana cycle introduced a QoS feature on both Cinder and Nova. Quick tour of this excellent implementation.

Originally both QEMU and KVM support rate limitation. This is obviously implemented through libvirt and available as an extra xml flag within the <disk> section called iotune.

QoS options are:

  • total_bytes_sec: the total allowed bandwidth for the guest per second
  • read_bytes_sec: sequential read limitation
  • write_bytes_sec: sequential write limitation
  • total_iops_sec: the total allowed IOPS for the guest per second
  • read_iops_sec: random read limitation
  • write_iops_sec: random write limitation

This is wonderful that OpenStack implemented such (easy?) feature in both Nova and Cinder. It is also a sign that OpenStack is getting more featured and complete in the existing core projects. Having such facility is extremely useful for several reasons. First of all, not all the storage backends support QoS. For instance, Ceph doesn't have any built-in QoS feature whatsoever. Moreover, the limitation is directly at the hypervisor layer and your storage solution doesn't even need to have such feature. Another good point is that from an operator side it is quite nice to be able to offer different levels of service. Operators can now offer different types of volumes based on a certain QoS, customers then, will be charged accordingly.


II. Test it!

First create the QoS in Cinder:

1  2  3  4  5  6  7  8  9  
$ cinder qos-create high-iops consumer="front-end" read_iops_sec=2000 write_iops_sec=1000  +----------+---------------------------------------------------------+  | Property | Value |  +----------+---------------------------------------------------------+  | consumer | front-end |  | id | c38d72f8-f4a4-4999-8acd-a17f34b040cb |  | name | high-iops |  | specs | {u'write_iops_sec': u'1000', u'read_iops_sec': u'2000'} |  +----------+---------------------------------------------------------+  

Create a new volume type:

1  2  3  4  5  6  
$ cinder type-create high-iops  +--------------------------------------+-----------+  | ID | Name |  +--------------------------------------+-----------+  | 9c746ca5-eff8-40fe-9a96-1cdef7173bd0 | high-iops |  +--------------------------------------+-----------+  

Then associate the volume type with the QoS:

1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  
$ cinder qos-associate c38d72f8-f4a4-4999-8acd-a17f34b040cb 9c746ca5-eff8-40fe-9a96-1cdef7173bd0    $ cinder create --display-name slow --volume-type slow 1  +---------------------+--------------------------------------+  | Property | Value |  +---------------------+--------------------------------------+  | attachments | [] |  | availability_zone | nova |  | bootable | false |  | created_at | 2013-12-02T12:59:33.177875 |  | display_description | None |  | display_name | high-iop |  | id | 743549c1-c7a3-4e86-8e99-b51df4cf7cdc |  | metadata | {} |  | size | 1 |  | snapshot_id | None |  | source_volid | None |  | status | creating |  | volume_type | high-iop |  +---------------------+--------------------------------------+  

Eventually attach the volume to an instance:

1  2  3  4  5  6  7  8  9  
$ nova volume-attach cirrOS 743549c1-c7a3-4e86-8e99-b51df4cf7cdc /dev/vdc  +----------+--------------------------------------+  | Property | Value |  +----------+--------------------------------------+  | device | /dev/vdc |  | serverId | 7fff1d37-efc4-46b9-8681-3e6b1086c453 |  | id | 743549c1-c7a3-4e86-8e99-b51df4cf7cdc |  | volumeId | 743549c1-c7a3-4e86-8e99-b51df4cf7cdc |  +----------+--------------------------------------+  

Expected result:

While attaching the device you should see the following xml creation from the nova-volume debug log. Dumping the virsh xml works as well.

2013-12-11 14:12:05.874 DEBUG nova.virt.libvirt.config [req-232cf5eb-a79b-42d5-a183-2f4758e8d8eb admin admin] Generated XML <disk type="network" device="disk">    <driver name="qemu" type="raw" cache="none"/>    <source protocol="rbd" name="volumes/volume-743549c1-c7a3-4e86-8e99-b51df4cf7cdc">      <host name="192.168.251.100" port="6790"/>    </source>    <auth username="volumes">      <secret type="ceph" uuid="95c98032-ad65-5db8-f5d3-5bd09cd563ef"/>    </auth>    <target bus="virtio" dev="vdc"/>    <serial>2e589abc-a008-4433-89ae-1bb142b139e3</serial>    <iotune>      <read_iops_sec>2000</read_iops_sec>      <write_iops_sec>1000</write_iops_sec>    </iotune>  </disk>  

W Important note: rate-limiting is currently broken in Havana, however the bug has already been reported and a fix submitted/accepted. This same patch has also already been proposed as a potential backport for Havana.






Sent from my iPad

Basho in 2013 [feedly]


 
 
Shared via feedly // published on Basho // visit site
Basho in 2013

December 30, 2013

2013 was a huge year for Basho Technologies and before we dive into 2014, we thought we'd take a moment to reflect on how far we've come.

Case Studies

2013 was the year of the Riak User. We love hearing about all the amazing ways companies across various industries are using Riak. This year, we were able to share dozens of exciting case studies. These include:

For even more Riak Users, check out the Users Page.

Releases

We released Riak 1.3, Riak 1.4, and the Technical Preview of Riak 2.0 this year. These releases added such features as Active Anti-Entropy, revamped Riak Control, queryability improvements, Riak Data Types, and much more. Be on the lookout for the general release of Riak 2.0 early next year.

This year we also open sourced Riak CS with the 1.3 release and released Riak CS 1.4. These releases added multi-part upload, Riak CS Control, and integration with OpenStack.

RICON

This year, we expanded RICON, Basho's distributed systems conference, to both RICON East and RICON West. These were both sold out conferences that featured speakers from bitly, Comcast, Google, Netflix, Salesforce, State Farm Insurance, The Weather Company, Turner Broadcasting, Twitter, and many more.

Partners

We drastically increased the number of Basho partners in 2013. For a full list of partners, check out the Partnerships Page. Some key ones to note include Tokyo Electron Device, SoftLayer, and Seagate.

Community

Our amazing community team hosted over 200 meetups around the world this year. On top of that, they also attended dozens of industry events to spread the word about Basho. Keep an eye on the Events Page to see where we'll be in 2014.

2013 was a busy year but, with some exciting announcements coming, we look forward to an even busier 2014. Happy New Year!

Basho






Sent from my iPad

Two podcast appearances in December [feedly]


 
 
Shared via feedly // published on Blog // visit site
Two podcast appearances in December

I had some fun being a guest on two different podcasts in December, talking about both Apache CloudStack and CumuLogic, and I figured I'd share the recordings here.

First up, Digital Nibbles, hosted by Reuven Cohen and Allyson Klein, was a two guest show with Duncan Johnston-Watt (from CloudSoft) and I splitting the time. Duncan's was up first (and he's worth the listen), but you can jump to minute 25 to hear my interview. We talked primarily about CumuLogic, but also spent some time discussing CloudStack.

Next, Aaron Delp interviewed me for episode 125 of The Cloudcast. This was my second time on this particular podcast, and it was just as fun the second time around. Aaron really gets the cloud market, so it was great to discuss some of the ins-and-outs of both Apache CloudStack's community and CumuLogic's potential with him.

Thanks to both of these podcasts for inviting me to share a bit about my move to CumuLogic. Hopefully the brief introduction I gave was enough for people to understand why I made the move, and why I'm excited about CumuLogic's future!






Sent from my iPad